Dealing With Non-Standard Orthography 'Information Access to Historical Documents from the Early New High German Period'. Hauser et al. (2007) Todd Shore.

Slides:



Advertisements
Ähnliche Präsentationen
___________________ {Trage hier bitte deinen Namen ein !}
Advertisements

Tagesprogramm Mittwoch
Institut AIFB, Universität Karlsruhe (TH) Forschungsuniversität gegründet 1825 Towards Automatic Composition of Processes based on Semantic.
You need to use your mouse to see this presentation © Heidi Behrens.
You need to use your mouse to see this presentation © Heidi Behrens.
Du, unser Vater Du, unser Vater, der Du bist im Himmel.
Vater unser… gesungen von : Oswald Sattler PPS u. Fotos: Michel.
Herr, lehre uns beten Vater unser.
Vom Plagiat zum falschen Fell.
CALPER Publications From Handouts to Pedagogical Materials.
and bless God‘s holy name bless the Lord my soul
Im Namen des Vaters und des Sohnes und des Heiligen Geistes. Amen.
Universität StuttgartInstitut für Wasserbau, Lehrstuhl für Hydrologie und Geohydrologie Copulas (1) András Bárdossy IWS Universität Stuttgart.
How Does Fuzzy Arithmetic Work ? © Hartwig Jeschke Institut für Mikroelektronische Schaltungen und Systeme Universität Hannover
Der formelle Imperativ – the Imperative
Coordinating Conjunctions Why we need them & how to use them deutschdrang.com.
Linking sentences or phrases together
Die Grundgebete auf Deutsch (Quelle: Gotteslob)
Literary Machines, zusammengestellt für ::COLLABOR:: von H. Mittendorfer Literary MACHINES 1980 bis 1987, by Theodor Holm NELSON ISBN
Empirical Methods of Linguistic Research. What you will learn How to write an empirical research paper How to design an experiment / a questionnaire How.
What is a “CASE”? in English: pronouns, certain interrogatives
Possessive Adjectives How to show belonging… The information contained in this document may not be duplicated or distributed without the permission of.
Physik multimedial Lehr- und Lernmodule für das Studium der Physik als Nebenfach Julika Mimkes: Links to e-learning content for.
Need: paper, coloured pens, glue, dwarf templates, dictionaries, adjective handout, judges hand out, blue tack For gallery – give students blue tack and.
To form the present perfect tense in German you need an auxilliary verb HABEN or SEIN as a rule HABEN is used. SEIN is used with intransitive verbs indicating.
Studentenstreik 2009, 2010 in Deutschland Part 1: Studiengebühren.
The Workers‘ Freedom The debate about industrial democracy in Germany and Sweden, Klaus Neumann presentation held at the.
Studentenstreik 2009, 2010 in Deutschland Part 1: Studiengebühren.
Kapitel 4 Grammar INDEX 1.Ordinal Numbers 2.Relative Pronouns and Relative Clauses 3.Conditional Sentences 4.Posessive: Genitive Case.
Kapitel 4: Mein Tag Sprache.
Berliner Elektronenspeicherring-Gesellschaft für Synchrotronstrahlung m.b.H., Albert-Einstein-Straße 15, Berlin frontend control at BESSY R. Fleischhauer.
Fakultät für Gesundheitswissenschaften Gesundheitsökonomie und Gesundheitsmanagement Universität Bielefeld WP 3.1 and WP 4.1: Macrocost.
Kapitel 2 Grammar INDEX 1.Subjects & Verbs 2.Conjugation of Verbs 3.Subject Verb Agreement 4.Person and Number 5.Present Tense 6.Word Order: Position of.
Kapitel 7 Grammar INDEX 1.Comparison 2.Adjectives 3.Adjective Endings Following Ein-Words.
Kapitel 8 Grammar INDEX 1.Command Forms: The Du-Command Form & Ihr- Command 2.Sentences & Clauses.
Here‘s what we‘ll do... Talk to the person sitting in front of you. Introduce each other, and ask each other questions concerning the information on your.
Caroline Euringer Hamburg University LEO.-App: Mobile phone application for self-testing in reading and writing Peer Learning Activity on the use of digital.
Text complexity in and for literary studies. foundations.
Word order: 1.In a main clause the VERB is the second idea: Helgakommteben aus der Bäckerei This may not be the second word Meiner Meinung nachsind Hobbys.
© 2012 IBM Corporation © 2013 IBM Corporation IBM Storage Germany Community Josef Weingand Infos / Find me on:
Warum und wie bete ich. Gebet ist die wichtigste Tätigkeit in unserem Leben Wir beten: WAS IST GEBET? Zu unserem Vater Durch den Sohn «Durch ihn (Jesus)
Coordinating Conjunctions but, however –aber* for, because –denn* or –oder but, rather –sondern and –und.
Thema: Vergebung 9 Unser Vater im Himmel! Dein Name werde geheiligt. 10 Dein Reich komme. Dein Wille geschehe wie im Himmel so auf Erden. 11.
Das Vaterunser Matthäus 6,9-13 (LÜ) Darum sollt ihr so beten: Unser Vater im Himmel! Dein Name werde geheiligt. Dein Reich komme. Dein Wille geschehe.
Das Vaterunser Matthäus 6,9-13 (LÜ) Darum sollt ihr so beten: Unser Vater im Himmel! Dein Name werde geheiligt. Dein Reich komme. Dein Wille geschehe.
Das Vaterunser Matthäus 6,9-13 (LÜ) Darum sollt ihr so beten: Unser Vater im Himmel! Dein Name werde geheiligt. Dein Reich komme. Dein Wille geschehe.
Learning Linear Ordering Problems for Better Translation Roy Tromble, Google Pittsburgh Jason Eisner, Johns Hopkins August 7, 2009.
Interrogatives and Verbs
Jesus lehrt uns Beten: Das Vater unser
Freizeit Thema 5 Kapitel 1 (1)
Sentence Structure Connectives
Frage des Tages Bin ich froh, wenn ich heute Abend ___ Hause bin!
Warum und wie bete ich.
Gruppe 25+ Pfarrer Franz Winzeler, Interlaken Beten, was ist das?
Process and Impact of Re-Inspection in NRW
Bell Work Does ‘ie’ make a long e or long I sound?
Synonyms are two or more words belonging to the same part of speech and possessing one or more identical or nearly identical denotational meanings, interchangeable.
MONTESSORI- METHOD FOR DOGS
Ferrite Material Modeling (1) : Kicker principle
Users outside of UCA Dear Elke Kerner, if this is your real name, and others: you have downloaded my PowerPoints and posted them on public access websites.
ELECTR IC CARS Karim Aly University of Applied Sciences.
type / function / form type of words:
The Conversational Past
The Conversational Past
Integrating Knowledge Discovery into Knowledge Management
Practical Exercises and Theory
Die tiere Share: Introduce the session and PDSA.
GebetsHilfe September 2019
Zhunussova G., AA 81. Linguistic communication, i.e. the use of language, is characteristically vocal and verbal behaviour, involving the use of discrete.
 Präsentation transkript:

Dealing With Non-Standard Orthography 'Information Access to Historical Documents from the Early New High German Period'. Hauser et al. (2007) Todd Shore Project Seminar: 'Unlocking the Secrets of the Past: Text Mining for Historical Documents' Caroline Sporleder & Martin Schreiber Dept of Comp. Ling. & Phonetics Saarland University 29/09/2016

29/09/2016Todd Shore2/20 Overview 1) Introduction 2) Problems 3) Solutions 4) Hauser et al. 1) Goals 2) Methods 3) Conclusion s/1/11/Iserlohn-Druckpresse1-Bubo.JPG inks_well.jpg

29/09/2016Todd Shore3/20 Why do we need to deal with non-standard orthography? 1) Introduction Motivation: Digitisation, indexing and archiving of physical records mmons/c/ca/CD-ROM.png

29/09/2016Todd Shore4/20 1) Introduction Why do we need to deal with non-standard orthography? Compare: Fater unser, thū thār bist in himile, sī geheilagōt thīn namo, queme thīn rīhhi, sī thīn uuillo, sō her in himile ist, sō sī her in erdu, unsar brōt tagalīhhaz gib uns hiutu, inti furlāz uns unsara sculdi sō uuir furlāzemēs unsarēn sculdīgōn, inti ni gileitēst unsih in costunga, ūzouh arlōsi unsih fon ubile. (Braune & Ebbinghaus 1994)

29/09/2016Todd Shore5/20 1) Introduction Why do we need to deal with non-standard orthography? Compare: Unser Vater in dem Himmel! Dein Name werde geheiliget. Dein Reich komme. Dein Wille geschehe auf Erden wie im Himmel. Unser täglich Brot gib uns heute. Und vergib uns unsere Schulden, wie wir unsern Schuldigern vergeben. Und führe uns nicht in Versuchung, sondern erlöse uns von dem Übel. Denn dein ist das Reich und die Kraft und die Herrlichkeit in Ewigkeit. Amen. (Braune & Ebbinghaus 1994)

29/09/2016Todd Shore6/20 1) Introduction Why do we need to deal with non-standard orthography? Compare: Vater unser im Himmel, geheiligt werde Dein Name. Dein Reich komme. Dein Wille geschehe, wie im Himmel so auf Erden. Unser tägliches Brot gib uns heute. Und vergib uns unsere Schuld, wie auch wir vergeben unseren Schuldigern. Und führe uns nicht in Versuchung, sondern erlöse uns von dem Bösen. Denn Dein ist das Reich und die Kraft und die Herrlichkeit in Ewigkeit. Amen. Vaterunser (Pater Noster/the Lord's Prayer) – ökumenische Fassung 5/50/Majolikaplatte_mit_dem_Vater_unser.JPG

29/09/2016Todd Shore7/20 1) Introduction Why do we need to deal with non-standard orthography? Compare: Fater unser, du pist in himilum. Kauuihit si namo din. Piqhueme rihhi din, Uuesa din uuillo, sama so in himile est, sama in erdu. Pilipi unsraz emizzigaz kip uns eogauuanna. Enti flaz uns unsro sculdi, sama so uuir flazzames unsrem scolom. Enti ni princ unsih in chorunka. Uzzan kaneri unsih fona allem sunton. (Braune & Ebbinghaus 1994)

29/09/2016Todd Shore8/20 1) Introduction Why do we need to deal with non-standard orthography? Compare: Üse Vater im Himmel Din Name isch gross und heilig Tue du bi üs dini Herrschaft und dis Gottesriech ufbaue. So, wiener im Himmel erfüllt wird, wömmer au do uf de Erde din Wille erülle. Gib us hüt üsers täglich Brot, und verzei üs üseri Sünde, so wie mer üserne Schldner vergend. Und für us nöd i Versuechig, sondern erlös üs vo alem Böse. Amen. (Rieger 2005)

29/09/2016Todd Shore9/20 1) Introduction ● 5 versions of the Vaterunser (Pater Noster / the Lord's Prayer) ● Which time period is each from? ● Which is intelligible to modern German? a) 9 th century A.D. (Old High German (OHG) – East Franconian) b) 1545 (Early New High German (ENHG)) c) 1970 (New High German (NHG)) d)9 th century A.D. ((Old High German (OHG) – Bavarian) e) 2005 (New High (Swiss) German) Why do we need to deal with non-standard orthography?

29/09/2016Todd Shore10/20 1) Introduction ● “Standard” orthography is an ideal – not a reality — True “universal” standard only emerged in 19 th century (cf. Duden 1888) — Differing “proto-standards” – 12 th -16 th century: Middle Low German (Lübeck-influenced, from Hanseatic League) – Eventually disappeared while High German standard(s) remained — Non-adherence to standard(s) Result: NLP/information extraction problems exponentiated ● Not including non-linguistic problems (e.g. OCR) Why do we need to deal with non-standard orthography?

29/09/2016Todd Shore11/20 2) Problems ● Phonological/graphical: e.g. : ● Morphological: Significant variation/change in inflection/derivation ● Lexical: e.g. ● ~50% variation in early ENHG text in relation to MHG ● Syntactic: Variation/change in syntactic structures, e.g. some dialects may feature SOV word order more commonly, others SVO – still others may feature both 4 levels of variation

29/09/2016Todd Shore12/20 3) Solutions Special dictionaries (for time period, dialect, etc) — Pros: Manual checking of text; no word-similarity approximations needed — Cons: Data coverage problems, expensive

29/09/2016Todd Shore13/20 3) Solutions Word similarity approximation — Rule-based generative matching: Either on- or offline generation of word variations through rules, e.g. Low German to High German p > ff ∨ pf (cf. LG Peper > NHG Pfeffer) — Word similarity: Matching based on context, e.g.: ● Im anfang war das Wort, vnd das Wort war bey Gott, vnd Gott war das Wort, (Luther, 1545 (ENHG)) ● Im Anfang war das Wort, und das Wort war bei Gott, und Gott war das Wort. (öF, 1970 (NHG))

29/09/2016Todd Shore14/20 4.1) Hauser et al.: Goals a.Digitise a large corpus of early texts ● 14 th – 17 th century prints: large amount of data available, unlike (handwritten) manuscripts ● First significant amount of German-language text – Previously presented only in Latin b.Improve OCR and general digitisation methods ● Many texts in poor condition ● Numerous variations in typesetting style – Even OCR software designed for 19 th century – Gothic print performs poorly mmons/b/b8/D12_ubt.jpeg

29/09/2016Todd Shore15/20 4.2) Hauser et al.: Methods ● Manually create a small corpus of text ● Handle spelling and compound variations ● Create an electronic dictionary ● Incorporate morphology and syntax ● Incorporate document structure and meta-info – Use these to improve OCR and digitisation methods, and to digitise more texts Overview

29/09/2016Todd Shore16/20 4.2) Hauser et al.: Methods ● Encountered ~50% lexical variation in comparison to modern German (cf. ≤15% in later texts – Pilz et al. 2006) ● Four main issues: 1New word forms: word x → word y (Eng. translation z) 2Latin words: prolific usage of Latin words in texts 3Variation in word splitting, e.g. Winterzeit → Winters zeiten ('wintertime') 4Partial new word form, e.g. Mönchswesen → Moencherey ('monasticism') Lexical issues

29/09/2016Todd Shore17/20 4.2) Hauser et al.: Methods ● And more... e.g. ● Morphemic variation: suffixes/prefixes ● Typesetting variation ● Graphemic/phonetic variation, e.g Abenteuer → Abertheur ('adventure') ● Innovative characters not in modern German, e.g. für → fůr ('for') ● Database covering first four variations used as a base from which to derive these other, more systematic variations Lexical issues

29/09/2016Todd Shore18/20 4.2) Hauser et al.: Methods ● Traditional rule-based derivation causes a high number of implausible word variations ● Developed a multi-layer, partially-probabilistic confidence-based derivation system – Fuzzy matching based on RegExp, Soundex. Edit operations weighted according to context through machine learning ● Based on re-writing of sequences of symbols rather than individual symbols (e.g. → more accurate than → – Rule-based derivation in context of lingustic POS tag – To improve recall: Old word compared to other old words based on word similarity, their respective derivations considered Lexical issues

29/09/2016Todd Shore19/20 4.3) Hauser et al.: Conclusion ● This comprehensive, multi-faceted system may lead to improvements in OCR/digitisation methods ● However, shortcomings in OCR/digitisation make development of such a system more difficult – technological advancement has largely ignored factors present in historical prints

29/09/2016Todd Shore20/20 Works cited Braune, W & E A Ebbinghaus (1994). Althochdeutsches Lesebuch. Tübingen: Max Niemeyer Verlag. Duden, K (1888) Vollständiges orthographisches wörterbuch der deutschen sprache: mit etymologischen Angaben, kurzen Sacherklärungen und Verdeutschungen der Fremdwörter. Leipzig: Verlag des Bibliographischen Instituts. Hauser, A, M Heller, E Leiss, K U Schulz & C Wanzeck (2007). Information Access to Historical Documents from the Early New High German Period. In Burnard, Dobreva, Fuhr & Lüdeling (eds) Digital historical corpora- architecture, annotation, and retrieval. Dagstuhl: Internationales Begegnungs- und Forschungszentrum für Informatik. Pilz, T, W Luther & N Fuhr (2006) Rule-based Search in Text Databases with Nonstandard Orthography. Literary and Linguistic Computing, 21(2): Rieger, R (2005). Vater unser Schweizerdeutsch. In Schönstatt Bewegung Schweiz. Retrieved 19 Feb 2010, from