Die Präsentation wird geladen. Bitte warten

Die Präsentation wird geladen. Bitte warten

Dealing With Non-Standard Orthography 'Information Access to Historical Documents from the Early New High German Period'. Hauser et al. (2007) Todd Shore.

Ähnliche Präsentationen


Präsentation zum Thema: "Dealing With Non-Standard Orthography 'Information Access to Historical Documents from the Early New High German Period'. Hauser et al. (2007) Todd Shore."—  Präsentation transkript:

1 Dealing With Non-Standard Orthography 'Information Access to Historical Documents from the Early New High German Period'. Hauser et al. (2007) Todd Shore Project Seminar: 'Unlocking the Secrets of the Past: Text Mining for Historical Documents' Caroline Sporleder & Martin Schreiber Dept of Comp. Ling. & Phonetics Saarland University 29/09/2016

2 29/09/2016Todd Shore2/20 Overview 1) Introduction 2) Problems 3) Solutions 4) Hauser et al. 1) Goals 2) Methods 3) Conclusion s/1/11/Iserlohn-Druckpresse1-Bubo.JPG inks_well.jpg

3 29/09/2016Todd Shore3/20 Why do we need to deal with non-standard orthography? 1) Introduction Motivation: Digitisation, indexing and archiving of physical records mmons/c/ca/CD-ROM.png

4 29/09/2016Todd Shore4/20 1) Introduction Why do we need to deal with non-standard orthography? Compare: Fater unser, thū thār bist in himile, sī geheilagōt thīn namo, queme thīn rīhhi, sī thīn uuillo, sō her in himile ist, sō sī her in erdu, unsar brōt tagalīhhaz gib uns hiutu, inti furlāz uns unsara sculdi sō uuir furlāzemēs unsarēn sculdīgōn, inti ni gileitēst unsih in costunga, ūzouh arlōsi unsih fon ubile. (Braune & Ebbinghaus 1994)

5 29/09/2016Todd Shore5/20 1) Introduction Why do we need to deal with non-standard orthography? Compare: Unser Vater in dem Himmel! Dein Name werde geheiliget. Dein Reich komme. Dein Wille geschehe auf Erden wie im Himmel. Unser täglich Brot gib uns heute. Und vergib uns unsere Schulden, wie wir unsern Schuldigern vergeben. Und führe uns nicht in Versuchung, sondern erlöse uns von dem Übel. Denn dein ist das Reich und die Kraft und die Herrlichkeit in Ewigkeit. Amen. (Braune & Ebbinghaus 1994)

6 29/09/2016Todd Shore6/20 1) Introduction Why do we need to deal with non-standard orthography? Compare: Vater unser im Himmel, geheiligt werde Dein Name. Dein Reich komme. Dein Wille geschehe, wie im Himmel so auf Erden. Unser tägliches Brot gib uns heute. Und vergib uns unsere Schuld, wie auch wir vergeben unseren Schuldigern. Und führe uns nicht in Versuchung, sondern erlöse uns von dem Bösen. Denn Dein ist das Reich und die Kraft und die Herrlichkeit in Ewigkeit. Amen. Vaterunser (Pater Noster/the Lord's Prayer) – ökumenische Fassung 5/50/Majolikaplatte_mit_dem_Vater_unser.JPG

7 29/09/2016Todd Shore7/20 1) Introduction Why do we need to deal with non-standard orthography? Compare: Fater unser, du pist in himilum. Kauuihit si namo din. Piqhueme rihhi din, Uuesa din uuillo, sama so in himile est, sama in erdu. Pilipi unsraz emizzigaz kip uns eogauuanna. Enti flaz uns unsro sculdi, sama so uuir flazzames unsrem scolom. Enti ni princ unsih in chorunka. Uzzan kaneri unsih fona allem sunton. (Braune & Ebbinghaus 1994)

8 29/09/2016Todd Shore8/20 1) Introduction Why do we need to deal with non-standard orthography? Compare: Üse Vater im Himmel Din Name isch gross und heilig Tue du bi üs dini Herrschaft und dis Gottesriech ufbaue. So, wiener im Himmel erfüllt wird, wömmer au do uf de Erde din Wille erülle. Gib us hüt üsers täglich Brot, und verzei üs üseri Sünde, so wie mer üserne Schldner vergend. Und für us nöd i Versuechig, sondern erlös üs vo alem Böse. Amen. (Rieger 2005)

9 29/09/2016Todd Shore9/20 1) Introduction ● 5 versions of the Vaterunser (Pater Noster / the Lord's Prayer) ● Which time period is each from? ● Which is intelligible to modern German? a) 9 th century A.D. (Old High German (OHG) – East Franconian) b) 1545 (Early New High German (ENHG)) c) 1970 (New High German (NHG)) d)9 th century A.D. ((Old High German (OHG) – Bavarian) e) 2005 (New High (Swiss) German) Why do we need to deal with non-standard orthography?

10 29/09/2016Todd Shore10/20 1) Introduction ● “Standard” orthography is an ideal – not a reality — True “universal” standard only emerged in 19 th century (cf. Duden 1888) — Differing “proto-standards” – 12 th -16 th century: Middle Low German (Lübeck-influenced, from Hanseatic League) – Eventually disappeared while High German standard(s) remained — Non-adherence to standard(s) Result: NLP/information extraction problems exponentiated ● Not including non-linguistic problems (e.g. OCR) Why do we need to deal with non-standard orthography?

11 29/09/2016Todd Shore11/20 2) Problems ● Phonological/graphical: e.g. : ● Morphological: Significant variation/change in inflection/derivation ● Lexical: e.g. ● ~50% variation in early ENHG text in relation to MHG ● Syntactic: Variation/change in syntactic structures, e.g. some dialects may feature SOV word order more commonly, others SVO – still others may feature both 4 levels of variation

12 29/09/2016Todd Shore12/20 3) Solutions Special dictionaries (for time period, dialect, etc) — Pros: Manual checking of text; no word-similarity approximations needed — Cons: Data coverage problems, expensive

13 29/09/2016Todd Shore13/20 3) Solutions Word similarity approximation — Rule-based generative matching: Either on- or offline generation of word variations through rules, e.g. Low German to High German p > ff ∨ pf (cf. LG Peper > NHG Pfeffer) — Word similarity: Matching based on context, e.g.: ● Im anfang war das Wort, vnd das Wort war bey Gott, vnd Gott war das Wort, (Luther, 1545 (ENHG)) ● Im Anfang war das Wort, und das Wort war bei Gott, und Gott war das Wort. (öF, 1970 (NHG))

14 29/09/2016Todd Shore14/20 4.1) Hauser et al.: Goals a.Digitise a large corpus of early texts ● 14 th – 17 th century prints: large amount of data available, unlike (handwritten) manuscripts ● First significant amount of German-language text – Previously presented only in Latin b.Improve OCR and general digitisation methods ● Many texts in poor condition ● Numerous variations in typesetting style – Even OCR software designed for 19 th century – Gothic print performs poorly mmons/b/b8/D12_ubt.jpeg

15 29/09/2016Todd Shore15/20 4.2) Hauser et al.: Methods ● Manually create a small corpus of text ● Handle spelling and compound variations ● Create an electronic dictionary ● Incorporate morphology and syntax ● Incorporate document structure and meta-info – Use these to improve OCR and digitisation methods, and to digitise more texts Overview

16 29/09/2016Todd Shore16/20 4.2) Hauser et al.: Methods ● Encountered ~50% lexical variation in comparison to modern German (cf. ≤15% in later texts – Pilz et al. 2006) ● Four main issues: 1New word forms: word x → word y (Eng. translation z) 2Latin words: prolific usage of Latin words in texts 3Variation in word splitting, e.g. Winterzeit → Winters zeiten ('wintertime') 4Partial new word form, e.g. Mönchswesen → Moencherey ('monasticism') Lexical issues

17 29/09/2016Todd Shore17/20 4.2) Hauser et al.: Methods ● And more... e.g. ● Morphemic variation: suffixes/prefixes ● Typesetting variation ● Graphemic/phonetic variation, e.g Abenteuer → Abertheur ('adventure') ● Innovative characters not in modern German, e.g. für → fůr ('for') ● Database covering first four variations used as a base from which to derive these other, more systematic variations Lexical issues

18 29/09/2016Todd Shore18/20 4.2) Hauser et al.: Methods ● Traditional rule-based derivation causes a high number of implausible word variations ● Developed a multi-layer, partially-probabilistic confidence-based derivation system – Fuzzy matching based on RegExp, Soundex. Edit operations weighted according to context through machine learning ● Based on re-writing of sequences of symbols rather than individual symbols (e.g. → more accurate than → – Rule-based derivation in context of lingustic POS tag – To improve recall: Old word compared to other old words based on word similarity, their respective derivations considered Lexical issues

19 29/09/2016Todd Shore19/20 4.3) Hauser et al.: Conclusion ● This comprehensive, multi-faceted system may lead to improvements in OCR/digitisation methods ● However, shortcomings in OCR/digitisation make development of such a system more difficult – technological advancement has largely ignored factors present in historical prints

20 29/09/2016Todd Shore20/20 Works cited Braune, W & E A Ebbinghaus (1994). Althochdeutsches Lesebuch. Tübingen: Max Niemeyer Verlag. Duden, K (1888) Vollständiges orthographisches wörterbuch der deutschen sprache: mit etymologischen Angaben, kurzen Sacherklärungen und Verdeutschungen der Fremdwörter. Leipzig: Verlag des Bibliographischen Instituts. Hauser, A, M Heller, E Leiss, K U Schulz & C Wanzeck (2007). Information Access to Historical Documents from the Early New High German Period. In Burnard, Dobreva, Fuhr & Lüdeling (eds) Digital historical corpora- architecture, annotation, and retrieval. Dagstuhl: Internationales Begegnungs- und Forschungszentrum für Informatik. Pilz, T, W Luther & N Fuhr (2006) Rule-based Search in Text Databases with Nonstandard Orthography. Literary and Linguistic Computing, 21(2): Rieger, R (2005). Vater unser Schweizerdeutsch. In Schönstatt Bewegung Schweiz. Retrieved 19 Feb 2010, from


Herunterladen ppt "Dealing With Non-Standard Orthography 'Information Access to Historical Documents from the Early New High German Period'. Hauser et al. (2007) Todd Shore."

Ähnliche Präsentationen


Google-Anzeigen