1 Dictionary Acquisition Using Parallel Text and Co-occurrence Statistics Chris Biemann, Uwe Quasthoff University of Leipzig, NLP-Dept. Friday, May 20, 2005 NODALIDA 2005
2 Problem Description Given: certain amounts of sentence-aligned parallel texts Not available: morphology, grammar, semantic etc. information string similarity for cognates bilingual dictionary Wanted: bilingual dictionaries alignment on word level
3 Broad Picture Calculation of translingual statistically significant co- occurrences yields ranked translation candidates For alignment, the highest ranked translation candidates that occurr in the sentence pair are linked.
4 Co-occurrence Statistics Co-occurrence: occurrence of two words within a well- defined unit of information (sentence, nearest neighbors, window...) Significant co-occurrences reflect relations between words Threshold on significance measure (log-likelihood): k= number of units containing A and B
5 Trans-co-occurrences Translingual co-occurrences normal co-occurrences: Calculaton performed on sentence basis Co-occurrents can be found frequently together in sentences Trans-co-occurrences: Calculaton performed on bilingual sentence pairs Co-occurrents can be found frequently together in bilingual sentence pairs Hypothesis: significant co-occurrences between words of different languages (= trans-co-occurrences) are translation equivalents
6 Data: Europarl Transcriptions of European Parliament, about 1 million sentences per language Available for Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portugese, Spanish and Swedish Experiments carried out for: Englisch-Danish Englisch-Dutch Englisch-German Englisch-Finnish Englisch-Italian Englisch-Portugese Englisch-Swedish (chosen because of dictionary availability)
7 Example: In all sentence pairs, and occur together.
8 Example: top-ranked trans-co-occurrences Gesellschaft: (12082), (342), (274), (237), (226), (187), (183), (182), (168), (165), (159), (137), (134), (120), (120), (119), (111), (110), (105), (102) society: (12082), (466), (379), (328), (312), (233), (219), (213), (179), (173), (161), (157), (155), (132), (119), (118), (108), (108), (98), (97) kaum: (825), (470), (362), (278), (254), (186), (88), (68), (63), (53), (51), (47), (40), (38), (37), (32), (32), (31), (29), (28) hardly: (825), (138), (64), (51), (43), (37), (37), (25), (21), (17), (17), (16), (16), (15), (14), (14), (13), (12), (12), (12)
9 Evaluation What is the quality of determined translation equivalents? Evaluation by comparing results to bilingual dictionaries (freelang) to measure precision Method: - Only words that are in the dictionary and have automatic translations are taken into account - Determine portion of matches in the 3 highest-ranked trans-co-occurrences Problems: Some translations are correct but not found in the dictionary Dictionaries are not adopted to domain Inflection: Dictionaries contain lemmas -> Prefix matching Unknown multiword units
10 Prefix matching Prefix match prfx(A,B) of two strings A and B is defined by Examples: prfx(Herbert, Herberts) = 7/8 = prfx(Baustelle, Baugenehmigung)=3/14 = prfx(Häuserkampf, Häuserkämpfe) = 7/12 = A quite crude measure, but deals more or less with the inflection problem length of common prefix of A and B max (length(A), length(B)) prfx(A,B)=
11 Sample data from en-de co1-3: top trans-co-occurrences, p1-3: largest prefix match with some dict. entry of word. word (en) co1 (de)p1 co2 (de) p2 co3 (de) p3 absolutely essential absolut0 unbedingt unbedingt notwendig 0.10 essential wesentlichen0.83 wesentliche ist0 office Büro1 Amt1 Büros0.8 pollutants Schadstoffe1 Schadstoffen Emission0 expertise Fachwissen0 Sachverstand1 Sachkenntnis1 prescribed vorgeschrieben1 vorgeschriebenen vorgeschriebene0.93 means bedeutet1 Mittel1 heißt0.09 bill Gesetzentwurf0.15 Gesetzesentwurf Rechnung1 approach Ansatz1 Konzept0 Vorgehensweise0 audit Prüfung0 Audit1 echnungsprüfung1
12 Results for freelang-evaluation blue: prfx=1, red: 1<prfx<=0.8, yellow: 0.8<prfx<=0.6
13 Manual Evaluation on 1000 words random samples Better results: -no domain- dependent deficiency of dictionary -no problems with inflection
14 Coverage on types Proportion of words with at least 3 trans-co-occurrences in types list
15 Coverage on tokens Proportion of tokens having at least 3 trans-co-occurrences in running text.
16 Comparison with [Sahlgren 2004]
17 Comparison with [Sahlgren 2004]
18 Alignment Given: Bilingual sentence pair Wanted: Which word corresponds with which? Method: Scan sentence 1 word by word and link it to the highest ranked word in the trans-co-coccurrences that can be found in sentence 2.
19 Alignment: Example 1 Red Words: No alignment Blue Arrows: Errors Arrow Index: rank in trans-co-occurrences Die Landwirtschaft stellt nur 5,5 % der Arbeitsplätze der Union. Agriculture only provides 5.5 % of employment in (the Union) Die Landwirtschaft stellt nur 5,5 % der Arbeitsplätze der Union. Agriculture only provides 5.5 % of employment in (the Union)
20 Alignment: Example 2 Grey Arrows: Multiple alignments for frequent words. Indem wir den Mitgliedstaaten für die Umsetzung der Richtlinie kein spezifisches Datum setzen, By not setting a specific date (for the) Member States (to implement) the directive sondern ihnen einen Zeitraum von drei Monaten nach Inkrafttreten der Richtlinie zugestehen, and instead giving them a period of three months after its (entry into force), führen wir eine Flexibilitätsklausel ein, we are introducing a flexibility clause die eine unverzügliche Umsetzung gewährleistet. which ensures that the directive will be implemented without delay ,2,
21 Further work Dictionary acquisition: document-level aligned texts weakly parallel texts or corpora Alignment: Dealing with cognates Symmetric alignment Alignment of phrases and multiword units
23 References (Biemann et al 2004): Biemann, Chr.; Bordag, S.; Heyer, G.; Quasthoff, U.; Wolff, Chr.: Language-independent Methods for Compiling Monolingual Lexical Data, Proceedings of CicLING 2004, Seoul, Korea and Springer LNCS 2945, pp , Springer Verlag Berlin Heidelberg (Sahlgren 2004) Sahlgren, M. (2004): Automatic Bilingual Lexicon Acquisition Using Random Indexing of Aligned Bilingual Data, Proceedings of LREC-2004, Lisboa, Portugal (Koehn 2002) Koehn, P. (2002): Europarl: A multilingual corpus for evaluation of machine translation,
24 Alignment Evaluation Strongs numbers in the Bible English-Russian
25 Alignment Evaluation Strongs numbers in the Bible English-German