1 Dictionary Acquisition Using Parallel Text and Co-occurrence Statistics Chris Biemann, Uwe Quasthoff University of Leipzig, NLP-Dept. Friday, May 20,

Slides:



Advertisements
Ähnliche Präsentationen
Word Order in German Subordiante Clauses
Advertisements

Mean and variance.
The difference between kein and nicht.
Multi electron atoms Atoms with Z>1 contain >1 electron. This changes the atomic structure considerably because in addition to the electron-nucleus interaction,
Fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/17 Graphics:
When people think about Germany, they usually think of……
Telling Time in German Deutsch 1 Part 1 Time in German There are two ways to tell time in German. There are two ways to tell time in German. Standard.
Stärken-Schwächen- Chancen-Risken Lernziele: to talk about future job possibilities to develop the ability to compare and contrast to confidently use present,
Prof. Dr. Bernhard Wasmayr
PageRank 1.What does the graph represent? 2.Describe PageRank. 3.What does PageRank measure in a graph? 4.Which role does PageRank play in IR?
Comparative Adjectives. The term comparison of adjectives is used when two or more persons or things have the same quality (height, size, color, any characteristic)
Haben – to have ich habe du hast er/sie hat es hat man hat wir haben
Subject & Object der changes to den die stays die das stays das die as a plural says die.
Institut für Umweltphysik/Fernerkundung Physik/Elektrotechnik Fachbereich 1 SADDU June 2008 S. Noël, K.Bramstedt,
Zu + Infinitiv : eine Erklärung
Alles ist eins Die toten Hosen
Tag um Tag, Jahr um Jahr, Tag um Tag, Jahr um Jahr, Day by day, year by year, Wenn ich durch diese Straßen geh', When I go through these streets, Seh ich.
Laurie Clarcq The purpose of language, used in communication, is to create a picture in the mind and/or the heart of another.
Machen Sie sich schlau am Beispiel Schizophrenie.
Sabine Dennerlein Nice to meet you! Enjoy 60 seconds of your precious time for yourself! Cornelia Renate Gottwald Your personal holistic health coach.
How many more nouns can you think of?
Institut AIFB, Universität Karlsruhe (TH) Forschungsuniversität gegründet 1825 Towards Automatic Composition of Processes based on Semantic.
| DC-IAP/SVC3 | © Bosch Rexroth Pneumatics GmbH This document, as well as the data, specifications and other information set forth in.
Morphology and Syntax More on sentence structure.
Deutsch 1 G Stunde. Dienstag, der 11. September 2012 Deutsch 1 (G Stunde)Heute ist ein E - Tag Unit: Introduction to German & Germany Objectives: Learn.
Ich möchte ein Eisberg sein. Last time … 3 icebergs Triangels Unique connections Ich möchte ein Eisberg sein
Christoph Durt: Wittgenstein on the possibility of philosophy: The importance of an intercultural approach
Deutsch 1 G Stunde. Donnerstag, der 25. Oktober 2012 Deutsch 1, G Stunde Heute ist ein B- Tag Unit: Family & homeFamilie & Zuhause Objectives: Phrases.
Deutsch 1 G Stunde. Donnerstag, der 18. Oktober 2012 Deutsch 1, G Stunde Heute ist ein E- Tag Unit: Family & homeFamilie & Zuhause Objectives: Phrases.
Deutsch 1 G Stunde. Montag, der 10. September 2012 Deutsch 1 (G Stunde)Heute ist ein D - Tag Unit: Introduction to German & Germany Objectives: Introducing.
Wiederholung – Meine Schule und ich!
Meine Schulfächer.
Wie komme ich zu..... ? (how do I get to....?)
You need to use your mouse to see this presentation © Heidi Behrens.
You need to use your mouse to see this presentation © Heidi Behrens.
INTAKT- Interkulturelle Berufsfelderkundungen als ausbildungsbezogene Lerneinheiten in berufsqualifizierenden Auslandspraktika DE/10/LLP-LdV/TOI/
DER FRÜHLING Der Frühling beginnt am 21. März, aber in Nova Scotia ist es im März oft noch kalt, und das Wetter ist meist nicht schön. Erst im Mai wird.
VERBS!.
Mein Stammbaum.
Verben Wiederholung Deutsch III Notizen.
Miteinander Leben Deutch III Notizen Einheit 5. In this unit you will: Learn vocabulary for talking about relationships Learn vocabulary for various emotions.
Relative Clauses.
Kölner Karneval By Logan Mack
The word,,aber in German is most often used as a coordinating conjunction. Ich wollte nach Bremen fahren aber Mein Auto ist kaputt. Ich mag English aber.
4th Symposium on Lidar Atmospheric Applications
Ein Projekt des Technischen Jugendfreizeit- und Bildungsvereins (tjfbv) e.V. kommunizieren.de Blended Learning for people with disabilities.
“Weil” und “Denn”.
Ertragsteuern, 5. Auflage Christiana Djanani, Gernot Brähler, Christian Lösel, Andreas Krenzin © UVK Verlagsgesellschaft mbH, Konstanz und München 2012.
Hätte gern vs. Möchte gern
Cross-Polarization Modulation in DWDM Systems
Demonstration of Performance of CASCOT 5.0
External Labels – The rules For all external labels the following rules apply (external labels are all labels which are not inside of a shape) - all labels.
Relativpronomen / Relativsätze:
© Boardworks Ltd of 8 Time Manner Place © Boardworks Ltd of 8 This icon indicates that the slide contains activities created in Flash. These.
Negation is when you dont have or dont do something.
Adjectiv Endungen Lite: Adjective following articles and pre-ceeding nouns. Colors and Clothes.
Relativpronomen / Relativsätze:
Sentence Structure Subject and verb are always together. Subject and verb are always together. Subject and verb must agree Subject and verb must agree.
German Word Order explained!
Present Tense Most regular verbs follow this pattern:
Separable Verbs Turn to page R22 in your German One Book R22 is in the back of the book There are examples at the top of the page.
Folie Beispiel für eine Einzelauswertung der Gemeindedaten (fiktive Daten)
1 Stevens Direct Scaling Methods and the Uniqueness Problem: Empirical Evaluation of an Axiom fundamental to Interval Scale Level.
THE PERFECT TENSE IN GERMAN
Selectivity in the German Mobility Panel Tobias Kuhnimhof Institute for Transport Studies, University of Karlsruhe Paris, May 20th, 2005.
SEPARABLE VERBS. Separable verbs have prefixes on them. These prefixes are adverbs and prepositions that are detached from the verb when it is conjugated.
Andreas Burger ZENTRUM FÜR MEDIZINISCHE LEHRE RUHR-UNIVERSITÄT BOCHUM Irkutsk October 2012 Report about the lecture "Report of the TEMPUS IV- Project Nr.
THE CONVERSATIONAL PAST
Folie Einzelauswertung der Gemeindedaten
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Modalverben.
 Präsentation transkript:

1 Dictionary Acquisition Using Parallel Text and Co-occurrence Statistics Chris Biemann, Uwe Quasthoff University of Leipzig, NLP-Dept. Friday, May 20, 2005 NODALIDA 2005

2 Problem Description Given: certain amounts of sentence-aligned parallel texts Not available: morphology, grammar, semantic etc. information string similarity for cognates bilingual dictionary Wanted: bilingual dictionaries alignment on word level

3 Broad Picture Calculation of translingual statistically significant co- occurrences yields ranked translation candidates For alignment, the highest ranked translation candidates that occurr in the sentence pair are linked.

4 Co-occurrence Statistics Co-occurrence: occurrence of two words within a well- defined unit of information (sentence, nearest neighbors, window...) Significant co-occurrences reflect relations between words Threshold on significance measure (log-likelihood): k= number of units containing A and B

5 Trans-co-occurrences Translingual co-occurrences normal co-occurrences: Calculaton performed on sentence basis Co-occurrents can be found frequently together in sentences Trans-co-occurrences: Calculaton performed on bilingual sentence pairs Co-occurrents can be found frequently together in bilingual sentence pairs Hypothesis: significant co-occurrences between words of different languages (= trans-co-occurrences) are translation equivalents

6 Data: Europarl Transcriptions of European Parliament, about 1 million sentences per language Available for Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portugese, Spanish and Swedish Experiments carried out for: Englisch-Danish Englisch-Dutch Englisch-German Englisch-Finnish Englisch-Italian Englisch-Portugese Englisch-Swedish (chosen because of dictionary availability)

7 Example: In all sentence pairs, and occur together.

8 Example: top-ranked trans-co-occurrences Gesellschaft: (12082), (342), (274), (237), (226), (187), (183), (182), (168), (165), (159), (137), (134), (120), (120), (119), (111), (110), (105), (102) society: (12082), (466), (379), (328), (312), (233), (219), (213), (179), (173), (161), (157), (155), (132), (119), (118), (108), (108), (98), (97) kaum: (825), (470), (362), (278), (254), (186), (88), (68), (63), (53), (51), (47), (40), (38), (37), (32), (32), (31), (29), (28) hardly: (825), (138), (64), (51), (43), (37), (37), (25), (21), (17), (17), (16), (16), (15), (14), (14), (13), (12), (12), (12)

9 Evaluation What is the quality of determined translation equivalents? Evaluation by comparing results to bilingual dictionaries (freelang) to measure precision Method: - Only words that are in the dictionary and have automatic translations are taken into account - Determine portion of matches in the 3 highest-ranked trans-co-occurrences Problems: Some translations are correct but not found in the dictionary Dictionaries are not adopted to domain Inflection: Dictionaries contain lemmas -> Prefix matching Unknown multiword units

10 Prefix matching Prefix match prfx(A,B) of two strings A and B is defined by Examples: prfx(Herbert, Herberts) = 7/8 = prfx(Baustelle, Baugenehmigung)=3/14 = prfx(Häuserkampf, Häuserkämpfe) = 7/12 = A quite crude measure, but deals more or less with the inflection problem length of common prefix of A and B max (length(A), length(B)) prfx(A,B)=

11 Sample data from en-de co1-3: top trans-co-occurrences, p1-3: largest prefix match with some dict. entry of word. word (en) co1 (de)p1 co2 (de) p2 co3 (de) p3 absolutely essential absolut0 unbedingt unbedingt notwendig 0.10 essential wesentlichen0.83 wesentliche ist0 office Büro1 Amt1 Büros0.8 pollutants Schadstoffe1 Schadstoffen Emission0 expertise Fachwissen0 Sachverstand1 Sachkenntnis1 prescribed vorgeschrieben1 vorgeschriebenen vorgeschriebene0.93 means bedeutet1 Mittel1 heißt0.09 bill Gesetzentwurf0.15 Gesetzesentwurf Rechnung1 approach Ansatz1 Konzept0 Vorgehensweise0 audit Prüfung0 Audit1 echnungsprüfung1

12 Results for freelang-evaluation blue: prfx=1, red: 1<prfx<=0.8, yellow: 0.8<prfx<=0.6

13 Manual Evaluation on 1000 words random samples Better results: -no domain- dependent deficiency of dictionary -no problems with inflection

14 Coverage on types Proportion of words with at least 3 trans-co-occurrences in types list

15 Coverage on tokens Proportion of tokens having at least 3 trans-co-occurrences in running text.

16 Comparison with [Sahlgren 2004]

17 Comparison with [Sahlgren 2004]

18 Alignment Given: Bilingual sentence pair Wanted: Which word corresponds with which? Method: Scan sentence 1 word by word and link it to the highest ranked word in the trans-co-coccurrences that can be found in sentence 2.

19 Alignment: Example 1 Red Words: No alignment Blue Arrows: Errors Arrow Index: rank in trans-co-occurrences Die Landwirtschaft stellt nur 5,5 % der Arbeitsplätze der Union. Agriculture only provides 5.5 % of employment in (the Union) Die Landwirtschaft stellt nur 5,5 % der Arbeitsplätze der Union. Agriculture only provides 5.5 % of employment in (the Union)

20 Alignment: Example 2 Grey Arrows: Multiple alignments for frequent words. Indem wir den Mitgliedstaaten für die Umsetzung der Richtlinie kein spezifisches Datum setzen, By not setting a specific date (for the) Member States (to implement) the directive sondern ihnen einen Zeitraum von drei Monaten nach Inkrafttreten der Richtlinie zugestehen, and instead giving them a period of three months after its (entry into force), führen wir eine Flexibilitätsklausel ein, we are introducing a flexibility clause die eine unverzügliche Umsetzung gewährleistet. which ensures that the directive will be implemented without delay ,2,

21 Further work Dictionary acquisition: document-level aligned texts weakly parallel texts or corpora Alignment: Dealing with cognates Symmetric alignment Alignment of phrases and multiword units

23 References (Biemann et al 2004): Biemann, Chr.; Bordag, S.; Heyer, G.; Quasthoff, U.; Wolff, Chr.: Language-independent Methods for Compiling Monolingual Lexical Data, Proceedings of CicLING 2004, Seoul, Korea and Springer LNCS 2945, pp , Springer Verlag Berlin Heidelberg (Sahlgren 2004) Sahlgren, M. (2004): Automatic Bilingual Lexicon Acquisition Using Random Indexing of Aligned Bilingual Data, Proceedings of LREC-2004, Lisboa, Portugal (Koehn 2002) Koehn, P. (2002): Europarl: A multilingual corpus for evaluation of machine translation,

24 Alignment Evaluation Strongs numbers in the Bible English-Russian

25 Alignment Evaluation Strongs numbers in the Bible English-German