1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05,

Slides:



Advertisements
Ähnliche Präsentationen
Steffen Maretzke October 2010 pers com
Advertisements

Die deutsche Satzstellung
Verbs Used Impersonally With Dative Deutsch I/II Fr. Spampinato.
Separable Verbs There is a group of verbs in German called separable verbs = trennbare Verben.
die Zeiten (The Tenses) das Aktiv (Active Voice)
Das Perfekt (Present Perfect Tense). Think of 5 things you did in your holidays but think of sentences in the PRESENT TENSE. 1.Am Montag schlafe ich viel.
Plural Forms of Nouns & Wie viel? or Wie viele?
You need to use your mouse to see this presentation © Heidi Behrens.
You need to use your mouse to see this presentation © Heidi Behrens.
You need to use your mouse to see this presentation © Heidi Behrens.
Universität StuttgartInstitut für Wasserbau, Lehrstuhl für Hydrologie und Geohydrologie Copulas (1) András Bárdossy IWS Universität Stuttgart.
Wo wohnst du? Objektiv To talk about where you live and your opinions of it MUST be able to recall vocabulary to do with where you live SHOULD be able.
Coordinating Conjunctions Why we need them & how to use them deutschdrang.com.
 Every part in a sentence has a grammatical function. Some common functions are: - Subject - Verb - Direct object / accusative object - Indirect object.
Konjunktionen & Indirekte Fragen {Conjunctions}
Networking on local area knowledge of territory-continuous presence in community (family-centre – people centre – key locations)
Review of Verb Tense & Expressing Opinions
Nominative & Accusative Basic Rules for Relative Pronouns in German:
Literary Machines, zusammengestellt für ::COLLABOR:: von H. Mittendorfer Literary MACHINES 1980 bis 1987, by Theodor Holm NELSON ISBN
You need to use your mouse to see this presentation © Heidi Behrens.
Akkusativ Präpositionen
What is a “CASE”? in English: pronouns, certain interrogatives
Possessive Adjectives How to show belonging… The information contained in this document may not be duplicated or distributed without the permission of.
What is a “CASE”? in English: pronouns, certain interrogatives
Museumsinsel Museum Island (German: Museumsinsel) is the name of the northern half of an island in the Spree river in the central Mitte district of Berlin,
type / function / form type of words:
Schreiben Sie fünf Sätze aus diesen Elementen. [Beispiel
The Journey to America… The Immigrant Experience.
E STUNDE Deutsch AP. Dienstag, der 23. April 2013 Deutsch AP (E Stunde)Heute ist ein B Tag Goal: to understand authentic written text, audio material.
COMMANDS imperative There are three command forms: formal familiar singular familiar plural.
COMMANDS imperative 1. you (formal): Sie 2. you (familiar plural): ihr
Unterwegs.
Montag den 8. Juni Lernziel:- To launch a project and receive results.
Kapitel 4 Grammar INDEX 1.Ordinal Numbers 2.Relative Pronouns and Relative Clauses 3.Conditional Sentences 4.Posessive: Genitive Case.
Imperfekt (Simple Past) Irregular or strong verbs
Kapitel 2 Grammar INDEX 1.Subjects & Verbs 2.Conjugation of Verbs 3.Subject Verb Agreement 4.Person and Number 5.Present Tense 6.Word Order: Position of.
Kapitel 7 Grammar INDEX 1.Comparison 2.Adjectives 3.Adjective Endings Following Ein-Words.
Memorisation techniques
Kapitel 8 Grammar INDEX 1.Command Forms: The Du-Command Form & Ihr- Command 2.Sentences & Clauses.
Komparativ und Superlativ German 2. zum Beispiel … Ein VW ist schnell. Ein BMW ist schneller. Ein Porsche ist am Schnellsten. There are 2 levels of comparison:
Der die das ein eine ein Wie sagt man “the” auf Deutsch? Wie sagt man “a” auf Deutsch?
Kapitel 9 Grammar INDEX 1.Formal Sie- Command 2.There Is/There Are 3.Negation: Nicht/Klein.
On the case of German has 4 cases NOMINATIVE ACCUSATIVE GENITIVE DATIVE.
Essay structure Example: Die fetten Jahre sind vorbei: Was passiert auf der Almhütte? Welche Bedeutung hat sie für jede der vier Personen? Intro: One or.
What’s the weather like?. Look at the question above Turn it around and you have Das Wetter ist.... The phrase Das Wetter ist.... or Es ist.... can be.
DAS VIERTE DEUTSCHE KASUS Genitiv. Kasus ● What is a case? A case shows the grammatical function of a word. ● There are four cases in German. Up to now.
Learning Linear Ordering Problems for Better Translation Roy Tromble, Google Pittsburgh Jason Eisner, Johns Hopkins August 7, 2009.
Interrogatives and Verbs
Sentence Structure Questions
Dom zu Lübeck The Lübeck Cathedral (German: Dom zu Lübeck, or colloquially Lübecker Dom) is a large brick Lutheran cathedral in Lübeck, Germany and part.
Freizeit Thema 5 Kapitel 1 (1)
you: ihr ( familiar plural ) you: du ( familiar singular)
Grundlagen Englisch Passive voice HFW Bern.
Sentence Structure Connectives
The Dative Indirect Objects.
Jetzt machen Venues aufmachen!!! Geh zu
Jetzt machen Venues aufmachen!!! Geh zu
Synonyms are two or more words belonging to the same part of speech and possessing one or more identical or nearly identical denotational meanings, interchangeable.
Students have revised SEIN and HABEN for homework
THE PERFECT TENSE IN GERMAN
Deutsch I für Anfänger Beginning German I
To English Translations
type / function / form type of words:
The Conversational Past
The Conversational Past
Explanations and Classwork Practice
School supplies.
Die tiere Share: Introduce the session and PDSA.
Zhunussova G., AA 81. Linguistic communication, i.e. the use of language, is characteristically vocal and verbal behaviour, involving the use of discrete.
 Präsentation transkript:

1 Disentangling from Babylonian Confusion – Unsupervised Language Identification Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05, Mexico City February 18, 2005

2 Outline 1.Review: Supervized Language Identification 2.Co-occurrence graphs Co-occurrences Visualizing co-occurrences 3.Chinese Whispers Algorithm Finding words of the same language 4. Sorting text by language Evaluation and limitations

3 Review: Supervized Language Identification needs training Operates on letter n-grams or common words as features Works almost error-free for texts from 500 letters on Drawbacks: Does not work for previously unknown languages Danger of misclassifying instead of reporting „unknown“ Example: “xx xxx x xxx …” classified as Nepali “öö ö öö ööö …” classified as Persian

4 Co-occurrence Statistics Co-occurrence: occurrence of two or more words within a well-defined unit of information (sentence, nearest neighbors, window...) Significant Co-occurrences reflect relations between words. Significance Measure (log-likelihood): In the following, sentence-based co-occurrence statistics are used.

5 Co-occurrence Graphs The entirety of all significant co-occurrences is a co-occurrence graph G(V,E) with V: Vertices = Words E: Edges (v1, v2, s) with v1, v2 words, s significance value. Co-occurrence graph is - weighted - undirected Small-world-property

6 Chinese Whispers - Motivation (small-world) graphs consist of regions with a high clustering coefficient and hubs that connect those regions The nodes in cluster regions should be assigned the same label per region Every node gets a label and whispers it to its neighbouring nodes. A node changes to a label if most of its neighbours whisper this label – or it invents a new one Under assumption of semantic closeness when being strongly connected there should emerge motivated clusters

7 Chinese Whispers Algorithm Assign different labels to every node in the graph; For iteration i from 1 to total_iterations { mutation_rate= 1/(i^2); For each word w in the graph { new_label of w = highest ranked label in neighbourhood of w; with probability mutation_rate: new_label of w = new class label; } labels = new_labels; } graph clustering algorithm linear time in the number of nodes random mutation can be omitted but showed better results for small graphs

8 Assigning New Labels Node A changes label from L1 to L3: Sum(L3)=9; Sum(L4)=8; Sum(L2)=5 Other strategies result in different kinds of partitioning - threshold for share - weighting by node degrees A L1->L3 D L2 E L3 B L4 C L

9 Chinese Whispers on 7 Languages

10 Chinese Whispers on 7 languages

11 Assigning languages to sentences Use word-based language identification tool Largest clusters form word lists for different languages A sentence is assigned a cluster label if - it contains at least 2 words from the cluster and - not more words from another cluster Questions for Evaluation: up to what number of languages is that possible ? How much can the corpus be biased ?

12 Evaluation: Mix of 7 languages Languages used: Dutch, Estonian, English, French, German, Icelandic and Italian At least 100 sentences per language are necessary for consistent clusters

13 Common mistakes Unclassified: - mostly enumerations of sport teams - very short sentences, e.g. headlines - legal act ciphers in estonian case, e.g jõust RT I 1996, 89, 1590 Misclassified: mixed-language-sentences, like French: Frönsku orðin "cinéma vérité" þýða "kvikmyndasannleikur“ English: Die Beatles mit "All you need is love".

14 Evaluation: Bilingual biased Language pairs used: English-Estonian, French-Italian, Dutch-German 1st language varied between ‘000 sentences, 2nd language 100‘000 sentences Factor up to 200 does not result in deterioration Above factor 200, the 1st language cluster is not distinguishable in size from 2nd-language ‚noise‘

15 Conclusion Unsupervized Language Identification is possible It fails to name the languages, but rather sorts them It works for previously undescribed languages, even for dialects Accurracy on sentences (here about 120 characters) is compareable to supervized approaches When classifying documents, there should be virtually no errors Time-linear graph-clustering algorithm

16 Questions? THANK YOU!

17 Small Cooccurring Worlds Angenommene Struktur von Kookkurrenzgraphen: skalenfreie Small Worlds kurze Weglänge zwischen den Knoten hoher Clustering Coeffizient Power-Law-Verteilung von Knotengraden Power-Law-Verteilung von Komponentengrößen Knotengrad: Anzahl (ausgehender) Kanten Komponente: Zusammenhängende Menge von Knoten Power-Law-Verteilungen lassen sich einfach aufzeichnen.

18 Strategien zur Farbübernahme Ein Knoten ändert seine Farbe auf eine neue Farbe aus der Umgebung, wenn diese (1) mit stärkster Signifikanzsumme auftritt. (top) (2) mit stärkster Signifikanzsumme gewichtet nach Knotengrad auftritt (a - linear, b - logarithmisch) (dist) (3) mit stärkster Signifikanzsumme auftritt und anteilig über einer gewissen Schwelle liegt (vote ) A L1 D L2 E L3 B L4 C L deg=1 deg=2 deg=3 deg=5 deg=4 Beispiel: Einf ä rben von A (1): Sum(L3)=9; Sum(L4)=8; Sum(L2)=5 (2a): wSum(L2)=5; wSum(L4)=4; wSum(L3)=2.2 (2b): wSum(L4)=7,28; wSum(L3)=5,51; wSum(L2)=3,46 (3): nSum(L3)=0,409; nSum(L4)=0,363; nSum(L2)=0,227

19 7 Clusters – 7 languages 68701:(3792): [...] a-t-elle, a-t-il, a-t-on, aanval, abandonné, abattu, abattus, aborder, abords, abouti, absolu, absolue, acceptent, accepter, accepté, accessibles, accession, accord, accords, accordé, accordée, accusation, accusations, accuse, accusé, accusée, acheter, achevé, achevée, acte, actes, actifs, action, actionnaires, actions, actions-suicides, activement, activiste, activistes, activités, actuelle, actuellement, adeptes, adjoint, admettre, administratif, admis, [...] 80266:(3616) [...] a, abandoned, able, ablösen, aboard, abortion, abortions, about, above, abroad, absence, absolute, absolutely, accepted, accessible, accident, accidents, acclaim, according, accounting, accused, accusing, acid, acidic, acknowledged, acquire, acquistare, acre, acres, across, act, acting, active, activist, activists, acts, actually, added, addicts, adding, additional, address, administration, administrator, admitted, adopt, adopted, adoption, adults, advance, 68952:(3312) [...] abbandonato, abbastanza, abbia, abbiamo, abbiano, abile, abitante, abitanti, abitazioni, abruzzesi, accade, accaduto, accenno, accertare, accertato, accesso, accoglienza, accolto, accordi, accordo, accorta, accorti, accusa, acquisito, ad, addetti, addirittura, addosso, adesso, adottata, aereo, affari, affermato, affetto, affidare, affidato, affiliati, affiliato, affonda, affrontare, affronteranno, agenti, agenzie, agevolare, aggiunge, aggiungere, aggiunto, agli [...] 75760:(3249) [...] af, afar, afgreiðslutíma, afl, afla, aflaheimilda, aflaheimildum, aflann, aflaverðmætið, afli, aflinn, aflýst, afnema, afnotagjalda, afnotagjaldið, afnotagjöld, afnotagjöldin, afnotagjöldum, afrek, afráðið, afstýra, aftur, afurðaverðs, afurðum, aka, al-Qaeda, al-Zawahri, ala, aldar, aldrei, aldri, aldur, allan, allar, allir, allra, allri, alls, allt, alltaf, alltof, allur, almannafjölmiðla, almannamiðla, almenna, almennt, altari, alveg, andvirði, annan, annar, annarra [...] 81089:(2894) [...] an, aandacht, aandachtsgebied, aangehouden, aangekeken, aangenomen, aangepakt, aangesloten, aangevuld, aangewezen, aangezien, aanleiding, aanmerking, aanpak, aanslag, aansluiting, aantal, aantrekkelijk, aanvankelijk, aanvragen, aanwezige, aanwezigheid, aanwijsbaar, aanwijzingen, aanzien, aarde, aardige, abortuspil, abortuswetgeving, acceptabel, achtduizend, achter, achtergrond, achterhalen, achterover, actie, actieve, actuele [..] 68872:(2791) [...] ab, abend, aber, abermals, abgebaut, abgelaufenen, abgeschlossen, abgeschlossenen, abschneiden, abzuwarten, acht, achtzehn, achtziger, afghanischen, akzeptieren, allein, allem, allen, allenfalls, aller, allerdings, allgemein, allgemeinen, alt, alte, alten, alter, am, amerikanische, amerikanischen, amerikanischer, anderen, anderer, anerkennen, angedroht, angegeben, angehende, angekündigt, angenommen, angesichts, angestellt, angetastet [...] 72602:(2247) [...] aadress, aadressi, aadressil, aadressina, aasta, aastaaruande, aastabilansi, aastabilanss, aastaks, aastal, aastas, aastat, aastate, aeg, aegumistähtaeg, aga, agressiooni, ainuaktsionäri, ainult, ainuõigus, ajaks, ajal, ajast, ajutiselt, akt, aktid, aktide, aktsia, aktsiad, aktsiaid, aktsiakapital, aktsiakapitali, aktsiakapitalist, aktsiaraamatusse, aktsiaselts, aktsiaseltsi, aktsiaseltsiga, aktsiaseltsil, aktsiaseltsile, aktsiat, aktsiate, aktsiatega [...] 60154:(195) [...] afferma, créée, dimostrano, dom, domicilio, dovranno, escl, esclusi, escluso, feriale, festivi, festivo, fóru, gleðilegs, gratuita, gravidanza, incarico, intero, inv, io, jr, jäetud, jõust, liðna, lõige, lõiked, nere, næstir, pagamento, punktides, sab, saper, scenario, servono, sindacale, sjálfsm, socc, soccorso, spiegato, spilurum, techniques, ventanni, warnte, zuständige [...] [...] 84023:(3) Inter, Mailand, Ronaldo