Carnegie Mellon University

Slides:



Advertisements
Ähnliche Präsentationen
Anzahl der ausgefüllten und eingesandten Fragebögen: 211
Advertisements

Vorlesung: 1 Betriebliche Informationssysteme 2003 Prof. Dr. G. Hellberg Studiengang Informatik FHDW Vorlesung: Betriebliche Informationssysteme Teil3.
LS 2 / Informatik Datenstrukturen, Algorithmen und Programmierung 2 (DAP2)
Telefonnummer.
Modelle und Methoden der Linearen und Nichtlinearen Optimierung (Ausgewählte Methoden und Fallstudien) U N I V E R S I T Ä T H A M B U R G November 2011.
Modelle und Methoden der Linearen und Nichtlinearen Optimierung (Ausgewählte Methoden und Fallstudien) U N I V E R S I T Ä T H A M B U R G November 2011.
1 JIM-Studie 2010 Jugend, Information, (Multi-)Media Landesanstalt für Kommunikation Baden-Württemberg (LFK) Landeszentrale für Medien und Kommunikation.
Telling Time in German Deutsch 1 Part 1 Time in German There are two ways to tell time in German. There are two ways to tell time in German. Standard.
= = = = 47 = 47 = 48 = =
Rechneraufbau & Rechnerstrukturen, Folie 2.1 © W. Oberschelp, G. Vossen W. Oberschelp G. Vossen Kapitel 2.
Internet facts 2008-II Graphiken zu dem Berichtsband AGOF e.V. September 2008.
Vorlesung: 1 Betriebliche Informationssysteme 2003 Prof. Dr. G. Hellberg Studiengang Informatik FHDW Vorlesung: Betriebliche Informationssysteme Teil2.
Differentielles Paar UIN rds gm UIN
Maxwell-Boltzmann Ausgewählte Themen des analogen Schaltungsentwurfs
Prof. Dr. Bernhard Wasmayr
Studienverlauf im Ausländerstudium
Prof. Dr. Bernhard Wasmayr VWL 2. Semester
AWA 2007 Natur und Umwelt Natürlich Leben
Zerlegung von Quadraten und ????
Rechneraufbau & Rechnerstrukturen, Folie 12.1 © W. Oberschelp, G. Vossen W. Oberschelp G. Vossen Kapitel 12.
Institut für Umweltphysik/Fernerkundung Physik/Elektrotechnik Fachbereich 1 SADDU June 2008 S. Noël, K.Bramstedt,
1 Status Index Politikergebnisse Management Index Reformfähigkeit Exekutivkapazität Beteiligungskompetenz Gestaltungsfähigkeit Ressourceneffizienz Internationale.
20:00.
Die Geschichte von Rudi
„Küsse deine Freunde“ – FlexKom-App teilen
Zusatzfolien zu B-Bäumen
In der Schule.
Institut AIFB, Universität Karlsruhe (TH) Forschungsuniversität gegründet 1825 Towards Automatic Composition of Processes based on Semantic.
Eine Einführung in die CD-ROM
Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections June 21 ACL 2011 Slav Petrov Google Research Dipanjan Das Carnegie Mellon University.
Dokumentation der Umfrage
für Weihnachten oder als Tischdekoration für das ganze Jahr
1 Ein kurzer Sprung in die tiefe Vergangenheit der Erde.
Syntaxanalyse Bottom-Up und LR(0)
Deutsch 1 G Stunde. Donnerstag, der 25. Oktober 2012 Deutsch 1, G Stunde Heute ist ein B- Tag Unit: Family & homeFamilie & Zuhause Objectives: Phrases.
Deutsch 1 G Stunde. Donnerstag, der 18. Oktober 2012 Deutsch 1, G Stunde Heute ist ein E- Tag Unit: Family & homeFamilie & Zuhause Objectives: Phrases.
G Stunde DEUTSCH 1. Unit: Family & homeFamilie & Zuhause Objectives: Phrases about date, weather and time-telling Alphabet – pronunciation and words The.
Deutsch 1 G Stunde. Montag, der 22. Oktober 2012 Deutsch 1, G Stunde Heute ist ein F- Tag Unit: Family & homeFamilie & Zuhause Objectives: Conjugations.
Deutsch 1 G Stunde. Donnerstag, der 18. Oktober 2012 Deutsch 1, G Stunde Heute ist ein D- Tag Unit: Family & homeFamilie & Zuhause Objectives: Phrases.
Deutsch 1 G Stunde. Donnerstag, der 15. November 2012 Deutsch 1, G Stunde Heute ist ein B- Tag Unit: Family & home Familie & Zuhause Goal: to talk about,
Deutsch 1 G Stunde. Dienstag, der 16. Oktober 2012 Deutsch 1, G Stunde Heute ist ein B- Tag Unit: Family & homeFamilie & Zuhause Objectives: Getting to.
Titelmasterformat durch Klicken bearbeiten Textmasterformate durch Klicken bearbeiten Zweite Ebene Dritte Ebene Vierte Ebene Fünfte Ebene 1 Development.
Goleo.
Verben Wiederholung Deutsch III Notizen.
Aufgabensammlung Thermodynamik Frank-Michael Barth ISBN: © 2014 Oldenbourg Wissenschaftsverlag GmbH Abbildungsübersicht / List of Figures.
PROCAM Score Alter (Jahre)
Ertragsteuern, 5. Auflage Christiana Djanani, Gernot Brähler, Christian Lösel, Andreas Krenzin © UVK Verlagsgesellschaft mbH, Konstanz und München 2012.
Geometrische Aufgaben
Symmetrische Blockchiffren DES – der Data Encryption Standard
Retuschen.ppt Die folgende Schau zeigt die Möglichkeiten, mit PhotoDraw Digitalbilder zu retuschieren. Vergleichen Sie jeweils zwei Bildpaare durch fleissiges.
By: Jade Bowerman. German numbers are quite a bit like our own. You start with one through ten and then you add 20, 30, 40 or 50 to them. For time you.
Zahlentheorie und Zahlenspiele Hartmut Menzer, Ingo Althöfer ISBN: © 2014 Oldenbourg Wissenschaftsverlag GmbH Abbildungsübersicht / List.
MINDREADER Ein magisch - interaktives Erlebnis mit ENZO PAOLO
1 (C)2006, Hermann Knoll, HTW Chur, FHO Quadratische Reste Definitionen: Quadratischer Rest Quadratwurzel Anwendungen.
Wind Energy in Germany 2004 Ralf Christmann, BMU Joachim Kutscher, PTJ
Sentence Structure Subject and verb are always together. Subject and verb are always together. Subject and verb must agree Subject and verb must agree.
Schutzvermerk nach DIN 34 beachten 20/05/14 Seite 1 Grundlagen XSoft Lösung :Logische Grundschaltung IEC-Grundlagen und logische Verknüpfungen.
Folie Beispiel für eine Einzelauswertung der Gemeindedaten (fiktive Daten)
1 Mathematical Programming Nichtlineare Programmierung.
Technische Universität München 1 CADUI' June FUNDP Namur G B I The FUSE-System: an Integrated User Interface Design Environment Frank Lonczewski.
Instrumente und Unterhaltung End of Unit Assessment.
Unternehmensbewertung Thomas Hering ISBN: © 2014 Oldenbourg Wissenschaftsverlag GmbH Abbildungsübersicht / List of Figures Tabellenübersicht.
Titelmasterformat durch Klicken bearbeiten Textmasterformate durch Klicken bearbeiten Zweite Ebene Dritte Ebene Vierte Ebene Fünfte Ebene 1 Rising energy.
Forschungsprojekt Statistik 2013 „Jugend zählt“ – Folie 1 Statistik 2013 „Jugend zählt“: Daten zur Arbeit mit Kindern und Jugendlichen.
Folie Einzelauswertung der Gemeindedaten
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Modalverben.
Numbers Greetings and Good-byes All about Me Verbs and Pronouns
Datum:17. Dezember 2014 Thema:IFRS Update zum Jahresende – die Neuerungen im Überblick Referent:Eberhard Grötzner, EMA ® Anlass:12. Arbeitskreis Internationale.
Sehen, Hören, Schmecken: wenn uns unsere Sinne täuschen
1 Medienpädagogischer Forschungsverbund Südwest KIM-Studie 2014 Landesanstalt für Kommunikation Baden-Württemberg (LFK) Landeszentrale für Medien und Kommunikation.
 Präsentation transkript:

Carnegie Mellon University Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections Good Morning. I am Dipanjan Das, a PhD student at Carnegie Mellon University. Welcome to this talk, where I will talk about learning part of speech taggers in a bilingual setting. More precisely, we learn taggers from unannotated data for a target language, by assuming that we have a resource rich source language with which the target language has parallel data. This is joint work with Slav Petrov, completed during an internship at Google Research. Dipanjan Das Carnegie Mellon University Slav Petrov Google Research June 21 ACL 2011

Part-of-Speech Tagging Portland has a thriving music scene . NOUN VERB DET ADJ . We are all familiar with the part-of-speech tagging task. Given a sentence such as “Portland has a thriving music scene”, our goal is to tag each of its words with disambiguated part of speech categories. <Press Enter> For example here thriving is an adjective because it modifies the noun phrase music scene, and is not a verb, which can also be a syntactic category that the word thriving can associate with.

Supervised POS Tagging To confirm that these universal tags work well across languages, we trained supervised models for these 20 odd languages, and the average accuracy turns out to be around 96%. This says that the universal tags behave in the desired way for all of these languages. However, we want to build POS taggers for much more than 20 languages. Supervised setting: average accuracy is 96.2% with TnT (Brants, 2000)

Resource-Poor Languages Several major languages with no or little annotated data e.g. 32 million 37 million 20 million Native speakers 7.7 million 109 million 69 million 40 million Punjabi However, lots of parallel and unannotated data! The problem however is, that several major languages have no annotated data. <press enter> For example, these seven languages with several million speakers each have little or no tagged data with which supervised NLP tools could be built. However, these languages have tons of translated data in a resource rich language like English, for which we have supervised taggers. Our goal is to build POS taggers for such languages because POS taggers are basic tools necessary for the development of language technologies in these languages. Vietnamese Basic NLP tools like POS tagging essential for development of language technologies Polish Oriya Indonesian-Malay Azerbaijani Haitian See http://www.ethnologue.org/ethno_docs/distribution.asp?by=size

(Nearly) Universal Part-of-Speech Tags VERB DET NOUN CONJ PRON NUM ADJ PRT ADV . ADP X Talk about open class / closed class.

(Nearly) Universal Part-of-Speech Tags Example Penn Treebank tag maps: NN → NOUN NNP → NOUN NNPS → NOUN NNS → NOUN PRP→ PRON PRP$ → PRON WP → PRON WP$ → PRON We now give you a brief idea about how these tags were constructed. For example, the four fine grained noun tags from the original Penn Treebank were collapsed to the coarse NOUN tag for English. <press enter> Similarly the fine-grained pronoun categories map to the coarse PRON category. For Spanish, the treebank contains 2 fine grained noun tags which collapse to the universal coarse NOUN tag. Finally there are 9 different pronoun tags in the Spanish treebank which map to the PRON category. Thus, for different languages, different number of fine tags map to a coarse universal POS tag. The detailed procedure by which these tags were designed and set of experiments with these tags have been described in a short report that we have uploaded to ArXiv, and is downloadable. np → NOUN nc → NOUN Example Spanish Treebank tag maps: p0 → PRON pd → PRON pe → PRON pi → PRON pn → PRON pp → PRON pr → PRON pt → PRON px → PRON See Petrov, Das and McDonald (2011)

(Nearly) Universal Part-of-Speech Tags Portland has a thriving music scene . NOUN VERB DET ADJ . Other people also used universal tags. Portland hat eine prächtig gedeihende Musikszene . NOUN VERB DET ADJ . পোর্টল্যান্ড শহর এর সঙ্গীত পরিবেশ বেশ উন্নত |  NOUN ADP ADJ .

State of the Art in Unsupervised POS Tagging Before we go into the details of our proposed method, I will talk briefly about some state of the art models in unsupervised part of speech tagging, i.e. learning taggers in a scenario where there is no annotated data. State of the Art in Unsupervised POS Tagging

Unsupervised Part-of-Speech Tagging Hidden Markov Model (HMM) estimated with the Expectation-Maximization algorithm ? The basic model used for POS speech tagging in various learning scenarios is the hidden Markov model. In this framework, we denote the observation sequence, or the sentence as bold x, and the state sequence that emit each token in the sentence as bold y. The HMM inference problem for POS tagging is to find the best state sequence for a given sentence. Portland hat eine prächtig gedeihende Musikszene . : observation sequence : state sequence Merialdo (1994)

Unsupervised Part-of-Speech Tagging Hidden Markov Model (HMM) estimated with the Expectation-Maximization algorithm one of the 12 coarse tags ? The question marks at each state position denotes one of our coarse universal tags. Portland hat eine prächtig gedeihende Musikszene . : observation sequence : state sequence Merialdo (1994)

Unsupervised Part-of-Speech Tagging Hidden Markov Model (HMM) estimated with the Expectation-Maximization algorithm transition multinomials ? ? A bigram HMM consists of a set of transition multinomial distributions which encode the probability of transitioning from one state to the next. Portland hat : observation sequence : state sequence Merialdo (1994)

Unsupervised Part-of-Speech Tagging Hidden Markov Model (HMM) estimated with the Expectation-Maximization algorithm emission multinomials ? ? It also consists of a set of emission multinomial distributions which denote the probability of emitting a observation type x sub I given the state y sub I for the position i. The simplest and one of the oldest unsupervised learning methods for training an HMM is the Expectation Maximization algorithm. We trained unsupervised POS taggers for eight European languages with the EM algorithm. Portland hat : observation sequence : state sequence Merialdo (1994)

Unsupervised Part-of-Speech Tagging Hidden Markov Model (HMM) estimated with the Expectation-Maximization algorithm ? However, results are quite poor across all these languages, which is a result previously noted. We definitely would want to improve over these models. Portland hat eine prächtig gedeihende Musikszene . Danish Dutch German Greek Italian Portuguese Spanish Swedish Average 68.7 57.0 75.9 65.8 63.7 62.9 71.5 68.4 66.7 EM-HMM Poor average result Johnson (2007)

Unsupervised Part-of-Speech Tagging Hidden Markov Model (HMM) with locally normalized log-linear models emission multinomials ? ? Very recently Berg-Kirkpatrick and colleagues proposed an HMM where the emission multinomials were replaced by locally normalized log-linear models. Portland hat : observation sequence : state sequence Berg-Kirkpatrick et al. (2010)

Unsupervised Part-of-Speech Tagging Hidden Markov Model (HMM) with locally normalized log-linear models emission multinomials ? ? These log-linear models look at features corresponding to various aspects of the observation X sub I, along with the state y i. Some of these features are whether the observation contains a hyphen, the nature of capitalization, whether the token contains numbers, and so forth. Portland hat suffix hyphen capital letters numbers ... : observation sequence : state sequence Berg-Kirkpatrick et al. (2010)

Unsupervised Part-of-Speech Tagging Hidden Markov Model (HMM) with locally normalized log-linear models Estimated using gradient-based methods emission multinomials ? ? When the parameters were estimated with a gradient-based method, L-BFGS to be particular, this model resulted in much better scores than vanilla EM for English. Portland hat suffix hyphen capital letters numbers ... : observation sequence : state sequence Berg-Kirkpatrick et al. (2010)

Unsupervised Part-of-Speech Tagging Hidden Markov Model (HMM) with locally normalized log-linear models Estimated using gradient-based methods emission multinomials ? ? We trained this feature-rich HMM model for our eight languages, to confirm the improvement. This table shows that we get consistent improvements over the EM based HMM, resulting in more than 6% absolute average improvement. For our work, we use this model, as we will see in the following slides. Portland hat Danish Dutch German Greek Italian Portuguese Spanish Swedish Average 68.7 57.0 75.9 65.8 63.7 62.9 71.5 68.4 66.7 69.1 65.1 81.3 71.8 68.1 78.4 80.2 70.1 73.0 EM-HMM Feature-HMM Improvements across all languages

. Unsupervised POS Tagging with Dictionaries Hidden Markov Model (HMM) with locally normalized log-linear models State space constrained by possible gold tags NOUN VERB PRON DET ADJ NUM ADV . A POS tag dictionary contains the possible POS tags that a particular token can assume. For example, the third word eine in the sentence can either be a pronoun, determiner, adjective or num. Incorporating a tag dictionary into an HMM is straightforward. The HMMs state sequence is hardly constrained such that only those states which associate with the current token can emit it. We ran an experiment where we constructed tag dictionaries for each language from its treebank, and trained a feature-rich HMM for each of our eight languages. Portland hat eine prächtig gedeihende Musikszene .

. Unsupervised POS Tagging with Dictionaries Hidden Markov Model (HMM) with locally normalized log-linear models State space constrained by possible gold tags NOUN VERB PRON DET ADJ NUM ADV . These feature HMMS using the treebank based dictionaries actually perform really well, more than 20% improvement is found over purely unsupervised feature-HMMs. The accuracies are quite close to the supervised counterparts, even. Portland hat eine prächtig gedeihende Musikszene . Danish Dutch German Greek Italian Portuguese Spanish Swedish Average 68.7 57.0 75.9 65.8 63.7 62.9 71.5 68.4 66.7 69.1 65.1 81.3 71.8 68.1 78.4 80.2 70.1 73.0 93.1 94.7 93.5 96.6 96.4 94.0 95.8 85.5 93.7 EM-HMM Feature-HMM w/ gold dictionary

access to high-quality tag dictionaries is not realistic. Morphologically rich languages only have base forms in dictionaries For most languages, access to high-quality tag dictionaries is not realistic. However, for most languages without annotated data, access to high-quality tag dictionaries in not realistic. <press enter> Although there may exist dictionaries in the hard form, for morphologically rich languages, only base forms are present in dictionaries, causing problem. We tried to use such dictionaries to build POS taggers, and deficient dictionaries prevented us to build accurate POS taggers. Instead, we resort to the following ideas for the development of good, unsupervised POS taggers. We use supervision in another, resource-rich language with annotated data, which we call the source. We use translations of our language in concern, which we call the target language into the source language. And finally, we project tag lexicons from the source language to the target language. Ideas: Use supervision in resource-rich languages Use parallel data Construct projected tag lexicons

automatic labels from supervised tagger, 97% accuracy Bilingual Projection NOUN VERB DET ADJ . automatic labels from supervised tagger, 97% accuracy Portland has a thriving music scene . Given a sentence in English, we can use existing POS taggers to tag it automatically. For English, automatic POS taggers, using the coarse tags get around 97% accuracy, which is quite good for projection purposes.

Bilingual Projection Portland has a thriving music scene . NOUN VERB DET ADJ . Portland has a thriving music scene . Automatic unsupervised alignments from translation data (available for more than 50 languages) For many languages, translations from a resource rich language like English exist. <press enter> Given a large amount of such translations, automatic unsupervised alignments are easy to get. For more than 50 languages, we can get these alignments. Portland hat eine prächtig gedeihende Musikszene .

Baseline1: direct projection Bilingual Projection . NOUN VERB DET ADJ NOUN NOUN Portland has a thriving music scene . Our first idea uses these alignments directly for projection. We use parallel data between a source language, which is always English for our experiments and a target language. We run a word aligner, tag the English side with a supervised POS tagger for all sentences in a bitext corpus. <press enter> Next, for aligned words for a target sentence, we project the tags from the source side. We make use of high confidence alignments, and choose at most one alignment for each target word. For unaligned words, we use the NOUN tag. Portland hat eine prächtig gedeihende Musikszene . unaligned word NOUN (most frequent tag) Baseline1: direct projection Yarowsky and Ngai (2001)

Baseline1: direct projection Bilingual Projection Portland hat eine prächtig gedeihende Musikszene . NOUN VERB DET ADJ . + more projected tagged sentences Once projection is done for all the sentences, we use the projected data to perform supervised training, yielding a POS tagger. supervised training tagger (Brants, 2000) Baseline1: direct projection Yarowsky and Ngai (2001)

Baseline 1: direct projection Bilingual Projection Baseline 1: direct projection Danish Dutch German Greek Italian Portuguese Spanish Swedish Average 68.7 57.0 75.9 65.8 63.7 62.9 71.5 68.4 66.7 69.1 65.1 81.3 71.8 68.1 78.4 80.2 70.1 73.0 73.6 77.0 83.2 79.3 79.7 82.6 80.1 74.7 78.8 The results for this model is shown in the third row of the table. EM-HMM Feature-HMM Direct projection Yarowsky and Ngai (2001)

Baseline 1: direct projection Bilingual Projection Baseline 1: direct projection Danish Dutch German Greek Italian Portuguese Spanish Swedish Average 68.7 57.0 75.9 65.8 63.7 62.9 71.5 68.4 66.7 69.1 65.1 81.3 71.8 68.1 78.4 80.2 70.1 73.0 73.6 77.0 83.2 79.3 79.7 82.6 80.1 74.7 78.8 The results for this model is shown in the third row of the table. EM-HMM Feature-HMM Direct projection consistent improvements over unsupervised models Yarowsky and Ngai (2001)

Baseline 2: lexicon projection Bilingual Projection Baseline 2: lexicon projection The second idea that we explore in this work is instead of projecting across tokens in the parallel corpus, we rather project across types. We call this lexicon projection.

Baseline 2: lexicon projection Bilingual Projection Baseline 2: lexicon projection . NOUN Portland VERB has DET a ADJ thriving NOUN music NOUN scene Given such an alignment, <press enter> we consider each source target alignment token pairs separately. Portland hat eine prächtig gedeihende Musikszene .

Baseline 2: lexicon projection Bilingual Projection Baseline 2: lexicon projection NOUN Portland ADJ thriving . Portland gedeihende . And for target words which are unaligned, we ignore them. hat prächtig NOUN music ignore unaligned word NOUN scene VERB has eine Musikszene DET a

Baseline 2: lexicon projection Bilingual Projection Baseline 2: lexicon projection NOUN Portland ADJ thriving . Portland gedeihende . Thus, we get a bag of alignments for each sentence pair. Bag of alignments hat NOUN music NOUN scene VERB has eine Musikszene DET a

Baseline 2: lexicon projection Bilingual Projection Baseline 2: lexicon projection NOUN Portland ADJ thriving . Portland gedeihende . In the next step, for each target word aligned to a auto-tagged source word, we project the tag to the target word. In other words, we get a distribution on each target word, which is skewed towards only one tag. hat NOUN music NOUN scene VERB has eine Musikszene DET a

Baseline 2: lexicon projection Bilingual Projection Baseline 2: lexicon projection . NOUN Portland ADJ thriving Portland gedeihende . As we scan the bitext, the target words get aligned to more tagged source words, and we update their projected distributions. For example, here eine is aligned to words which were tagged with the DET, NUM and the PRON tags, and the tag distribution has equal weights for each of these tags. hat NOUN music NOUN scene VERB has eine Musikszene DET a PRON one NUM one

Baseline 2: lexicon projection Bilingual Projection Baseline 2: lexicon projection . NOUN Portland ADJ thriving VERB thriving Portland gedeihende . Similarly, the word <>, when aligning to two senses of the word thriving, has a distribution with equal weights on the adjective and the noun tags. hat NOUN music NOUN scene VERB has eine Musikszene DET a PRON one NUM one

Baseline 2: lexicon projection Bilingual Projection Baseline 2: lexicon projection After scanning all the parallel data: After scanning all the data, we get a POS tag distribution for each aligned word in the bitext. <press enter> We use this set of tag distributions to a dictionary, by using a threshold tau above which all tags are entered into the dictionary for a given word. Portland gedeihende hat eine Musikszene . = probability of a tag given a word

Baseline 2: lexicon projection Bilingual Projection Baseline 2: lexicon projection Feature HMM constrained with projected dictionary When this dictionary is used to constrain the state space of a feature-based HMM and the model is trained, we get pretty good results. For 6 out of 8 languages, we see improvements over the three previous models we considered. The improvement over the purely unsupervised feature-HMM model is more than 8% on an average, which is quite impressive. Danish Dutch German Greek Italian Portuguese Spanish Swedish Average 68.7 57.0 75.9 65.8 63.7 62.9 71.5 68.4 66.7 69.1 65.1 81.3 71.8 68.1 78.4 80.2 70.1 73.0 73.6 77.0 83.2 79.3 79.7 82.6 80.1 74.7 78.8 79.0 82.4 76.3 84.8 87.0 82.8 79.4 EM-HMM Feature-HMM Direct projection Projected Dictionary Improvements over simple projection for majority of the languages

Projected lexicon expansion and refinement using label propagation No information about unaligned words Portland hat eine prächtig gedeihende Musikszene . Portland has a thriving music scene . NOUN VERB DET ADJ . One drawback of this approach I just described is that there is no information about unaligned words – we had ignored unaligned words during the projection procedure. <press enter> The question we ask at this point is whether coverage can be improved. To address this problem, we explore an idea whether the projected lexicon is expanded using a lot of unlabeled data. Can coverage be improved? Idea: Projected lexicon expansion and refinement using label propagation

How can label propagation help? Our Model: Graph-Based Projections How can label propagation help? For a language: Build graph over a 2M trigram types as vertices compute similarity matrix using co-occurrence statistics Label distribution at each vertex ≈ tag distribution over the trigram’s middle word Now the question is how can label propagation help us. <press enter> We first construct a graph over a 2 million trigram types of a target language, as vertices. Distributional similarity between the trigram types, constructed from a lot of unlabeled data gives us the edge weight between a pair of trigrams. The graph is made sparse by considering only 5-NNs of each trigram type. Now, for each vertex, the distribution over labels, which we called q in the previous slide is chosen to be the distribution over the middle word of the trigram. The trigram representation is chosen just to reduce the ambiguity of the middle word of it, and to construct a better graph. The graph construction over the target language follows as it is from a recent work by Subramanya and colleagues, where their goal was to perform POS domain adaption for English. Subramanya, Petrov and Pereira (2010)

Example Graph in German gutem Essen zugetan ist wichtig bei zum Essen niederlassen ist gut bei fuers Essen drauf ist fein bei schlechtes Essen und 1000 Essen pro ist lebhafter bei zu realisieren , zu erreichen , zu stecken , zu essen ,

Example Graph in German gutem Essen zugetan NOUN ist wichtig bei zum Essen niederlassen ist gut bei fuers Essen drauf ist fein bei schlechtes Essen und 1000 Essen pro ist lebhafter bei zu realisieren , zu erreichen , zu stecken , zu essen , VERB

How can label propagation help? Our Model: Graph-Based Projections How can label propagation help? For a target language: Build graph over a 2M trigram types as vertices compute similarity matrix using co-occurrence statistics Label distribution at each vertex ≈ tag distribution over the trigram’s middle word Now, to spread label distributions through this target graph over trigrams, we need to seed it. Since we assume that we do not have ANY labeled data for this language, we bring in parallel data. Thus, we plug in automatically tagged words from a resource-rich source language. We connect these words to the target language trigrams, where edges between the source words and target trigrams are nothing but word alignments. Plug in auto-tagged words from a source language Links between source and target language units are word alignments

zum Essen niederlassen Bilingual Graph ADJ ADJ ADV important gutem Essen zugetan good nicely ADJ ist wichtig bei fine zum Essen niederlassen ist gut bei fuers Essen drauf ist fein bei schlechtes Essen und 1000 Essen pro ist lebhafter bei zu realisieren , zu erreichen , NOUN food VERB zu stecken , zu essen , eating eat eat VERB VERB

How can label propagation help? Our Model: Graph-Based Projections How can label propagation help? For a target language: Plug in auto-tagged words from a source language Links between source and target language units are word alignments Next, we are ready to run label propagation. In the first stage, we propagate labels from the source words to the target trigrams. Run first stage of label propagation Source language → target language

First Stage of Label Propagation ADJ ADJ ADV important gutem Essen zugetan good nicely ADJ ist wichtig bei fine zum Essen niederlassen ist gut bei fuers Essen drauf ist fein bei schlechtes Essen und 1000 Essen pro ist lebhafter bei zu realisieren , zu erreichen , NOUN food VERB zu stecken , zu essen , eating eat eat VERB VERB

First Stage of Label Propagation ADJ ADJ ADV important gutem Essen zugetan good nicely ADJ ist wichtig bei fine zum Essen niederlassen ist gut bei fuers Essen drauf ist fein bei schlechtes Essen und 1000 Essen pro ist lebhafter bei zu realisieren , zu erreichen , NOUN food VERB zu stecken , zu essen , eating eat eat VERB VERB

How can label propagation help? Our Model: Graph-Based Projections How can label propagation help? For a target language: Plug in auto-tagged words from a source language Links between source and target language units are word alignments After this first stage of label propagation from the source language words to the target language trigrams, we use the second stage of label propagation, because we have some seed distributions over the target language graph. This second stage of label propagation is nothing but the optimization of the graph objective function with squared penalties that we showed previously. Run first stage of label propagation Source language → target language Run second stage of label propagation within target language vertices graph objective function with squared penalties

Second Stage of Label Propagation ADJ ADJ ADV important gutem Essen zugetan good nicely ADJ ist wichtig bei fine zum Essen niederlassen ist gut bei fuers Essen drauf ist fein bei schlechtes Essen und 1000 Essen pro ist lebhafter bei zu realisieren , zu erreichen , NOUN food VERB zu stecken , zu essen , eating eat eat VERB VERB

Second Stage of Label Propagation ADJ ADJ ADV important gutem Essen zugetan good nicely ADJ ist wichtig bei fine zum Essen niederlassen ist gut bei fuers Essen drauf ist fein bei schlechtes Essen und 1000 Essen pro ist lebhafter bei zu realisieren , zu erreichen , NOUN food VERB zu stecken , zu essen , eating eat eat VERB VERB

Second Stage of Label Propagation ADJ ADJ ADV important gutem Essen zugetan good nicely ADJ ist wichtig bei fine zum Essen niederlassen ist gut bei fuers Essen drauf ist fein bei schlechtes Essen und 1000 Essen pro ist lebhafter bei zu realisieren , zu erreichen , NOUN food VERB zu stecken , zu essen , eating eat eat VERB VERB

Second Stage of Label Propagation ADJ ADJ ADV important gutem Essen zugetan good nicely Continues till convergence... ADJ ist wichtig bei fine zum Essen niederlassen ist gut bei fuers Essen drauf ist fein bei schlechtes Essen und 1000 Essen pro ist lebhafter bei zu realisieren , zu erreichen , NOUN food VERB zu stecken , zu essen , eating eat eat VERB VERB

Our Model: Graph-Based Projections End result? Portland gedeihende hat eine Musikszene . The end result of this bilingual projection procedure is a large set of distributions over all the unigrams of the target language, once we collapse and renormalize the distribution over tags for all the target trigrams. Some examples are shown here. fein lebhafter realisieren

Our Model: Graph-Based Projections A larger set of tag distributions → better and larger dictionary Portland gedeihende hat eine Musikszene . The end result of this bilingual projection procedure is a large set of distributions over all the unigrams of the target language, once we collapse and renormalize the distribution over tags for all the target trigrams. Some examples are shown here. fein lebhafter realisieren

Our Model: Graph-Based Projections Lexicon Expansion thousands of words This slide compares the sizes of the dictionaries for each language constructed using the projected dictionary model and our full model that uses graph-based learning. Note that there is significant expansion in lexicon size. Although the lexicons are imperfect, this expansion helps reason about the consistent improvements that we get across languages.

with Labeled and Unlabeled Data Brief Overview: Graph-Based Learning with Labeled and Unlabeled Data We adopt a graph-based semi-supervised learning technique using which we perform lexicon expansion. The reason why we choose this particular way of learning from indirect supervision is the following. This framework helps us learn from similar datapoints which have some bilingually projected knowledge, thus expanding our belief on a small set of items to a larger set. The framework helps us create a probabilistic resource, which is stationary and created offline, and can be used for both training and inference. I will describe the particular graph-based technique we use in this work.

= symmetric weight matrix distributions to be found 0.9 0.01 0.8 0.1 = symmetric weight matrix 0.05 supervised label distributions labeled datapoints unlabeled datapoints Here we have five sample points, <press enter> The two shaded datapoints correspond to labeled datapoints, While the three white ones correspond to unlabeled ones. The geometry of the data is also given, with edges between pairs of vertices. Each edge has a symmetric weight wij, which is computed using some sort of similarity function sim. We assume this weight matrix for the entire graph to be given. Note that the graph is not fully connected, which indicates that the absent edges had weight 0. On the labeled vertices, supervised label distributions are known. These are q1 and q5 for the given graph Our goal is to learn q2, q3 and q4, the distributions on the unlabeled vertices. Zhu, Ghahramani and Lafferty (2003)

Label Propagation Zhu, Ghahramani and Lafferty (2003) 0.9 0.1 0.05 0.01 0.8 0.1 0.05 This transductive set up, where we want to learn the label distributions on unlabeled vertices, is called label propagation, and was proposed by Zhu et al. In the current work, to perform label propagation, we minimize the shown objective function Zhu, Ghahramani and Lafferty (2003)

set of distributions over unlabeled vertices Label Propagation 0.9 0.01 0.8 0.1 0.05 Here capital QsubU is the set of distributions over unlabeled vertices, which we need to find. set of distributions over unlabeled vertices

unlabeled vertices Label Propagation 0.9 0.1 0.05 0.01 0.8 VsubU is the set of unlabeled vertices. unlabeled vertices

brings the distributions of similar Label Propagation 0.9 0.01 0.8 0.1 0.05 The first term in the objective function brings the distributions of similar vertices closer to each other. brings the distributions of similar vertices closer

brings the distributions of uncertain neighborhoods Label Propagation 0.9 0.01 0.8 0.1 0.05 Size of the label set The second term can be thought of as a regularizer which brings the distributions on vertices in uncertain graph nerighborhoods close to the uniform distribution brings the distributions of uncertain neighborhoods close to the uniform distribution

Iterative updates for optimization Label Propagation 0.9 0.01 0.8 0.1 0.05 To optimize this function, we use iterative updates, whose detail I won’t go into, in this talk. They can be found in the paper. Iterative updates for optimization

Final Results

Our Model: Graph-Based Projections Feature HMM constrained with graph-based dictionary Danish Dutch German Greek Italian Portuguese Spanish Swedish Average 68.7 57.0 75.9 65.8 63.7 62.9 71.5 68.4 66.7 69.1 65.1 81.3 71.8 68.1 78.4 80.2 70.1 73.0 73.6 77.0 83.2 79.3 79.7 82.6 80.1 74.7 78.8 79.0 82.4 76.3 84.8 87.0 82.8 79.4 79.5 82.5 86.8 87.9 84.2 80.5 83.4 EM-HMM We compare these results with two oracles – a feature HMM that incorporates a gold dictionary constructed from the treebanks. We fall short of this model by quite a bit, but this is not surprising because our lexicon is noisy. Finally, the supervised model gives us an upper bound, and we are more than 13 points worse than it. Feature-HMM Direct projection Projected Dictionary Graph-Based Projections

Our Model: Graph-Based Projections Feature HMM constrained with graph-based dictionary Danish Dutch German Greek Italian Portuguese Spanish Swedish Average 68.7 57.0 75.9 65.8 63.7 62.9 71.5 68.4 66.7 69.1 65.1 81.3 71.8 68.1 78.4 80.2 70.1 73.0 73.6 77.0 83.2 79.3 79.7 82.6 80.1 74.7 78.8 79.0 82.4 76.3 84.8 87.0 82.8 79.4 79.5 82.5 86.8 87.9 84.2 80.5 83.4 EM-HMM We compare these results with two oracles – a feature HMM that incorporates a gold dictionary constructed from the treebanks. We fall short of this model by quite a bit, but this is not surprising because our lexicon is noisy. Finally, the supervised model gives us an upper bound, and we are more than 13 points worse than it. Feature-HMM Direct projection Projected Dictionary Graph-Based Projections w/ gold dictionary 93.1 94.7 93.5 96.6 96.4 94.0 95.8 85.5 93.7 96.9 94.9 98.2 97.8 95.8 97.2 96.8 94.8 96.6 supervised

Concluding Notes Reasonably accurate POS taggers without direct supervision Evaluated on major European languages Towards a standard of universal POS tags Traditional evaluation of unsupervised POS taggers done using greedy metrics that use labeled data Our presented models avoid these evaluation methods

Future Directions Scaling up the number of nodes in the graph from 2M to billions may help create larger lexicons Including penalties in the graph objective that induce sparse tag distributions at each graph vertex Inclusion of multiple languages in the graph may further improve results Label propagation in one huge multilingual graph

Projected POS Tagged data available at: http://code.google.com/p/pos-projection/

Questions? Portland has a thriving music scene . NOUN NOUN NOUN NOUN NOUN NOUN VERB VERB VERB VERB DET DET DET DET ADJ ADJ ADJ ADJ ADJ ADJ ADJ NOUN NOUN NOUN NOUN NOUN NOUN NOUN . . . . Portland has a thriving music scene . Portland hat eine prächtig gedeihende Musikszene . No need to say anything পোর্টল্যান্ড শহর এর সঙ্গীত পরিবেশ বেশ উন্নত |  ADP Portland tiene una escena musical vibrante . ADJ 波特兰 有 一个 生机勃勃的 音乐 场景 Portland a une scène musicale florissante . ADJ