Carnegie Mellon University

Carnegie Mellon University
Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections Good Morning. I am Dipanjan Das, a PhD student at Carnegie Mellon University. Welcome to this talk, where I will talk about learning part of speech taggers in a bilingual setting. More precisely, we learn taggers from unannotated data for a target language, by assuming that we have a resource rich source language with which the target language has parallel data. This is joint work with Slav Petrov, completed during an internship at Google Research. Dipanjan Das Carnegie Mellon University Slav Petrov Google Research June 21 ACL 2011

Part-of-Speech Tagging
Portland has a thriving music scene . NOUN VERB DET ADJ . We are all familiar with the part-of-speech tagging task. Given a sentence such as “Portland has a thriving music scene”, our goal is to tag each of its words with disambiguated part of speech categories. <Press Enter> For example here thriving is an adjective because it modifies the noun phrase music scene, and is not a verb, which can also be a syntactic category that the word thriving can associate with.

Supervised POS Tagging
To confirm that these universal tags work well across languages, we trained supervised models for these 20 odd languages, and the average accuracy turns out to be around 96%. This says that the universal tags behave in the desired way for all of these languages. However, we want to build POS taggers for much more than 20 languages. Supervised setting: average accuracy is 96.2% with TnT (Brants, 2000)

Resource-Poor Languages
Several major languages with no or little annotated data e.g. 32 million 37 million 20 million Native speakers 7.7 million 109 million 69 million 40 million Punjabi However, lots of parallel and unannotated data! The problem however is, that several major languages have no annotated data. <press enter> For example, these seven languages with several million speakers each have little or no tagged data with which supervised NLP tools could be built. However, these languages have tons of translated data in a resource rich language like English, for which we have supervised taggers. Our goal is to build POS taggers for such languages because POS taggers are basic tools necessary for the development of language technologies in these languages. Vietnamese Basic NLP tools like POS tagging essential for development of language technologies Polish Oriya Indonesian-Malay Azerbaijani Haitian See

(Nearly) Universal Part-of-Speech Tags
VERB DET NOUN CONJ PRON NUM ADJ PRT ADV . ADP X Talk about open class / closed class.

Example Penn Treebank tag maps: NN → NOUN NNP → NOUN NNPS → NOUN NNS → NOUN PRP→ PRON PRP$ → PRON WP → PRON WP$ → PRON We now give you a brief idea about how these tags were constructed. For example, the four fine grained noun tags from the original Penn Treebank were collapsed to the coarse NOUN tag for English. <press enter> Similarly the fine-grained pronoun categories map to the coarse PRON category. For Spanish, the treebank contains 2 fine grained noun tags which collapse to the universal coarse NOUN tag. Finally there are 9 different pronoun tags in the Spanish treebank which map to the PRON category. Thus, for different languages, different number of fine tags map to a coarse universal POS tag. The detailed procedure by which these tags were designed and set of experiments with these tags have been described in a short report that we have uploaded to ArXiv, and is downloadable. np → NOUN nc → NOUN Example Spanish Treebank tag maps: p0 → PRON pd → PRON pe → PRON pi → PRON pn → PRON pp → PRON pr → PRON pt → PRON px → PRON See Petrov, Das and McDonald (2011)

Portland has a thriving music scene . NOUN VERB DET ADJ . Other people also used universal tags. Portland hat eine prächtig gedeihende Musikszene . NOUN VERB DET ADJ . পোর্টল্যান্ড শহর এর সঙ্গীত পরিবেশ বেশ উন্নত | NOUN ADP ADJ .

State of the Art in Unsupervised POS Tagging
Before we go into the details of our proposed method, I will talk briefly about some state of the art models in unsupervised part of speech tagging, i.e. learning taggers in a scenario where there is no annotated data. State of the Art in Unsupervised POS Tagging

Unsupervised Part-of-Speech Tagging
Hidden Markov Model (HMM) estimated with the Expectation-Maximization algorithm ? The basic model used for POS speech tagging in various learning scenarios is the hidden Markov model. In this framework, we denote the observation sequence, or the sentence as bold x, and the state sequence that emit each token in the sentence as bold y. The HMM inference problem for POS tagging is to find the best state sequence for a given sentence. Portland hat eine prächtig gedeihende Musikszene . : observation sequence : state sequence Merialdo (1994)

Hidden Markov Model (HMM) estimated with the Expectation-Maximization algorithm one of the 12 coarse tags ? The question marks at each state position denotes one of our coarse universal tags. Portland hat eine prächtig gedeihende Musikszene . : observation sequence : state sequence Merialdo (1994)

Hidden Markov Model (HMM) estimated with the Expectation-Maximization algorithm transition multinomials ? ? A bigram HMM consists of a set of transition multinomial distributions which encode the probability of transitioning from one state to the next. Portland hat : observation sequence : state sequence Merialdo (1994)

Hidden Markov Model (HMM) estimated with the Expectation-Maximization algorithm emission multinomials ? ? It also consists of a set of emission multinomial distributions which denote the probability of emitting a observation type x sub I given the state y sub I for the position i. The simplest and one of the oldest unsupervised learning methods for training an HMM is the Expectation Maximization algorithm. We trained unsupervised POS taggers for eight European languages with the EM algorithm. Portland hat : observation sequence : state sequence Merialdo (1994)

Hidden Markov Model (HMM) estimated with the Expectation-Maximization algorithm ? However, results are quite poor across all these languages, which is a result previously noted. We definitely would want to improve over these models. Portland hat eine prächtig gedeihende Musikszene . Danish Dutch German Greek Italian Portuguese Spanish Swedish Average 68.7 57.0 75.9 65.8 63.7 62.9 71.5 68.4 66.7 EM-HMM Poor average result Johnson (2007)

Hidden Markov Model (HMM) with locally normalized log-linear models emission multinomials ? ? Very recently Berg-Kirkpatrick and colleagues proposed an HMM where the emission multinomials were replaced by locally normalized log-linear models. Portland hat : observation sequence : state sequence Berg-Kirkpatrick et al. (2010)

Hidden Markov Model (HMM) with locally normalized log-linear models emission multinomials ? ? These log-linear models look at features corresponding to various aspects of the observation X sub I, along with the state y i. Some of these features are whether the observation contains a hyphen, the nature of capitalization, whether the token contains numbers, and so forth. Portland hat suffix hyphen capital letters numbers ... : observation sequence : state sequence Berg-Kirkpatrick et al. (2010)

Hidden Markov Model (HMM) with locally normalized log-linear models Estimated using gradient-based methods emission multinomials ? ? When the parameters were estimated with a gradient-based method, L-BFGS to be particular, this model resulted in much better scores than vanilla EM for English. Portland hat suffix hyphen capital letters numbers ... : observation sequence : state sequence Berg-Kirkpatrick et al. (2010)

Hidden Markov Model (HMM) with locally normalized log-linear models Estimated using gradient-based methods emission multinomials ? ? We trained this feature-rich HMM model for our eight languages, to confirm the improvement. This table shows that we get consistent improvements over the EM based HMM, resulting in more than 6% absolute average improvement. For our work, we use this model, as we will see in the following slides. Portland hat Danish Dutch German Greek Italian Portuguese Spanish Swedish Average 68.7 57.0 75.9 65.8 63.7 62.9 71.5 68.4 66.7 69.1 65.1 81.3 71.8 68.1 78.4 80.2 70.1 73.0 EM-HMM Feature-HMM Improvements across all languages

. Unsupervised POS Tagging with Dictionaries
Hidden Markov Model (HMM) with locally normalized log-linear models State space constrained by possible gold tags NOUN VERB PRON DET ADJ NUM ADV . A POS tag dictionary contains the possible POS tags that a particular token can assume. For example, the third word eine in the sentence can either be a pronoun, determiner, adjective or num. Incorporating a tag dictionary into an HMM is straightforward. The HMMs state sequence is hardly constrained such that only those states which associate with the current token can emit it. We ran an experiment where we constructed tag dictionaries for each language from its treebank, and trained a feature-rich HMM for each of our eight languages. Portland hat eine prächtig gedeihende Musikszene .

. Unsupervised POS Tagging with Dictionaries
Hidden Markov Model (HMM) with locally normalized log-linear models State space constrained by possible gold tags NOUN VERB PRON DET ADJ NUM ADV . These feature HMMS using the treebank based dictionaries actually perform really well, more than 20% improvement is found over purely unsupervised feature-HMMs. The accuracies are quite close to the supervised counterparts, even. Portland hat eine prächtig gedeihende Musikszene . Danish Dutch German Greek Italian Portuguese Spanish Swedish Average 68.7 57.0 75.9 65.8 63.7 62.9 71.5 68.4 66.7 69.1 65.1 81.3 71.8 68.1 78.4 80.2 70.1 73.0 93.1 94.7 93.5 96.6 96.4 94.0 95.8 85.5 93.7 EM-HMM Feature-HMM w/ gold dictionary

access to high-quality tag dictionaries is not realistic.
Morphologically rich languages only have base forms in dictionaries For most languages, access to high-quality tag dictionaries is not realistic. However, for most languages without annotated data, access to high-quality tag dictionaries in not realistic. <press enter> Although there may exist dictionaries in the hard form, for morphologically rich languages, only base forms are present in dictionaries, causing problem. We tried to use such dictionaries to build POS taggers, and deficient dictionaries prevented us to build accurate POS taggers. Instead, we resort to the following ideas for the development of good, unsupervised POS taggers. We use supervision in another, resource-rich language with annotated data, which we call the source. We use translations of our language in concern, which we call the target language into the source language. And finally, we project tag lexicons from the source language to the target language. Ideas: Use supervision in resource-rich languages Use parallel data Construct projected tag lexicons

automatic labels from supervised tagger, 97% accuracy
Bilingual Projection NOUN VERB DET ADJ . automatic labels from supervised tagger, 97% accuracy Portland has a thriving music scene . Given a sentence in English, we can use existing POS taggers to tag it automatically. For English, automatic POS taggers, using the coarse tags get around 97% accuracy, which is quite good for projection purposes.

Bilingual Projection Portland has a thriving music scene .
NOUN VERB DET ADJ . Portland has a thriving music scene . Automatic unsupervised alignments from translation data (available for more than 50 languages) For many languages, translations from a resource rich language like English exist. <press enter> Given a large amount of such translations, automatic unsupervised alignments are easy to get. For more than 50 languages, we can get these alignments. Portland hat eine prächtig gedeihende Musikszene .

Baseline1: direct projection
Bilingual Projection . NOUN VERB DET ADJ NOUN NOUN Portland has a thriving music scene . Our first idea uses these alignments directly for projection. We use parallel data between a source language, which is always English for our experiments and a target language. We run a word aligner, tag the English side with a supervised POS tagger for all sentences in a bitext corpus. <press enter> Next, for aligned words for a target sentence, we project the tags from the source side. We make use of high confidence alignments, and choose at most one alignment for each target word. For unaligned words, we use the NOUN tag. Portland hat eine prächtig gedeihende Musikszene . unaligned word NOUN (most frequent tag) Baseline1: direct projection Yarowsky and Ngai (2001)

Baseline1: direct projection
Bilingual Projection Portland hat eine prächtig gedeihende Musikszene . NOUN VERB DET ADJ . + more projected tagged sentences Once projection is done for all the sentences, we use the projected data to perform supervised training, yielding a POS tagger. supervised training tagger (Brants, 2000) Baseline1: direct projection Yarowsky and Ngai (2001)

Baseline 1: direct projection
Bilingual Projection Baseline 1: direct projection Danish Dutch German Greek Italian Portuguese Spanish Swedish Average 68.7 57.0 75.9 65.8 63.7 62.9 71.5 68.4 66.7 69.1 65.1 81.3 71.8 68.1 78.4 80.2 70.1 73.0 73.6 77.0 83.2 79.3 79.7 82.6 80.1 74.7 78.8 The results for this model is shown in the third row of the table. EM-HMM Feature-HMM Direct projection Yarowsky and Ngai (2001)

Baseline 1: direct projection
Bilingual Projection Baseline 1: direct projection Danish Dutch German Greek Italian Portuguese Spanish Swedish Average 68.7 57.0 75.9 65.8 63.7 62.9 71.5 68.4 66.7 69.1 65.1 81.3 71.8 68.1 78.4 80.2 70.1 73.0 73.6 77.0 83.2 79.3 79.7 82.6 80.1 74.7 78.8 The results for this model is shown in the third row of the table. EM-HMM Feature-HMM Direct projection consistent improvements over unsupervised models Yarowsky and Ngai (2001)

Baseline 2: lexicon projection
Bilingual Projection Baseline 2: lexicon projection The second idea that we explore in this work is instead of projecting across tokens in the parallel corpus, we rather project across types. We call this lexicon projection.

Bilingual Projection Baseline 2: lexicon projection . NOUN Portland VERB has DET a ADJ thriving NOUN music NOUN scene Given such an alignment, <press enter> we consider each source target alignment token pairs separately. Portland hat eine prächtig gedeihende Musikszene .

Bilingual Projection Baseline 2: lexicon projection NOUN Portland ADJ thriving . Portland gedeihende . And for target words which are unaligned, we ignore them. hat prächtig NOUN music ignore unaligned word NOUN scene VERB has eine Musikszene DET a

Bilingual Projection Baseline 2: lexicon projection NOUN Portland ADJ thriving . Portland gedeihende . Thus, we get a bag of alignments for each sentence pair. Bag of alignments hat NOUN music NOUN scene VERB has eine Musikszene DET a

Bilingual Projection Baseline 2: lexicon projection NOUN Portland ADJ thriving . Portland gedeihende . In the next step, for each target word aligned to a auto-tagged source word, we project the tag to the target word. In other words, we get a distribution on each target word, which is skewed towards only one tag. hat NOUN music NOUN scene VERB has eine Musikszene DET a

Bilingual Projection Baseline 2: lexicon projection . NOUN Portland ADJ thriving Portland gedeihende . As we scan the bitext, the target words get aligned to more tagged source words, and we update their projected distributions. For example, here eine is aligned to words which were tagged with the DET, NUM and the PRON tags, and the tag distribution has equal weights for each of these tags. hat NOUN music NOUN scene VERB has eine Musikszene DET a PRON one NUM one

Bilingual Projection Baseline 2: lexicon projection . NOUN Portland ADJ thriving VERB thriving Portland gedeihende . Similarly, the word <>, when aligning to two senses of the word thriving, has a distribution with equal weights on the adjective and the noun tags. hat NOUN music NOUN scene VERB has eine Musikszene DET a PRON one NUM one

Bilingual Projection Baseline 2: lexicon projection After scanning all the parallel data: After scanning all the data, we get a POS tag distribution for each aligned word in the bitext. <press enter> We use this set of tag distributions to a dictionary, by using a threshold tau above which all tags are entered into the dictionary for a given word. Portland gedeihende hat eine Musikszene . = probability of a tag given a word

Bilingual Projection Baseline 2: lexicon projection Feature HMM constrained with projected dictionary When this dictionary is used to constrain the state space of a feature-based HMM and the model is trained, we get pretty good results. For 6 out of 8 languages, we see improvements over the three previous models we considered. The improvement over the purely unsupervised feature-HMM model is more than 8% on an average, which is quite impressive. Danish Dutch German Greek Italian Portuguese Spanish Swedish Average 68.7 57.0 75.9 65.8 63.7 62.9 71.5 68.4 66.7 69.1 65.1 81.3 71.8 68.1 78.4 80.2 70.1 73.0 73.6 77.0 83.2 79.3 79.7 82.6 80.1 74.7 78.8 79.0 82.4 76.3 84.8 87.0 82.8 79.4 EM-HMM Feature-HMM Direct projection Projected Dictionary Improvements over simple projection for majority of the languages

Projected lexicon expansion and refinement using label propagation
No information about unaligned words Portland hat eine prächtig gedeihende Musikszene . Portland has a thriving music scene . NOUN VERB DET ADJ . One drawback of this approach I just described is that there is no information about unaligned words – we had ignored unaligned words during the projection procedure. <press enter> The question we ask at this point is whether coverage can be improved. To address this problem, we explore an idea whether the projected lexicon is expanded using a lot of unlabeled data. Can coverage be improved? Idea: Projected lexicon expansion and refinement using label propagation

How can label propagation help?
Our Model: Graph-Based Projections How can label propagation help? For a language: Build graph over a 2M trigram types as vertices compute similarity matrix using co-occurrence statistics Label distribution at each vertex ≈ tag distribution over the trigram’s middle word Now the question is how can label propagation help us. <press enter> We first construct a graph over a 2 million trigram types of a target language, as vertices. Distributional similarity between the trigram types, constructed from a lot of unlabeled data gives us the edge weight between a pair of trigrams. The graph is made sparse by considering only 5-NNs of each trigram type. Now, for each vertex, the distribution over labels, which we called q in the previous slide is chosen to be the distribution over the middle word of the trigram. The trigram representation is chosen just to reduce the ambiguity of the middle word of it, and to construct a better graph. The graph construction over the target language follows as it is from a recent work by Subramanya and colleagues, where their goal was to perform POS domain adaption for English. Subramanya, Petrov and Pereira (2010)

Example Graph in German
gutem Essen zugetan ist wichtig bei zum Essen niederlassen ist gut bei fuers Essen drauf ist fein bei schlechtes Essen und 1000 Essen pro ist lebhafter bei zu realisieren , zu erreichen , zu stecken , zu essen ,

Example Graph in German
gutem Essen zugetan NOUN ist wichtig bei zum Essen niederlassen ist gut bei fuers Essen drauf ist fein bei schlechtes Essen und 1000 Essen pro ist lebhafter bei zu realisieren , zu erreichen , zu stecken , zu essen , VERB

Our Model: Graph-Based Projections How can label propagation help? For a target language: Build graph over a 2M trigram types as vertices compute similarity matrix using co-occurrence statistics Label distribution at each vertex ≈ tag distribution over the trigram’s middle word Now, to spread label distributions through this target graph over trigrams, we need to seed it. Since we assume that we do not have ANY labeled data for this language, we bring in parallel data. Thus, we plug in automatically tagged words from a resource-rich source language. We connect these words to the target language trigrams, where edges between the source words and target trigrams are nothing but word alignments. Plug in auto-tagged words from a source language Links between source and target language units are word alignments

zum Essen niederlassen
Bilingual Graph ADJ ADJ ADV important gutem Essen zugetan good nicely ADJ ist wichtig bei fine zum Essen niederlassen ist gut bei fuers Essen drauf ist fein bei schlechtes Essen und 1000 Essen pro ist lebhafter bei zu realisieren , zu erreichen , NOUN food VERB zu stecken , zu essen , eating eat eat VERB VERB

Our Model: Graph-Based Projections How can label propagation help? For a target language: Plug in auto-tagged words from a source language Links between source and target language units are word alignments Next, we are ready to run label propagation. In the first stage, we propagate labels from the source words to the target trigrams. Run first stage of label propagation Source language → target language

First Stage of Label Propagation
ADJ ADJ ADV important gutem Essen zugetan good nicely ADJ ist wichtig bei fine zum Essen niederlassen ist gut bei fuers Essen drauf ist fein bei schlechtes Essen und 1000 Essen pro ist lebhafter bei zu realisieren , zu erreichen , NOUN food VERB zu stecken , zu essen , eating eat eat VERB VERB

Our Model: Graph-Based Projections How can label propagation help? For a target language: Plug in auto-tagged words from a source language Links between source and target language units are word alignments After this first stage of label propagation from the source language words to the target language trigrams, we use the second stage of label propagation, because we have some seed distributions over the target language graph. This second stage of label propagation is nothing but the optimization of the graph objective function with squared penalties that we showed previously. Run first stage of label propagation Source language → target language Run second stage of label propagation within target language vertices graph objective function with squared penalties

Second Stage of Label Propagation
ADJ ADJ ADV important gutem Essen zugetan good nicely ADJ ist wichtig bei fine zum Essen niederlassen ist gut bei fuers Essen drauf ist fein bei schlechtes Essen und 1000 Essen pro ist lebhafter bei zu realisieren , zu erreichen , NOUN food VERB zu stecken , zu essen , eating eat eat VERB VERB

Second Stage of Label Propagation
ADJ ADJ ADV important gutem Essen zugetan good nicely Continues till convergence... ADJ ist wichtig bei fine zum Essen niederlassen ist gut bei fuers Essen drauf ist fein bei schlechtes Essen und 1000 Essen pro ist lebhafter bei zu realisieren , zu erreichen , NOUN food VERB zu stecken , zu essen , eating eat eat VERB VERB

Our Model: Graph-Based Projections
End result? Portland gedeihende hat eine Musikszene . The end result of this bilingual projection procedure is a large set of distributions over all the unigrams of the target language, once we collapse and renormalize the distribution over tags for all the target trigrams. Some examples are shown here. fein lebhafter realisieren

A larger set of tag distributions → better and larger dictionary Portland gedeihende hat eine Musikszene . The end result of this bilingual projection procedure is a large set of distributions over all the unigrams of the target language, once we collapse and renormalize the distribution over tags for all the target trigrams. Some examples are shown here. fein lebhafter realisieren

Lexicon Expansion thousands of words This slide compares the sizes of the dictionaries for each language constructed using the projected dictionary model and our full model that uses graph-based learning. Note that there is significant expansion in lexicon size. Although the lexicons are imperfect, this expansion helps reason about the consistent improvements that we get across languages.

with Labeled and Unlabeled Data
Brief Overview: Graph-Based Learning with Labeled and Unlabeled Data We adopt a graph-based semi-supervised learning technique using which we perform lexicon expansion. The reason why we choose this particular way of learning from indirect supervision is the following. This framework helps us learn from similar datapoints which have some bilingually projected knowledge, thus expanding our belief on a small set of items to a larger set. The framework helps us create a probabilistic resource, which is stationary and created offline, and can be used for both training and inference. I will describe the particular graph-based technique we use in this work.

= symmetric weight matrix
distributions to be found 0.9 0.01 0.8 0.1 = symmetric weight matrix 0.05 supervised label distributions labeled datapoints unlabeled datapoints Here we have five sample points, <press enter> The two shaded datapoints correspond to labeled datapoints, While the three white ones correspond to unlabeled ones. The geometry of the data is also given, with edges between pairs of vertices. Each edge has a symmetric weight wij, which is computed using some sort of similarity function sim. We assume this weight matrix for the entire graph to be given. Note that the graph is not fully connected, which indicates that the absent edges had weight 0. On the labeled vertices, supervised label distributions are known. These are q1 and q5 for the given graph Our goal is to learn q2, q3 and q4, the distributions on the unlabeled vertices. Zhu, Ghahramani and Lafferty (2003)

Label Propagation Zhu, Ghahramani and Lafferty (2003) 0.9 0.1 0.05
0.01 0.8 0.1 0.05 This transductive set up, where we want to learn the label distributions on unlabeled vertices, is called label propagation, and was proposed by Zhu et al. In the current work, to perform label propagation, we minimize the shown objective function Zhu, Ghahramani and Lafferty (2003)

set of distributions over unlabeled vertices
Label Propagation 0.9 0.01 0.8 0.1 0.05 Here capital QsubU is the set of distributions over unlabeled vertices, which we need to find. set of distributions over unlabeled vertices

unlabeled vertices Label Propagation 0.9 0.1 0.05 0.01 0.8
VsubU is the set of unlabeled vertices. unlabeled vertices

brings the distributions of similar
Label Propagation 0.9 0.01 0.8 0.1 0.05 The first term in the objective function brings the distributions of similar vertices closer to each other. brings the distributions of similar vertices closer

brings the distributions of uncertain neighborhoods
Label Propagation 0.9 0.01 0.8 0.1 0.05 Size of the label set The second term can be thought of as a regularizer which brings the distributions on vertices in uncertain graph nerighborhoods close to the uniform distribution brings the distributions of uncertain neighborhoods close to the uniform distribution

Iterative updates for optimization
Label Propagation 0.9 0.01 0.8 0.1 0.05 To optimize this function, we use iterative updates, whose detail I won’t go into, in this talk. They can be found in the paper. Iterative updates for optimization

Final Results

Feature HMM constrained with graph-based dictionary Danish Dutch German Greek Italian Portuguese Spanish Swedish Average 68.7 57.0 75.9 65.8 63.7 62.9 71.5 68.4 66.7 69.1 65.1 81.3 71.8 68.1 78.4 80.2 70.1 73.0 73.6 77.0 83.2 79.3 79.7 82.6 80.1 74.7 78.8 79.0 82.4 76.3 84.8 87.0 82.8 79.4 79.5 82.5 86.8 87.9 84.2 80.5 83.4 EM-HMM We compare these results with two oracles – a feature HMM that incorporates a gold dictionary constructed from the treebanks. We fall short of this model by quite a bit, but this is not surprising because our lexicon is noisy. Finally, the supervised model gives us an upper bound, and we are more than 13 points worse than it. Feature-HMM Direct projection Projected Dictionary Graph-Based Projections

Feature HMM constrained with graph-based dictionary Danish Dutch German Greek Italian Portuguese Spanish Swedish Average 68.7 57.0 75.9 65.8 63.7 62.9 71.5 68.4 66.7 69.1 65.1 81.3 71.8 68.1 78.4 80.2 70.1 73.0 73.6 77.0 83.2 79.3 79.7 82.6 80.1 74.7 78.8 79.0 82.4 76.3 84.8 87.0 82.8 79.4 79.5 82.5 86.8 87.9 84.2 80.5 83.4 EM-HMM We compare these results with two oracles – a feature HMM that incorporates a gold dictionary constructed from the treebanks. We fall short of this model by quite a bit, but this is not surprising because our lexicon is noisy. Finally, the supervised model gives us an upper bound, and we are more than 13 points worse than it. Feature-HMM Direct projection Projected Dictionary Graph-Based Projections w/ gold dictionary 93.1 94.7 93.5 96.6 96.4 94.0 95.8 85.5 93.7 96.9 94.9 98.2 97.8 95.8 97.2 96.8 94.8 96.6 supervised

Concluding Notes Reasonably accurate POS taggers without direct supervision Evaluated on major European languages Towards a standard of universal POS tags Traditional evaluation of unsupervised POS taggers done using greedy metrics that use labeled data Our presented models avoid these evaluation methods

Future Directions Scaling up the number of nodes in the graph from 2M to billions may help create larger lexicons Including penalties in the graph objective that induce sparse tag distributions at each graph vertex Inclusion of multiple languages in the graph may further improve results Label propagation in one huge multilingual graph

Projected POS Tagged data
available at:

Questions? Portland has a thriving music scene .
NOUN NOUN NOUN NOUN NOUN NOUN VERB VERB VERB VERB DET DET DET DET ADJ ADJ ADJ ADJ ADJ ADJ ADJ NOUN NOUN NOUN NOUN NOUN NOUN NOUN . . . . Portland has a thriving music scene . Portland hat eine prächtig gedeihende Musikszene . No need to say anything পোর্টল্যান্ড শহর এর সঙ্গীত পরিবেশ বেশ উন্নত | ADP Portland tiene una escena musical vibrante . ADJ 波特兰有一个生机勃勃的音乐场景 Portland a une scène musicale florissante . ADJ

Carnegie Mellon University

Ähnliche Präsentationen

Präsentation zum Thema: "Carnegie Mellon University"— Präsentation transkript:

Ähnliche Präsentationen

Über Projekt

Feedback

Anmelden

Anmeldung über soziales Netzwerk:

Carnegie Mellon University

Ähnliche Präsentationen

Präsentation zum Thema: "Carnegie Mellon University"— Präsentation transkript:

Ähnliche Präsentationen

Über Projekt

Feedback