Die Präsentation wird geladen. Bitte warten

Die Präsentation wird geladen. Bitte warten

COOC Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin Klöckner, Stephan Lesch, Holger Neis, Norbert Pfleger,

Ähnliche Präsentationen


Präsentation zum Thema: "COOC Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin Klöckner, Stephan Lesch, Holger Neis, Norbert Pfleger,"—  Präsentation transkript:

1 COOC Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin Klöckner, Stephan Lesch, Holger Neis, Norbert Pfleger, Diana Raileanu, Hubert Schlarb Supervisors: Jan Alexandersson, Paul Buitelaar

2 30.11.00COOC2 Contents Intro Theoretical foundations At the outset Project aspects –Preprocessing –Training –Application –Evaluation Outlook

3 30.11.00COOC: Einleitung3 Intro Word Sense Disambiguation (WSD) as preparation for semantic analysis of text documents Application areas: translation systems, info retrieval systems, document classification, etc. Machine learning approaches: - supervised (semantically tagged corpora) - unsupervised (untagged corpora) COOC: the first unsupervised, corpus-based approach for German

4 30.11.00COOC: Theoretische Grundlagen4 Theoretical Foundations WSD (Word Sense Disambiguation) in context: E.g.: bank - place to sit vs. financial institution I‘m going to the bank to get some money. COOC: cooccurrence of words in a given context GermaNet: (WordNet for German) WordNet: - lexical and semantic data bank - semantic net, ontology - lexical and conceptual relations (antonymy, hyponymy)

5 30.11.00COOC: Theoretische Grundlagen5 Theoretical Foundations (II) Method: - knowledge sources (WordNet, Thesaurus) - the possibility of finding relations between words and meanings supervised: - requires already disambiguated data - requires large amounts of data unsupervised: - requires even more data - data need not be desambiguated

6 30.11.00COOC: Theoretische Grundlagen6 Theoretical Foundations (III) Examples of unsupervised methods: Lesk (1986): comparison among dictionary entries Yarowski (1992): - Roget‘s Thesaurus, Groliers Encyclopedia - collections of contexts for a thesaurus category - identification of characteristic words Resnik (1997): - Penn Treebank Corpus, pos-tagged, syntactically annotated - selectional preference (predicate arguments)

7 30.11.00COOC: Ausgangssituation7 At the outset Approach of Seligman (94): Japanese dialogues (direction finding, hotel reservations in spontaneous speech) thesaurus with 4 fixed abstraction levels explicit semantic smoothing COOC project: Tiger corpus (Frankfurter Rundschau) GermaNet with varying number of abstraction levels (up to 26) implicit semantic smoothing

8 30.11.00COOC: Training8 Flow diagram

9 30.11.00COOC: Vorbehandlung9 Preprocessing Conversion of the training corpus (plain text) into the COOC format Statistics on GermaNet categories

10 30.11.00COOC: Vorbehandlung10 Resources Tiger corpus (1.051.446 tokens) - German newspaper text from the Frankfurter Rundschau TnT tagger (Brants 2000) - statistical Part-of-Speech tagger Mmorph (Petitpierre & Russell, 1995) - morphological analysis tool GermaNet : - lexico-semantic network for German (about 25000 nouns, 6000 verbs, 3500 Adjectives)

11 30.11.00COOC: Vorbehandlung11 COOC-Format Philip Glass wurde auf seinen weltweiten Tourneen mit Kassetten und Tonbändern überschüttet. ( Phillip Glass was showered with audio tape and cassettes during his wordwide tour. )... 166 seinen NA PPOSAT 167 weltweiten weltweit ADJA [ 113815 113669 111763 111559 ]... 172 Tonbändern Tonband NN [ 75749... 1749365 ]... [ 75749... 144863 ] 173 überschüttet überschütten VVPP [ 353400... 226602 ] [ 353400... 2266023 ]...

12 30.11.00COOC: Vorbehandlung12 GermaNet Hierarchy

13 30.11.00COOC: Vorbehandlung13 Statistics on GermaNet Categories Omission of higher-frequency categories Reduction of computational complexity Format: Frequency ID(Offset) Synset Example: 70725 1749365 Objekt_0 43450 369009 Situation_0........... 2 843903 Kofferraum_0 1 695036 Intellekt_0_Genius_0

14 30.11.00COOC: Training14 Segmentation......at sentence boundaries:...or e.g. after every 3 significant words: Landesbank schlägt Verträge zwischen Stadt und privaten Investoren vor Überall wird gebuddelt und gemauert. Hamburg erlebt den größten Geschäftsbau-Boom. Jährlich hinzukommen rund 300 000 Quadratmeter an Büroräumen. Landesbank schlägt Verträge zwischen Stadt und privaten Investoren vor Überall wird gebuddelt und gemauert. Hamburg erlebt den größten Geschäftsbau-Boom. Jährlich hinzukommen rund 300 000 Quadratmeter an Büroräumen.

15 30.11.00COOC: Training15 Windows Windows Text window: n segments with current segment in the middle wider scope than n-grams S(i)S(i+1)S(i+2)S(i+3)S(i+4) W(t) W(t+1) W(t+2) n = 3

16 30.11.00COOC: Training16 Training: unsupervised Compare Peter goes by train with Diana goes by bike: train and bike should both be VEHICLES; but different ambiguities

17 30.11.00COOC: Training17 For a pair of categories: conditional probability conditional probability mutual information mutual information Effect: correct category combinations emerge statistically Statistics

18 30.11.00COOC: Training18 Training: Parameters Segmentation methods Window width limiting calculation time and space requirements: exclusion of certain POS combinations only categories in certain frequency intervals only pairs with frequency > minimum

19 30.11.00COOC: Anwendung19 Application Actual disambiguation process –input: sentences/text in COOC format, containing ambiguous words –output: disambiguated sentences/words –requires training results

20 30.11.00COOC: Anwendung20 To proceed Connection to the training data bank –selection of parameters (window and segment size) of the training data bank Text processing –construction of the initial windows –desambiguation of the current segment –results are written to the Ouput Data

21 30.11.00COOC: Anwendung21 To proceed (II) Window handling: –the middle (current) segment is then disambiguated word by word –at the last segment, the window is moved one segment to the right S(i)S(i+1)S(i+2)S(i+3)S(i+4)S(i)S(i+1)S(i+2)S(i+3)S(i+4)

22 30.11.00COOC: Anwendung22 To proceed (III) Handling the words in the middle (current) segment –distinguish significant vs. insignificant words (with and without GermaNet categories) –for significant words, the most probable meaning is computed and output –insignificant words are written unchanged into the Output Data

23 30.11.00COOC: Anwendung23 where: MI: mutual information PR: conditional probability c 0 : current category c i : context category Probability of the Appeareance of a Category in Context

24 30.11.00COOC: Anwendung24 where: PC: probability of the appearance of a category given a context Calculation of the most probable meaning

25 30.11.00COOC: Anwendung25 Folklore, Rock, Klassik und Jazz zu vermischen reicht ihnen nicht, sie nutzen die Elektronik und sind sogar dazu übergegangen, Instrumente selbst zu bauen. Example: Disambiguation 3002 Rock Rock NN 2 Rock_0 3004 Klassik Klassik NN 1 Klassik_0 3008 vermischen vermischen VVINF 1 vermengen_0_vermischen_0 3009 reicht reichen VVFIN 7 reichen_0 3014 nutzen nutzen VVFIN 2 nutzen_2_nützen_2 3016 Elektronik Elektronik NN 1 Elektronik_0 3023 Instrumente Instrument NN 2 Musikinstrument_0_Instrument_2 3026 bauen bauen VVINF 4 bauen_3 3002 Rock Rock NN [ 39981... 3228 ] [ 39981... 3228 ] 3004 Klassik Klassik NN [ 221503... 221266 ] 3008 vermischen vermischen VVINF [ 643704 643048 ] 3009 reicht reichen VVFIN [ 21538 ] [ 339847 307402 ] [ 581324... 568361 ] [ 581324... 862674] [ 581324... 912753 ] [ 586102 585849 ] [ 588150... 586261 ] 3016 Elektronik Elektronik NN [ 405356... 383322 ] 3023 Instrumente Instrument NN [ 5357 3228 ] [ 142311... 3228 ] 3026 bauen bauen VVINF [ 650176 647379 ] [ 742021... 734399 ] [ 743571... 734399 ] [ 743710 735354 734399 ] Not satisfied to merely mix up Folk, Rock, Classical, and Jazz, they make use of Electronic Music as well, and go so far as to build their own instruments.

26 30.11.00COOC: Evaluation26 Evaluation:Comparison Test corpus 1017 Komponisten Komponist NN1 Komponist_0_Komponistin_0 2010 Möglichkeiten MöglichkeitNN2 Möglichkeit_2_Eventualität_0 14011verfügbar verfügbarADJD0 14014machen machenVVINF6 betätigen_0_treiben_0_machen_0 24006wirkt wirkenVVFIN6 wirken_2 Evaluation corpus (Negra/Lexsem corpus) 1017 Komponisten Komponist NN Komponist_0_Komponistin_0 2010 Möglichkeiten MöglichkeitNN Möglichkeit_2_Chance_0_Gelegenheit_0 14011verfügbar verfügbarADJD unknown 14014machen machenVVINF unspec 24006wirkt wirkenVVFIN wirken_2

27 30.11.00COOC: Evaluation27 Meanings in the test corpus 2346 words annotated with 3.1 meanings per word, 1366 of these ambiguous, with average of 4.6 meanings

28 30.11.00COOC: Evaluation28 Results (3 Segments/Window) Segmentgröße 0(Satz) 2 5 7 10 15 count 1882 trivial 773 hitcount 703 586 688 718 724 720 incorrect 347 523 421 349 357 366 nicht desambiguiert 59 210 52 42 28 23 Precision (alle) [32,3%] 80,9781,2879,8481,0380,7480,31 Precision (amb.) [21,7%] 66,9552,8462,0467,3066,9866,30 Recall96,5188,5196,8897,4198,1598,41 segment size Sentences not disambiguated

29 30.11.00COOC: Zusammenfassung29 Summary COOC: is the first unsupervised, corpus-based method of disambiguating semantically ambiguous words for German goes beyond n-gram statistics uses plain text, GermaNet, MMorph and a POS tagger is a tool for unsupervised learning, semantic tagging, and evaluation first evaluation gives 67,3% (81) precision and 97,4% recall

30 30.11.00COOC: Ausblick30 Outlook Use of GermaNet 2 (but still need a hand-labeled evaluation corpus) Repeat experiment with WordNet and Penn Treebank Corpus Several experiments to determine optimal parameters Two theses: lexical disambiguation general predictions


Herunterladen ppt "COOC Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin Klöckner, Stephan Lesch, Holger Neis, Norbert Pfleger,"

Ähnliche Präsentationen


Google-Anzeigen