COOC Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin Klöckner, Stephan Lesch, Holger Neis, Norbert Pfleger,

Slides:



Advertisements
Ähnliche Präsentationen
Die deutsche Satzstellung
Advertisements

Art der Arbeit (Projekt-/Studien-/Diplomarbeit/
You need to use your mouse to see this presentation © Heidi Behrens.
You need to use your mouse to see this presentation © Heidi Behrens.
Deutsch II Frau Spampinato
CALPER Publications From Handouts to Pedagogical Materials.
Universität StuttgartInstitut für Wasserbau, Lehrstuhl für Hydrologie und Geohydrologie Copulas (1) András Bárdossy IWS Universität Stuttgart.
Der formelle Imperativ – the Imperative
Coordinating Conjunctions Why we need them & how to use them deutschdrang.com.
Qualitätssicherung von Software Prof. Dr. Holger Schlingloff Humboldt-Universität zu Berlin und Fraunhofer FIRST.
Das Perfekt – past tense In German you use the perfect tense (das Perfekt) to say what you have done at a certain time in the past.
Konjunktionen & Indirekte Fragen {Conjunctions}
Lust auf Lesen Treffpunkt Deutsch Sixth Edition. Relative Pronoun object of a preposition Recall from chapter 9 that relative clauses describe people,
Die Fragen Wörter Wer? Was? Wann?.
Was glaubst du… Was ist ein gutes Geschenk für deine Gastschwester/ Gastbruder? Ein Buch? Ein Radiowecker? Ein Hut? Ohrringe? Eine DVD?
Modals: a thing of the past… The Cheating verbs are history…
Akkusativ Präpositionen
Arbeiten in einem agilen Team mit VS & TFS 11
The perfect tense Vital for a good grade!!. There are 2 parts to the perfect tense 1)The correct part of haben (or sein) 2) + the past participle.
Empirical Methods of Linguistic Research. What you will learn How to write an empirical research paper How to design an experiment / a questionnaire How.
Nachweis von B 0 s -Oszillationen mit dem ATLAS Detektor am LHC B. Epp 1, V.M. Ghete 2, E. Kneringer 1, D. Kuhn 1, A. Nairz 3 1 Institut für Experimentalphysik,
What is a “CASE”? in English: pronouns, certain interrogatives
What is a “CASE”? in English: pronouns, certain interrogatives
Need: paper, coloured pens, glue, dwarf templates, dictionaries, adjective handout, judges hand out, blue tack For gallery – give students blue tack and.
Es gibt there is (singular) or there are (plural)
type / function / form type of words:
I go home I am going home I do go home Ich gehe nach Hause. WORD ORDER
GERMAN WORD ORDER ORDER s. Sentences are made up by placing a variety of words in a specific order. If the order is wrong, the sentence is difficult to.
Unterwegs.
Montag den 8. Juni Lernziel:- To launch a project and receive results.
Magnetenzephalogramm, MEG
EUROPÄISCHE GEMEINSCHAFT Europäischer Sozialfonds EUROPÄISCHE GEMEINSCHAFT Europäischer Fonds für Regionale Entwicklung Workpackage 5 – guidelines Tasks.
Fakultät für Gesundheitswissenschaften Gesundheitsökonomie und Gesundheitsmanagement Universität Bielefeld WP 3.1 and WP 4.1: Macrocost.
Memorisation techniques
Kapitel 8 Grammar INDEX 1.Command Forms: The Du-Command Form & Ihr- Command 2.Sentences & Clauses.
Here‘s what we‘ll do... Talk to the person sitting in front of you. Introduce each other, and ask each other questions concerning the information on your.
Komparativ und Superlativ German 2. zum Beispiel … Ein VW ist schnell. Ein BMW ist schneller. Ein Porsche ist am Schnellsten. There are 2 levels of comparison:
Quiz: modal verbs Note which sentence is correct: a, b or c.
Text complexity in and for literary studies. foundations.
Kapitel 9 Grammar INDEX 1.Formal Sie- Command 2.There Is/There Are 3.Negation: Nicht/Klein.
Das Wetter Lernziele: Heute: The „Wenn“ clause! - To describe and report the weather - To discuss activities done in different types of weather - To compare.
Y9 German Exam (1 hour) The Y9 Exam will assess: your knowledge and recall of vocabulary Your translation skills & ability to apply grammar Your reading.
VERBEN KONJUGIEREN. What is a verb? An ________ _______, mental __________ or ________.  Examples of verbs:  __________________________ actionword state.
Richtlinien für empirische Arbeiten Proseminar Empirische Musikpsychologie Richard Parncutt, Uni Graz.
€100 €400 €300€200€400 €200€100€100€400 €200€200€500 €500€300 €200€500 €100€300€100€300 €500€300€400€400€500 KlamottenwollensollenRandom vocab. Pronomen.
Essay structure Example: Die fetten Jahre sind vorbei: Was passiert auf der Almhütte? Welche Bedeutung hat sie für jede der vier Personen? Intro: One or.
Learning Linear Ordering Problems for Better Translation Roy Tromble, Google Pittsburgh Jason Eisner, Johns Hopkins August 7, 2009.
Interrogatives and Verbs
German Stem-Vowel Changing Verbs
What is a “CASE”? Grammatical cases indicate how certain words function in a sentence. The case of a word is shown by the particular form of the word itself.
Freizeit Thema 5 Kapitel 1 (1)
Grundlagen Englisch Passive voice HFW Bern.
Englisch Grundlagen, Modal Verbs
Jetzt machen Venues aufmachen!!! Geh zu
Jetzt machen Venues aufmachen!!! Geh zu
Process and Impact of Re-Inspection in NRW
Synonyms are two or more words belonging to the same part of speech and possessing one or more identical or nearly identical denotational meanings, interchangeable.
Die andere Vergangenheitsform
THE PERFECT TENSE IN GERMAN
THE PAST TENSE (Part 3) VERBS WHICH TAKE SEIN
To English Translations
Health Card for refugees in Bremen
Wie viel Uhr ist es? Telling Time.
ELECTR IC CARS Karim Aly University of Applied Sciences.
type / function / form type of words:
Official Statistics Web Cartography in Germany − Regional Statistics, Federal and European Elections, Future Activities − Joint Working Party meeting.
THE PAST TENSE (Part 3) VERBS WHICH TAKE SEIN
Integrating Knowledge Discovery into Knowledge Management
Make-up Work from Montag, 30.1
Zhunussova G., AA 81. Linguistic communication, i.e. the use of language, is characteristically vocal and verbal behaviour, involving the use of discrete.
 Präsentation transkript:

COOC Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin Klöckner, Stephan Lesch, Holger Neis, Norbert Pfleger, Diana Raileanu, Hubert Schlarb Supervisors: Jan Alexandersson, Paul Buitelaar

COOC2 Contents Intro Theoretical foundations At the outset Project aspects –Preprocessing –Training –Application –Evaluation Outlook

COOC: Einleitung3 Intro Word Sense Disambiguation (WSD) as preparation for semantic analysis of text documents Application areas: translation systems, info retrieval systems, document classification, etc. Machine learning approaches: - supervised (semantically tagged corpora) - unsupervised (untagged corpora) COOC: the first unsupervised, corpus-based approach for German

COOC: Theoretische Grundlagen4 Theoretical Foundations WSD (Word Sense Disambiguation) in context: E.g.: bank - place to sit vs. financial institution I‘m going to the bank to get some money. COOC: cooccurrence of words in a given context GermaNet: (WordNet for German) WordNet: - lexical and semantic data bank - semantic net, ontology - lexical and conceptual relations (antonymy, hyponymy)

COOC: Theoretische Grundlagen5 Theoretical Foundations (II) Method: - knowledge sources (WordNet, Thesaurus) - the possibility of finding relations between words and meanings supervised: - requires already disambiguated data - requires large amounts of data unsupervised: - requires even more data - data need not be desambiguated

COOC: Theoretische Grundlagen6 Theoretical Foundations (III) Examples of unsupervised methods: Lesk (1986): comparison among dictionary entries Yarowski (1992): - Roget‘s Thesaurus, Groliers Encyclopedia - collections of contexts for a thesaurus category - identification of characteristic words Resnik (1997): - Penn Treebank Corpus, pos-tagged, syntactically annotated - selectional preference (predicate arguments)

COOC: Ausgangssituation7 At the outset Approach of Seligman (94): Japanese dialogues (direction finding, hotel reservations in spontaneous speech) thesaurus with 4 fixed abstraction levels explicit semantic smoothing COOC project: Tiger corpus (Frankfurter Rundschau) GermaNet with varying number of abstraction levels (up to 26) implicit semantic smoothing

COOC: Training8 Flow diagram

COOC: Vorbehandlung9 Preprocessing Conversion of the training corpus (plain text) into the COOC format Statistics on GermaNet categories

COOC: Vorbehandlung10 Resources Tiger corpus ( tokens) - German newspaper text from the Frankfurter Rundschau TnT tagger (Brants 2000) - statistical Part-of-Speech tagger Mmorph (Petitpierre & Russell, 1995) - morphological analysis tool GermaNet : - lexico-semantic network for German (about nouns, 6000 verbs, 3500 Adjectives)

COOC: Vorbehandlung11 COOC-Format Philip Glass wurde auf seinen weltweiten Tourneen mit Kassetten und Tonbändern überschüttet. ( Phillip Glass was showered with audio tape and cassettes during his wordwide tour. ) seinen NA PPOSAT 167 weltweiten weltweit ADJA [ ] Tonbändern Tonband NN [ ]... [ ] 173 überschüttet überschütten VVPP [ ] [ ]...

COOC: Vorbehandlung12 GermaNet Hierarchy

COOC: Vorbehandlung13 Statistics on GermaNet Categories Omission of higher-frequency categories Reduction of computational complexity Format: Frequency ID(Offset) Synset Example: Objekt_ Situation_ Kofferraum_ Intellekt_0_Genius_0

COOC: Training14 Segmentation......at sentence boundaries:...or e.g. after every 3 significant words: Landesbank schlägt Verträge zwischen Stadt und privaten Investoren vor Überall wird gebuddelt und gemauert. Hamburg erlebt den größten Geschäftsbau-Boom. Jährlich hinzukommen rund Quadratmeter an Büroräumen. Landesbank schlägt Verträge zwischen Stadt und privaten Investoren vor Überall wird gebuddelt und gemauert. Hamburg erlebt den größten Geschäftsbau-Boom. Jährlich hinzukommen rund Quadratmeter an Büroräumen.

COOC: Training15 Windows Windows Text window: n segments with current segment in the middle wider scope than n-grams S(i)S(i+1)S(i+2)S(i+3)S(i+4) W(t) W(t+1) W(t+2) n = 3

COOC: Training16 Training: unsupervised Compare Peter goes by train with Diana goes by bike: train and bike should both be VEHICLES; but different ambiguities

COOC: Training17 For a pair of categories: conditional probability conditional probability mutual information mutual information Effect: correct category combinations emerge statistically Statistics

COOC: Training18 Training: Parameters Segmentation methods Window width limiting calculation time and space requirements: exclusion of certain POS combinations only categories in certain frequency intervals only pairs with frequency > minimum

COOC: Anwendung19 Application Actual disambiguation process –input: sentences/text in COOC format, containing ambiguous words –output: disambiguated sentences/words –requires training results

COOC: Anwendung20 To proceed Connection to the training data bank –selection of parameters (window and segment size) of the training data bank Text processing –construction of the initial windows –desambiguation of the current segment –results are written to the Ouput Data

COOC: Anwendung21 To proceed (II) Window handling: –the middle (current) segment is then disambiguated word by word –at the last segment, the window is moved one segment to the right S(i)S(i+1)S(i+2)S(i+3)S(i+4)S(i)S(i+1)S(i+2)S(i+3)S(i+4)

COOC: Anwendung22 To proceed (III) Handling the words in the middle (current) segment –distinguish significant vs. insignificant words (with and without GermaNet categories) –for significant words, the most probable meaning is computed and output –insignificant words are written unchanged into the Output Data

COOC: Anwendung23 where: MI: mutual information PR: conditional probability c 0 : current category c i : context category Probability of the Appeareance of a Category in Context

COOC: Anwendung24 where: PC: probability of the appearance of a category given a context Calculation of the most probable meaning

COOC: Anwendung25 Folklore, Rock, Klassik und Jazz zu vermischen reicht ihnen nicht, sie nutzen die Elektronik und sind sogar dazu übergegangen, Instrumente selbst zu bauen. Example: Disambiguation 3002 Rock Rock NN 2 Rock_ Klassik Klassik NN 1 Klassik_ vermischen vermischen VVINF 1 vermengen_0_vermischen_ reicht reichen VVFIN 7 reichen_ nutzen nutzen VVFIN 2 nutzen_2_nützen_ Elektronik Elektronik NN 1 Elektronik_ Instrumente Instrument NN 2 Musikinstrument_0_Instrument_ bauen bauen VVINF 4 bauen_ Rock Rock NN [ ] [ ] 3004 Klassik Klassik NN [ ] 3008 vermischen vermischen VVINF [ ] 3009 reicht reichen VVFIN [ ] [ ] [ ] [ ] [ ] [ ] [ ] 3016 Elektronik Elektronik NN [ ] 3023 Instrumente Instrument NN [ ] [ ] 3026 bauen bauen VVINF [ ] [ ] [ ] [ ] Not satisfied to merely mix up Folk, Rock, Classical, and Jazz, they make use of Electronic Music as well, and go so far as to build their own instruments.

COOC: Evaluation26 Evaluation:Comparison Test corpus 1017 Komponisten Komponist NN1 Komponist_0_Komponistin_ Möglichkeiten MöglichkeitNN2 Möglichkeit_2_Eventualität_ verfügbar verfügbarADJD machen machenVVINF6 betätigen_0_treiben_0_machen_ wirkt wirkenVVFIN6 wirken_2 Evaluation corpus (Negra/Lexsem corpus) 1017 Komponisten Komponist NN Komponist_0_Komponistin_ Möglichkeiten MöglichkeitNN Möglichkeit_2_Chance_0_Gelegenheit_ verfügbar verfügbarADJD unknown 14014machen machenVVINF unspec 24006wirkt wirkenVVFIN wirken_2

COOC: Evaluation27 Meanings in the test corpus 2346 words annotated with 3.1 meanings per word, 1366 of these ambiguous, with average of 4.6 meanings

COOC: Evaluation28 Results (3 Segments/Window) Segmentgröße 0(Satz) count 1882 trivial 773 hitcount incorrect nicht desambiguiert Precision (alle) [32,3%] 80,9781,2879,8481,0380,7480,31 Precision (amb.) [21,7%] 66,9552,8462,0467,3066,9866,30 Recall96,5188,5196,8897,4198,1598,41 segment size Sentences not disambiguated

COOC: Zusammenfassung29 Summary COOC: is the first unsupervised, corpus-based method of disambiguating semantically ambiguous words for German goes beyond n-gram statistics uses plain text, GermaNet, MMorph and a POS tagger is a tool for unsupervised learning, semantic tagging, and evaluation first evaluation gives 67,3% (81) precision and 97,4% recall

COOC: Ausblick30 Outlook Use of GermaNet 2 (but still need a hand-labeled evaluation corpus) Repeat experiment with WordNet and Penn Treebank Corpus Several experiments to determine optimal parameters Two theses: lexical disambiguation general predictions