Learning semantic relations using very large corpora

Slides:

Advertisements

Ähnliche Präsentationen

Der Stinger 1. Wohin geh ihr denn? 2. Kenn du das Mädchen da?

Advertisements

Word Order in German Subordiante Clauses

Role Play Cards Edexcel B

The difference between kein and nicht.

Peter Marwedel TU Dortmund, Informatik 12

When people think about Germany, they usually think of……

Formation of Questions in German

Insulin pump therapy in adults allows metabolic control at lower rates of hypoglycemia along with reduced insulin doses – results from the nationwide DPV-survey.

Lancing: What is the future? Lutz Heinemann Profil Institute for Clinical Research, San Diego, US Profil Institut für Stoffwechselforschung, Neuss Science.

Three minutes presentation I ArbeitsschritteW Seminar I-Prax: Inhaltserschließung visueller Medien, Spree WS 2010/2011 Giving directions.

GERMAN 1013 Kapitel 2 3.

Question words and word order By the end of this lesson you will have revised question words By the end of this lesson you will be able to use question.

Die Zeit (TIME) Germans are on military time which is 1-24

Institut AIFB, Universität Karlsruhe (TH) Forschungsuniversität gegründet 1825 Towards Automatic Composition of Processes based on Semantic.

| DC-IAP/SVC3 | © Bosch Rexroth Pneumatics GmbH This document, as well as the data, specifications and other information set forth in.

Das Perfekt (Present Perfect Tense). Think of 5 things you did in your holidays but think of sentences in the PRESENT TENSE. 1.Am Montag schlafe ich viel.

Plural Forms of Nouns & Wie viel? or Wie viele?

GERMAN 1013 C1 Guten Morgen.

Morphology and Syntax More on sentence structure.

Deutsch 1 G Stunde. Dienstag, der 11. September 2012 Deutsch 1 (G Stunde)Heute ist ein E - Tag Unit: Introduction to German & Germany Objectives: Learn.

Ich möchte ein Eisberg sein. Last time … 3 icebergs Triangels Unique connections Ich möchte ein Eisberg sein

Christoph Durt: Wittgenstein on the possibility of philosophy: The importance of an intercultural approach

Deutsch 1 G Stunde. Donnerstag, der 25. Oktober 2012 Deutsch 1, G Stunde Heute ist ein B- Tag Unit: Family & homeFamilie & Zuhause Objectives: Phrases.

Deutsch 1 G Stunde. Donnerstag, der 18. Oktober 2012 Deutsch 1, G Stunde Heute ist ein E- Tag Unit: Family & homeFamilie & Zuhause Objectives: Phrases.

G Stunde DEUTSCH 1. Unit: Family & homeFamilie & Zuhause Objectives: Phrases about date, weather and time-telling Alphabet – pronunciation and words The.

Deutsch 1 G Stunde. Montag, der 22. Oktober 2012 Deutsch 1, G Stunde Heute ist ein F- Tag Unit: Family & homeFamilie & Zuhause Objectives: Conjugations.

Deutsch 1 G Stunde. Donnerstag, der 18. Oktober 2012 Deutsch 1, G Stunde Heute ist ein D- Tag Unit: Family & homeFamilie & Zuhause Objectives: Phrases.

Deutsch 1 G Stunde. Donnerstag, der 15. November 2012 Deutsch 1, G Stunde Heute ist ein B- Tag Unit: Family & home Familie & Zuhause Goal: to talk about,

Deutsch 1 G Stunde. Dienstag, der 16. Oktober 2012 Deutsch 1, G Stunde Heute ist ein B- Tag Unit: Family & homeFamilie & Zuhause Objectives: Getting to.

Deutsch 1 G Stunde. Montag, der 5. November 2012 Deutsch 1, G Stunde Heute ist ein B- Tag Unit: Family & home Familie & Zuhause Question: Who / How is.

Deutsch 1 G Stunde. Montag, der 10. September 2012 Deutsch 1 (G Stunde)Heute ist ein D - Tag Unit: Introduction to German & Germany Objectives: Introducing.

Befehle Commands. When do you give a command? You give a command when you tell one or more people what you want them to do.

Donnerstag, den 28. November FUTURE HOLIDAYS Lernziel: to learn how to express future ideas. Starter: Finde ein Paar! will plane...zu werde hoffe...zu.

Wiederholung – Meine Schule und ich!

Meine Schulfächer.

Die Zukunft The future I will.

You need to use your mouse to see this presentation © Heidi Behrens.

Learning Target / Lernziel: 1.Hausaufgabenkontrolle (Vok 1-1) 2.Kultur 3.Forming questions Heute ist Dienstag, der 3. September 2013 Hausaufgaben GH #1,#2.

Guten Tag! Dienstag den Hausaufgabe für Mittwoch Note Syllabus Change! G 4.5 Dative Prepositions Quiz all separable verbs and ALL months.

Verben Wiederholung Deutsch III Notizen.

The Genitive Case This case has three uses: USE #1: To show possession or that something is an integral part of something else. z.B. The mans dog has just.

Tage der Woche German Early Level Montag Dienstag Mittwoch Donnerstag

Kölner Karneval By Logan Mack

Einführung Bild und Erkenntnis Einige Probleme Fazit Eberhard Karls Universität Tübingen Philosophische Fakultät Institut für Medienwissenschaft Epistemic.

4th Symposium on Lidar Atmospheric Applications

Ein Projekt des Technischen Jugendfreizeit- und Bildungsvereins (tjfbv) e.V. kommunizieren.de Blended Learning for people with disabilities.

Grammatik Deutsch I Kapitel 3 – 1. Stufe LERNZIEL:

Nouns. What is a noun O A noun is the name of a person, O place, O animal O or thing.

Talking about yourself

Klingelarbeit – Translate

© Boardworks Ltd of 8 Time Manner Place © Boardworks Ltd of 8 This icon indicates that the slide contains activities created in Flash. These.

By: Jade Bowerman. German numbers are quite a bit like our own. You start with one through ten and then you add 20, 30, 40 or 50 to them. For time you.

Time Expressions with Dative Von: Payton Knuckles.

By Martin L. Loeffler. The future tense is created in two ways. You introduce a time reference in the future. Anything that implies a future time. Morgen,

Adjectiv Endungen Lite: Adjective following articles and pre-ceeding nouns. Colors and Clothes.

Two-part conjunctions

L.O. Der Schultag(1) Dienstag, den 04. März 2010

Sentence Structure Subject and verb are always together. Subject and verb are always together. Subject and verb must agree Subject and verb must agree.

German Word Order explained!

Es ist Herbst. Das Wetter ist schön. Die Sonne scheint oft

1 Stevens Direct Scaling Methods and the Uniqueness Problem: Empirical Evaluation of an Axiom fundamental to Interval Scale Level.

THE PERFECT TENSE IN GERMAN

Adjective Endings Nominative & Accusative Cases describing auf deutsch The information contained in this document may not be duplicated or distributed.

Selectivity in the German Mobility Panel Tobias Kuhnimhof Institute for Transport Studies, University of Karlsruhe Paris, May 20th, 2005.

Instrumente und Unterhaltung End of Unit Assessment.

By Martin L. Loeffler.  The basic sentence has a subject and a verb.  The subject and verb need to be together.  The subject and verb need to agree.

Adjective Declension in German

Your next assignment is not a test but rather an essay. In order to help you write this essay, we are going to discuss the parts of an essay in German.

1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Modalverben.

1 Medienpädagogischer Forschungsverbund Südwest KIM-Studie 2014 Landesanstalt für Kommunikation Baden-Württemberg (LFK) Landeszentrale für Medien und Kommunikation.

Präsentation transkript:

Learning semantic relations using very large corpora Uwe Quasthoff Institut für Informatik Universität Leipzig quasthoff@informatik.uni-leipzig.de www.wortschatz.uni-leipzig.de

Contents Part 1: Introduction to the Wortschatz lexicon Part 2: Collocations and the collocation measure Part 3: Applications of collocations Part 4: Clustering Collocations Part 5: The Vector Space Model Part 6: Combining Simple Methods Part 7: Temporal Analysis: Words of the Day Part 8: Document Similarity U. Quasthoff

Language Data U. Quasthoff

Große Abfrage für Leipzig Anzahl: 9967 Beschreibung: Stadt in Deutschland (über 250000 E) Stadt in Sachsen Grammatikangaben: Wortart: Eigenname Form(en): Leipzig [9967], Leipzigs [276] -er-Adj. / Einwohner zu Stadt: Leipziger [3553] Teilwort von: VfB Leipzig [403], SC Leipzig [183], ..., Erste Baugesellschaft Leipzig AG [8], ... Beispiel: Auch Debütanten aus angrenzenden Sprachräumen, die sich über Leipzig den deutschen Buchmarkt erschließen möchten, bietet die Buchmesse ein geeignetes Forum. (Quelle: OTS-Newsticker) Kollokationen im Satz: Dresden (1488), Berlin (694), Halle (470), Universität (266), Sachsen (265), ..., DDR-Bürger (5), DDR-Innenminister (5), DTSB (5), ... Linke Nachbarn: Universität (392), Stadt (201), Reclam (102), Handelshochschule (51), Oper (50), Karl-Marx-Universität (48), Raum (36), ... U. Quasthoff

Empirical Analysis of Associations U. Quasthoff

Collocations for Schweine On the right side we find a collection of similar animals (all in plural) Rinder, Hühner, Kühe, Schafe. On the left side we find words describing the aspect of slaughtering. U. Quasthoff

Collocations for Stich Two groups for different meanings: Tennis (Michael Stich, Boris Becker etc.). The cards game Skat with the tree players Vorhand (lead), Mittelhand, and Hinterhand. The thin connection between Becker and Vorhand is representing Beckers strong forehand. U. Quasthoff

Funny collocation sets Identifying English words in German text: Collocations for the: of, and, to, The, on, for, is, from, you, with, that, it, world, are, be, not, We, at, World, we, have, this, by, they, when, You, can, When, into, what, your, or, But, time, And, like, over, Breaking, only, one, but, shall, which, has, What, road, as, On, same, people, out, our, This, It, way, best, who, no, my, more, his, up, their, ... The same way we find dialect words. Berlin dialect is identified using the collocations for ick: det, nich, Ick, Det, hab, is, ne, ooch, keene, wat, weeß, uff, de, ma, nu, keen, dat, aba, och, jing, jut, Nee, meen, Jöre, een, mach, inne, watt, wa, jenuch, kieke, janze, kumm, janz, tau, Mutta, janzen, hätt, sag, wieda, kleene, ha, hör, imma, un, habense, kriejen, ejal, zwee, nischt, nee, Wetta, jedacht, hebb, heff, ... U. Quasthoff

Analysis of other languages The above procedure was (without changes) applied to English French Dutch Sorbian Italian U. Quasthoff

Dutch and French U. Quasthoff

Sorbian and Italian U. Quasthoff

U. Quasthoff

U. Quasthoff

Part 2 Part 1: Introduction to the Wortschatz lexicon Part 2: Collocations and the collocation measure Part 3: Applications of collocations Part 4: Clustering Collocations Part 5: The Vector Space Model Part 6: Combining Simple Methods Part 7: Temporal Analysis: Words of the Day Part 8: Document Similarity U. Quasthoff

The Common Birthday Problem The common birthday problem: What is the probability p of the existence of (at least) two people with the same birthday given n people randomly chosen? Modification: What is the probability p of the existence of k couples with the same birthday (different birthdays are allowed for different couples) a boys and b girls randomly chosen? Reformulation as collocation problem: Common Birthday Problem Collocation Problem Number a of boys Number a of sentences containing A Number b of girls Number b of sentences containing B Number of days of a year (i.e. n=365) Total number n of sentences Number k of couples with same birthday Number k of sentences containing both A and B U. Quasthoff

Introduction to Poisson Distribution We can calculate the probability of multiple joint occurrences of independent events as follows: Given two independent events to be observed in an experiment with probabilities pa and pb, resp. The probability of their joint occurrence is pa pb . Next we repeat the experiment n times, and we are interested in k joint occurrences. Using λ= n pa pb we get the probability For at least k joint occurrences we get the probability To measure the surprise for the joint occurrence for non-independent events we just calculate the probability as if they were independent. Next we are surprised to see such a rare event. U. Quasthoff

The Collocation Measure The collocation measure of the two words A and B is defined as the negative logarithm of the above probability divided by log n. For λ=ab/n we get Approximations: If (k+1)/λ>10 (this is typically true) and, moreover k>10 we get: and U. Quasthoff

Comparison to log-Likelihood Comparison of the formulae: Translating the log-Likelihood formula into our notation and ignoring small terms we get Compared to Note: This may only apply to the typical case (k+1) / λ > 10. U. Quasthoff

Comparing Results: The sources IDS Cosmas I (W-PUB) Wortschatz (German) Corpus Size 374 Mio 255 Mio Sources Mainly Newspapers Window size Fixed size (here: ±5 words Sentence Collocation Measure Log Likelihood Poisson Distribution U. Quasthoff

Comparing Results: Collocations for Bier Rank IDS Cosmas I Cosmas Rating Wortschatz (German) Sig-Rating 1 Wein 4351 trinken 1234 2 2745 648 3 getrunken 1715 478 4 kühles 1627 Liter 460 5 Glas 1379 trinkt 428 6 1318 348 7 Faß 1236 Schnaps 318 8 Fass 1139 Hektoliter 300 9 Flasche 1071 Flaschen 272 10 899 gebraut 269 11 Trinkt 881 242 12 873 Sekt 239 Kaffee U. Quasthoff

Properties of sig(n,k,a,b) I Simple co-occurrence: A and B occur only once, and they occur together: sig(n,1,1,1) → 1. This should ensure that the minimum significance threshold is independent of the corpus size. Independence: A and B occur statistically independent with probabilities p and q: sig(n,npq,np,nq) → 0. Enlarging the corpus by a factor m: sig(mn, mk, ma, mb) = m sig(n, k, a, b) This is useful for comparing corpora of different size. U. Quasthoff

Properties of sig(n,k,a,b) II Additivity: The unification of the words B and B‘ just adds the corresponding significances. For k/b≈k‘/b‘ we have sig(n,k,a,b) + sig(n,k‘,a,b‘) ≈ sig(n,k+k‘,a,b+b‘) This has applications for grouping words due to various methods. Maximum: max sig(A, B) ≈ a B It might be useful to know how strong a collocation is compared to the possible maximum. U. Quasthoff

Part 3 Part 1: Introduction to the Wortschatz lexicon Part 2: Collocations and the collocation measure Part 3: Applications of collocations Part 4: Clustering Collocations Part 5: The Vector Space Model Part 6: Combining Simple Methods Part 7: Temporal Analysis: Words of the Day Part 8: Document Similarity U. Quasthoff

Applications Collocations of inflected forms or basic forms? A numeric measure for polysemy Identification of Proper Names and Phrases Compound Analysis U. Quasthoff

Collocations of inflected forms or basic forms? Collocations of basic forms will give more results because of higher frequency. But: Collocations of basic forms and inflected forms may differ strongly. Example: Collocations for As: Karo (488), K (393), Pik (391), Treff (307), Coeur (296), D (258), Herz (190), Karte (189), Kreuz (178), As (166), As (166), Süd (145), Matchball (113), Hinterhand (110), West (110), Karo-Bube (101), Ärmel (95), Ost (94), Vorhand (86), … Collocations for Asse: Ivanisevic (72), Becker (63), schlug (62), Aufschlag (56), servierte (47), Sampras (40), Goran Ivanisevic (32), Spiel (25), Stich (24), gewann (24), Wolfenbüttel (23), Kroate (22), schlagen (22), Asse (21), Asse (21), Match (21), Satz (21), … U. Quasthoff

Application of additivity First calculate collocations for inflected forms, then use additivity to calculate the measure for basic forms, if you want. Example: Collocations for Bundeskanzler: ..., betonte (46), ..., betont (21), ..., betonten (7), ... Additivity gives significance 74 for the pair Bundeskanzler, {betonen, betont, betonte, betonten}. U. Quasthoff

A numeric measure for polysemy: Space The collocations of space taken from our general language corpus of English fall mainly into three classes: The subject areas computer, real estate and outer space. The corresponding senses of space are denoted with space1, space2, and space3. Assigning the top 30 collocations of space (disk, shuttle, square, station, NASA, feet, …) to these three senses we get an qualitative estimate of these senses: space1 28.2%: disk (2629), memory (718), storage (479), program (308), RAM (307), free (300), hard (336) space2 53.2%: shuttle (2618), station (991), NASA (920), Space (602), launch (505), astronauts (473), Challenger (420), manned (406), NASA's (297), flight (293), Atlantis (291) Mir (335), rocket (329), orbit (326), Discovery (341), mission (385) space3: 18.6%: square (1163), feet (822), leased (567), office (382), lessor (390) U. Quasthoff

Proper Names and Phrases A large relative collocation measure sigC(A) indicates that a reasonable part of all occurrences of the word C is together with A. Hence, C might be the head with respect to A. Left Word Right Word “head” Alzheimersche Krankheit left AQA total Anorexia nervosa left and right Algighiero Boetti 30jährige US-Bond right André Lussi U. Quasthoff

Compound analysis using multi-word collocations Assume we know that Geschwindigkeitsüberschreitung has the parts Geschwindigkeit and Überschreitung. If a multi-word collocation (here: Überschreitung der Geschwindigkeit) is of some predefined form we accept this collocation as a semantic description. Pattern Word A Word B Compound A aus B Orgie Farben Farbenorgie A der B Bebauung Insel Inselbebauung A mit B Feld Getreide Getreidefeld A in der B Feldbau Regenzeit Regenzeitfeldbau A für B Übung Anfänger Anfängerübung A für die B Gebäude Flugsicherung Flugsicherungsgebäude A von B Anbau Kaffee Kaffeeanbau A zur B Andrang Eröffnung Eröffnungsandrang U. Quasthoff

Part 4 Part 1: Introduction to the Wortschatz lexicon Part 2: Collocations and the collocation measure Part 3: Applications of collocations Part 4: Clustering Collocations Part 5: The Vector Space Model Part 6: Combining Simple Methods Part 7: Temporal Analysis: Words of the Day Part 8: Document Similarity U. Quasthoff

Clustering So far, we are able to find relations between words. They are still of unknown type. Moreover, different types are mixed. Problem 1: How to construct sets of relations of a fixed type? Problem 2: How to identify the type of a relation using background information? U. Quasthoff

Collocations of collocation sets The production of collocations is now applied on sets of (next neighbour or sentence) collocations instead of sentences. The collection of 500.000 sentence collocations has the following ‚sentence‘ for Hemd: Hemd Krawatte Hose weißes Anzug weißem Jeans trägt trug bekleidet weißen Jacke schwarze Jackett schwarzen Weste kariertes Schlips Mann The collection of 250.000 next neighbor collocations has the following two ‚sentences‘ for Hemd: weißes weißem weißen blaues kariertes kariertem offenem aufs karierten gestreiftes letztes ... näher bekleidet ausgezogen spannt trägt aufknöpft ausgeplündert auszieht wechseln aufgeknöpft ausziehen ... U. Quasthoff

Erklärte (declaired) using sentences Sprecher (2581), werde (2302), gestern (1696), seien (1440), Wir (1187), bereit (929), wolle (839), Vorsitzende (807), Anfrage (775), Präsident (721) U. Quasthoff

Erklärte (declared): Using NN-collocations sagte (137), betonte (59), sprach (55), kündigte (44), wies (37), nannte (36), warnte (27), bekräftigte (24), meinte (24), kritisierte (23) U. Quasthoff

Collocation set for Leipzig (other cities in black) in, Dresden, Berlin, Halle, Leipzig, Leipzig, und, Universität, Sachsen, Erfurt, Chemnitz, UM, Frankfurt, Hamburg, Rostock, Magdeburg, München, Leipziger, Hannover, Messe, Zwickau, studierte, nach, aus, Stadt, Stuttgart, Jena, Düsseldorf, Nürnberg, Reclam, Messestadt, sächsischen, DDR, Kischko, am, Köln, Däbritz, Karl-Marx-Universität, In, Rische, ostdeutschen, geboren, sächsische, bewölkt, Völkerschlacht, Bredow, Taucha, VEB, Edmond, Verlag, Buchmesse, Gewandhausorchester, Städten, Strombörse, Deutschen, Institut, GmbH, Lindner, Wurzen, GV, Verbundnetz, Ampler, Frankfurt am Main, Potsdam, Reclam Verlag, Städte, Cottbus, Versandzentrum, Handelshochschule, Hinrich Lehmann-Grube, Gera, Kirchentag, Völkerschlachtdenkmal, Buchstadt, Thomanerchor, Unterhaching, Lübeck, Oper, Dessau, Meppen, Studium, MDR, Philosophie, eröffnet, wurde, Anke Huber, Jens Lehmann, Turowski, Uwe Ampler, Weimar, ostdeutsche, Hecking, IAT, Boomtown, Buchkunst, Engelmann, Freistaat, Liebers, Dortmund, Mai, Mannheim, Schwerin, neuen Bundesländern, Grischin, VNG, Wende, bei, AG, Auto Mobil International, Cindy Klemrath, Gewandhaus, Messegelände, Parteitag, Bremen, Montagsdemonstrationen, Neubrandenburg, Gustav Kiepenheuer Verlag, Karl-Marx-Stadt, Journalistik, Ostdeutschland, Thomas Liese, Essen, Heidenreich, Udo Zimmermann, Umweltforschungszentrum, DHFK, Hochschule, Mainz, Oktober, Wolfgang Engel, Deutschen Hochschule für Körperkultur, Frankfurt/Main, Heldenstadt, Trommer, Wolfsburg, EBL, Universitäten, Wien, Bautzen, ... U. Quasthoff

First Iteration for Leipzig (other cities in black) Frankfurt, Berlin, München, Stuttgart, Köln, Dresden, Hamburg, Hannover, Düsseldorf, Bremen, Karlsruhe, Potsdam, Wien, Paris, Magdeburg, Halle, Tübingen, Bonn, Freiburg, New York, Chemnitz, Darmstadt, Augsburg, Erfurt, Mannheim, Schweiz, Ulm, Bochum, Wiesbaden, Hanau, Braunschweig, Schwerin, Münster, Frankfurt am Main, London, USA, Regensburg, Cottbus, Göttingen, Kassel, Moskau, Passau, Rostock, Straßburg, Deutschland, Konstanz, Ausland, Dortmund, Heidelberg, Mainz, Würzburg, Zürich, Aachen, Offenbach, Weimar, Gießen, Koblenz, Italien, Chicago, Mailand, Osnabrück, Prag, Rom, Saarbrücken, Wuppertal, Niederlanden, Gera, Basel, Lyon, Nürnberg, Holland, Marburg, St. Petersburg, Amerika, Genf, Kaiserslautern, Tel Aviv, Woche, September, Tiergarten, dort, eröffnet, Budapest, Essen, Jena, Jerusalem, Neubrandenburg, Athen, Frankreich, Vereinigten Staaten, Amsterdam, Baden-Württemberg, Februar, Tempelhof, Trier, Venedig, Bayreuth, England, Erlangen, Indien, Belgrad, Duisburg, Heilbronn, Kairo, Ludwigsburg, Oldenburg, Oxford, Stockholm, Washington, Großbritannien, Görlitz, Kreuzberg, Lausanne, Lübeck, Mitte, Wochenende, April, Australien, Griechenland, Singapur, Florenz, Kanada, Kiel, Kopenhagen, Madrid, Mai, Südafrika, Tegel, Türkei, soeben, Bad Homburg, Bundesrepublik, Göppingen, Heute, Hongkong, Ingolstadt, Japan, Lande, Miami, Mittwoch, Oder, Sarajewo, Afghanistan, Argentinien, Baden-Baden, Bayern, Deutschlands, Europa, Haus, Iran, Istanbul, Peking, Rußland, neu, ... U. Quasthoff

Second Iteration for Leipzig (other cities in black) Stuttgart, München, Frankfurt, Hamburg, Hannover, Köln, Berlin, Dresden, Bremen, Darmstadt, Karlsruhe, Freiburg, Potsdam, Mannheim, Wiesbaden, Düsseldorf, Tübingen, Magdeburg, Gießen, Augsburg, Rostock, Kassel, Halle, Ulm, Hanau, Heidelberg, Ludwigsburg, Konstanz, Nürnberg, Bonn, Schwerin, Münster, Wien, Dortmund, Würzburg, Chemnitz, Passau, Göttingen, Erfurt, Mitte, Aachen, Mainz, Friedberg, Nord, Regensburg, Braunschweig, Cottbus, New York, Kreuzberg, Frankfurt am Main, Göppingen, Tiergarten, Esslingen, Ravensburg, II, Hessen, Ost, Lübeck, Charlottenburg, Böblingen, Offenbach, Oldenburg, Osnabrück, Traunstein, Paris, Bad Homburg, London, Prenzlauer Berg, Neukölln, Tempelhof, Hellersdorf, Koblenz, Essen, Fulda, Trier, Lüneburg, Prag, Chicago, Landshut, Reinickendorf, USA, Wilmersdorf, Kiel, Bochum, Deutschland, Mittelfranken, Schöneberg, Marzahn, Oberbayern, Eimsbüttel, Niederrhein, Unterfranken, Wuppertal, Friedrichshain, Spandau, Oberfranken, Lichtenberg, Moskau, Oberpfalz, Bielefeld, Schweiz, Kaiserslautern, Kempten, Bayreuth, Schwaben, Zürich, Bamberg, Ingolstadt, Mailand, Oder, Heilbronn, Altona, Sarajewo, Marburg, Ansbach, Harburg, Berlin-Mitte, Jena, Steglitz, Suhl, Görlitz, Baden-Württemberg, Hessen-Süd, dort, Italien, Weimar, West, Saarbrücken, Ausland, Bayern, Ostwestfalen-Lippe, Moabit, Offenburg, Main, Polen, Amsterdam, Westliches, Mittlerer, eröffnet, ... U. Quasthoff

Part 5 Part 1: Introduction to the Wortschatz lexicon Part 2: Collocations and the collocation measure Part 3: Applications of collocations Part 4: Clustering Collocations Part 5: The Vector Space Model Part 6: Combining Simple Methods Part 7: Temporal Analysis: Words of the Day Part 8: Document Similarity U. Quasthoff

Feature Vectors Given by Collocations If two words A and B have similar contexts, that is, they are alike in their use, this indicates that there is a semantic relation between A and B of some kind. A kind of average context for every word A is formed by all collocations for A above a certain significance threshold. This average context of A is transferred into a feature vector of A of dimension n (the total number of words) as usual. The feature vector of word A is a description of the meaning of A, because the most important words of the contexts of A are included. Clustering of feature vectors can be used to investigate the relations between a group of similar words and to figure out whether or not all the relations are of the same kind. U. Quasthoff

Clustering Months and Days Example (1): Jahres _____________________ Uhr, Ende, abend, vergangenen, Anfang, Jahres, Samstag, Freitag, Mitte, Sonntag Donnerstag _ | Uhr, abend, heutigen, Nacht, teilte, Mittwoch, Freitag, worden, mitteilte, sagte Dienstag _|_ | Uhr, abend, heutigen, teilte, Freitag, worden, kommenden, sagte, mitteilte, Nacht Montag _ | | Uhr, abend, heutigen, Dienstag, kommenden, teilte, Freitag, worden, sagte, morgen Mittwoch _|_|_ | Uhr, abend, heutigen, Nacht, Samstag, Freitag, Sonntag, kommenden, nachmittag Samstag ___ | | Uhr, abend, Samstag, Nacht, Sonntag, Freitag, Montag, nachmittag, heutigen Sonntag _ | | | Uhr, abend, Samstag, Nacht, Montag, kommenden, morgen, nachmittag, vergangenen Freitag _|_|_|_____________ | Uhr, abend, Ende, Jahres, Samstag, Anfang, Freitag, Sonntag, heutigen, worden Januar _________________ | | Uhr, Ende, Jahres, Anfang, Mitte, Samstag, Mai, August, März, Januar August _______________ | | | Uhr, Ende, Jahres, Anfang, Mitte, Samstag, Mai, August, Januar, März Juli _____________ | | | | Uhr, Jahres, Ende, Anfang, Mitte, Mai, Samstag, August, Januar, März März ___________ | | | | | Uhr, Ende, Jahres, Anfang, Mitte, Samstag, Mai, Januar, März, April Mai _________ | | | | | | Uhr, Ende, Jahres, Anfang, Mitte, Samstag, März, Januar, Mai, vergangenen September _______ | | | | | | | Uhr, Ende, Jahres, Anfang, Mitte, Mai, Januar, März, Samstag, vergangenen Februar _ | | | | | | | | Uhr, Januar, Jahres, Anfang, Mitte, Ende, März, November, Samstag, vergangenen Dezember _|___ | | | | | | | | Uhr, Jahres, Ende, Anfang, Mitte, Mai, Januar, März, Samstag, vergangenen November _ | | | | | | | | | Uhr, Jahres, Ende, Anfang, Mitte, September, vergangenen, Dezember, Samstag Oktober _|_ | | | | | | | | | Uhr, Ende, Jahres, Anfang, Mai, Mitte, Samstag, September, März, vergangenen April _ | | | | | | | | | | Uhr, Ende, Jahres, Mai, Anfang, März, Mitte, Prozent, Samstag, Hauptversammlung Juni _|_|_|_|_|_|_|_|_|_|_|_ U. Quasthoff

Clustering Leaders and Verbs of Utterance Example (2): Clustering Leaders Präsident _________ sagte, Boris Jelzin, erklärte, stellvertretende, Bill Clinton, stellvertretender, Richter Vorsitzender _______ | sagte, erklärte, stellvertretende, stellvertretender, Richter, Abteilung, bestätigte Vorsitzende ___ | | sagte, erklärte, stellvertretende, Richter, bestätigte, Außenministeriums, teilte, gestern Sprecher _ | | | sagte, erklärte, Außenministeriums, bestätigte, teilte, gestern, mitteilte, Anfrage Sprecherin _|_|_ | | sagte, erklärte, stellvertretende, Richter, Abteilung, bestätigte, Außenministeriums, sagt Chef _ | | | Abteilung, Instituts, sagte, sagt, stellvertretender, Professor, Staatskanzlei, Dr. Leiter _|___|_|_|_ Example (3): Clustering Verbs of Utterance verwies _____________ Sprecher, werde, gestern, Vorsitzende, Polizei, Sprecherin, Anfrage, Präsident, gebe mitteilte ___________ | Sprecher, werde, gestern, Vorsitzende, Polizei, Sprecherin, Anfrage, Präsident, Montag meinte _______ | | Sprecher, werde, gestern, Vorsitzende, Sprecherin, Anfrage, Präsident, gebe, Interview bestätigte_____ | | | Sprecher, werde, gestern, Vorsitzende, Sprecherin, Anfrage, Präsident, gebe, Interview betonte ___ | | | | Sprecher, werde, gestern, Vorsitzende, Sprecherin, Präsident, gebe, Interview, würden, Bonn sagte _ | | | | | Sprecher, werde, gestern, Vorsitzende, Sprecherin, Präsident, gebe, Interview, würden erklärte _|_|_|_|_ | | Sprecher, werde, gestern, Vorsitzende, Sprecherin, Präsident, Anfrage, gebe, Interview warnte _ | | | Präsident, Vorsitzende, SPD, eindringlich, Ministerpräsident, CDU, Außenminister, Zugleich sprach _|_______|_|_|_ U. Quasthoff

The Clustering Algorithm The Single Link Hierarchical Agglomerative Clustering Method (HACM) works bottom up like this: All words are treated as (basic) items. Each item has a description (feature vector). In each step of the clustering algorithm the two items A and B with the most similar descriptions are searched and fitted together to create a new complex item C combining the words in A and B. Each step of the clustering algorithm reduces the number of items by one. The feature vector for C is constructed from the feature vectors of A and B by „identifying“ the words A and B and calculating their joint collocations. The algorithm stops if only one item is left or or if all remaining feature vectors are orthogonal. This results usually in a very natural clustering if the threshold for constructing the feature vectors is suitably chosen. U. Quasthoff

Part 6 Part 1: Introduction to the Wortschatz lexicon Part 2: Collocations and the collocation measure Part 3: Applications of collocations Part 4: Clustering Collocations Part 5: The Vector Space Model Part 6: Combining Simple Methods Part 7: Temporal Analysis: Words of the Day Part 8: Document Similarity U. Quasthoff

Part 3: Combining Non-contradictory Partial Results The results of these combination either give more and / or better results. Identical Results Two or more of the above algorithms may suggest a certain relation between two words, for instance, cohyponymy. Example: If both the second order collocations and clustering by feature vectors independently yield similar sets of words as a result, this may be taken as an indication of cohyponymy between the words, e. g. sagte, betonte, kündigte, wies, nannte, warnte, bekräftigte, meinte [...] (German verbs of utterance). U. Quasthoff

Types of Relations Symmetric Relations: A relation r is called symmetric if r(A, B) always implies r(A, B). Examples of symmetric relations are synonymy, cohyponymy (or similarity), elements of a certain subject area, and relations of unknown type. Usually, sentence collocations express symmetric relations. Anti-symmetric Relations: Let us call a relation r anti-symmetric if r(A, B) never implies r(A, B). Examples of anti-symmetric relations are hyponymy, relations between properties and its owners like action and actor, or class and instance. Usually, next neighbor collocations of two words express anti-symmetric relations. In the case of next neighbor collocations consisting of more than two words (like A prep/det/conj B e. g. Samson and Delilah), the relation might be symmetric, for instance in the case of conjunctions like and or or. Transitivity: Transitivity of a relation means that r(A, B) and r(B, C) always implies r(A, C). In general, a relation found experimentally will not be transitive, of course. But there may be a part where transitivity holds. Some of the most prominent transitive relations are the cohyponymy, hyponymy, synonymy relations. U. Quasthoff

Supporting Second Results In second combination type a known relation given by one method of extraction is verified by an identical but unnamed second result as follows: Result 1: There is certain relation r between A and B Result 2: There is some strong (but unknown) relation between A and B (e. g. given by a collocation set) Conclusion: Result 1 holds with more evidence. Example Result 2: The German compound Entschädigungsgesetz can be divided into Gesetz and Entschädigung with an unknown relation. Result 1 is given by the four word next neighbor collocation Gesetz über die Entschädigung. Similarly Stundenkilometer is analyzed as Kilometer pro Stunde. In these examples, result 1 is not enough because there are collocations like Woche auf dem Tisch which do not describe a meaningful semantic relation. U. Quasthoff

Combining Three Results Result 1: There is relation r between A and B Result 2: B is similar to B' (cohyponymy) Result 3: There is some strong but unknown relation between A and B' Conclusion: There is a relation r between A and B'‚ Example: As result 1 we might know that Schwanz (tail) is part of Pferd (horse). Similar terms to Pferd are both Kuh (cow) and Hund (dog) (result 2). Both of them have the term Schwanz collocation in their set of significant collocations (result 3). Hence we might correctly conjecture that both Kuh and Hund have a tail (Schwanz) as part of their body. In contrast, Reiter (rider) is a strong collocation to Pferd and might (incorrectly) be conjectured to be another similar concept, but Reiter is no collocation with respect to Schwanz. Hence, the absence of result 3 prevents us from making an incorrect conclusion. U. Quasthoff

Similarity Used to Infer a Strong Property Let us call an property p important, if similarity respects this property. This strong property can be assured as follows: Result 1: A has a certain important property p Result 2: B is similar to A (i. e., B is a cohyponym of A) Conclusion: B has the same property p Example: We consider A and B as similar if they are in the set of right neighbor collocations of Hafenstadt (port town) (result 2). If we know that Hafenstadt is a property of its typical right neighbors (result 1) we may infer this property for more then 200 cities like Split, Sidon, Durban, Kismayo, Tyrus, Vlora, Karachi, Durres, [...]. U. Quasthoff

Subject Area Inferred from Collocation Sets Result 1: A, B, C, ... are collocates of a certain term. Result 2: Some of them belong to a certain subject area. Conclusion: All of them belong to this subject area. Example: Consider the following top entries in the collocation set of carcinoma: patients, cell, squamous, radiotherapy, lung, thyroid, treated, hepatocellular, metastases, adenocarcinoma, cervix, irradiation, breast, treatment, CT, therapy, renal, cases, bladder, cervical, tumor, cancer, metastatic, radiation, uterine, ovarian, chemotherapy, [...] If we know that some of them belong to the subject area Medicine, we can add this subject area to the other members of the collocation set as well. U. Quasthoff

Part 7 Part 1: Introduction to the Wortschatz lexicon Part 2: Collocations and the collocation measure Part 3: Applications of collocations Part 4: Clustering Collocations Part 5: The Vector Space Model Part 6: Combining Simple Methods Part 7: Temporal Analysis: Words of the Day Part 8: Document Similarity U. Quasthoff

U. Quasthoff

Overview Input: Today‘s news text Ca. 20.000 sentences Relative size compared to the large corus: Factor 1000 # of sentence collocations: ca. 100.000 # of next neighbor collocations: ca. 7000 # of next neighbor collocations, both capitalized: ca. 300 Size So und Mo: ca. 50% compared to weekdays. Problem: Find important terms. Total number of 100-150 words Frequency data available: total frequency today relative frequency compared to our large corpus total frequency in our large corpus Morphosyntactic criteria: Words and multiwords should be capitalized No inflected forms U. Quasthoff

Frequency Measures Total frequency today Minimum frequency needed, otherwise too many words (cf. Zipf’s Law). Today: Minimum frequency of 12 Today: Maximum frequency of 100 for relevant words Relative frequency compared to our large corpus Large factor implies importance. Small variance appears by chance. Threshold for importence: factor>6. May be lowered for larger daily corpora. Total frequency in our large corpus Words should be familiar Today: Wortschatz-frequency > 20 What about totally new words? - Today: Minimum frequency of 12 as above Question: Which measure is closest to importance as felt by humans? Answer: Total frequency today U. Quasthoff

Words of the Day (without human inspection) U. Quasthoff

Words of the Day (after 5 minutes of inspection) U. Quasthoff

Problem: Find the Message We automatically find: Jürgen Hart is rarely mentioned We notice the words gestorben and tot and the phrase hörte sein Herz auf zu schlagen. Conclusion: We have an obituary. U. Quasthoff

Relations between Words Today‘s collocation graph: Connected words represent a strong relation U. Quasthoff

Temporal Relations We see wether collocations repeatedly appear together during the last 30 days. U. Quasthoff

Part 8 Part 1: Introduction to the Wortschatz lexicon Part 2: Collocations and the collocation measure Part 3: Applications of collocations Part 4: Clustering Collocations Part 5: The Vector Space Model Part 6: Combining Simple Methods Part 7: Temporal Analysis: Words of the Day Part 8: Document Similarity U. Quasthoff

Document Similarity The description of a document consists of all its terms which have been Word of the Day at any time. Hence we use only approx. 5.000 for indexing. Documents are compared just by counting their common terms, weighted by their frequencies. U. Quasthoff

Sample similar Documents Doc.-Nr. 1 Doc.-Nr. 2 Ähnlichkeit 60910293 51923690 9558.00 Weltmeister Südkorea Michael_Ballack Oliver_Kahn Brasilien Yokohama Ballack DFB Rudi_Völler 60910293 552389133 7946.00 Weltmeister Südkorea Dietmar_Hamann Thomas_Linke Weltmeisterschaft Carsten_Ramelow Rudi_Völler 60910293 588749685 7278.00 Südkorea Michael_Ballack Weltmeisterschaft Ballack Elf Seoul 734389933 1313082725 11073.00 Israel Arafat Hebron Jericho Palästinenser Terror Bush Frieden US-Präsident_George_W._Bush 734389933 1598295465 7344.00 Israel Arafat Palästinenser Terror Bush US-Präsident_George_W._Bush 242550748 1598295465 7344.00 242550748 734389933 12691.00 Israel Autonomiebehörde Arafat Hebron Palästinenser Terror Bush Frieden Jassir_Arafat US-Präsident_George_W._Bush U. Quasthoff

Topics of the Day If we have sets of similar documents we can use clustering. The terms describing the cluster can be viewed as Topic of the Day. The clustering algorithm: Consider all documents (approx. 200 each day). For each pair of similar documents, consider their set of common Words of the Day. Next we cluster these words using HACM: In each step, the most similar sets are combined. As similarity measure we use: sim(A,B)= |AB| / |B| (Which part of A is contained in B?) If sim(A,B)>0.4, then B is replaced by AB and A is dropped. The algorithm stops if there are no sets with similarity >.4. U. Quasthoff

Clusters of 25/6/2002 (Titels are made by hand) NAHOST 1 Gaza-Streifen Arafat Außenminister_Schimon_Peres Jerusalem Terroristen Westjordanland Hebron Israel Panzer Attentäter Palästinenser Ramallah Selbstmordanschläge Israelis Ausgangssperre Autonomiebehörde Tulkarem Bethlehem Dschenin Gazastreifen Terror US-Präsident_George_W._Bush Anschlägen Nablus FORMEL1 2 Grand_Prix Rubens_Barrichello Ferrari Ralf_Schumacher McLaren-Mercedes Coulthard Barrichello Montoya Großbritannien Schumacher Michael_Schumacher Nürburgring Nürburgri Großen_Preis Stallorder Weltmeister Brasilien Fußball-WM STOLPE 3 Potsdam SPD-Generalsekretär_Franz_Müntefering Lothar_Späth Stolpe Bundestagswahlkampf Matthias_Platzeck Müntefering Schönbohm Bundesrat Platzeck Brandenburg Manfred_Stolpe Cottbus Zuwanderungsgesetz PDS Wittenberge Ministerpräsident_Manfred_Stolpe Jörg_Schönbohm Bundestagswahl Schröder WM 4 Korea Südkorea Skibbe Seoul Oliver_Kahn Südkoreaner Michael_Ballack Koreaner Spanier Hitze Nationalmannschaft Elfmeterschießen Viertelfinale Paraguay WM-Halbfinale Miroslav_Klose Völler Jens_Jeremies Karl-Heinz_Rummenigge Klose Golden_Goal Weltmeister Türken Senegal Fußball Verlängerung Brasilien Elf Weltmeisterschaft Entschuldigung Rudi_Völler Portugal Ronaldo Rivaldo Achtelfinale Argentinien Fifa Dietmar_Hamann PISA 5 Nordrhein-Westfalen Gymnasien Pisa-E Naturwissenschaften Brandenburg Rheinland-Pfalz Sachsen-Anhalt BÖRSE 6 T-Aktie Allzeittief Neuen_Markt DAX France_Télécom Moody's Tarifrunde HARTZ 7 Hartz SPD-Generalsekretär_Franz_Müntefering Bundeswirtschaftsminister_Werner_Müller Florian_Gerster Arbeitslosenzahl FDP-Chef_Guido_Westerwelle Hartz-Kommission U. Quasthoff

Cluster des 26/6/2002 WM 1 Fußball Ilhan_Mansiz Golden_Goal WM-Halbfinale Senegal Türken Schröder Weltmeister Bundesinnenminister_Otto_Schily Völler Bundespräsident_Johannes_Rau Brasilien Bundeskanzler_Gerhard_Schröder Ballack Frings Neuville Bierhoff Jeremies Klose Ramelow Korea Südkorea Michael_Ballack Oliver_Kahn Beckham Weltmeisterschaft Zidane Pelé Ronaldo Rivaldo Miroslav_Klose Viertelfinale Paraguay Jens_Jeremies FC_Liverpool Seoul Christian_Ziege Spanier Sebastian_Kehl Elf Saudi-Arabien Thomas_Linke Nationalmannschaft Rudi_Völler Seo Carsten_Ramelow Christoph_Metzelder Foul WM-Finale Koreaner Südkoreaner Oliver_Bierhoff Dietmar_Hamann Yokohama Schiedsrichter Franz_Beckenbauer Portugal Guus_Hiddink DFB Oliver_Neuville Marco_Bode Gelbe_Karte Fifa Franzosen Yoo NAHOST 2 Ariel_Scharon Israel Arafat Palästinenser Bush Nahen_Osten Autonomiebehörde Hebron Terror Frieden Jassir_Arafat US-Präsident_George_W._Bush Jericho Ramallah Israelis Scharon Weiße_Haus George_Bush Palästina Westjordanland Jerusalem Ministerpräsident_Ariel_Scharon Panzer Großbritannien Gewalt Palästinenserpräsident_Jassir_Arafat US-Regierung Anschläge Waffen Intifada ERFURT 3 Schule Massaker Erfurt Lehrer Steinhäuser Rainer_Heise Robert_Steinhäuser STOLPE 4 Brandenburg Bundesrat Stolpe Bundespräsident_Johannes_Rau PDS Jörg_Schönbohm Schönbohm Lothar_Späth Platzeck Matthias_Platzeck Manfred_Stolpe FORMEL1 5 Weltmeister Rubens_Barrichello Nürburgring Großen_Preis McLaren-Mercedes Barrichello Ralf_Schumacher Ferrari Michael_Schumacher Schumacher Jean_Todt BÖRSE 6 Moody's Neuen_Markt DAX ABN_Amro Goldman_Sachs France_Télécom BABCOCK 7 Babcock Nordrhein-Westfalen Oberhausen Bürgschaft Babcock_Borsig IG_Metall Stellenabbau Indien HOLZMANN 8 Philipp_Holzmann_AG Ottmar_Hermann Baukonzern Philipp_Holzmann Holzmann Insolvenz Niederländer U. Quasthoff

Comparison 25.6.02: WM NAHOST FORMEL1 STOLPE BÖRSE PISA HARTZ 26.6.02: ERFURT BABCOCK HOLZMANN Some topics appear repeatedly, either on consecutive days or after some interval. If a topic once is introduced by hand, it will be detected automatically later on. U. Quasthoff

Thank you. U. Quasthoff