Learning semantic relations using very large corpora

Learning semantic relations using very large corpora
Uwe Quasthoff Institut für Informatik Universität Leipzig

Contents Part 1: Introduction to the Wortschatz lexicon
Part 2: Collocations and the collocation measure Part 3: Applications of collocations Part 4: Clustering Collocations Part 5: The Vector Space Model Part 6: Combining Simple Methods Part 7: Temporal Analysis: Words of the Day Part 8: Document Similarity U. Quasthoff

Language Data U. Quasthoff

Große Abfrage für Leipzig
Anzahl: Beschreibung: Stadt in Deutschland (über E) Stadt in Sachsen Grammatikangaben: Wortart: Eigenname Form(en): Leipzig [9967], Leipzigs [276] -er-Adj. / Einwohner zu Stadt: Leipziger [3553] Teilwort von: VfB Leipzig [403], SC Leipzig [183], ..., Erste Baugesellschaft Leipzig AG [8], ... Beispiel: Auch Debütanten aus angrenzenden Sprachräumen, die sich über Leipzig den deutschen Buchmarkt erschließen möchten, bietet die Buchmesse ein geeignetes Forum. (Quelle: OTS-Newsticker) Kollokationen im Satz: Dresden (1488), Berlin (694), Halle (470), Universität (266), Sachsen (265), ..., DDR-Bürger (5), DDR-Innenminister (5), DTSB (5), ... Linke Nachbarn: Universität (392), Stadt (201), Reclam (102), Handelshochschule (51), Oper (50), Karl-Marx-Universität (48), Raum (36), ... U. Quasthoff

Empirical Analysis of Associations
U. Quasthoff

Collocations for Schweine
On the right side we find a collection of similar animals (all in plural) Rinder, Hühner, Kühe, Schafe. On the left side we find words describing the aspect of slaughtering. U. Quasthoff

Collocations for Stich
Two groups for different meanings: Tennis (Michael Stich, Boris Becker etc.). The cards game Skat with the tree players Vorhand (lead), Mittelhand, and Hinterhand. The thin connection between Becker and Vorhand is representing Beckers strong forehand. U. Quasthoff

Funny collocation sets
Identifying English words in German text: Collocations for the: of, and, to, The, on, for, is, from, you, with, that, it, world, are, be, not, We, at, World, we, have, this, by, they, when, You, can, When, into, what, your, or, But, time, And, like, over, Breaking, only, one, but, shall, which, has, What, road, as, On, same, people, out, our, This, It, way, best, who, no, my, more, his, up, their, ... The same way we find dialect words. Berlin dialect is identified using the collocations for ick: det, nich, Ick, Det, hab, is, ne, ooch, keene, wat, weeß, uff, de, ma, nu, keen, dat, aba, och, jing, jut, Nee, meen, Jöre, een, mach, inne, watt, wa, jenuch, kieke, janze, kumm, janz, tau, Mutta, janzen, hätt, sag, wieda, kleene, ha, hör, imma, un, habense, kriejen, ejal, zwee, nischt, nee, Wetta, jedacht, hebb, heff, ... U. Quasthoff

Analysis of other languages
The above procedure was (without changes) applied to English French Dutch Sorbian Italian U. Quasthoff

Dutch and French U. Quasthoff

Sorbian and Italian U. Quasthoff

U. Quasthoff

Part 2 Part 1: Introduction to the Wortschatz lexicon

The Common Birthday Problem
The common birthday problem: What is the probability p of the existence of (at least) two people with the same birthday given n people randomly chosen? Modification: What is the probability p of the existence of k couples with the same birthday (different birthdays are allowed for different couples) a boys and b girls randomly chosen? Reformulation as collocation problem: Common Birthday Problem Collocation Problem Number a of boys Number a of sentences containing A Number b of girls Number b of sentences containing B Number of days of a year (i.e. n=365) Total number n of sentences Number k of couples with same birthday Number k of sentences containing both A and B U. Quasthoff

Introduction to Poisson Distribution
We can calculate the probability of multiple joint occurrences of independent events as follows: Given two independent events to be observed in an experiment with probabilities pa and pb, resp. The probability of their joint occurrence is pa pb . Next we repeat the experiment n times, and we are interested in k joint occurrences. Using λ= n pa pb we get the probability For at least k joint occurrences we get the probability To measure the surprise for the joint occurrence for non-independent events we just calculate the probability as if they were independent. Next we are surprised to see such a rare event. U. Quasthoff

The Collocation Measure
The collocation measure of the two words A and B is defined as the negative logarithm of the above probability divided by log n. For λ=ab/n we get Approximations: If (k+1)/λ>10 (this is typically true) and, moreover k>10 we get: and U. Quasthoff

Comparison to log-Likelihood
Comparison of the formulae: Translating the log-Likelihood formula into our notation and ignoring small terms we get Compared to Note: This may only apply to the typical case (k+1) / λ > 10. U. Quasthoff

Comparing Results: The sources
IDS Cosmas I (W-PUB) Wortschatz (German) Corpus Size 374 Mio 255 Mio Sources Mainly Newspapers Window size Fixed size (here: ±5 words Sentence Collocation Measure Log Likelihood Poisson Distribution U. Quasthoff

Comparing Results: Collocations for Bier
Rank IDS Cosmas I Cosmas Rating Wortschatz (German) Sig-Rating 1 Wein 4351 trinken 1234 2 2745 648 3 getrunken 1715 478 4 kühles 1627 Liter 460 5 Glas 1379 trinkt 428 6 1318 348 7 Faß 1236 Schnaps 318 8 Fass 1139 Hektoliter 300 9 Flasche 1071 Flaschen 272 10 899 gebraut 269 11 Trinkt 881 242 12 873 Sekt 239 Kaffee U. Quasthoff

Properties of sig(n,k,a,b) I
Simple co-occurrence: A and B occur only once, and they occur together: sig(n,1,1,1) → 1. This should ensure that the minimum significance threshold is independent of the corpus size. Independence: A and B occur statistically independent with probabilities p and q: sig(n,npq,np,nq) → 0. Enlarging the corpus by a factor m: sig(mn, mk, ma, mb) = m sig(n, k, a, b) This is useful for comparing corpora of different size. U. Quasthoff

Properties of sig(n,k,a,b) II
Additivity: The unification of the words B and B‘ just adds the corresponding significances. For k/b≈k‘/b‘ we have sig(n,k,a,b) + sig(n,k‘,a,b‘) ≈ sig(n,k+k‘,a,b+b‘) This has applications for grouping words due to various methods. Maximum: max sig(A, B) ≈ a B It might be useful to know how strong a collocation is compared to the possible maximum. U. Quasthoff

Applications Collocations of inflected forms or basic forms?
A numeric measure for polysemy Identification of Proper Names and Phrases Compound Analysis U. Quasthoff

Collocations of inflected forms or basic forms?
Collocations of basic forms will give more results because of higher frequency. But: Collocations of basic forms and inflected forms may differ strongly. Example: Collocations for As: Karo (488), K (393), Pik (391), Treff (307), Coeur (296), D (258), Herz (190), Karte (189), Kreuz (178), As (166), As (166), Süd (145), Matchball (113), Hinterhand (110), West (110), Karo-Bube (101), Ärmel (95), Ost (94), Vorhand (86), … Collocations for Asse: Ivanisevic (72), Becker (63), schlug (62), Aufschlag (56), servierte (47), Sampras (40), Goran Ivanisevic (32), Spiel (25), Stich (24), gewann (24), Wolfenbüttel (23), Kroate (22), schlagen (22), Asse (21), Asse (21), Match (21), Satz (21), … U. Quasthoff

Application of additivity
First calculate collocations for inflected forms, then use additivity to calculate the measure for basic forms, if you want. Example: Collocations for Bundeskanzler: ..., betonte (46), ..., betont (21), ..., betonten (7), ... Additivity gives significance 74 for the pair Bundeskanzler, {betonen, betont, betonte, betonten}. U. Quasthoff

A numeric measure for polysemy: Space
The collocations of space taken from our general language corpus of English fall mainly into three classes: The subject areas computer, real estate and outer space. The corresponding senses of space are denoted with space1, space2, and space3. Assigning the top 30 collocations of space (disk, shuttle, square, station, NASA, feet, …) to these three senses we get an qualitative estimate of these senses: space1 28.2%: disk (2629), memory (718), storage (479), program (308), RAM (307), free (300), hard (336) space2 53.2%: shuttle (2618), station (991), NASA (920), Space (602), launch (505), astronauts (473), Challenger (420), manned (406), NASA's (297), flight (293), Atlantis (291) Mir (335), rocket (329), orbit (326), Discovery (341), mission (385) space3: 18.6%: square (1163), feet (822), leased (567), office (382), lessor (390) U. Quasthoff

Proper Names and Phrases
A large relative collocation measure sigC(A) indicates that a reasonable part of all occurrences of the word C is together with A. Hence, C might be the head with respect to A. Left Word Right Word “head” Alzheimersche Krankheit left AQA total Anorexia nervosa left and right Algighiero Boetti 30jährige US-Bond right André Lussi U. Quasthoff

Compound analysis using multi-word collocations
Assume we know that Geschwindigkeitsüberschreitung has the parts Geschwindigkeit and Überschreitung. If a multi-word collocation (here: Überschreitung der Geschwindigkeit) is of some predefined form we accept this collocation as a semantic description. Pattern Word A Word B Compound A aus B Orgie Farben Farbenorgie A der B Bebauung Insel Inselbebauung A mit B Feld Getreide Getreidefeld A in der B Feldbau Regenzeit Regenzeitfeldbau A für B Übung Anfänger Anfängerübung A für die B Gebäude Flugsicherung Flugsicherungsgebäude A von B Anbau Kaffee Kaffeeanbau A zur B Andrang Eröffnung Eröffnungsandrang U. Quasthoff

Clustering So far, we are able to find relations between words. They are still of unknown type. Moreover, different types are mixed. Problem 1: How to construct sets of relations of a fixed type? Problem 2: How to identify the type of a relation using background information? U. Quasthoff

Collocations of collocation sets
The production of collocations is now applied on sets of (next neighbour or sentence) collocations instead of sentences. The collection of sentence collocations has the following ‚sentence‘ for Hemd: Hemd Krawatte Hose weißes Anzug weißem Jeans trägt trug bekleidet weißen Jacke schwarze Jackett schwarzen Weste kariertes Schlips Mann The collection of next neighbor collocations has the following two ‚sentences‘ for Hemd: weißes weißem weißen blaues kariertes kariertem offenem aufs karierten gestreiftes letztes ... näher bekleidet ausgezogen spannt trägt aufknöpft ausgeplündert auszieht wechseln aufgeknöpft ausziehen ... U. Quasthoff

Erklärte (declaired) using sentences
Sprecher (2581), werde (2302), gestern (1696), seien (1440), Wir (1187), bereit (929), wolle (839), Vorsitzende (807), Anfrage (775), Präsident (721) U. Quasthoff

Erklärte (declared): Using NN-collocations
sagte (137), betonte (59), sprach (55), kündigte (44), wies (37), nannte (36), warnte (27), bekräftigte (24), meinte (24), kritisierte (23) U. Quasthoff

Collocation set for Leipzig (other cities in black)
in, Dresden, Berlin, Halle, Leipzig, Leipzig, und, Universität, Sachsen, Erfurt, Chemnitz, UM, Frankfurt, Hamburg, Rostock, Magdeburg, München, Leipziger, Hannover, Messe, Zwickau, studierte, nach, aus, Stadt, Stuttgart, Jena, Düsseldorf, Nürnberg, Reclam, Messestadt, sächsischen, DDR, Kischko, am, Köln, Däbritz, Karl-Marx-Universität, In, Rische, ostdeutschen, geboren, sächsische, bewölkt, Völkerschlacht, Bredow, Taucha, VEB, Edmond, Verlag, Buchmesse, Gewandhausorchester, Städten, Strombörse, Deutschen, Institut, GmbH, Lindner, Wurzen, GV, Verbundnetz, Ampler, Frankfurt am Main, Potsdam, Reclam Verlag, Städte, Cottbus, Versandzentrum, Handelshochschule, Hinrich Lehmann-Grube, Gera, Kirchentag, Völkerschlachtdenkmal, Buchstadt, Thomanerchor, Unterhaching, Lübeck, Oper, Dessau, Meppen, Studium, MDR, Philosophie, eröffnet, wurde, Anke Huber, Jens Lehmann, Turowski, Uwe Ampler, Weimar, ostdeutsche, Hecking, IAT, Boomtown, Buchkunst, Engelmann, Freistaat, Liebers, Dortmund, Mai, Mannheim, Schwerin, neuen Bundesländern, Grischin, VNG, Wende, bei, AG, Auto Mobil International, Cindy Klemrath, Gewandhaus, Messegelände, Parteitag, Bremen, Montagsdemonstrationen, Neubrandenburg, Gustav Kiepenheuer Verlag, Karl-Marx-Stadt, Journalistik, Ostdeutschland, Thomas Liese, Essen, Heidenreich, Udo Zimmermann, Umweltforschungszentrum, DHFK, Hochschule, Mainz, Oktober, Wolfgang Engel, Deutschen Hochschule für Körperkultur, Frankfurt/Main, Heldenstadt, Trommer, Wolfsburg, EBL, Universitäten, Wien, Bautzen, ... U. Quasthoff

First Iteration for Leipzig (other cities in black)
Frankfurt, Berlin, München, Stuttgart, Köln, Dresden, Hamburg, Hannover, Düsseldorf, Bremen, Karlsruhe, Potsdam, Wien, Paris, Magdeburg, Halle, Tübingen, Bonn, Freiburg, New York, Chemnitz, Darmstadt, Augsburg, Erfurt, Mannheim, Schweiz, Ulm, Bochum, Wiesbaden, Hanau, Braunschweig, Schwerin, Münster, Frankfurt am Main, London, USA, Regensburg, Cottbus, Göttingen, Kassel, Moskau, Passau, Rostock, Straßburg, Deutschland, Konstanz, Ausland, Dortmund, Heidelberg, Mainz, Würzburg, Zürich, Aachen, Offenbach, Weimar, Gießen, Koblenz, Italien, Chicago, Mailand, Osnabrück, Prag, Rom, Saarbrücken, Wuppertal, Niederlanden, Gera, Basel, Lyon, Nürnberg, Holland, Marburg, St. Petersburg, Amerika, Genf, Kaiserslautern, Tel Aviv, Woche, September, Tiergarten, dort, eröffnet, Budapest, Essen, Jena, Jerusalem, Neubrandenburg, Athen, Frankreich, Vereinigten Staaten, Amsterdam, Baden-Württemberg, Februar, Tempelhof, Trier, Venedig, Bayreuth, England, Erlangen, Indien, Belgrad, Duisburg, Heilbronn, Kairo, Ludwigsburg, Oldenburg, Oxford, Stockholm, Washington, Großbritannien, Görlitz, Kreuzberg, Lausanne, Lübeck, Mitte, Wochenende, April, Australien, Griechenland, Singapur, Florenz, Kanada, Kiel, Kopenhagen, Madrid, Mai, Südafrika, Tegel, Türkei, soeben, Bad Homburg, Bundesrepublik, Göppingen, Heute, Hongkong, Ingolstadt, Japan, Lande, Miami, Mittwoch, Oder, Sarajewo, Afghanistan, Argentinien, Baden-Baden, Bayern, Deutschlands, Europa, Haus, Iran, Istanbul, Peking, Rußland, neu, ... U. Quasthoff

Second Iteration for Leipzig (other cities in black)
Stuttgart, München, Frankfurt, Hamburg, Hannover, Köln, Berlin, Dresden, Bremen, Darmstadt, Karlsruhe, Freiburg, Potsdam, Mannheim, Wiesbaden, Düsseldorf, Tübingen, Magdeburg, Gießen, Augsburg, Rostock, Kassel, Halle, Ulm, Hanau, Heidelberg, Ludwigsburg, Konstanz, Nürnberg, Bonn, Schwerin, Münster, Wien, Dortmund, Würzburg, Chemnitz, Passau, Göttingen, Erfurt, Mitte, Aachen, Mainz, Friedberg, Nord, Regensburg, Braunschweig, Cottbus, New York, Kreuzberg, Frankfurt am Main, Göppingen, Tiergarten, Esslingen, Ravensburg, II, Hessen, Ost, Lübeck, Charlottenburg, Böblingen, Offenbach, Oldenburg, Osnabrück, Traunstein, Paris, Bad Homburg, London, Prenzlauer Berg, Neukölln, Tempelhof, Hellersdorf, Koblenz, Essen, Fulda, Trier, Lüneburg, Prag, Chicago, Landshut, Reinickendorf, USA, Wilmersdorf, Kiel, Bochum, Deutschland, Mittelfranken, Schöneberg, Marzahn, Oberbayern, Eimsbüttel, Niederrhein, Unterfranken, Wuppertal, Friedrichshain, Spandau, Oberfranken, Lichtenberg, Moskau, Oberpfalz, Bielefeld, Schweiz, Kaiserslautern, Kempten, Bayreuth, Schwaben, Zürich, Bamberg, Ingolstadt, Mailand, Oder, Heilbronn, Altona, Sarajewo, Marburg, Ansbach, Harburg, Berlin-Mitte, Jena, Steglitz, Suhl, Görlitz, Baden-Württemberg, Hessen-Süd, dort, Italien, Weimar, West, Saarbrücken, Ausland, Bayern, Ostwestfalen-Lippe, Moabit, Offenburg, Main, Polen, Amsterdam, Westliches, Mittlerer, eröffnet, ... U. Quasthoff

Feature Vectors Given by Collocations
If two words A and B have similar contexts, that is, they are alike in their use, this indicates that there is a semantic relation between A and B of some kind. A kind of average context for every word A is formed by all collocations for A above a certain significance threshold. This average context of A is transferred into a feature vector of A of dimension n (the total number of words) as usual. The feature vector of word A is a description of the meaning of A, because the most important words of the contexts of A are included. Clustering of feature vectors can be used to investigate the relations between a group of similar words and to figure out whether or not all the relations are of the same kind. U. Quasthoff

Clustering Months and Days
Example (1): Jahres _____________________ Uhr, Ende, abend, vergangenen, Anfang, Jahres, Samstag, Freitag, Mitte, Sonntag Donnerstag _ | Uhr, abend, heutigen, Nacht, teilte, Mittwoch, Freitag, worden, mitteilte, sagte Dienstag _|_ | Uhr, abend, heutigen, teilte, Freitag, worden, kommenden, sagte, mitteilte, Nacht Montag _ | | Uhr, abend, heutigen, Dienstag, kommenden, teilte, Freitag, worden, sagte, morgen Mittwoch _|_|_ | Uhr, abend, heutigen, Nacht, Samstag, Freitag, Sonntag, kommenden, nachmittag Samstag ___ | | Uhr, abend, Samstag, Nacht, Sonntag, Freitag, Montag, nachmittag, heutigen Sonntag _ | | | Uhr, abend, Samstag, Nacht, Montag, kommenden, morgen, nachmittag, vergangenen Freitag _|_|_|_____________ | Uhr, abend, Ende, Jahres, Samstag, Anfang, Freitag, Sonntag, heutigen, worden Januar _________________ | | Uhr, Ende, Jahres, Anfang, Mitte, Samstag, Mai, August, März, Januar August _______________ | | | Uhr, Ende, Jahres, Anfang, Mitte, Samstag, Mai, August, Januar, März Juli _____________ | | | | Uhr, Jahres, Ende, Anfang, Mitte, Mai, Samstag, August, Januar, März März ___________ | | | | | Uhr, Ende, Jahres, Anfang, Mitte, Samstag, Mai, Januar, März, April Mai _________ | | | | | | Uhr, Ende, Jahres, Anfang, Mitte, Samstag, März, Januar, Mai, vergangenen September _______ | | | | | | | Uhr, Ende, Jahres, Anfang, Mitte, Mai, Januar, März, Samstag, vergangenen Februar _ | | | | | | | | Uhr, Januar, Jahres, Anfang, Mitte, Ende, März, November, Samstag, vergangenen Dezember _|___ | | | | | | | | Uhr, Jahres, Ende, Anfang, Mitte, Mai, Januar, März, Samstag, vergangenen November _ | | | | | | | | | Uhr, Jahres, Ende, Anfang, Mitte, September, vergangenen, Dezember, Samstag Oktober _|_ | | | | | | | | | Uhr, Ende, Jahres, Anfang, Mai, Mitte, Samstag, September, März, vergangenen April _ | | | | | | | | | | Uhr, Ende, Jahres, Mai, Anfang, März, Mitte, Prozent, Samstag, Hauptversammlung Juni _|_|_|_|_|_|_|_|_|_|_|_ U. Quasthoff

Clustering Leaders and Verbs of Utterance
Example (2): Clustering Leaders Präsident _________ sagte, Boris Jelzin, erklärte, stellvertretende, Bill Clinton, stellvertretender, Richter Vorsitzender _______ | sagte, erklärte, stellvertretende, stellvertretender, Richter, Abteilung, bestätigte Vorsitzende ___ | | sagte, erklärte, stellvertretende, Richter, bestätigte, Außenministeriums, teilte, gestern Sprecher _ | | | sagte, erklärte, Außenministeriums, bestätigte, teilte, gestern, mitteilte, Anfrage Sprecherin _|_|_ | | sagte, erklärte, stellvertretende, Richter, Abteilung, bestätigte, Außenministeriums, sagt Chef _ | | | Abteilung, Instituts, sagte, sagt, stellvertretender, Professor, Staatskanzlei, Dr. Leiter _|___|_|_|_ Example (3): Clustering Verbs of Utterance verwies _____________ Sprecher, werde, gestern, Vorsitzende, Polizei, Sprecherin, Anfrage, Präsident, gebe mitteilte ___________ | Sprecher, werde, gestern, Vorsitzende, Polizei, Sprecherin, Anfrage, Präsident, Montag meinte _______ | | Sprecher, werde, gestern, Vorsitzende, Sprecherin, Anfrage, Präsident, gebe, Interview bestätigte_____ | | | Sprecher, werde, gestern, Vorsitzende, Sprecherin, Anfrage, Präsident, gebe, Interview betonte ___ | | | | Sprecher, werde, gestern, Vorsitzende, Sprecherin, Präsident, gebe, Interview, würden, Bonn sagte _ | | | | | Sprecher, werde, gestern, Vorsitzende, Sprecherin, Präsident, gebe, Interview, würden erklärte _|_|_|_|_ | | Sprecher, werde, gestern, Vorsitzende, Sprecherin, Präsident, Anfrage, gebe, Interview warnte _ | | | Präsident, Vorsitzende, SPD, eindringlich, Ministerpräsident, CDU, Außenminister, Zugleich sprach _|_______|_|_|_ U. Quasthoff

The Clustering Algorithm
The Single Link Hierarchical Agglomerative Clustering Method (HACM) works bottom up like this: All words are treated as (basic) items. Each item has a description (feature vector). In each step of the clustering algorithm the two items A and B with the most similar descriptions are searched and fitted together to create a new complex item C combining the words in A and B. Each step of the clustering algorithm reduces the number of items by one. The feature vector for C is constructed from the feature vectors of A and B by „identifying“ the words A and B and calculating their joint collocations. The algorithm stops if only one item is left or or if all remaining feature vectors are orthogonal. This results usually in a very natural clustering if the threshold for constructing the feature vectors is suitably chosen. U. Quasthoff

Part 3: Combining Non-contradictory Partial Results
The results of these combination either give more and / or better results. Identical Results Two or more of the above algorithms may suggest a certain relation between two words, for instance, cohyponymy. Example: If both the second order collocations and clustering by feature vectors independently yield similar sets of words as a result, this may be taken as an indication of cohyponymy between the words, e. g. sagte, betonte, kündigte, wies, nannte, warnte, bekräftigte, meinte [...] (German verbs of utterance). U. Quasthoff

Types of Relations Symmetric Relations: A relation r is called symmetric if r(A, B) always implies r(A, B). Examples of symmetric relations are synonymy, cohyponymy (or similarity), elements of a certain subject area, and relations of unknown type. Usually, sentence collocations express symmetric relations. Anti-symmetric Relations: Let us call a relation r anti-symmetric if r(A, B) never implies r(A, B). Examples of anti-symmetric relations are hyponymy, relations between properties and its owners like action and actor, or class and instance. Usually, next neighbor collocations of two words express anti-symmetric relations. In the case of next neighbor collocations consisting of more than two words (like A prep/det/conj B e. g. Samson and Delilah), the relation might be symmetric, for instance in the case of conjunctions like and or or. Transitivity: Transitivity of a relation means that r(A, B) and r(B, C) always implies r(A, C). In general, a relation found experimentally will not be transitive, of course. But there may be a part where transitivity holds. Some of the most prominent transitive relations are the cohyponymy, hyponymy, synonymy relations. U. Quasthoff

Supporting Second Results
In second combination type a known relation given by one method of extraction is verified by an identical but unnamed second result as follows: Result 1: There is certain relation r between A and B Result 2: There is some strong (but unknown) relation between A and B (e. g. given by a collocation set) Conclusion: Result 1 holds with more evidence. Example Result 2: The German compound Entschädigungsgesetz can be divided into Gesetz and Entschädigung with an unknown relation. Result 1 is given by the four word next neighbor collocation Gesetz über die Entschädigung. Similarly Stundenkilometer is analyzed as Kilometer pro Stunde. In these examples, result 1 is not enough because there are collocations like Woche auf dem Tisch which do not describe a meaningful semantic relation. U. Quasthoff

Combining Three Results
Result 1: There is relation r between A and B Result 2: B is similar to B' (cohyponymy) Result 3: There is some strong but unknown relation between A and B' Conclusion: There is a relation r between A and B'‚ Example: As result 1 we might know that Schwanz (tail) is part of Pferd (horse). Similar terms to Pferd are both Kuh (cow) and Hund (dog) (result 2). Both of them have the term Schwanz collocation in their set of significant collocations (result 3). Hence we might correctly conjecture that both Kuh and Hund have a tail (Schwanz) as part of their body. In contrast, Reiter (rider) is a strong collocation to Pferd and might (incorrectly) be conjectured to be another similar concept, but Reiter is no collocation with respect to Schwanz. Hence, the absence of result 3 prevents us from making an incorrect conclusion. U. Quasthoff

Similarity Used to Infer a Strong Property
Let us call an property p important, if similarity respects this property. This strong property can be assured as follows: Result 1: A has a certain important property p Result 2: B is similar to A (i. e., B is a cohyponym of A) Conclusion: B has the same property p Example: We consider A and B as similar if they are in the set of right neighbor collocations of Hafenstadt (port town) (result 2). If we know that Hafenstadt is a property of its typical right neighbors (result 1) we may infer this property for more then 200 cities like Split, Sidon, Durban, Kismayo, Tyrus, Vlora, Karachi, Durres, [...]. U. Quasthoff

Subject Area Inferred from Collocation Sets
Result 1: A, B, C, ... are collocates of a certain term. Result 2: Some of them belong to a certain subject area. Conclusion: All of them belong to this subject area. Example: Consider the following top entries in the collocation set of carcinoma: patients, cell, squamous, radiotherapy, lung, thyroid, treated, hepatocellular, metastases, adenocarcinoma, cervix, irradiation, breast, treatment, CT, therapy, renal, cases, bladder, cervical, tumor, cancer, metastatic, radiation, uterine, ovarian, chemotherapy, [...] If we know that some of them belong to the subject area Medicine, we can add this subject area to the other members of the collocation set as well. U. Quasthoff

U. Quasthoff

Overview Input: Today‘s news text Ca. 20.000 sentences
Relative size compared to the large corus: Factor 1000 # of sentence collocations: ca # of next neighbor collocations: ca. 7000 # of next neighbor collocations, both capitalized: ca. 300 Size So und Mo: ca. 50% compared to weekdays. Problem: Find important terms. Total number of words Frequency data available: total frequency today relative frequency compared to our large corpus total frequency in our large corpus Morphosyntactic criteria: Words and multiwords should be capitalized No inflected forms U. Quasthoff

Frequency Measures Total frequency today
Minimum frequency needed, otherwise too many words (cf. Zipf’s Law). Today: Minimum frequency of 12 Today: Maximum frequency of 100 for relevant words Relative frequency compared to our large corpus Large factor implies importance. Small variance appears by chance. Threshold for importence: factor>6. May be lowered for larger daily corpora. Total frequency in our large corpus Words should be familiar Today: Wortschatz-frequency > 20 What about totally new words? - Today: Minimum frequency of 12 as above Question: Which measure is closest to importance as felt by humans? Answer: Total frequency today U. Quasthoff

Words of the Day (without human inspection)
U. Quasthoff

Words of the Day (after 5 minutes of inspection)
U. Quasthoff

Problem: Find the Message
We automatically find: Jürgen Hart is rarely mentioned We notice the words gestorben and tot and the phrase hörte sein Herz auf zu schlagen. Conclusion: We have an obituary. U. Quasthoff

Relations between Words
Today‘s collocation graph: Connected words represent a strong relation U. Quasthoff

Temporal Relations We see wether collocations repeatedly appear together during the last 30 days. U. Quasthoff

Document Similarity The description of a document consists of all its terms which have been Word of the Day at any time. Hence we use only approx for indexing. Documents are compared just by counting their common terms, weighted by their frequencies. U. Quasthoff

Sample similar Documents
Doc.-Nr. 1 Doc.-Nr. 2 Ähnlichkeit Weltmeister Südkorea Michael_Ballack Oliver_Kahn Brasilien Yokohama Ballack DFB Rudi_Völler Weltmeister Südkorea Dietmar_Hamann Thomas_Linke Weltmeisterschaft Carsten_Ramelow Rudi_Völler Südkorea Michael_Ballack Weltmeisterschaft Ballack Elf Seoul Israel Arafat Hebron Jericho Palästinenser Terror Bush Frieden US-Präsident_George_W._Bush Israel Arafat Palästinenser Terror Bush US-Präsident_George_W._Bush Israel Autonomiebehörde Arafat Hebron Palästinenser Terror Bush Frieden Jassir_Arafat US-Präsident_George_W._Bush U. Quasthoff

Topics of the Day If we have sets of similar documents we can use clustering. The terms describing the cluster can be viewed as Topic of the Day. The clustering algorithm: Consider all documents (approx. 200 each day). For each pair of similar documents, consider their set of common Words of the Day. Next we cluster these words using HACM: In each step, the most similar sets are combined. As similarity measure we use: sim(A,B)= |AB| / |B| (Which part of A is contained in B?) If sim(A,B)>0.4, then B is replaced by AB and A is dropped. The algorithm stops if there are no sets with similarity >.4. U. Quasthoff

Clusters of 25/6/2002 (Titels are made by hand)
NAHOST Gaza-Streifen Arafat Außenminister_Schimon_Peres Jerusalem Terroristen Westjordanland Hebron Israel Panzer Attentäter Palästinenser Ramallah Selbstmordanschläge Israelis Ausgangssperre Autonomiebehörde Tulkarem Bethlehem Dschenin Gazastreifen Terror US-Präsident_George_W._Bush Anschlägen Nablus FORMEL Grand_Prix Rubens_Barrichello Ferrari Ralf_Schumacher McLaren-Mercedes Coulthard Barrichello Montoya Großbritannien Schumacher Michael_Schumacher Nürburgring Nürburgri Großen_Preis Stallorder Weltmeister Brasilien Fußball-WM STOLPE Potsdam SPD-Generalsekretär_Franz_Müntefering Lothar_Späth Stolpe Bundestagswahlkampf Matthias_Platzeck Müntefering Schönbohm Bundesrat Platzeck Brandenburg Manfred_Stolpe Cottbus Zuwanderungsgesetz PDS Wittenberge Ministerpräsident_Manfred_Stolpe Jörg_Schönbohm Bundestagswahl Schröder WM Korea Südkorea Skibbe Seoul Oliver_Kahn Südkoreaner Michael_Ballack Koreaner Spanier Hitze Nationalmannschaft Elfmeterschießen Viertelfinale Paraguay WM-Halbfinale Miroslav_Klose Völler Jens_Jeremies Karl-Heinz_Rummenigge Klose Golden_Goal Weltmeister Türken Senegal Fußball Verlängerung Brasilien Elf Weltmeisterschaft Entschuldigung Rudi_Völler Portugal Ronaldo Rivaldo Achtelfinale Argentinien Fifa Dietmar_Hamann PISA Nordrhein-Westfalen Gymnasien Pisa-E Naturwissenschaften Brandenburg Rheinland-Pfalz Sachsen-Anhalt BÖRSE T-Aktie Allzeittief Neuen_Markt DAX France_Télécom Moody's Tarifrunde HARTZ Hartz SPD-Generalsekretär_Franz_Müntefering Bundeswirtschaftsminister_Werner_Müller Florian_Gerster Arbeitslosenzahl FDP-Chef_Guido_Westerwelle Hartz-Kommission U. Quasthoff

Cluster des 26/6/2002 WM Fußball Ilhan_Mansiz Golden_Goal WM-Halbfinale Senegal Türken Schröder Weltmeister Bundesinnenminister_Otto_Schily Völler Bundespräsident_Johannes_Rau Brasilien Bundeskanzler_Gerhard_Schröder Ballack Frings Neuville Bierhoff Jeremies Klose Ramelow Korea Südkorea Michael_Ballack Oliver_Kahn Beckham Weltmeisterschaft Zidane Pelé Ronaldo Rivaldo Miroslav_Klose Viertelfinale Paraguay Jens_Jeremies FC_Liverpool Seoul Christian_Ziege Spanier Sebastian_Kehl Elf Saudi-Arabien Thomas_Linke Nationalmannschaft Rudi_Völler Seo Carsten_Ramelow Christoph_Metzelder Foul WM-Finale Koreaner Südkoreaner Oliver_Bierhoff Dietmar_Hamann Yokohama Schiedsrichter Franz_Beckenbauer Portugal Guus_Hiddink DFB Oliver_Neuville Marco_Bode Gelbe_Karte Fifa Franzosen Yoo NAHOST Ariel_Scharon Israel Arafat Palästinenser Bush Nahen_Osten Autonomiebehörde Hebron Terror Frieden Jassir_Arafat US-Präsident_George_W._Bush Jericho Ramallah Israelis Scharon Weiße_Haus George_Bush Palästina Westjordanland Jerusalem Ministerpräsident_Ariel_Scharon Panzer Großbritannien Gewalt Palästinenserpräsident_Jassir_Arafat US-Regierung Anschläge Waffen Intifada ERFURT Schule Massaker Erfurt Lehrer Steinhäuser Rainer_Heise Robert_Steinhäuser STOLPE Brandenburg Bundesrat Stolpe Bundespräsident_Johannes_Rau PDS Jörg_Schönbohm Schönbohm Lothar_Späth Platzeck Matthias_Platzeck Manfred_Stolpe FORMEL Weltmeister Rubens_Barrichello Nürburgring Großen_Preis McLaren-Mercedes Barrichello Ralf_Schumacher Ferrari Michael_Schumacher Schumacher Jean_Todt BÖRSE Moody's Neuen_Markt DAX ABN_Amro Goldman_Sachs France_Télécom BABCOCK 7 Babcock Nordrhein-Westfalen Oberhausen Bürgschaft Babcock_Borsig IG_Metall Stellenabbau Indien HOLZMANN 8 Philipp_Holzmann_AG Ottmar_Hermann Baukonzern Philipp_Holzmann Holzmann Insolvenz Niederländer U. Quasthoff

Comparison 25.6.02: WM NAHOST FORMEL1 STOLPE BÖRSE PISA HARTZ 26.6.02:
ERFURT BABCOCK HOLZMANN Some topics appear repeatedly, either on consecutive days or after some interval. If a topic once is introduced by hand, it will be detected automatically later on. U. Quasthoff

Thank you. U. Quasthoff

Learning semantic relations using very large corpora

Ähnliche Präsentationen

Präsentation zum Thema: "Learning semantic relations using very large corpora"— Präsentation transkript:

Ähnliche Präsentationen

Über Projekt

Feedback

Anmelden

Anmeldung über soziales Netzwerk:

Learning semantic relations using very large corpora

Ähnliche Präsentationen

Präsentation zum Thema: "Learning semantic relations using very large corpora"— Präsentation transkript:

Ähnliche Präsentationen

Über Projekt

Feedback