Die Präsentation wird geladen. Bitte warten

Die Präsentation wird geladen. Bitte warten

Learning semantic relations using very large corpora Uwe Quasthoff Institut für Informatik Universität Leipzig

Ähnliche Präsentationen

Präsentation zum Thema: "Learning semantic relations using very large corpora Uwe Quasthoff Institut für Informatik Universität Leipzig"—  Präsentation transkript:

1 Learning semantic relations using very large corpora Uwe Quasthoff Institut für Informatik Universität Leipzig

2 U. Quasthoff2 Contents Part 1: Introduction to the Wortschatz lexicon Part 2: Collocations and the collocation measure Part 3: Applications of collocations Part 4: Clustering Collocations Part 5: The Vector Space Model Part 6: Combining Simple Methods Part 7: Temporal Analysis: Words of the Day Part 8: Document Similarity

3 U. Quasthoff3 Language Data

4 U. Quasthoff4 Große Abfrage für Leipzig Anzahl: 9967 Beschreibung: Stadt in Deutschland (über E) Stadt in Sachsen Grammatikangaben: Wortart: Eigenname Form(en): Leipzig [9967], Leipzigs [276] -er-Adj. / Einwohner zu Stadt: Leipziger [3553] Teilwort von: VfB Leipzig [403], SC Leipzig [183],..., Erste Baugesellschaft Leipzig AG [8],... Beispiel: Auch Debütanten aus angrenzenden Sprachräumen, die sich über Leipzig den deutschen Buchmarkt erschließen möchten, bietet die Buchmesse ein geeignetes Forum. (Quelle: OTS-Newsticker) Kollokationen im Satz: Dresden (1488), Berlin (694), Halle (470), Universität (266), Sachsen (265),..., DDR-Bürger (5), DDR-Innenminister (5), DTSB (5),... Linke Nachbarn: Universität (392), Stadt (201), Reclam (102), Handelshochschule (51), Oper (50), Karl-Marx- Universität (48), Raum (36),...

5 U. Quasthoff5 Empirical Analysis of Associations

6 U. Quasthoff6 Collocations for Schweine On the right side we find a collection of similar animals (all in plural) Rinder, Hühner, Kühe, Schafe. On the left side we find words describing the aspect of slaughtering.

7 U. Quasthoff7 Collocations for Stich Two groups for different meanings: Tennis ( Michael Stich, Boris Becker etc.). The cards game Skat with the tree players Vorhand (lead), Mittelhand, and Hinterhand. The thin connection between Becker and Vorhand is representing Beckers strong forehand.

8 U. Quasthoff8 Funny collocation sets Identifying English words in German text: Collocations for the: of, and, to, The, on, for, is, from, you, with, that, it, world, are, be, not, We, at, World, we, have, this, by, they, when, You, can, When, into, what, your, or, But, time, And, like, over, Breaking, only, one, but, shall, which, has, What, road, as, On, same, people, out, our, This, It, way, best, who, no, my, more, his, up, their,... The same way we find dialect words. Berlin dialect is identified using the collocations for ick: det, nich, Ick, Det, hab, is, ne, ooch, keene, wat, weeß, uff, de, ma, nu, keen, dat, aba, och, jing, jut, Nee, meen, Jöre, een, mach, inne, watt, wa, jenuch, kieke, janze, kumm, janz, tau, Mutta, janzen, hätt, sag, wieda, kleene, ha, hör, imma, un, habense, kriejen, ejal, zwee, nischt, nee, Wetta, jedacht, hebb, heff,...

9 U. Quasthoff9 Analysis of other languages The above procedure was (without changes) applied to English French Dutch Sorbian Italian

10 U. Quasthoff10 Dutch and French

11 U. Quasthoff11 Sorbian and Italian

12 U. Quasthoff12

13 U. Quasthoff13

14 U. Quasthoff14 Part 2 Part 1: Introduction to the Wortschatz lexicon Part 2: Collocations and the collocation measure Part 3: Applications of collocations Part 4: Clustering Collocations Part 5: The Vector Space Model Part 6: Combining Simple Methods Part 7: Temporal Analysis: Words of the Day Part 8: Document Similarity

15 U. Quasthoff15 The Common Birthday Problem The common birthday problem: What is the probability p of the existence of (at least) two people with the same birthday given n people randomly chosen? Modification: What is the probability p of the existence of k couples with the same birthday (different birthdays are allowed for different couples) a boys and b girls randomly chosen? Reformulation as collocation problem: Common Birthday ProblemCollocation Problem Number a of boysNumber a of sentences containing A Number b of girlsNumber b of sentences containing B Number of days of a year (i.e. n=365)Total number n of sentences Number k of couples with same birthday Number k of sentences containing both A and B

16 U. Quasthoff16 Introduction to Poisson Distribution We can calculate the probability of multiple joint occurrences of independent events as follows: Given two independent events to be observed in an experiment with probabilities p a and p b, resp. The probability of their joint occurrence is p a p b. Next we repeat the experiment n times, and we are interested in k joint occurrences. Using λ= n p a p b we get the probability For at least k joint occurrences we get the probability To measure the surprise for the joint occurrence for non-independent events we just calculate the probability as if they were independent. Next we are surprised to see such a rare event.

17 U. Quasthoff17 The Collocation Measure The collocation measure of the two words A and B is defined as the negative logarithm of the above probability divided by log n. For λ=ab/n we get Approximations: If (k+1)/λ>10 (this is typically true) and, moreover k>10 we get: and

18 U. Quasthoff18 Comparison to log-Likelihood Comparison of the formulae: Translating the log-Likelihood formula into our notation and ignoring small terms we get Compared to Note: This may only apply to the typical case (k+1) / λ > 10.

19 U. Quasthoff19 Comparing Results: The sources IDS Cosmas I (W-PUB) Wortschatz (German) Corpus Size374 Mio255 Mio SourcesMainly Newspapers Window sizeFixed size (here: ±5 words Sentence Collocation Measure Log LikelihoodPoisson Distribution

20 U. Quasthoff20 Comparing Results: Collocations for Bier RankIDS Cosmas I Cosmas Rating Wortschatz (German) Sig-Rating 1Wein4351trinken1234 2trinken2745Wein648 3getrunken1715getrunken478 4kühles1627Liter460 5Glas1379trinkt428 6Liter1318Glas348 7Faß1236Schnaps318 8Fass1139Hektoliter300 9Flasche1071Flaschen272 10Hektoliter899gebraut269 11Trinkt Flaschen873Sekt239 Kaffee

21 U. Quasthoff21 Properties of sig(n,k,a,b) I Simple co-occurrence: A and B occur only once, and they occur together: sig(n,1,1,1) 1. This should ensure that the minimum significance threshold is independent of the corpus size. Independence: A and B occur statistically independent with probabilities p and q: sig(n,npq,np,nq) 0. Enlarging the corpus by a factor m: sig(mn, mk, ma, mb) = m sig(n, k, a, b) This is useful for comparing corpora of different size.

22 U. Quasthoff22 Properties of sig(n,k,a,b) II Additivity: The unification of the words B and B just adds the corresponding significances. For k/bk/b we have sig(n,k,a,b) + sig(n,k,a,b) sig(n,k+k,a,b+b) This has applications for grouping words due to various methods. Maximum: max sig(A, B) a B It might be useful to know how strong a collocation is compared to the possible maximum.

23 U. Quasthoff23 Part 3 Part 1: Introduction to the Wortschatz lexicon Part 2: Collocations and the collocation measure Part 3: Applications of collocations Part 4: Clustering Collocations Part 5: The Vector Space Model Part 6: Combining Simple Methods Part 7: Temporal Analysis: Words of the Day Part 8: Document Similarity

24 U. Quasthoff24 Applications Collocations of inflected forms or basic forms? A numeric measure for polysemy Identification of Proper Names and Phrases Compound Analysis

25 U. Quasthoff25 Collocations of inflected forms or basic forms? Collocations of basic forms will give more results because of higher frequency. But: Collocations of basic forms and inflected forms may differ strongly. Example: Collocations for As: Karo (488), K (393), Pik (391), Treff (307), Coeur (296), D (258), Herz (190), Karte (189), Kreuz (178), As (166), As (166), Süd (145), Matchball (113), Hinterhand (110), West (110), Karo-Bube (101), Ärmel (95), Ost (94), Vorhand (86), … Collocations for Asse: Ivanisevic (72), Becker (63), schlug (62), Aufschlag (56), servierte (47), Sampras (40), Goran Ivanisevic (32), Spiel (25), Stich (24), gewann (24), Wolfenbüttel (23), Kroate (22), schlagen (22), Asse (21), Asse (21), Match (21), Satz (21), …

26 U. Quasthoff26 Application of additivity First calculate collocations for inflected forms, then use additivity to calculate the measure for basic forms, if you want. Example: Collocations for Bundeskanzler:..., betonte (46),..., betont (21),..., betonten (7),... Additivity gives significance 74 for the pair Bundeskanzler, {betonen, betont, betonte, betonten}.

27 U. Quasthoff27 A numeric measure for polysemy: Space The collocations of space taken from our general language corpus of English fall mainly into three classes: The subject areas computer, real estate and outer space. The corresponding senses of space are denoted with space 1, space 2, and space 3. Assigning the top 30 collocations of space (disk, shuttle, square, station, NASA, feet, …) to these three senses we get an qualitative estimate of these senses: space %: disk (2629), memory (718), storage (479), program (308), RAM (307), free (300), hard (336) space %: shuttle (2618), station (991), NASA (920), Space (602), launch (505), astronauts (473), Challenger (420), manned (406), NASA's (297), flight (293), Atlantis (291) Mir (335), rocket (329), orbit (326), Discovery (341), mission (385) space 3 : 18.6%: square (1163), feet (822), leased (567), office (382), lessor (390)

28 U. Quasthoff28 Proper Names and Phrases A large relative collocation measure sig C (A) indicates that a reasonable part of all occurrences of the word C is together with A. Hence, C might be the head with respect to A. Left WordRight Wordhead AlzheimerscheKrankheitleft AQAtotalleft Anorexianervosaleft and right AlgighieroBoettileft and right 30jährigeUS-Bondright AndréLussiright

29 U. Quasthoff29 Compound analysis using multi-word collocations Assume we know that Geschwindigkeitsüberschreitung has the parts Geschwindigkeit and Überschreitung. If a multi-word collocation (here: Überschreitung der Geschwindigkeit) is of some predefined form we accept this collocation as a semantic description. PatternWord AWord BCompound A aus BOrgieFarbenFarbenorgie A der BBebauungInselInselbebauung A mit BFeldGetreideGetreidefeld A in der BFeldbauRegenzeitRegenzeitfeldbau A für BÜbungAnfängerAnfängerübung A für die BGebäudeFlugsicherungFlugsicherungsgebäude A von BAnbauKaffeeKaffeeanbau A zur BAndrangEröffnungEröffnungsandrang

30 U. Quasthoff30 Part 4 Part 1: Introduction to the Wortschatz lexicon Part 2: Collocations and the collocation measure Part 3: Applications of collocations Part 4: Clustering Collocations Part 5: The Vector Space Model Part 6: Combining Simple Methods Part 7: Temporal Analysis: Words of the Day Part 8: Document Similarity

31 U. Quasthoff31 Clustering So far, we are able to find relations between words. They are still of unknown type. Moreover, different types are mixed. Problem 1: How to construct sets of relations of a fixed type? Problem 2: How to identify the type of a relation using background information?

32 U. Quasthoff32 Collocations of collocation sets The production of collocations is now applied on sets of (next neighbour or sentence) collocations instead of sentences. The collection of sentence collocations has the following sentence for Hemd : Hemd Krawatte Hose weißes Anzug weißem Jeans trägt trug bekleidet weißen Jacke schwarze Jackett schwarzen Weste kariertes Schlips Mann The collection of next neighbor collocations has the following two sentences for Hemd : weißes weißem weißen blaues kariertes kariertem offenem aufs karierten gestreiftes letztes... näher bekleidet ausgezogen spannt trägt aufknöpft ausgeplündert auszieht wechseln aufgeknöpft ausziehen...

33 U. Quasthoff33 Erklärte (declaired) using sentences Sprecher (2581), werde (2302), gestern (1696), seien (1440), Wir (1187), bereit (929), wolle (839), Vorsitzende (807), Anfrage (775), Präsident (721)

34 U. Quasthoff34 Erklärte (declared): Using NN-collocations sagte (137), betonte (59), sprach (55), kündigte (44), wies (37), nannte (36), warnte (27), bekräftigte (24), meinte (24), kritisierte (23)

35 U. Quasthoff35 Collocation set for Leipzig (other cities in black) in, Dresden, Berlin, Halle, Leipzig, Leipzig, und, Universität, Sachsen, Erfurt, Chemnitz, UM, Frankfurt, Hamburg, Rostock, Magdeburg, München, Leipziger, Hannover, Messe, Zwickau, studierte, nach, aus, Stadt, Stuttgart, Jena, Düsseldorf, Nürnberg, Reclam, Messestadt, sächsischen, DDR, Kischko, am, Köln, Däbritz, Karl-Marx-Universität, In, Rische, ostdeutschen, geboren, sächsische, bewölkt, Völkerschlacht, Bredow, Taucha, VEB, Edmond, Verlag, Buchmesse, Gewandhausorchester, Städten, Strombörse, Deutschen, Institut, GmbH, Lindner, Wurzen, GV, Verbundnetz, Ampler, Frankfurt am Main, Potsdam, Reclam Verlag, Städte, Cottbus, Versandzentrum, Handelshochschule, Hinrich Lehmann-Grube, Gera, Kirchentag, Völkerschlachtdenkmal, Buchstadt, Thomanerchor, Unterhaching, Lübeck, Oper, Dessau, Meppen, Studium, MDR, Philosophie, eröffnet, wurde, Anke Huber, Jens Lehmann, Turowski, Uwe Ampler, Weimar, ostdeutsche, Hecking, IAT, Boomtown, Buchkunst, Engelmann, Freistaat, Liebers, Dortmund, Mai, Mannheim, Schwerin, neuen Bundesländern, Grischin, VNG, Wende, bei, AG, Auto Mobil International, Cindy Klemrath, Gewandhaus, Messegelände, Parteitag, Bremen, Montagsdemonstrationen, Neubrandenburg, Gustav Kiepenheuer Verlag, Karl-Marx-Stadt, Journalistik, Ostdeutschland, Thomas Liese, Essen, Heidenreich, Udo Zimmermann, Umweltforschungszentrum, DHFK, Hochschule, Mainz, Oktober, Wolfgang Engel, Deutschen Hochschule für Körperkultur, Frankfurt/Main, Heldenstadt, Trommer, Wolfsburg, EBL, Universitäten, Wien, Bautzen,...

36 U. Quasthoff36 First Iteration for Leipzig (other cities in black) Frankfurt, Berlin, München, Stuttgart, Köln, Dresden, Hamburg, Hannover, Düsseldorf, Bremen, Karlsruhe, Potsdam, Wien, Paris, Magdeburg, Halle, Tübingen, Bonn, Freiburg, New York, Chemnitz, Darmstadt, Augsburg, Erfurt, Mannheim, Schweiz, Ulm, Bochum, Wiesbaden, Hanau, Braunschweig, Schwerin, Münster, Frankfurt am Main, London, USA, Regensburg, Cottbus, Göttingen, Kassel, Moskau, Passau, Rostock, Straßburg, Deutschland, Konstanz, Ausland, Dortmund, Heidelberg, Mainz, Würzburg, Zürich, Aachen, Offenbach, Weimar, Gießen, Koblenz, Italien, Chicago, Mailand, Osnabrück, Prag, Rom, Saarbrücken, Wuppertal, Niederlanden, Gera, Basel, Lyon, Nürnberg, Holland, Marburg, St. Petersburg, Amerika, Genf, Kaiserslautern, Tel Aviv, Woche, September, Tiergarten, dort, eröffnet, Budapest, Essen, Jena, Jerusalem, Neubrandenburg, Athen, Frankreich, Vereinigten Staaten, Amsterdam, Baden-Württemberg, Februar, Tempelhof, Trier, Venedig, Bayreuth, England, Erlangen, Indien, Belgrad, Duisburg, Heilbronn, Kairo, Ludwigsburg, Oldenburg, Oxford, Stockholm, Washington, Großbritannien, Görlitz, Kreuzberg, Lausanne, Lübeck, Mitte, Wochenende, April, Australien, Griechenland, Singapur, Florenz, Kanada, Kiel, Kopenhagen, Madrid, Mai, Südafrika, Tegel, Türkei, soeben, Bad Homburg, Bundesrepublik, Göppingen, Heute, Hongkong, Ingolstadt, Japan, Lande, Miami, Mittwoch, Oder, Sarajewo, Afghanistan, Argentinien, Baden-Baden, Bayern, Deutschlands, Europa, Haus, Iran, Istanbul, Peking, Rußland, neu,...

37 U. Quasthoff37 Second Iteration for Leipzig (other cities in black) Stuttgart, München, Frankfurt, Hamburg, Hannover, Köln, Berlin, Dresden, Bremen, Darmstadt, Karlsruhe, Freiburg, Potsdam, Mannheim, Wiesbaden, Düsseldorf, Tübingen, Magdeburg, Gießen, Augsburg, Rostock, Kassel, Halle, Ulm, Hanau, Heidelberg, Ludwigsburg, Konstanz, Nürnberg, Bonn, Schwerin, Münster, Wien, Dortmund, Würzburg, Chemnitz, Passau, Göttingen, Erfurt, Mitte, Aachen, Mainz, Friedberg, Nord, Regensburg, Braunschweig, Cottbus, New York, Kreuzberg, Frankfurt am Main, Göppingen, Tiergarten, Esslingen, Ravensburg, II, Hessen, Ost, Lübeck, Charlottenburg, Böblingen, Offenbach, Oldenburg, Osnabrück, Traunstein, Paris, Bad Homburg, London, Prenzlauer Berg, Neukölln, Tempelhof, Hellersdorf, Koblenz, Essen, Fulda, Trier, Lüneburg, Prag, Chicago, Landshut, Reinickendorf, USA, Wilmersdorf, Kiel, Bochum, Deutschland, Mittelfranken, Schöneberg, Marzahn, Oberbayern, Eimsbüttel, Niederrhein, Unterfranken, Wuppertal, Friedrichshain, Spandau, Oberfranken, Lichtenberg, Moskau, Oberpfalz, Bielefeld, Schweiz, Kaiserslautern, Kempten, Bayreuth, Schwaben, Zürich, Bamberg, Ingolstadt, Mailand, Oder, Heilbronn, Altona, Sarajewo, Marburg, Ansbach, Harburg, Berlin-Mitte, Jena, Steglitz, Suhl, Görlitz, Baden-Württemberg, Hessen-Süd, dort, Italien, Weimar, West, Saarbrücken, Ausland, Bayern, Ostwestfalen-Lippe, Moabit, Offenburg, Main, Polen, Amsterdam, Westliches, Mittlerer, eröffnet,...

38 U. Quasthoff38 Part 5 Part 1: Introduction to the Wortschatz lexicon Part 2: Collocations and the collocation measure Part 3: Applications of collocations Part 4: Clustering Collocations Part 5: The Vector Space Model Part 6: Combining Simple Methods Part 7: Temporal Analysis: Words of the Day Part 8: Document Similarity

39 U. Quasthoff39 Feature Vectors Given by Collocations If two words A and B have similar contexts, that is, they are alike in their use, this indicates that there is a semantic relation between A and B of some kind. A kind of average context for every word A is formed by all collocations for A above a certain significance threshold. This average context of A is transferred into a feature vector of A of dimension n (the total number of words) as usual. The feature vector of word A is a description of the meaning of A, because the most important words of the contexts of A are included. Clustering of feature vectors can be used to investigate the relations between a group of similar words and to figure out whether or not all the relations are of the same kind.

40 U. Quasthoff40 Clustering Months and Days Jahres _____________________ Uhr, Ende, abend, vergangenen, Anfang, Jahres, Samstag, Freitag, Mitte, Sonntag Donnerstag _ | Uhr, abend, heutigen, Nacht, teilte, Mittwoch, Freitag, worden, mitteilte, sagte Dienstag _|_ | Uhr, abend, heutigen, teilte, Freitag, worden, kommenden, sagte, mitteilte, Nacht Montag _ | | Uhr, abend, heutigen, Dienstag, kommenden, teilte, Freitag, worden, sagte, morgen Mittwoch _|_|_ | Uhr, abend, heutigen, Nacht, Samstag, Freitag, Sonntag, kommenden, nachmittag Samstag ___ | | Uhr, abend, Samstag, Nacht, Sonntag, Freitag, Montag, nachmittag, heutigen Sonntag _ | | | Uhr, abend, Samstag, Nacht, Montag, kommenden, morgen, nachmittag, vergangenen Freitag _|_|_|_____________ | Uhr, abend, Ende, Jahres, Samstag, Anfang, Freitag, Sonntag, heutigen, worden Januar _________________ | | Uhr, Ende, Jahres, Anfang, Mitte, Samstag, Mai, August, März, Januar August _______________ | | | Uhr, Ende, Jahres, Anfang, Mitte, Samstag, Mai, August, Januar, März Juli _____________ | | | | Uhr, Jahres, Ende, Anfang, Mitte, Mai, Samstag, August, Januar, März März ___________ | | | | | Uhr, Ende, Jahres, Anfang, Mitte, Samstag, Mai, Januar, März, April Mai _________ | | | | | | Uhr, Ende, Jahres, Anfang, Mitte, Samstag, März, Januar, Mai, vergangenen September _______ | | | | | | | Uhr, Ende, Jahres, Anfang, Mitte, Mai, Januar, März, Samstag, vergangenen Februar _ | | | | | | | | Uhr, Januar, Jahres, Anfang, Mitte, Ende, März, November, Samstag, vergangenen Dezember _|___ | | | | | | | | Uhr, Jahres, Ende, Anfang, Mitte, Mai, Januar, März, Samstag, vergangenen November _ | | | | | | | | | Uhr, Jahres, Ende, Anfang, Mitte, September, vergangenen, Dezember, Samstag Oktober _|_ | | | | | | | | | Uhr, Ende, Jahres, Anfang, Mai, Mitte, Samstag, September, März, vergangenen April _ | | | | | | | | | | Uhr, Ende, Jahres, Mai, Anfang, März, Mitte, Prozent, Samstag, Hauptversammlung Juni _|_|_|_|_|_|_|_|_|_|_|_ Example (1):

41 U. Quasthoff41 Clustering Leaders and Verbs of Utterance Example (2): Clustering Leaders Präsident _________ sagte, Boris Jelzin, erklärte, stellvertretende, Bill Clinton, stellvertretender, Richter Vorsitzender _______ | sagte, erklärte, stellvertretende, stellvertretender, Richter, Abteilung, bestätigte Vorsitzende ___ | | sagte, erklärte, stellvertretende, Richter, bestätigte, Außenministeriums, teilte, gestern Sprecher _ | | | sagte, erklärte, Außenministeriums, bestätigte, teilte, gestern, mitteilte, Anfrage Sprecherin _|_|_ | | sagte, erklärte, stellvertretende, Richter, Abteilung, bestätigte, Außenministeriums, sagt Chef _ | | | Abteilung, Instituts, sagte, sagt, stellvertretender, Professor, Staatskanzlei, Dr. Leiter _|___|_|_|_ Example (3): Clustering Verbs of Utterance verwies _____________ Sprecher, werde, gestern, Vorsitzende, Polizei, Sprecherin, Anfrage, Präsident, gebe mitteilte ___________ | Sprecher, werde, gestern, Vorsitzende, Polizei, Sprecherin, Anfrage, Präsident, Montag meinte _______ | | Sprecher, werde, gestern, Vorsitzende, Sprecherin, Anfrage, Präsident, gebe, Interview bestätigte_____ | | | Sprecher, werde, gestern, Vorsitzende, Sprecherin, Anfrage, Präsident, gebe, Interview betonte ___ | | | | Sprecher, werde, gestern, Vorsitzende, Sprecherin, Präsident, gebe, Interview, würden, Bonn sagte _ | | | | | Sprecher, werde, gestern, Vorsitzende, Sprecherin, Präsident, gebe, Interview, würden erklärte _|_|_|_|_ | | Sprecher, werde, gestern, Vorsitzende, Sprecherin, Präsident, Anfrage, gebe, Interview warnte _ | | | Präsident, Vorsitzende, SPD, eindringlich, Ministerpräsident, CDU, Außenminister, Zugleich sprach _|_______|_|_|_

42 U. Quasthoff42 The Clustering Algorithm The Single Link Hierarchical Agglomerative Clustering Method (HACM) works bottom up like this: All words are treated as (basic) items. Each item has a description (feature vector). In each step of the clustering algorithm the two items A and B with the most similar descriptions are searched and fitted together to create a new complex item C combining the words in A and B. Each step of the clustering algorithm reduces the number of items by one. The feature vector for C is constructed from the feature vectors of A and B by identifying the words A and B and calculating their joint collocations. The algorithm stops if only one item is left or or if all remaining feature vectors are orthogonal. This results usually in a very natural clustering if the threshold for constructing the feature vectors is suitably chosen.

43 U. Quasthoff43 Part 6 Part 1: Introduction to the Wortschatz lexicon Part 2: Collocations and the collocation measure Part 3: Applications of collocations Part 4: Clustering Collocations Part 5: The Vector Space Model Part 6: Combining Simple Methods Part 7: Temporal Analysis: Words of the Day Part 8: Document Similarity

44 U. Quasthoff44 Part 3: Combining Non- contradictory Partial Results The results of these combination either give more and / or better results. Identical Results Two or more of the above algorithms may suggest a certain relation between two words, for instance, cohyponymy. Example: If both the second order collocations and clustering by feature vectors independently yield similar sets of words as a result, this may be taken as an indication of cohyponymy between the words, e. g. sagte, betonte, kündigte, wies, nannte, warnte, bekräftigte, meinte [...] (German verbs of utterance).

45 U. Quasthoff45 Types of Relations Symmetric Relations: A relation r is called symmetric if r(A, B) always implies r(A, B). Examples of symmetric relations are –synonymy, –cohyponymy (or similarity), –elements of a certain subject area, and –relations of unknown type. Usually, sentence collocations express symmetric relations. Anti-symmetric Relations: Let us call a relation r anti-symmetric if r(A, B) never implies r(A, B). Examples of anti-symmetric relations are hyponymy, relations between properties and its owners like action and actor, or class and instance. Usually, next neighbor collocations of two words express anti-symmetric relations. In the case of next neighbor collocations consisting of more than two words (like A prep/det/conj B e. g. Samson and Delilah), the relation might be symmetric, for instance in the case of conjunctions like and or or. Transitivity: Transitivity of a relation means that r(A, B) and r(B, C) always implies r(A, C). In general, a relation found experimentally will not be transitive, of course. But there may be a part where transitivity holds. Some of the most prominent transitive relations are the cohyponymy, hyponymy, synonymy relations.

46 U. Quasthoff46 Supporting Second Results In second combination type a known relation given by one method of extraction is verified by an identical but unnamed second result as follows: Result 1: There is certain relation r between A and B Result 2: There is some strong (but unknown) relation between A and B (e. g. given by a collocation set) Conclusion: Result 1 holds with more evidence. Example Result 2: The German compound Entschädigungsgesetz can be divided into Gesetz and Entschädigung with an unknown relation. Result 1 is given by the four word next neighbor collocation Gesetz über die Entschädigung. Similarly Stundenkilometer is analyzed as Kilometer pro Stunde. In these examples, result 1 is not enough because there are collocations like Woche auf dem Tisch which do not describe a meaningful semantic relation.

47 U. Quasthoff47 Combining Three Results Result 1: There is relation r between A and B Result 2: B is similar to B' (cohyponymy) Result 3: There is some strong but unknown relation between A and B' Conclusion: There is a relation r between A and B' Example: As result 1 we might know that Schwanz (tail) is part of Pferd (horse). Similar terms to Pferd are both Kuh (cow) and Hund (dog) (result 2). Both of them have the term Schwanz collocation in their set of significant collocations (result 3). Hence we might correctly conjecture that both Kuh and Hund have a tail (Schwanz) as part of their body. In contrast, Reiter (rider) is a strong collocation to Pferd and might (incorrectly) be conjectured to be another similar concept, but Reiter is no collocation with respect to Schwanz. Hence, the absence of result 3 prevents us from making an incorrect conclusion.

48 U. Quasthoff48 Similarity Used to Infer a Strong Property Let us call an property p important, if similarity respects this property. This strong property can be assured as follows: Result 1: A has a certain important property p Result 2: B is similar to A (i. e., B is a cohyponym of A) Conclusion: B has the same property p Example: We consider A and B as similar if they are in the set of right neighbor collocations of Hafenstadt (port town) (result 2). If we know that Hafenstadt is a property of its typical right neighbors (result 1) we may infer this property for more then 200 cities like Split, Sidon, Durban, Kismayo, Tyrus, Vlora, Karachi, Durres, [...].

49 U. Quasthoff49 Subject Area Inferred from Collocation Sets Result 1: A, B, C,... are collocates of a certain term. Result 2: Some of them belong to a certain subject area. Conclusion: All of them belong to this subject area. Example: Consider the following top entries in the collocation set of carcinoma: patients, cell, squamous, radiotherapy, lung, thyroid, treated, hepatocellular, metastases, adenocarcinoma, cervix, irradiation, breast, treatment, CT, therapy, renal, cases, bladder, cervical, tumor, cancer, metastatic, radiation, uterine, ovarian, chemotherapy, [...] If we know that some of them belong to the subject area Medicine, we can add this subject area to the other members of the collocation set as well.

50 U. Quasthoff50 Part 7 Part 1: Introduction to the Wortschatz lexicon Part 2: Collocations and the collocation measure Part 3: Applications of collocations Part 4: Clustering Collocations Part 5: The Vector Space Model Part 6: Combining Simple Methods Part 7: Temporal Analysis: Words of the Day Part 8: Document Similarity

51 U. Quasthoff51

52 U. Quasthoff52 Overview Input: Todays news text Ca sentences Relative size compared to the large corus: Factor 1000 # of sentence collocations: ca # of next neighbor collocations: ca # of next neighbor collocations, both capitalized: ca. 300 Size So und Mo: ca. 50% compared to weekdays. Problem: Find important terms. Total number of words Frequency data available: –total frequency today –relative frequency compared to our large corpus –total frequency in our large corpus Morphosyntactic criteria: –Words and multiwords should be capitalized –No inflected forms

53 U. Quasthoff53 Frequency Measures Total frequency today –Minimum frequency needed, otherwise too many words (cf. Zipfs Law). –Today: Minimum frequency of 12 –Today: Maximum frequency of 100 for relevant words Relative frequency compared to our large corpus –Large factor implies importance. –Small variance appears by chance. –Threshold for importence: factor>6. May be lowered for larger daily corpora. Total frequency in our large corpus –Words should be familiar –Today: Wortschatz-frequency > 20 –What about totally new words? - Today: Minimum frequency of 12 as above Question:Which measure is closest to importance as felt by humans? Answer: Total frequency today

54 U. Quasthoff54 Words of the Day (without human inspection)

55 U. Quasthoff55 Words of the Day (after 5 minutes of inspection)

56 U. Quasthoff56 Problem: Find the Message We automatically find: Jürgen Hart is rarely mentioned We notice the words gestorben and tot and the phrase hörte sein Herz auf zu schlagen. Conclusion: We have an obituary.

57 U. Quasthoff57 Relations between Words Todays collocation graph: Connected words represent a strong relation

58 U. Quasthoff58 Temporal Relations We see wether collocations repeatedly appear together during the last 30 days.

59 U. Quasthoff59 Part 8 Part 1: Introduction to the Wortschatz lexicon Part 2: Collocations and the collocation measure Part 3: Applications of collocations Part 4: Clustering Collocations Part 5: The Vector Space Model Part 6: Combining Simple Methods Part 7: Temporal Analysis: Words of the Day Part 8: Document Similarity

60 U. Quasthoff60 Document Similarity The description of a document consists of all its terms which have been Word of the Day at any time. Hence we use only approx for indexing. Documents are compared just by counting their common terms, weighted by their frequencies.

61 U. Quasthoff61 Sample similar Documents Doc.-Nr. 1Doc.-Nr. 2Ähnlichkeit Weltmeister Südkorea Michael_Ballack Oliver_Kahn Brasilien Yokohama Ballack DFB Rudi_Völler Weltmeister Südkorea Dietmar_Hamann Thomas_Linke Weltmeisterschaft Carsten_Ramelow Rudi_Völler Südkorea Michael_Ballack Weltmeisterschaft Ballack Elf Seoul Israel Arafat Hebron Jericho Palästinenser Terror Bush Frieden US- Präsident_George_W._Bush Israel Arafat Palästinenser Terror Bush US-Präsident_George_W._Bush Israel Arafat Palästinenser Terror Bush US-Präsident_George_W._Bush Israel Autonomiebehörde Arafat Hebron Palästinenser Terror Bush Frieden Jassir_Arafat US- Präsident_George_W._Bush

62 U. Quasthoff62 Topics of the Day If we have sets of similar documents we can use clustering. The terms describing the cluster can be viewed as Topic of the Day. The clustering algorithm: Consider all documents (approx. 200 each day). For each pair of similar documents, consider their set of common Words of the Day. Next we cluster these words using HACM: In each step, the most similar sets are combined. As similarity measure we use: sim(A,B)= |A B| / |B| (Which part of A is contained in B?) If sim(A,B)>0.4, then B is replaced by A B and A is dropped. The algorithm stops if there are no sets with similarity >.4.

63 U. Quasthoff63 Clusters of 25/6/2002 (Titels are made by hand) NAHOST 1 Gaza-Streifen Arafat Außenminister_Schimon_Peres Jerusalem Terroristen Westjordanland Hebron Israel Panzer Attentäter Palästinenser Ramallah Selbstmordanschläge Israelis Ausgangssperre Autonomiebehörde Tulkarem Bethlehem Dschenin Gazastreifen Terror US-Präsident_George_W._Bush Anschlägen Nablus FORMEL1 2 Grand_Prix Rubens_Barrichello Ferrari Ralf_Schumacher McLaren-Mercedes Coulthard Barrichello Montoya Großbritannien Schumacher Michael_Schumacher Nürburgring Nürburgri Großen_Preis Stallorder Weltmeister Brasilien Fußball-WM STOLPE 3 Potsdam SPD-Generalsekretär_Franz_Müntefering Lothar_Späth Stolpe Bundestagswahlkampf Matthias_Platzeck Müntefering Schönbohm Bundesrat Platzeck Brandenburg Manfred_Stolpe Cottbus Zuwanderungsgesetz PDS Wittenberge Ministerpräsident_Manfred_Stolpe Jörg_Schönbohm Bundestagswahl Schröder WM 4 Korea Südkorea Skibbe Seoul Oliver_Kahn Südkoreaner Michael_Ballack Koreaner Spanier Hitze Nationalmannschaft Elfmeterschießen Viertelfinale Paraguay WM-Halbfinale Miroslav_Klose Völler Jens_Jeremies Karl-Heinz_Rummenigge Klose Golden_Goal Weltmeister Türken Senegal Fußball Verlängerung Brasilien Elf Weltmeisterschaft Entschuldigung Rudi_Völler Portugal Ronaldo Rivaldo Achtelfinale Argentinien Fifa Dietmar_Hamann PISA 5 Nordrhein-Westfalen Gymnasien Pisa-E Naturwissenschaften Brandenburg Rheinland-Pfalz Sachsen-Anhalt BÖRSE 6 T-Aktie Allzeittief Neuen_Markt DAX France_Télécom Moody's Tarifrunde HARTZ 7 Hartz SPD-Generalsekretär_Franz_Müntefering Bundeswirtschaftsminister_Werner_Müller Florian_Gerster Arbeitslosenzahl FDP-Chef_Guido_Westerwelle Hartz-Kommission

64 U. Quasthoff64 Cluster des 26/6/2002 WM 1 Fußball Ilhan_Mansiz Golden_Goal WM-Halbfinale Senegal Türken Schröder Weltmeister Bundesinnenminister_Otto_Schily Völler Bundespräsident_Johannes_Rau Brasilien Bundeskanzler_Gerhard_Schröder Ballack Frings Neuville Bierhoff Jeremies Klose Ramelow Korea Südkorea Michael_Ballack Oliver_Kahn Beckham Weltmeisterschaft Zidane Pelé Ronaldo Rivaldo Miroslav_Klose Viertelfinale Paraguay Jens_Jeremies FC_Liverpool Seoul Christian_Ziege Spanier Sebastian_Kehl Elf Saudi-Arabien Thomas_Linke Nationalmannschaft Rudi_Völler Seo Carsten_Ramelow Christoph_Metzelder Foul WM-Finale Koreaner Südkoreaner Oliver_Bierhoff Dietmar_Hamann Yokohama Schiedsrichter Franz_Beckenbauer Portugal Guus_Hiddink DFB Oliver_Neuville Marco_Bode Gelbe_Karte Fifa Franzosen Yoo NAHOST 2 Ariel_Scharon Israel Arafat Palästinenser Bush Nahen_Osten Autonomiebehörde Hebron Terror Frieden Jassir_Arafat US-Präsident_George_W._Bush Jericho Ramallah Israelis Scharon Weiße_Haus George_Bush Palästina Westjordanland Jerusalem Ministerpräsident_Ariel_Scharon Panzer Großbritannien Gewalt Palästinenserpräsident_Jassir_Arafat US-Regierung Anschläge Waffen Intifada ERFURT 3 Schule Massaker Erfurt Lehrer Steinhäuser Rainer_Heise Robert_Steinhäuser STOLPE 4 Brandenburg Bundesrat Stolpe Bundespräsident_Johannes_Rau PDS Jörg_Schönbohm Schönbohm Lothar_Späth Platzeck Matthias_Platzeck Manfred_Stolpe FORMEL1 5 Weltmeister Rubens_Barrichello Nürburgring Großen_Preis McLaren-Mercedes Barrichello Ralf_Schumacher Ferrari Michael_Schumacher Schumacher Jean_Todt BÖRSE 6 Moody's Neuen_Markt DAX ABN_Amro Goldman_Sachs France_Télécom BABCOCK 7 Babcock Nordrhein-Westfalen Oberhausen Bürgschaft Babcock_Borsig IG_Metall Stellenabbau Indien HOLZMANN 8 Philipp_Holzmann_AG Ottmar_Hermann Baukonzern Philipp_Holzmann Holzmann Insolvenz Niederländer

65 U. Quasthoff65 Comparison : WM NAHOST FORMEL1 STOLPE BÖRSE PISA HARTZ : WM NAHOST FORMEL1 STOLPE BÖRSE ERFURT BABCOCK HOLZMANN Some topics appear repeatedly, either on consecutive days or after some interval. If a topic once is introduced by hand, it will be detected automatically later on.

66 U. Quasthoff66 Thank you.

Herunterladen ppt "Learning semantic relations using very large corpora Uwe Quasthoff Institut für Informatik Universität Leipzig"

Ähnliche Präsentationen