Präsentation zum Thema: "The Wortschatz Project Language-independent Methods for Enriching Corpora Chris Biemann NLP Department, University of Leipzig"— Präsentation transkript:
The Wortschatz Project Language-independent Methods for Enriching Corpora Chris Biemann NLP Department, University of Leipzig email@example.com Universitetet i Oslo, 11/10/2005
2 Project Subjectives Effective processing large amounts of plain text data Availability for virtually any language - ASCII and UTF-8 Coding Applying language-independent methods to enrich the data Providing corpora of standard sizes for as many languages as possible via our Website and Web services Project Leader since 1993: Uwe Quasthoff http://www.wortschatz.uni-leipzig.de http://corpora.informatik.uni-leipzig.de
4 Sources of Texts free online-newspapers not-copyright-protected literature website content Ensuring Monolinguality: Sentence-based Language Detector - Rejection rate ~5% - False prediction rate ~0.0x % Avoiding Copyright problems: Sentence is biggest unit in database Documents cannot be restored
5 Standard Processing Steps Preprocessing Language Verification Indexing words and multiterms Co-occurrences Biemann, Chr.; Bordag, S.; Heyer, G.; Quasthoff, U.; Wolff, Chr.(2004): Language-independent Methods for Compiling Monolingual Lexical Data, Proceedings of CicLING 2004, Seoul, Korea and Springer LNCS 2945, pp. 215-228, Springer Verlag Berlin Heidelberg
10 Poisson Distribution for Measuring Associations We can calculate the probability of multiple joint occurrences of independent events as follows: Given two independent events to be observed in an experiment with probabilities p a and p b, resp., the probability of their joint occurrence is p a p b. Next we repeat the experiment n times, and we are interested in k joint occurrences. Using λ= n p a p b we get the probability For at least k joint occurrences we get the probability To measure the surprise for the joint occurrence for non-independent events we just calculate the probability as if they were independent. Next we are surprised to see such a rare event.
11 Significance Measure for Co-occurrences The cooccurrence measure of the two words A and B is defined as the negative logarithm of the above probability divided by log n. For λ=ab/n we get Approximations: If (k+1)/λ>10 (this is typically true) and, moreover k>10 we get: and Quasthoff, U.; Wolff, Chr.: „The Poisson Collocation Measure and its Applications”.In: Proc. Second International Workshop on Computational Approaches to Collocations, Wien, Juli 2002
12 Norwegian Politics significant left neighbours of Stoltenberg: Jens (1091), Thorvald (137), Regjeringen (64), statsminister (52), Fredrikke (26), regjeringen (24), regjeringa (24), Holmboe (24), Robert (22), Mathias (20), Nini (17), Vincent (15), ] (7), sier (6) significant right neighbours of Stoltenberg: har (25), og (15),, (13), snakker (12), mener (11), forsvarer (11), understreket (10), skriver (10), sa (9), skrev (8), g (8), vil (7), støtter (7),. (7), må (6), forteller (6), holder (5) significant left neighbours of Bondevik: Magne (1553), statsminister (167), regjeringen (165), Odd (137), Statsminister (128), Regjeringen (69), sier (37), Margreta (37), Kjell-Magne (31), sa (19), at (18), Biskop (18), Regjeringa (17), regjeringa (15), Stein (12), Jarle (9), Dagsavisen (9), Marianne (8), Hilde (8), ( (8), ] (6), Kanskje (5) significant right neighbours of Bondevik: II (60), har (44), II-regjeringen (34), og (26), Spre (24), vil (20), delte (19),, (19), sier (15), mener (15), ( (15), uttalte (14), overtok (13), orienterte (13), grep (13), Foreleser (13), & (13), Statsminister (12), senker (11), oppnevnte (11), gikk (11), varsler (10), dreper (10), undertegnet (9), Midt (9), var (7), tok (6), skrev (6), kjem (6), foreslår (6), eller (6),. (6), stiller (5), måtte (5), føler (5)
13 Visualisation of Co-occurrences Simulated annealing to visualise significant co- occurrences ambiguous words Cluster
14 Word Sense Discrimination 3 (or more) words are either from one topic, or do not fit together
15 WSD algorithm 1.Inputword w 2.Take first 15 co-occurrences of w, ordered by co- occurrence significance 3.Generate all combinations of w and two of the 15 co-occurrences (105 different triples of words) 4.Retrieve intersections of co-occurrences 5.cluster intersections 6.If there are still co-occurrences above threshold left, go to step 2. 7.Else stop S. Bordag, Sentence Co-occurrences as Small-World Graphs: A solution to Automatic Lexical Disambiguation, A. Gelbukh (Ed.): CICLing 2003, LNCS 2588, pp. 329-332, Springer-Verlag Berlin Heidelberg, 2003
17 Small World Graphs SW graphs as optimum between two extremes: Cluster coefficient: Anywhere in the graph are local clusters which scale – very much like a fractal Short path lengths explain the small world effect in conversation (“Oh, you know him too!”) and why natural searching works so well. D.J. Watts and S.H. Strogatz, (1998). Collective dynamics of ‘small world’ networks. Nature, (393), p440-442. A.L. Barabasi et al (2000). Scale-free characteristics of random networks: the topology of the World-wide web. Physica A (281)70-77. Graphregularrandomsmall-world path lengthlongshort clustering coefficient highlowhigh
18 Small World Graphs (ctd) SW graphs with the same properties and even almost the same numbers can be found in a variety of different applications: –power line network –neural net structure of living creatures –language, e.g. co-occurrence graphs –world wide web M. Steyvers, J. B. Tenenbaum. The large-scale structure of semantic networks: statistical analyses and a model of semantic growth. M. Steyvers, J. B. Tenenbaum, Cognitive Science, 2002
19 Word of the day Task: Extraction of important keywords from newspaper on a daily basis Method: compile daily corpus and compare it to large reference corpus Needed: 5-10 downloadable online newspapers in target language. Alternatively use allTheWeb.com with enhanced options Measures used: relative frequency ratio in combination with statistics of surprise Classification of keywords into categories: 5 minutes of daily workload Richter, M. (2005): Analysis and Vizualisation for Daily Newspaper Corpora, Proceedings of RANLP- 05, Borovets, Bulgaria
20 Frequency bands absolute frequency of today –minimum frequency needed because of Zipf's law –maximum frequency needed for eleiminating stop words –now we use a lower threshold of 10 –and an upper theshold of 60 relative frequency factor compared to reference corpus –large factors indicate "importantness" in today's news, small fluctuations are noise. –threshold value (empirically determined): 6. absolute frequency in reference corpus –abs. frequency is measure for "well-knownness". This is a binary decision - people know about a concept or not. –Now: threshold is a count of 20 in reference corpus –new words are excluded from this calculation
21 Word of the day: 24/09/2005 http://wortschatz.informatik.uni-leipzig.de/wort-des-tages
22 Word stock market for 1-hit-wonders: Katrina in New Orleans
23 Word Stock market for periodic events: Formula 1 11/09/2005: Alonso did not succeed in obtaining the world championship before the end of the season
24 German Parliament elections 19/09/2005: predicting the future? Analyzing text Polls Counting the votes
25 Co-occurrences of higher Orders (sentence-based) co-ocurrences of first order: words that co-occur significantly often together in sentences co-occurrences of second order: words that co-occur significantly often in collocation sets of first order co-occurrences of n-th order: words that co-occur significantly often in collocation sets of (n-1)th order When calculating a higher order, the significance values of the preceding order are not relevant. A co-occurrence set consists of the N highest ranked co-occurrences of a word. Biemann, C.; Bordag, S.; Quasthoff, U. (2004): Automatic Acquisition of Paradigmatic Relations using Iterated Co-occurrences, Proceedings of LREC2004, Lisboa, Portugal
27 Intersection of Co-occurrence Sets: resolving ambiguity Herz- Bube Stich Becker Achtelfinale - Aufschlag - Boris Becker - Daviscup - Doppel - DTB – Edberg - Finale - Graf - Haas - Halbfinale - Match - Pilic - Runde - Sampras - Satz - Tennis - Turnier - Viertelfinale - Weltrangliste - Wimbledon Alleinspieler - Herz - Herz-Dame - Herz- König - Hinterhand - Karo - Karo-As - Karo- Bube - Kreuz-As - Kreuz-Bube - Pik-As - Pik-Bube - Pik-König - Vorhand - Becker - Courier - Einzel - Elmshorn - French Open - Herz-As - ins - Kafelnikow - Karbacher - Krajicek - Kreuz-As - Kreuz-Bube - Michael Stich - Mittelhand - Pik-As - Pik-Bube - Pik-König bedient - folgenden - gereizt - Karo-Buben - Karo-Dame - Karo- König - Karte - Karten - Kreuz-Ass - Kreuz-Dame - Kreuz-Hand - Kreuz-König - legt - Mittelhand - Null ouvert - Pik - Pik-Ass - Pik- Dame - schmiert - Skat - spielt - Spielverlauf - sticht - übernimmt - zieht - Agassi - Australian Open - Bindewald - Boris - Break - Chang - Dickhaut - - gewann - Ivanisevic - Kafelnikow - Kiefer - Komljenovic - Leimen - Matchball - Michael Stich - Monte Carlo - Prinosil - Sieg - Spiel - spielen - Steeb - Teamchef Stich
28 Detection of X-onyms synonyms, antonyms, (co)-hyponyms... Idea: Intersection of co-occurrence sets of two X-onyms as reference words should contain X-onyms lexical ambiguity of one reference word does not deteriorate the result set Method: - Detect word class for reference words - calculate co-occurrences for reference words - filter co-occurrences w.r.t the word class of the reference words (by means of POS tags) - perform disjunction of the co-occurrence sets - output result ranking can be realized over significance values of the co-occurrences
29 Mini-Evaluation Experiments for different data sources, NB-collocations of 2nd and 3rd order fraction of X-onyms in TOP 5 higher than in TOP 10 ranking method makes sense disjunction of 2nd-order and 3rd-order co-occurrences almost always empty different orders exhibit different relations satisfactory quantity, more through larger corpora quality: for unsupervised extension not precise enough
31 More Examples in English Intersection of N2-Order co-occurrence sets
32 Dictionary Acquisition Using Parallel Text and Co-occurrence Statistics Given: certain amounts of sentence-aligned parallel texts Not available: morphology, grammar, semantic etc. information string similarity for cognates bilingual dictionary Wanted: bilingual dictionaries alignment on word level Biemann, C. and Quasthoff, U. (2005): Dictionary Acquisition Using Parallel Text and Co- occurrence Statistics (2005): Proceedings of NODALIDA-05, Joensuu, Finland (to appear)
33 Trans-co-occurrences Translingual co-occurrences ‘normal‘ co-occurrences: Calculaton performed on sentence basis Co-occurrents can be found frequently together in sentences Trans-co-occurrences: Calculaton performed on bilingual sentence pairs Co-occurrents can be found frequently together in bilingual sentence pairs Hypothesis: significant co-occurrences between words of different languages (= trans-co- occurrences) are translation equivalents
36 Evaluation for German-English on Europarl & dict.uni-chemnitz
37 Comparison with [Sahlgren 2004] Sahlgren, M. (2004): Automatic Bilingual Lexicon Acquisition Using Random Indexing of Aligned Bilingual Data, Proceedings of LREC-2004, Lisboa, Portugal
38 Alignment Given: Bilingual sentence pair Wanted: Which word corresponds with which? Method: Scan sentence 1 word by word and link it to the highest ranked word in the trans-co- coccurrences that can be found in sentence 2.
39 Alignment: Example 1 Red Words: No alignment Blue Arrows: Errors Arrow Index: rank in trans-co-occurrences Die Landwirtschaft stellt nur 5,5 % der Arbeitsplätze der Union. Agriculture only provides 5.5 % of employment in (the Union). 1 21 1 13 3 2 Die Landwirtschaft stellt nur 5,5 % der Arbeitsplätze der Union. Agriculture only provides 5.5 % of employment in (the Union). 1 1 15 421
40 Alignment: Example 2 Grey Arrows: Multiple alignments for frequent words. Indem wir den Mitgliedstaaten für die Umsetzung der Richtlinie kein spezifisches Datum setzen, By not setting a specific date (for the) Member States (to implement) the directive sondern ihnen einen Zeitraum von drei Monaten nach Inkrafttreten der Richtlinie zugestehen, and instead giving them a period of three months after its (entry into force), führen wir eine Flexibilitätsklausel ein, we are introducing a flexibility clause die eine unverzügliche Umsetzung gewährleistet. which ensures that the directive will be implemented without delay. 111 2 1 71 4 1 115 1 1 1 1 111111,2,3 1 1 4 11 1 1 1 2 1 1 1 1 1 1 4 44 5
41 Manual Evaluation on 1000 words random samples Better results: -no domain- dependent deficiency of dictionary -no problems with inflection
42 Nextlinks and Findlinks Applying Co-occurence Statistics to Co-citation Web Graphs Co-occurrence of words: words that appear together in sentences more often than to be expected Co-occurrence of (external) links on websites: links that appear together on websites more often than t.b.e. This is called co-citation; the co-citation web graph: websites are nodes, an edge means that two websites are linked on another website. Heyer, G. and Quasthoff, U. 2004): Calculating Communities by Link Analysis of URLs. P=roceedings of IICS-04, Guadalajara, Mexico and Springer LNCS 3473, pp. 151-156 Springer Verlag Berlin Heidelberg
43 External Links External Links are linking to a different web server. In contrast, internal links are for navigation inside a web site. External links are explicit associations: They refer to something external. Usually, they are optional to follow. Usually, they are hand made and elaborate. In contrast, internal links are often structural links. They are usually machine generated for navigation. About 10% of web pages have more than one external link
44 Application: NextLinks Surf Guide Input: Current URL shown in the browser, here: www.uio.no www.leipzig.de Output: Top 10 cooccurrences of the input URL Often returns the most popular websites of similar interest. Download it: http://wortschatz.uni-leipzig.de/nextlinks
45 The Crawling Problem Crawling the Web Today, Google knows 8 168 684 336 web pages. Crawling them in 30 days means approx. 280 mill. pages/day (or 3150 pages/sec). Observation: Crawling gives a lot of input data Extracting only links produces only few output. Hence, we try distributed crawling!
46 FindLinks: Distributed Client At the moment, we receive 500,000 pages/day per broadband client. Hence, we need about 600 clients for the whole web. Help us and download the client: http://wortschatz.uni-leipzig.de/nextlinks/findlinks.html
49 Abstract The goals of the Wortschatz Project (University of Leipzig) is to process and to provide large, unannotated corpora for a variety of languages. The focus is on langage-independent methods to enrich those plain text corpora with structure without using manually developed resources or language- dependent preprocessing. Mainly building on an efficient implementation of co- occurrence statistics, approaches for acquiring knowledge from text range from word sense discrimination over trend mining and time series analysis to thesaurus expansion and bilingual dictionary acquisition. Finally, the framework is applied to web graph analysis.
Your consent to our cookies if you continue to use this website.