Präsentation herunterladen
Die Präsentation wird geladen. Bitte warten
Veröffentlicht von:Irmgard Worl Geändert vor über 9 Jahren
1
technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik Name Autor | Ort und Datum
2
Fakultät für Informatik LS 8 technische universität dortmund Overview Handling texts – overview Mark-up languages Services based on annotated texts Automatic tagging From lay-out information to tags Named entity recognition Data-intensive Approach Counting in very large unlabeled corpus Turning frequencies into features Compiling sequences into features Overview
3
Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Handling Texts Granularity: hypertext structure, text, paragraph, word, letters Learning mode: batch, incremental Learning goal: adapted organization, class or clustering, syntactic or semantic structures Application tasks: Personalization, optimization of information access, integration in business processes, reporting Handling Texts Hyper- text TextPara- graph Word Adapta- tion Alesker, Joachims, Neifach VeltmannHüppe, Mintert, Thomas Helbig Extraction Rössler Clustering Schewe, Wurst Classifica - tion Joachims, Klinken- berg this talk
4
Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Intelligent Publishing Using Mark-Ups Search qualified by semantic category Self-contained parts of text (atoms) as search result Composition of one’s own text Presentation according to semantic category IP4W3 System by Stefan Mintert 1999 Mark-up languages Query: category + word Webserver Result: list of atoms Text Selection Search Composition User Presentation
5
Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Qualified search Mark-up languages
6
Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Presentation of Results Mark-up languages
7
Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Text Composition Mark-up languages Selected results from 2 Queries combined
8
Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Applications e-Learning, e-Publishing Intelligent publication in the web: users customize the material to their own needs. “IP4W3” Stefan Mintert 1999, Dortmund Course material for different groups: from the central repository of presentations or texts, courses are designed for special interests. “Slicing Books” Ingo Dahm 2001, Koblenz-Landau Additional sequence information allows to tailor courses to learning types, e.g. Top-down from definition to application Bottom-up from application to definition. Moritz Thomas 1999, Dortmund Mark-up languages
9
Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Behind the Curtain Mark-up editor Editor for defining qualified search fits Characters are the atomic unit of texts according to ISO/IEC 10646. … Mark-up languages DTD/Schema Webserver Search patterns Style sheets Administrator Author Annotated text Bottleneck!
10
Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Automatic tagging WISDOM++ Univ. Bari From scanned texts to blocks to XML tags – classification of blocks by C4.5 Altamura, Esposito, Malerba 2000 ADT Univ. Dortmund From RTF annotation to XML tags – classification by C4.5 Christian Hüppe 2003 Automatic tagging
11
Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund ADT – Input Document Automatic Tagging
12
Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund ADT – Manual Annotation of Examples Automatic Tagging
13
Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund ADT – Attributes of Examples RTF control words Presence of control word in current and preceding paragraph ff: neither in this nor in preceding paragraph ft: not in this but in preceding paragraph tf: in this but not in preceding paragraph tt: as well in this as in preceding paragraph Value of indention in current and preceding paragraph First and second word of paragraph Automatic Tagging
14
Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund ADT -- Learning Automatic Tagging
15
Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund ADT – Classification of Paragraphs Automatic Tagging No. of examples for each class F-measure 141 % 294 % 398,3% 499,68% 9 classes (tags) 159 paragraphs
16
Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Application Options Automatic Tagging Named entity recognition necessary! Combination of ADT and IP4W3 offers qualified search, tailored courseware, and enhanced e-learning without tedious annotation on behalf of the author or administrator. Semantic information within paragraphs cannot be captured.
17
Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Named Entity Recognition Classification of single words into given semantic categories (e.g., person, location, date). A phrase of the category is a sequence of the same label. Features of a word: Linguistic features (e.g., part of speech) Letters (e.g., beginning with upper case letter) Word length N-grams Knowledge intensive vs. data intensive approaches: Linguistic rules Examples Unlabeled text (corpus) Training time, classification time – size of training and test sets Named Entity Recognition
18
Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund The Task Biomedical task on 22 000 word forms (JNLPBA) 472 000 labeled occurrences for training 54 173 occurrences for testing 100 Mio. word forms from Medline as background German corpus 33 000 word forms (CoNLL) 220 189 labeled occurrences for training 54 173 occurrences for testing 40 Mio. word forms from Frankfurter Rundschau as background Fast learning and classification necessary! Named Entity Recognition
19
Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Data-intensive Approach -- Marc Rössler Knowledge-poor: No linguistic knowledge No given word lists No hand-written rules Use of very large given corpora: Distribution of word occurrence in corpus Frequencies of words Frequencies of word sequences Bootstrapping of features: 1.Learn classifiers from examples 2.Apply classifiers to unlabeled corpus 3.Extract features from now labeled corpus, enhance examples 4.Learn classifiers from enhanced examples Named Entity Recognition Stop
20
Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund The Base Classifier -- Input Features: 1 out of 30 word surface features (e.g., 4-digit number, uppercase only, starting with capital letter) Word length Positional substrings (at most 8): Last character z Before last and last character nz Last 3 characterenz First trigramKon Second trigram…onk Fifth trigramurr Window of 3 preceding and 2 succeeding words Ebenso schnell hat Peter Müllers Konkurrenz Vector of 60 features for each occurrence Named Entity Recognition
21
Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund The Base Classifier -- Output A classifier f c is trained for each category against all others A classifier f NE is trained for “is a NE” vs. “is no NE” Tagging the focus of the sliding window according to Named Entity Recognition
22
Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Corpus-based Features – Internal Evidence Applying the base classifier to the corpus results in new features. Membership frequencies (how often a word v was seen as a member of the category c ) – where v is the token described by All f c > 0 become a feature with the ratio as value. Example: Named Entity Recognition
23
Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Sequences -- Windows The sequences of words with the same label are considered one token within the sliding window. Ebenso schnell hat Peter Müllers KonkurrenzP Ebenso schnell hat Peter Müllers Konkurrenz die 3 2 1 1 2
24
Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Compiling Sequences into Features Membership frequencies (how often a word v was seen as the first (last) in a sequence labeled c ) – internal evidence Membership frequencies become new features. Example: Named Entity Recognition
25
Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Corpus-based Features – External Evidence Context frequencies (how often a sequence seq c was preceded or succeeded by certain words) Sequence s preceding seq c is written seq preC Contexts with relative frequency >0.01 become features of the preceding words in the sliding window Example: Named Entity Recognition
26
Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Enhanced Features Based on the tagging of the unlabeled corpus by the base classifiers, features are extracted: Internal evidence: f c X intervals First/last in seq c External evidence: First, second, third in seq preC First/second in seq sucC Training is again performed using the enriched feature set. Tagging is enhanced by max(length(seq i )) (read again) Named Entity Recognition
27
Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Experiments Does the sequence in window focus enhance the learning result? Does the use of unlabeled background corpus enhance learning results? How is the enhancement per round? How many rounds are necessary? Is the knowledge-poor approach compatible with approaches using linguistic knowledge? Would a Hidden Markov Model be better? Named Entity Recognition
28
Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Does the sequence in window focus enhance the learning result? Instances in Training/T est F- measure LOC F- measure PER F- measure ORG Overall Precisi on Overall Recall Overall F- measure Regular N-grams 101 810/ 25 909 50.642.6745.3869.8234.1845.9 Sequen ces 113 245/ 30 792 52.6844.1949.189.7233.1348.39 Named Entity Recognition Yes.
29
Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Does the use of unlabeled background corpus enhance learning results? F- measure LOC F- measure PER F- measure ORG Overall Precision Overall Recall Overall F- measure No use of corpus, sequences 52.6844.1949.1089.7233.1348.39 Corpus for internal and external evidence 75.0491.0965.3683.6973.8278.44 Named Entity Recognition Yes.
30
Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund How is the enhancement per round? How many rounds are necessary? Named Entity Recognition
31
Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Number of Support Vectors Named Entity Recognition
32
Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Is the knowledge-poor approach compatible with approaches using linguistic knowledge? Named Entity Recognition AuthorF- measure LOC F-measure PER F-measure ORG Volk, Clematide 2001 85.788.978.4 Neumann, Piskorski 2002 81.188.079.4 Florian et al. 2003 (best CoNLL) 77.7183.5771.08 Rössler 75.9491.0965.36 Hmm …
33
Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Would a Hidden Markov Model be better? Named Entity Recognition No, but turning its classification into a feature helps SVM!
34
Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Summary of Rössler’s Approach System consists of 3 components SVM training: fast in large, sparse vector space Feature extraction from large corpora: fast automatic adaptation to new domain The outer loop Splitting instances of a m-class learning problem into m-1 binary problems Tagging using a voting mechanism Enhancing examples by extracted features The feature approach easily integrates linguistic knowledge or predictions of other learners, if given. The data-driven approach is language independent. Results are compatible with knowledge-based approaches. Named Entity Recognition
35
Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Conclusion Tagged data allow for enhanced services. Automatic tagging of paragraphs or tables can easily be done using very few examples in an interactive, incremental way. Named entity recognition for automatic tagging remains a challenge.
36
Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund
Ähnliche Präsentationen
© 2024 SlidePlayer.org Inc.
All rights reserved.