Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik.

technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik Name Autor | Ort und Datum

Fakultät für Informatik LS 8 technische universität dortmund Overview  Handling texts – overview  Mark-up languages  Services based on annotated texts  Automatic tagging  From lay-out information to tags  Named entity recognition  Data-intensive Approach  Counting in very large unlabeled corpus  Turning frequencies into features  Compiling sequences into features Overview

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Handling Texts  Granularity: hypertext structure, text, paragraph, word, letters  Learning mode: batch, incremental  Learning goal: adapted organization, class or clustering, syntactic or semantic structures  Application tasks: Personalization, optimization of information access, integration in business processes, reporting Handling Texts Hyper- text TextPara- graph Word Adapta- tion Alesker, Joachims, Neifach VeltmannHüppe, Mintert, Thomas Helbig Extraction Rössler Clustering Schewe, Wurst Classifica - tion Joachims, Klinken- berg this talk

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Intelligent Publishing Using Mark-Ups  Search qualified by semantic category  Self-contained parts of text (atoms) as search result  Composition of one’s own text  Presentation according to semantic category IP4W3 System by Stefan Mintert 1999 Mark-up languages Query: category + word Webserver Result: list of atoms Text Selection Search Composition User Presentation

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Qualified search Mark-up languages

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Presentation of Results Mark-up languages

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Text Composition Mark-up languages Selected results from 2 Queries combined

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Applications e-Learning, e-Publishing  Intelligent publication in the web: users customize the material to their own needs. “IP4W3” Stefan Mintert 1999, Dortmund  Course material for different groups: from the central repository of presentations or texts, courses are designed for special interests. “Slicing Books” Ingo Dahm 2001, Koblenz-Landau  Additional sequence information allows to tailor courses to learning types, e.g. Top-down from definition to application Bottom-up from application to definition. Moritz Thomas 1999, Dortmund Mark-up languages

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Behind the Curtain  Mark-up editor  Editor for defining qualified search fits Characters are the atomic unit of texts according to ISO/IEC 10646. … Mark-up languages DTD/Schema Webserver Search patterns Style sheets Administrator Author Annotated text Bottleneck!

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Automatic tagging  WISDOM++ Univ. Bari From scanned texts to blocks to XML tags – classification of blocks by C4.5 Altamura, Esposito, Malerba 2000  ADT Univ. Dortmund From RTF annotation to XML tags – classification by C4.5 Christian Hüppe 2003 Automatic tagging

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund ADT – Input Document Automatic Tagging

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund ADT – Manual Annotation of Examples Automatic Tagging

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund ADT – Attributes of Examples  RTF control words  Presence of control word in current and preceding paragraph  ff: neither in this nor in preceding paragraph  ft: not in this but in preceding paragraph  tf: in this but not in preceding paragraph  tt: as well in this as in preceding paragraph  Value of indention in current and preceding paragraph  First and second word of paragraph Automatic Tagging

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund ADT -- Learning Automatic Tagging

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund ADT – Classification of Paragraphs Automatic Tagging No. of examples for each class F-measure 141 % 294 % 398,3% 499,68% 9 classes (tags) 159 paragraphs

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Application Options Automatic Tagging Named entity recognition necessary!  Combination of ADT and IP4W3 offers qualified search, tailored courseware, and enhanced e-learning without tedious annotation on behalf of the author or administrator.  Semantic information within paragraphs cannot be captured.

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Named Entity Recognition  Classification of single words into given semantic categories (e.g., person, location, date).  A phrase of the category is a sequence of the same label.  Features of a word:  Linguistic features (e.g., part of speech)  Letters (e.g., beginning with upper case letter)  Word length  N-grams  Knowledge intensive vs. data intensive approaches:  Linguistic rules  Examples  Unlabeled text (corpus)  Training time, classification time – size of training and test sets Named Entity Recognition

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund The Task  Biomedical task on 22 000 word forms (JNLPBA)  472 000 labeled occurrences for training  54 173 occurrences for testing  100 Mio. word forms from Medline as background  German corpus 33 000 word forms (CoNLL)  220 189 labeled occurrences for training  54 173 occurrences for testing  40 Mio. word forms from Frankfurter Rundschau as background  Fast learning and classification necessary! Named Entity Recognition

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Data-intensive Approach -- Marc Rössler  Knowledge-poor:  No linguistic knowledge  No given word lists  No hand-written rules  Use of very large given corpora:  Distribution of word occurrence in corpus  Frequencies of words  Frequencies of word sequences  Bootstrapping of features: 1.Learn classifiers from examples 2.Apply classifiers to unlabeled corpus 3.Extract features from now labeled corpus, enhance examples 4.Learn classifiers from enhanced examples Named Entity Recognition Stop

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund The Base Classifier -- Input  Features:  1 out of 30 word surface features (e.g., 4-digit number, uppercase only, starting with capital letter)  Word length  Positional substrings (at most 8):  Last character z  Before last and last character nz  Last 3 characterenz  First trigramKon  Second trigram…onk  Fifth trigramurr  Window of 3 preceding and 2 succeeding words Ebenso schnell hat Peter Müllers Konkurrenz  Vector of 60 features for each occurrence Named Entity Recognition

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund The Base Classifier -- Output  A classifier f c is trained for each category against all others  A classifier f NE is trained for “is a NE” vs. “is no NE”  Tagging the focus of the sliding window according to Named Entity Recognition

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Corpus-based Features – Internal Evidence Applying the base classifier to the corpus results in new features.  Membership frequencies (how often a word v was seen as a member of the category c ) – where v is the token described by  All f c > 0 become a feature with the ratio as value. Example: Named Entity Recognition

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Sequences -- Windows  The sequences of words with the same label are considered one token within the sliding window. Ebenso schnell hat Peter Müllers KonkurrenzP Ebenso schnell hat Peter Müllers Konkurrenz die 3 2 1 1 2

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Compiling Sequences into Features  Membership frequencies (how often a word v was seen as the first (last) in a sequence labeled c ) – internal evidence  Membership frequencies become new features. Example: Named Entity Recognition

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Corpus-based Features – External Evidence  Context frequencies (how often a sequence seq c was preceded or succeeded by certain words)  Sequence s preceding seq c is written seq preC  Contexts with relative frequency >0.01 become features of the preceding words in the sliding window Example: Named Entity Recognition

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Enhanced Features  Based on the tagging of the unlabeled corpus by the base classifiers, features are extracted:  Internal evidence:  f c X intervals  First/last in seq c  External evidence:  First, second, third in seq preC  First/second in seq sucC  Training is again performed using the enriched feature set.  Tagging is enhanced by max(length(seq i )) (read again) Named Entity Recognition

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Experiments  Does the sequence in window focus enhance the learning result?  Does the use of unlabeled background corpus enhance learning results?  How is the enhancement per round? How many rounds are necessary?  Is the knowledge-poor approach compatible with approaches using linguistic knowledge?  Would a Hidden Markov Model be better? Named Entity Recognition

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Does the sequence in window focus enhance the learning result? Instances in Training/T est F- measure LOC F- measure PER F- measure ORG Overall Precisi on Overall Recall Overall F- measure Regular N-grams 101 810/ 25 909 50.642.6745.3869.8234.1845.9 Sequen ces 113 245/ 30 792 52.6844.1949.189.7233.1348.39 Named Entity Recognition Yes.

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Does the use of unlabeled background corpus enhance learning results? F- measure LOC F- measure PER F- measure ORG Overall Precision Overall Recall Overall F- measure No use of corpus, sequences 52.6844.1949.1089.7233.1348.39 Corpus for internal and external evidence 75.0491.0965.3683.6973.8278.44 Named Entity Recognition Yes.

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund How is the enhancement per round? How many rounds are necessary? Named Entity Recognition

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Number of Support Vectors Named Entity Recognition

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Is the knowledge-poor approach compatible with approaches using linguistic knowledge? Named Entity Recognition AuthorF- measure LOC F-measure PER F-measure ORG Volk, Clematide 2001 85.788.978.4 Neumann, Piskorski 2002 81.188.079.4 Florian et al. 2003 (best CoNLL) 77.7183.5771.08 Rössler 75.9491.0965.36 Hmm …

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Would a Hidden Markov Model be better? Named Entity Recognition No, but turning its classification into a feature helps SVM!

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Summary of Rössler’s Approach  System consists of 3 components  SVM training: fast in large, sparse vector space  Feature extraction from large corpora: fast automatic adaptation to new domain  The outer loop  Splitting instances of a m-class learning problem into m-1 binary problems  Tagging using a voting mechanism  Enhancing examples by extracted features  The feature approach easily integrates linguistic knowledge or predictions of other learners, if given.  The data-driven approach is language independent.  Results are compatible with knowledge-based approaches. Named Entity Recognition

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Conclusion  Tagged data allow for enhanced services.  Automatic tagging of paragraphs or tables can easily be done using very few examples in an interactive, incremental way.  Named entity recognition for automatic tagging remains a challenge.

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund

Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik.

Ähnliche Präsentationen

Präsentation zum Thema: "Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik."— Präsentation transkript:

Ähnliche Präsentationen

Über Projekt

Feedback

Anmelden

Anmeldung über soziales Netzwerk:

Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik.

Ähnliche Präsentationen

Präsentation zum Thema: "Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik."— Präsentation transkript:

Ähnliche Präsentationen

Über Projekt

Feedback