Die Präsentation wird geladen. Bitte warten

Die Präsentation wird geladen. Bitte warten

Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik.

Ähnliche Präsentationen


Präsentation zum Thema: "Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik."—  Präsentation transkript:

1 technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik Name Autor | Ort und Datum

2 Fakultät für Informatik LS 8 technische universität dortmund Overview  Handling texts – overview  Mark-up languages  Services based on annotated texts  Automatic tagging  From lay-out information to tags  Named entity recognition  Data-intensive Approach  Counting in very large unlabeled corpus  Turning frequencies into features  Compiling sequences into features Overview

3 Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Handling Texts  Granularity: hypertext structure, text, paragraph, word, letters  Learning mode: batch, incremental  Learning goal: adapted organization, class or clustering, syntactic or semantic structures  Application tasks: Personalization, optimization of information access, integration in business processes, reporting Handling Texts Hyper- text TextPara- graph Word Adapta- tion Alesker, Joachims, Neifach VeltmannHüppe, Mintert, Thomas Helbig Extraction Rössler Clustering Schewe, Wurst Classifica - tion Joachims, Klinken- berg this talk

4 Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Intelligent Publishing Using Mark-Ups  Search qualified by semantic category  Self-contained parts of text (atoms) as search result  Composition of one’s own text  Presentation according to semantic category IP4W3 System by Stefan Mintert 1999 Mark-up languages Query: category + word Webserver Result: list of atoms Text Selection Search Composition User Presentation

5 Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Qualified search Mark-up languages

6 Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Presentation of Results Mark-up languages

7 Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Text Composition Mark-up languages Selected results from 2 Queries combined

8 Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Applications e-Learning, e-Publishing  Intelligent publication in the web: users customize the material to their own needs. “IP4W3” Stefan Mintert 1999, Dortmund  Course material for different groups: from the central repository of presentations or texts, courses are designed for special interests. “Slicing Books” Ingo Dahm 2001, Koblenz-Landau  Additional sequence information allows to tailor courses to learning types, e.g. Top-down from definition to application Bottom-up from application to definition. Moritz Thomas 1999, Dortmund Mark-up languages

9 Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Behind the Curtain  Mark-up editor  Editor for defining qualified search fits Characters are the atomic unit of texts according to ISO/IEC … Mark-up languages DTD/Schema Webserver Search patterns Style sheets Administrator Author Annotated text Bottleneck!

10 Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Automatic tagging  WISDOM++ Univ. Bari From scanned texts to blocks to XML tags – classification of blocks by C4.5 Altamura, Esposito, Malerba 2000  ADT Univ. Dortmund From RTF annotation to XML tags – classification by C4.5 Christian Hüppe 2003 Automatic tagging

11 Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund ADT – Input Document Automatic Tagging

12 Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund ADT – Manual Annotation of Examples Automatic Tagging

13 Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund ADT – Attributes of Examples  RTF control words  Presence of control word in current and preceding paragraph  ff: neither in this nor in preceding paragraph  ft: not in this but in preceding paragraph  tf: in this but not in preceding paragraph  tt: as well in this as in preceding paragraph  Value of indention in current and preceding paragraph  First and second word of paragraph Automatic Tagging

14 Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund ADT -- Learning Automatic Tagging

15 Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund ADT – Classification of Paragraphs Automatic Tagging No. of examples for each class F-measure 141 % 294 % 398,3% 499,68% 9 classes (tags) 159 paragraphs

16 Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Application Options Automatic Tagging Named entity recognition necessary!  Combination of ADT and IP4W3 offers qualified search, tailored courseware, and enhanced e-learning without tedious annotation on behalf of the author or administrator.  Semantic information within paragraphs cannot be captured.

17 Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Named Entity Recognition  Classification of single words into given semantic categories (e.g., person, location, date).  A phrase of the category is a sequence of the same label.  Features of a word:  Linguistic features (e.g., part of speech)  Letters (e.g., beginning with upper case letter)  Word length  N-grams  Knowledge intensive vs. data intensive approaches:  Linguistic rules  Examples  Unlabeled text (corpus)  Training time, classification time – size of training and test sets Named Entity Recognition

18 Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund The Task  Biomedical task on word forms (JNLPBA)  labeled occurrences for training  occurrences for testing  100 Mio. word forms from Medline as background  German corpus word forms (CoNLL)  labeled occurrences for training  occurrences for testing  40 Mio. word forms from Frankfurter Rundschau as background  Fast learning and classification necessary! Named Entity Recognition

19 Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Data-intensive Approach -- Marc Rössler  Knowledge-poor:  No linguistic knowledge  No given word lists  No hand-written rules  Use of very large given corpora:  Distribution of word occurrence in corpus  Frequencies of words  Frequencies of word sequences  Bootstrapping of features: 1.Learn classifiers from examples 2.Apply classifiers to unlabeled corpus 3.Extract features from now labeled corpus, enhance examples 4.Learn classifiers from enhanced examples Named Entity Recognition Stop

20 Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund The Base Classifier -- Input  Features:  1 out of 30 word surface features (e.g., 4-digit number, uppercase only, starting with capital letter)  Word length  Positional substrings (at most 8):  Last character z  Before last and last character nz  Last 3 characterenz  First trigramKon  Second trigram…onk  Fifth trigramurr  Window of 3 preceding and 2 succeeding words Ebenso schnell hat Peter Müllers Konkurrenz  Vector of 60 features for each occurrence Named Entity Recognition

21 Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund The Base Classifier -- Output  A classifier f c is trained for each category against all others  A classifier f NE is trained for “is a NE” vs. “is no NE”  Tagging the focus of the sliding window according to Named Entity Recognition

22 Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Corpus-based Features – Internal Evidence Applying the base classifier to the corpus results in new features.  Membership frequencies (how often a word v was seen as a member of the category c ) – where v is the token described by  All f c > 0 become a feature with the ratio as value. Example: Named Entity Recognition

23 Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Sequences -- Windows  The sequences of words with the same label are considered one token within the sliding window. Ebenso schnell hat Peter Müllers KonkurrenzP Ebenso schnell hat Peter Müllers Konkurrenz die

24 Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Compiling Sequences into Features  Membership frequencies (how often a word v was seen as the first (last) in a sequence labeled c ) – internal evidence  Membership frequencies become new features. Example: Named Entity Recognition

25 Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Corpus-based Features – External Evidence  Context frequencies (how often a sequence seq c was preceded or succeeded by certain words)  Sequence s preceding seq c is written seq preC  Contexts with relative frequency >0.01 become features of the preceding words in the sliding window Example: Named Entity Recognition

26 Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Enhanced Features  Based on the tagging of the unlabeled corpus by the base classifiers, features are extracted:  Internal evidence:  f c X intervals  First/last in seq c  External evidence:  First, second, third in seq preC  First/second in seq sucC  Training is again performed using the enriched feature set.  Tagging is enhanced by max(length(seq i )) (read again) Named Entity Recognition

27 Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Experiments  Does the sequence in window focus enhance the learning result?  Does the use of unlabeled background corpus enhance learning results?  How is the enhancement per round? How many rounds are necessary?  Is the knowledge-poor approach compatible with approaches using linguistic knowledge?  Would a Hidden Markov Model be better? Named Entity Recognition

28 Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Does the sequence in window focus enhance the learning result? Instances in Training/T est F- measure LOC F- measure PER F- measure ORG Overall Precisi on Overall Recall Overall F- measure Regular N-grams / Sequen ces / Named Entity Recognition Yes.

29 Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Does the use of unlabeled background corpus enhance learning results? F- measure LOC F- measure PER F- measure ORG Overall Precision Overall Recall Overall F- measure No use of corpus, sequences Corpus for internal and external evidence Named Entity Recognition Yes.

30 Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund How is the enhancement per round? How many rounds are necessary? Named Entity Recognition

31 Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Number of Support Vectors Named Entity Recognition

32 Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Is the knowledge-poor approach compatible with approaches using linguistic knowledge? Named Entity Recognition AuthorF- measure LOC F-measure PER F-measure ORG Volk, Clematide Neumann, Piskorski Florian et al (best CoNLL) Rössler Hmm …

33 Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Would a Hidden Markov Model be better? Named Entity Recognition No, but turning its classification into a feature helps SVM!

34 Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Summary of Rössler’s Approach  System consists of 3 components  SVM training: fast in large, sparse vector space  Feature extraction from large corpora: fast automatic adaptation to new domain  The outer loop  Splitting instances of a m-class learning problem into m-1 binary problems  Tagging using a voting mechanism  Enhancing examples by extracted features  The feature approach easily integrates linguistic knowledge or predictions of other learners, if given.  The data-driven approach is language independent.  Results are compatible with knowledge-based approaches. Named Entity Recognition

35 Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Conclusion  Tagged data allow for enhanced services.  Automatic tagging of paragraphs or tables can easily be done using very few examples in an interactive, incremental way.  Named entity recognition for automatic tagging remains a challenge.

36 Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund


Herunterladen ppt "Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik."

Ähnliche Präsentationen


Google-Anzeigen