Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik.

Slides:

Advertisements

Ähnliche Präsentationen

Finding the Pattern You Need: The Design Pattern Intent Ontology

Advertisements

Peter Marwedel TU Dortmund, Informatik 12

Art der Arbeit (Projekt-/Studien-/Diplomarbeit/

Verbs Used Impersonally With Dative Deutsch I/II Fr. Spampinato.

Introduction to the topic. Goals: Improving the students essay style in general Finding special words and expressions that can be used in essay writing.

Can you think of some KEY phrases which would be useful in multiple contexts? Take 2 minutes with a partner and come up with as many as you can!

Institut AIFB, Universität Karlsruhe (TH) Forschungsuniversität gegründet 1825 Towards Automatic Composition of Processes based on Semantic.

| DC-IAP/SVC3 | © Bosch Rexroth Pneumatics GmbH This document, as well as the data, specifications and other information set forth in.

You need to use your mouse to see this presentation © Heidi Behrens.

Tage der Woche German Early Level Montag Dienstag Mittwoch Donnerstag

Ein Projekt des Technischen Jugendfreizeit- und Bildungsvereins (tjfbv) e.V. kommunizieren.de Blended Learning for people with disabilities.

How Does Fuzzy Arithmetic Work ? © Hartwig Jeschke Institut für Mikroelektronische Schaltungen und Systeme Universität Hannover

Der formelle Imperativ – the Imperative

Research-guided Teaching Representation in the Biology Curriculum.

Konjunktionen & Indirekte Fragen {Conjunctions}

Institut für Angewandte Mikroelektronik und Datentechnik Phase 5 Architectural impact on ASIC and FPGA Nils Büscher Selected Topics in VLSI Design (Module.

Jonathan Harrington: "Die phonetischen Grundlagen des Lautwandels“

Synchronization: Multiversion Concurrency Control

Institut für Angewandte Mikroelektronik und Datentechnik Course and contest Results of Phase 4 Nils Büscher Selected Topics in VLSI Design (Module 24513)

SiPass standalone.

Stephanie Müller, Rechtswissenschaftliches Institut, Universität Zürich, Rämistrasse 74/17, 8001 Zürich, Criminal liability.

AT THE OASIS ACADEMY WINTRINGHAM Linking Languages and Sport.

Literary Machines, zusammengestellt für ::COLLABOR:: von H. Mittendorfer Literary MACHINES 1980 bis 1987, by Theodor Holm NELSON ISBN

Akkusativ Präpositionen

Arbeiten in einem agilen Team mit VS & TFS 11

Alltagsleben Treffpunkt Deutsch Sixth Edition

Welcome to Web Services & Grid Computing Jens Mache

Collaborative Research Online: Knowledge management pilot project on Haskala Dr. Rachel Heuberger, Judaica Abteilung, Universitätsbibliothek Frankfurt.

Wissenschaftliche Kommunikations-Infrastruktur Workshop CIDOC CRM SIG Meeting Germanisches Nationalmuseum, Nürnberg May 19, 2015 Mark Fichtner Germanisches.

Possessive Adjectives How to show belonging… The information contained in this document may not be duplicated or distributed without the permission of.

Physik multimedial Lehr- und Lernmodule für das Studium der Physik als Nebenfach Julika Mimkes: Links to e-learning content for.

Need: paper, coloured pens, glue, dwarf templates, dictionaries, adjective handout, judges hand out, blue tack For gallery – give students blue tack and.

How does the Summer Party of the LMU work? - Organizations and Networks -

COMMANDS imperative There are three command forms: formal familiar singular familiar plural.

COMMANDS imperative 1. you (formal): Sie 2. you (familiar plural): ihr

KLIMA SUCHT SCHUTZ EINE KAMPAGNE GEFÖRDERT VOM BUNDESUMWELTMINISTERIUM Co2 online.

Magnetenzephalogramm, MEG

Gregor Graf Oracle Portal (Part of the Oracle Application Server 9i) Gregor Graf (2001,2002)

© Crown copyright 2011, Department for Education These materials have been designed to be reproduced for internal circulation, research and teaching or.

You need to use your mouse to see this presentation © Heidi Behrens.

EUROPÄISCHE GEMEINSCHAFT Europäischer Sozialfonds EUROPÄISCHE GEMEINSCHAFT Europäischer Fonds für Regionale Entwicklung Workpackage 5 – guidelines Tasks.

Berliner Elektronenspeicherring-Gesellschaft für Synchrotronstrahlung m.b.H., Albert-Einstein-Straße 15, Berlin frontend control at BESSY R. Fleischhauer.

Imperfekt (Simple Past) Irregular or strong verbs

Kapitel 2 Grammar INDEX 1.Subjects & Verbs 2.Conjugation of Verbs 3.Subject Verb Agreement 4.Person and Number 5.Present Tense 6.Word Order: Position of.

EUROPÄISCHE GEMEINSCHAFT Europäischer Sozialfonds EUROPÄISCHE GEMEINSCHAFT Europäischer Fonds für Regionale Entwicklung Workpackage 5 – guidelines Tasks.

Our next PLD session is focused on turning theory into practice by learning methodologies that will scaffold learning for second language acquisition.

G Stunde DEUTSCH 1.  Unit: Family & homeFamilie & Zuhause  Objectives:  Phrases about date, weather and time-telling  Family and family relations.

Standort assurance for companies Industrie- und Handelskammer Lippe zu Detmold 01. Juni 2010 Seite 1 What does the IHK do against the crisis?

How to play: Students are broken up into 2-3 teams (depending on class size). Students can see the game board and the categories, but not point values.

Essay structure Example: Die fetten Jahre sind vorbei: Was passiert auf der Almhütte? Welche Bedeutung hat sie für jede der vier Personen? Intro: One or.

Learning Linear Ordering Problems for Better Translation Roy Tromble, Google Pittsburgh Jason Eisner, Johns Hopkins August 7, 2009.

Azure Countdown Wenn der Freund und Helfer Freunde und Helfer braucht: Sichere Content-Upload-Plattform für Bürger.

The dynamic ultrasound

Aspect-Oriented Programming: Fad or the Future

Process and Impact of Re-Inspection in NRW

Synonyms are two or more words belonging to the same part of speech and possessing one or more identical or nearly identical denotational meanings, interchangeable.

Metadata - Catalogues and Digitised works

IT QM Part2 Lecture 7 PSE GSC

Ferrite Material Modeling (1) : Kicker principle

The new online recognition process

Institut für Experimentelle

„Förderwolke“ A Cloud-based exchange platform for the qualitative enhancement and improvement of inclusive education Dipl. Reha-Päd. Hanna Linke scientific.

Official Statistics Web Cartography in Germany − Regional Statistics, Federal and European Elections, Future Activities − Joint Working Party meeting.

OFFICE 365 FOCUS SESSION SHAREPOINT ONLINE 101:LERNE DIE BASICS 19. März 2018 Höhr-Grenzhausen.

Integrating Knowledge Discovery into Knowledge Management

Practical Exercises and Theory

Calorimetry as an efficiency factor for biogas plants?

Zhunussova G., AA 81. Linguistic communication, i.e. the use of language, is characteristically vocal and verbal behaviour, involving the use of discrete.

Präsentation transkript:

technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik Name Autor | Ort und Datum

Fakultät für Informatik LS 8 technische universität dortmund Overview  Handling texts – overview  Mark-up languages  Services based on annotated texts  Automatic tagging  From lay-out information to tags  Named entity recognition  Data-intensive Approach  Counting in very large unlabeled corpus  Turning frequencies into features  Compiling sequences into features Overview

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Handling Texts  Granularity: hypertext structure, text, paragraph, word, letters  Learning mode: batch, incremental  Learning goal: adapted organization, class or clustering, syntactic or semantic structures  Application tasks: Personalization, optimization of information access, integration in business processes, reporting Handling Texts Hyper- text TextPara- graph Word Adapta- tion Alesker, Joachims, Neifach VeltmannHüppe, Mintert, Thomas Helbig Extraction Rössler Clustering Schewe, Wurst Classifica - tion Joachims, Klinken- berg this talk

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Intelligent Publishing Using Mark-Ups  Search qualified by semantic category  Self-contained parts of text (atoms) as search result  Composition of one’s own text  Presentation according to semantic category IP4W3 System by Stefan Mintert 1999 Mark-up languages Query: category + word Webserver Result: list of atoms Text Selection Search Composition User Presentation

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Qualified search Mark-up languages

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Presentation of Results Mark-up languages

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Text Composition Mark-up languages Selected results from 2 Queries combined

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Applications e-Learning, e-Publishing  Intelligent publication in the web: users customize the material to their own needs. “IP4W3” Stefan Mintert 1999, Dortmund  Course material for different groups: from the central repository of presentations or texts, courses are designed for special interests. “Slicing Books” Ingo Dahm 2001, Koblenz-Landau  Additional sequence information allows to tailor courses to learning types, e.g. Top-down from definition to application Bottom-up from application to definition. Moritz Thomas 1999, Dortmund Mark-up languages

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Behind the Curtain  Mark-up editor  Editor for defining qualified search fits Characters are the atomic unit of texts according to ISO/IEC … Mark-up languages DTD/Schema Webserver Search patterns Style sheets Administrator Author Annotated text Bottleneck!

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Automatic tagging  WISDOM++ Univ. Bari From scanned texts to blocks to XML tags – classification of blocks by C4.5 Altamura, Esposito, Malerba 2000  ADT Univ. Dortmund From RTF annotation to XML tags – classification by C4.5 Christian Hüppe 2003 Automatic tagging

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund ADT – Input Document Automatic Tagging

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund ADT – Manual Annotation of Examples Automatic Tagging

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund ADT – Attributes of Examples  RTF control words  Presence of control word in current and preceding paragraph  ff: neither in this nor in preceding paragraph  ft: not in this but in preceding paragraph  tf: in this but not in preceding paragraph  tt: as well in this as in preceding paragraph  Value of indention in current and preceding paragraph  First and second word of paragraph Automatic Tagging

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund ADT -- Learning Automatic Tagging

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund ADT – Classification of Paragraphs Automatic Tagging No. of examples for each class F-measure 141 % 294 % 398,3% 499,68% 9 classes (tags) 159 paragraphs

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Application Options Automatic Tagging Named entity recognition necessary!  Combination of ADT and IP4W3 offers qualified search, tailored courseware, and enhanced e-learning without tedious annotation on behalf of the author or administrator.  Semantic information within paragraphs cannot be captured.

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Named Entity Recognition  Classification of single words into given semantic categories (e.g., person, location, date).  A phrase of the category is a sequence of the same label.  Features of a word:  Linguistic features (e.g., part of speech)  Letters (e.g., beginning with upper case letter)  Word length  N-grams  Knowledge intensive vs. data intensive approaches:  Linguistic rules  Examples  Unlabeled text (corpus)  Training time, classification time – size of training and test sets Named Entity Recognition

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund The Task  Biomedical task on word forms (JNLPBA)  labeled occurrences for training  occurrences for testing  100 Mio. word forms from Medline as background  German corpus word forms (CoNLL)  labeled occurrences for training  occurrences for testing  40 Mio. word forms from Frankfurter Rundschau as background  Fast learning and classification necessary! Named Entity Recognition

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Data-intensive Approach -- Marc Rössler  Knowledge-poor:  No linguistic knowledge  No given word lists  No hand-written rules  Use of very large given corpora:  Distribution of word occurrence in corpus  Frequencies of words  Frequencies of word sequences  Bootstrapping of features: 1.Learn classifiers from examples 2.Apply classifiers to unlabeled corpus 3.Extract features from now labeled corpus, enhance examples 4.Learn classifiers from enhanced examples Named Entity Recognition Stop

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund The Base Classifier -- Input  Features:  1 out of 30 word surface features (e.g., 4-digit number, uppercase only, starting with capital letter)  Word length  Positional substrings (at most 8):  Last character z  Before last and last character nz  Last 3 characterenz  First trigramKon  Second trigram…onk  Fifth trigramurr  Window of 3 preceding and 2 succeeding words Ebenso schnell hat Peter Müllers Konkurrenz  Vector of 60 features for each occurrence Named Entity Recognition

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund The Base Classifier -- Output  A classifier f c is trained for each category against all others  A classifier f NE is trained for “is a NE” vs. “is no NE”  Tagging the focus of the sliding window according to Named Entity Recognition

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Corpus-based Features – Internal Evidence Applying the base classifier to the corpus results in new features.  Membership frequencies (how often a word v was seen as a member of the category c ) – where v is the token described by  All f c > 0 become a feature with the ratio as value. Example: Named Entity Recognition

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Sequences -- Windows  The sequences of words with the same label are considered one token within the sliding window. Ebenso schnell hat Peter Müllers KonkurrenzP Ebenso schnell hat Peter Müllers Konkurrenz die

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Compiling Sequences into Features  Membership frequencies (how often a word v was seen as the first (last) in a sequence labeled c ) – internal evidence  Membership frequencies become new features. Example: Named Entity Recognition

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Corpus-based Features – External Evidence  Context frequencies (how often a sequence seq c was preceded or succeeded by certain words)  Sequence s preceding seq c is written seq preC  Contexts with relative frequency >0.01 become features of the preceding words in the sliding window Example: Named Entity Recognition

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Enhanced Features  Based on the tagging of the unlabeled corpus by the base classifiers, features are extracted:  Internal evidence:  f c X intervals  First/last in seq c  External evidence:  First, second, third in seq preC  First/second in seq sucC  Training is again performed using the enriched feature set.  Tagging is enhanced by max(length(seq i )) (read again) Named Entity Recognition

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Experiments  Does the sequence in window focus enhance the learning result?  Does the use of unlabeled background corpus enhance learning results?  How is the enhancement per round? How many rounds are necessary?  Is the knowledge-poor approach compatible with approaches using linguistic knowledge?  Would a Hidden Markov Model be better? Named Entity Recognition

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Does the sequence in window focus enhance the learning result? Instances in Training/T est F- measure LOC F- measure PER F- measure ORG Overall Precisi on Overall Recall Overall F- measure Regular N-grams / Sequen ces / Named Entity Recognition Yes.

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Does the use of unlabeled background corpus enhance learning results? F- measure LOC F- measure PER F- measure ORG Overall Precision Overall Recall Overall F- measure No use of corpus, sequences Corpus for internal and external evidence Named Entity Recognition Yes.

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund How is the enhancement per round? How many rounds are necessary? Named Entity Recognition

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Number of Support Vectors Named Entity Recognition

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Is the knowledge-poor approach compatible with approaches using linguistic knowledge? Named Entity Recognition AuthorF- measure LOC F-measure PER F-measure ORG Volk, Clematide Neumann, Piskorski Florian et al (best CoNLL) Rössler Hmm …

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Would a Hidden Markov Model be better? Named Entity Recognition No, but turning its classification into a feature helps SVM!

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Summary of Rössler’s Approach  System consists of 3 components  SVM training: fast in large, sparse vector space  Feature extraction from large corpora: fast automatic adaptation to new domain  The outer loop  Splitting instances of a m-class learning problem into m-1 binary problems  Tagging using a voting mechanism  Enhancing examples by extracted features  The feature approach easily integrates linguistic knowledge or predictions of other learners, if given.  The data-driven approach is language independent.  Results are compatible with knowledge-based approaches. Named Entity Recognition

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund Conclusion  Tagged data allow for enhanced services.  Automatic tagging of paragraphs or tables can easily be done using very few examples in an interactive, incremental way.  Named entity recognition for automatic tagging remains a challenge.

Name Autor | Ort und Datum Fakultät für Informatik LS 8 technische universität dortmund