Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński,

Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński, Gregor Erbach, Clara Guasch, Petr Homola, Sabine Lehmann, Hong Li, Hans-Ulrich Krieger, Jakub Piskorski, Ulrich Schäfer, Atsuko Shimada, Melanie Siegel, Feiyu Xu, Dorothee Ziegler-Eisele DFKI GmbH, LT Lab, Saarbrücken Saarland University Computational Linguistics Dept, Saarbrücken Acrolinx GmbH, Berlin

Multilingual Corpora Workshop, 27 March 2003 Outline Motivation SPROUT – shallow processing toolkit Multilingual NE grammar development Shared output structures Shared token classes Shared grammars Multilingual NE corpora Evaluation tool

Multilingual Corpora Workshop, 27 March 2003 Motivation Named Entity Recognition is fundamental to a number of information management applications (search engines, question answering, text mining …) Many of these applications deal with different languages Development of multilingual named entity grammars, supported by BMBF in the projects WHITEBOARD and COLLATE, and by the EU in the project AIRFORCE

Multilingual Corpora Workshop, 27 March 2003 Challenges in multilingual NER Different alphabets, character sets and character encodings Different tokenization conventions Different time and currency formats Different representations of proper names Identical (New York, George Bush, IBM) Different for some languages (London vs. Londres, Firenze vs. Florence vs. Florenz, NATO vs. OTAN, München vs. Munich vs. Monaco)

Multilingual Corpora Workshop, 27 March 2003 SProUT - Objectives platform for the development of multilingual and domain adaptive shallow text processing and information extraction systems trade-off between efficiency and expressiveness modularity (fine-grained modeling of linguistic components into clear-cut modules) portability and industrial standards

Multilingual Corpora Workshop, 27 March 2003 FINITE-STATE TOOLKIT REGULAR COMPILER SHALLOW GRAMMAR INTERPRETER JTFS SHALLOW GRAMMAR EXTENDED OPTIMIZED FST REPRES. LEXICAL RESOURCES INPUT DATA STRUCTURED OUTPUT DATA G R A M M A R D E V E L O P M E N T E N V I R O N M E N T System Architecture O N L I N E P R O C E S S I N G STREAM OF TEXT ITEMS …. [..] [..] [..] …. LINGUISTIC PROCESSING RESOURCES

Multilingual Corpora Workshop, 27 March 2003 System Components linguistic processing resources tokenizer (easily adaptable for indo-european languages) gazetteer morphology component (8 languages) named entity recognition (6 languages) core tools JTFS FSM toolkit regular compiler shallow grammar interpreter tries for NLP processing

Multilingual Corpora Workshop, 27 March 2003 TFS and TFS-XML TFS as data interchange format in SProUT unification and subsumption check as basic operations for evaluation compact XML encoding of typed feature structures (following TEI-SGML) exchange format for linguistic resources: grammars feature structure tree banks exchange format for visualization

Multilingual Corpora Workshop, 27 March 2003 TFS-XML: Example

Multilingual Corpora Workshop, 27 March 2003 Morphological Resources English200,000 entries (Mmorph (Multext)) German830,000 entries (Mmorph (Multext)) French225,000 entries (Mmorph (Multext)) Spanish570,000 entries (Mmorph (Multext)) Italian330,000 entries (Mmorph (Multext)) Czech600,000 entries (Institue of Formal and Applied Linguistics in Prague) ChineseShanxi-Tokenizer JapaneseChaSen Asian language resources Indo-European language resources

Multilingual Corpora Workshop, 27 March 2003 Architecture Mmorph fullform lexica are stored as trie external modules (Asian and Czech) are integrated via Client/Server Parser Tokeniser Mmorph ChaSen Czech Shanxi

Multilingual Corpora Workshop, 27 March 2003 A Type-Driven Method for Compacting Mmorph redundant and spurious ambiguous readings German Mmorph: 5.8 readings per wordform in DNF compacts Mmorph by deletion of redundant readings substitution of special readings through more general ones using type generalization and subsumption checking generation of a type hierarchy average number of readings per wordform in German is now 1.6

Multilingual Corpora Workshop, 27 March 2003 Compacting a Mmorph Entry "evaluierten" = "evaluieren" Adjective[ gender=masc number=singular case=gen|dat|acc] "evaluierten" = "evaluieren" Adjective[ gender=fem|neutrum number=singular case=gen|dat] "evaluierten" = "evaluieren" Adjective[ gender=masc|fem|neutrum number=singular case=nom|gen|dat|acc] "evaluierten" = "evaluieren" Adjective[ gender=masc|fem|neutrum number=plural case=nom|gen|dat|acc] compacting "evaluierten" = "evaluieren" Adjective[ gender=fem_masc _neutrum number=singular_plural case=acc_dat_gen_nom] plural_singular pluralsingular fem_ masc_neutrum fem_mascfem_neutrummasc_neutrum femmascneutrum

Multilingual Corpora Workshop, 27 March 2003 A SProUT Grammar Rule (XTDL) *

Multilingual Corpora Workshop, 27 March 2003 Unification Matched input structure Extended Rule Structure After Match Fully Unified Structure

Multilingual Corpora Workshop, 27 March 2003 Grammar Development Loading loading type hierarchy and visualization loading existing SproUT grammar files Editing editing SProUT grammar in XTDL and XML format Compilation compiling SProUT grammar into XFSM Configuration selection of linguistic resources for a given language,e. g., tokenizer, morphological component and grammar files.

Multilingual Corpora Workshop, 27 March 2003 Title of Slid Item 1Item 2 COLLATE, Scientific Advisory Board Meeting, Saarland University, 22 November 2002

Multilingual Corpora Workshop, 27 March 2003 Multilingual Named Entity Grammars Languages English, French, German, Spanish Chinese, Japanese Grammar Style MUC-7/MET-2 named entity classes with some variations ENAMEX: person, location, organisation TIMEX: time point, time span (instead of date, time) NUMEX: percentage, money Named entity types with internal attribute-value structures, e.g., span := timex & [FROM point, TO point ].

Multilingual Corpora Workshop, 27 March 2003 Future Work extension of NE grammars to other languages, e.g., Czech, Polish grammar evaluation with JTACO efficiency issues experiments with different search strategies grammar processing optimization extension of XTDL expressiveness functional operators seek operator

Multilingual Corpora Workshop, 27 March 2003 Multilingual Named Entity Grammar Development Language coverage: German, French, English, Spanish, Czech, Japanese, Chinese Domain coverage: extended MUC7 standard (person names, organizations, geographical locations, currencies, date and time expressions)

Multilingual Corpora Workshop, 27 March 2003 Multilingual NE grammar development Our approach Shared output structures Shared token classes Shared grammars

Multilingual Corpora Workshop, 27 March 2003 Shared Output Structures The grammars for all six languages produce the same, semantically oriented output structures, defined in TDL ne_type := sign & [DESCRIPTOR string]. enamex := ne_type. ne-person := enamex & [TITLE list-of-strings, GIVEN_NAME list-of-strings, SURNAME list-of-strings, P-POSITION list-of-strings, NAME-SUFFIX string]. ne-location := enamex & [LOCTYPE loc-type, LOCNAME string]. loc-type :< atom. river := loc-type. continent := loc-type. country := loc-type. province := loc-type. city := loc-type.

Multilingual Corpora Workshop, 27 March 2003 Shared Token Classes A single set of token classes is used for the European languages NATURAL_NUMBER12344 FLOATING_POINT_NUMBER123,43 NUMBER_PERCENT_COMPOUND34,4% NUMBER_DOT_COMPOUND234.345.545. NUMBER_WORD_COMPOUND2,4-fachen DIGIT_SLASH_COMPOUND12/01/1998 DIGIT_DASH_COMPOUND12-01-1998 DIGIT_COLON_COMPOUND15:13 ALL_CAPS_WORDABC LOWERCASE_WORDtokenization FIRST_CAPITAL_WORDMicrosoft MIXED_WORD_FIRST_CAPITAL GmbH MIXED_WORD_FIRST_LOWERdKK

Multilingual Corpora Workshop, 27 March 2003 Shared Grammars SPROUT supports re-use and extension of grammars This feature has been used for the development of multilingual parallel grammars for English, Spanish and French Common parts of the grammar for different languages (e.g. date formats like 20.10.2003) are stored in one file, and combined with the language-specific parts of the grammars (for structures like 20 de octubre del 2003) Common proper names such as Amsterdam are stored in generic gazeteer, while language-specific names such as Brussels, Bruxelles, Bruselas are stored in language- specific lists

Multilingual Corpora Workshop, 27 March 2003 Advantages of shared grammars Grammars are more easily re-usable and extendible Consistency is improved, as changes must only be made in one place for shared structures Grammar development is more efficient, and less time- consuming and error-prone The same methodology has been applied for combining general-language grammars with domain-specific grammars

Multilingual Corpora Workshop, 27 March 2003 Re-use of corpora We use NE-annotated corpora for grammar development and evaluation of grammars Special-purpose annotation of corpora is only feasible for large-scale evaluations such as MUC, but exceeds the resources of most application-oriented projects Corpora from other projects are re-used in order to save labour and have larger evaluation resources There may be mismatches between corpus annotation and grammar output

Multilingual Corpora Workshop, 27 March 2003 Multilingual NE corpora English corpora from the MUC7 evaluations Japanese and Chinese corpora annotated according to MUC7 conventions German corpora annotated in the COLLATE project with a superset of MUC7 annotations German, English, French and Spanish texts annotated with Named Entities, from Joint Research Centre Spanish data from the CoNLL-2002 Language-Independent NER task English and French corpora from the business domain annotated with named entities according to the MUC7 guidelines within our project

Multilingual Corpora Workshop, 27 March 2003 Issues with re-use of corpora The corpora contain differences in Annotation format Types of named entities annotated Attributes used to describe each NE Superficial differences in annotation format are handled by conversion to XML Differences in the content of the annotation are not handled by modification of the corpora, but rather by making our evaluation tool more flexible

Multilingual Corpora Workshop, 27 March 2003 Outline of the Talk some facts and figures motivation for corpus annotation structure of annotated articles annotation of semantic relations annotation of named entities annotation of coreference cooperation

Multilingual Corpora Workshop, 27 March 2003 Some Facts and Figures annotation of business news up to now: 575 articles, i.e. 80502 tokens annotation + consistency check annotation of semantic relations, named entities and coreference stand-off markup in XML (XML Spy + MMAX) standards: MUC 7 (for named entities and coreference) annotation scheme + DTD

Multilingual Corpora Workshop, 27 March 2003 Motivation for Corpus Annotation data for testing and training statistically based methods in natural language processing basis for empirical language research

Multilingual Corpora Workshop, 27 March 2003 Structure of Annotated Articles … … … … … semantic relations named entities + coreference

Multilingual Corpora Workshop, 27 March 2003 Annotation of Semantic Relations acquisition company corporateStructure dividends newBusiness offer occupation premiumIncome profit relocation revenue turnover Robert Bosch GmbH, Stuttgart: Der Kfz-Zulieferkonzern übernimmt zum 1. Januar die van Doorne's Transmissie b. v., Tilburg. Das niederländische Unternehmen, das im letzten Jahr mit 220 Mitarbeitern einen Umsatz von 45 Millionen DM erzielte, entwickelt stufenlose auto- matische Automobilgetriebe (CVT = Continuously Variable Transmission) und produziert Komponenten für CVT. <Firma Branche="Kfz-Zulieferkonzern" Firma="Robert Bosch" Rechtsform="GmbH" Sitz="Stuttgart"/> <Firma Firma="van Doorne's Transmissie" Land="NL" Rechtsform="b. v." Sitz="Tilburg"/> <Beschaeftigung Firma="van Doorne's Transmissie" Mitarbeiter="220"/> <Umsatz Betrag="45 Mill." Firma="van Doorne's Transmissie" Waehrung="DEM"/> <Uebernahme Kaeufer="Robert Bosch" Objekt="van Doorne's Transmissie"/>

Multilingual Corpora Workshop, 27 March 2003 Annotation of Named Entities function location money number ordinalNumber organization percent personName productName scaleUnit time Robert Bosch GmbH, Stuttgart: Der Kfz-Zulieferkonzern übernimmt zum 1. Januar die van Doorne's Transmissie b. v., Tilburg. Das niederländische Unternehmen, das im letzten Jahr mit 220 Mitarbeitern einen Umsatz von 45 Millionen DM erzielte, entwickelt stufenlose auto- matische Automobilgetriebe (CVT = Continuously Variable Transmission) und produziert Komponenten für CVT. Robert Bosch GmbH, Stuttgart : Der Kfz-Zulieferkonzern übernimmt zum 1. Januar die <NE Organisation="van Doorne's Transmissie b. v.">van Doorne's Transmissie, Tilburg.

Multilingual Corpora Workshop, 27 March 2003 Annotation of Coreference 3 rd person personal pronouns 3 rd person possessive pronouns and determiners demonstrative pronouns and determiners indefinite pronouns and determiners anaphoric and cataphoric adverbs elliptical nominal phrases anaphoric and cataphoric nominal phrases LM Ericsson AB, Stockholm: Der schwedische Elektronikkonzern hat … LM Ericsson AB, Stockholm: Der schwedische Elektronikkonzern hat …

Multilingual Corpora Workshop, 27 March 2003 Cooperation: Annotation of FR SPD fordert Koalition zu Gespr"ach "uber Reform auf. SPD fordert Koalition zu Gespr"ach "uber Reform auf. <t id="s37_4" word="verkörpert" pos="VVFIN" morph="3.Sg.Pres.Ind" />...... TIGER: syntactic annotation LLX: FrameNet annotation

Multilingual Corpora Workshop, 27 March 2003 Cooperation: Multi-layer Annotation <t id="s37_4" word="verkörpert" pos="VVFIN" morph="3.Sg.Pres.Ind" />...... LLX: FrameNet annotation SPD fordert Koalition zu Gespr"ach "uber Reform auf. SPD fordert Koalition zu Gespr"ach "uber Reform auf. TIGER: syntactic annotation SZ 1995-03-31 vwd 22 Wirtschaft Firmenmeldungen Datev eG, NÃ¼rnberg: Der EDV-Dienstleister fÃ¼r Steuerberater hat 1994 den Umsatz laut vorlÃ¤ufigen Zahlen um 5% auf rund 980 Mill. DM gesteigert. Die Anzahl der Mitarbeiter ist auf 4605 (4474) BeschÃ¤ftigte gestiegen, die Zahl der Genossenschaftsmitglieder zog auf 34246 (33551) an. Die Investitionen von 115 (93) Mill. DM haben sich in erster Linie auf die Modernisierung der GroÃrechner, den PC-Bereich sowie auf ein automatisches Versandlager konzentriert. <Firma Branche1="EDV-Dienstleister fÃ¼r Steuerberater" Firma="Datev eG" Sitz1="NÃ¼rnberg" Rechtsform="eG"/> <Umsatz Firma="Datev eG" Differenz="5%" Trend="plus" Betrag1="980 Mill." Waehrung1="DEM" Beschreibung1="rund" Zeit="1994"/> <Beschaeftigung Firma="Datev eG" Trend="plus" Mitarbeiter1_alt="4474" Mitarbeiter1_neu="4605" Zeit="1994"/> Datev eG, NÃ¼rnberg : Der EDV-Dienstleister fÃ¼r Steuerberater hat 1994 den Umsatz laut vorlÃ¤ufigen Zahlen um 5% auf <NE Geld="rund 980 Mill. DEM">rund 980 Mill. DM gesteigert. Die Anzahl der Mitarbeiter ist auf 4605 ( 4474 ) BeschÃ¤ftigte gestiegen, die Zahl der Genossenschaftsmitglieder zog auf 34246 ( 33551 ) an. Die Investitionen von <NE Geld="115 (93) Mill. DEM">115 (93) Mill. DM haben sich in erster Linie auf die Modernisierung der GroÃrechner, den PC-Bereich sowie auf ein automatisches Versandlager konzentriert. COLLATE: semantic annotation => multi-layer annotated language resource

Multilingual Corpora Workshop, 27 March 2003 Evaluation Tool: jTaCo Evaluates grammars wrt. an annotated corpus Removes annotations from corpus, and feeds unannotated text to grammar Compares grammar output with original annotated texts Produces detailed statistics, evaluation scores, and diagnostic output

Multilingual Corpora Workshop, 27 March 2003 Configuration of jTaCo jTaCo can be configured to deal with various problems in evaluating grammars wrt. a corpus: Use of different classes of NE, or different granularities (e.g. organization and subclasses company, university etc.) Declaration of class equivalence and subclass relationships. Extent of NE may be different (CEO Bill Gates vs. Bill Gates) Left or right boundary may be mismatched. Size of allowable mismatch can be specified for each NE class. Markup of corpus may be textually oriented (XML tags) while grammar output may be a different datastructure (e.g. semantics encoded in feature structure) No general solution is possible. In case of SPROUT, feature structures are linked with input tokens, so that a correspondence can be established (under development).

Multilingual Corpora Workshop, 27 March 2003 Architecture of jTaCo

Multilingual Corpora Workshop, 27 March 2003 Conclusion We discussed a fundamental problem in re-using heterogeneously annotated corpora for multilingual grammar development With increasing availability of annotated corpora, re-use becomes attractive and cost-effective We described methods and tools for re-using annotated corpora for development and evaluation of NE grammars

Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński,

Ähnliche Präsentationen

Präsentation zum Thema: "Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński,"— Präsentation transkript:

Ähnliche Präsentationen

Über Projekt

Feedback

Anmelden

Anmeldung über soziales Netzwerk:

Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński,

Ähnliche Präsentationen

Präsentation zum Thema: "Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński,"— Präsentation transkript:

Ähnliche Präsentationen

Über Projekt

Feedback