Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra.

Slides:



Advertisements
Ähnliche Präsentationen
Relative Clauses.
Advertisements

Cadastre for the 21st Century – The German Way
E-Solutions mySchoeller.com for Felix Schoeller Imaging
H - A - M - L - E - IC T Teachers Acting Patterns while Teaching with New Media in the Subjects German, Mathematics and Computer Science Prof. S. Blömeke,
Masterstudiengänge im Tourismus in der Schweiz Chur Lausanne Hochschule für Technik & Wirtschaft Chur International Hospitality Management Lausanne International.
R. Zankl – Ch. Oelschlegel – M. Schüler – M. Karg – H. Obermayer R. Gottanka – F. Rösch – P. Keidler – A. Spangler th Expert Meeting Business.
The difference between kein and nicht.
Peter Marwedel TU Dortmund, Informatik 12
Stärken-Schwächen- Chancen-Risken Lernziele: to talk about future job possibilities to develop the ability to compare and contrast to confidently use present,
NUMEX – Numerical experiments for the GME Fachhochschule Bonn-Rhein-Sieg Wolfgang Joppich PFTOOL - Precipitation forecast toolbox Semi-Lagrangian Mass-Integrating.
Three minutes presentation I ArbeitsschritteW Seminar I-Prax: Inhaltserschließung visueller Medien, Spree WS 2010/2011 Giving directions.
Introduction to BOS Supplier Guidelines v6
Hochschulteam der Agentur für Arbeit Trier Preventing the Brainware Crisis Workshop Schloss Dagstuhl Student Enrollment in Computer Science.
Deutsche Gesellschaft für Technische Zusammenarbeit GmbH Integrated Experts as interface between technical cooperation and the private sector – An Example.
virtPresenter „lecture recording framework“
Kapitel 1 Komm mit! Level I erste Stufe.
Die Hausaufgaben: Machen Sie Ü. 7 auf S. 29
Institut für Umweltphysik/Fernerkundung Physik/Elektrotechnik Fachbereich 1 Pointing Meeting Nov 2006 S. Noël IFE/IUP Elevation and Azimuth Jumps during.
Adjektive Endungen von Frau Templeton.
Laurie Clarcq The purpose of language, used in communication, is to create a picture in the mind and/or the heart of another.
Relative clauses What is a relative clause?
Institut AIFB, Universität Karlsruhe (TH) Forschungsuniversität gegründet 1825 Towards Automatic Composition of Processes based on Semantic.
| DC-IAP/SVC3 | © Bosch Rexroth Pneumatics GmbH This document, as well as the data, specifications and other information set forth in.
I. Building and forms: A sentence can be written in the active or passive. This is a question of your point of view to the circumstances: In the active.
VOCABULARY SOCCER QUESTIONS PRESENT CONTINOUS WORDLIST – CELL PHONE Unit 2 - Trainer.
Morphology and Syntax More on sentence structure.
Deutsch 1 G Stunde. Donnerstag, der 25. Oktober 2012 Deutsch 1, G Stunde Heute ist ein B- Tag Unit: Family & homeFamilie & Zuhause Objectives: Phrases.
Deutsch 1 G Stunde. Donnerstag, der 18. Oktober 2012 Deutsch 1, G Stunde Heute ist ein E- Tag Unit: Family & homeFamilie & Zuhause Objectives: Phrases.
G Stunde DEUTSCH 1. Unit: Family & homeFamilie & Zuhause Objectives: Phrases about date, weather and time-telling Alphabet – pronunciation and words The.
Deutsch 1 G Stunde. Unit: Introduction to German & Germany Objectives: Learn phrases about date, weather and time-telling Presentations about the federal.
Deutsch 1 G Stunde. Montag, der 10. September 2012 Deutsch 1 (G Stunde)Heute ist ein D - Tag Unit: Introduction to German & Germany Objectives: Introducing.
Donnerstag, den 28. November FUTURE HOLIDAYS Lernziel: to learn how to express future ideas. Starter: Finde ein Paar! will plane...zu werde hoffe...zu.
Wiederholung – Meine Schule und ich!
Meine Schulfächer.
Die Zukunft The future I will.
Wie komme ich zu..... ? (how do I get to....?)
You need to use your mouse to see this presentation © Heidi Behrens.
INTAKT- Interkulturelle Berufsfelderkundungen als ausbildungsbezogene Lerneinheiten in berufsqualifizierenden Auslandspraktika DE/10/LLP-LdV/TOI/
DEUTSCHLAND UND DIE MEDIEN
Guten Morgen!.
Faculty of Public Health Department of Health Economics and Management University of Bielefeld WP 3.1 and WP 4.1: Macrocost EUprimecare Plenary Meeting.
Tage der Woche German Early Level Montag Dienstag Mittwoch Donnerstag
Kölner Karneval By Logan Mack
1 Competence Center ELANFraunhofer FOKUS COLORPICKER: Zum Auswählen der korrekten Farbe mit dem Tool »Farbe auswählen« auf das jeweilige Feld unten klicken.
Ein Projekt des Technischen Jugendfreizeit- und Bildungsvereins (tjfbv) e.V. kommunizieren.de Blended Learning for people with disabilities.
Talking about yourself
Cross-Polarization Modulation in DWDM Systems
Feste und Feiertage Treffpunkt Deutsch Sixth Edition.
Relativpronomen / Relativsätze:
Negation is when you dont have or dont do something.
Can you tell me about your school?
Berner Fachhochschule Hochschule für Agrar-, Forst- und Lebensmittelwissenschaften HAFL 95% der Ammoniakemissionen aus der Landwirtschaft Rindvieh Pflanzenbau.
Relativpronomen / Relativsätze:
Ciiema CITEM - Dr. Siegl VU Dr. Manfred Siegl ENTWURF, ERRICHTUNG, BETRIEB VON DATENNETZEN VU Dr. Manfred Siegl
Sentence Structure Subject and verb are always together. Subject and verb are always together. Subject and verb must agree Subject and verb must agree.
To school => zu der Schule With friends => mit den Freunden On top of the desk => auf dem Schreibtisch Through the wall => durch die Wand.
So ist das Leben Treffpunkt Deutsch Sixth Edition
1 Stevens Direct Scaling Methods and the Uniqueness Problem: Empirical Evaluation of an Axiom fundamental to Interval Scale Level.
THE PERFECT TENSE IN GERMAN
Lehrstuhl für Waldbau, Technische Universität MünchenBudapest, 10./11. December 2006 WP 1 Status (TUM) Bernhard Felbermeier.
Selectivity in the German Mobility Panel Tobias Kuhnimhof Institute for Transport Studies, University of Karlsruhe Paris, May 20th, 2005.
TUM in CrossGrid Role and Contribution Fakultät für Informatik der Technischen Universität München Informatik X: Rechnertechnik und Rechnerorganisation.
By Martin L. Loeffler.  The basic sentence has a subject and a verb.  The subject and verb need to be together.  The subject and verb need to agree.
Warm-up: Kickers ‘ob’, ‘dass’, ‘weil’
die Briefmarke (n)die Lüge (n) die Entlassung (en)der Rechtsanwalt betender Richter das Gebetder Verlobte das Fallbeildie Universität das Flugblattdrücken.
Proposal for a unified XML-file of the avalanche report... © Patrick NAIRZ, Avalanche Warning Center Tyrol - Why do we need a unified XML-file? - Integration.
© Handwerkskammer für München und Oberbayern, Max-Joseph-Straße 4, München Dietmar Schneider Foreign Trade Department of the Chamber of Trade and.
Kapitel 4 Grammar INDEX 1.Ordinal Numbers 2.Relative Pronouns and Relative Clauses 3.Conditional Sentences 4.Posessive: Genitive Case.
Kapitel 8 Grammar INDEX 1.Command Forms: The Du-Command Form & Ihr- Command 2.Sentences & Clauses.
Relative Clauses Frau Lizz Caplan-Carbin.
 Präsentation transkript:

Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra

Outline Why corpora, why interpreted corpora Many types of annotation - linguistic annotation - non-linguistic annotation New developments

Why corpora? Cognition models of human language processing Cognition models of human language processing Engineering language technology applicationsEngineering applications Linguistics linguistic theory Linguistics

Empirical linguistics corpus data experimental psycholinguistic data introspective data DB of relevant data research

Engineering motivation information extraction question-answering statistical machine translation parser training and evaluation => increased need for deeply annotated corpora

Cognitive motivation experience-oriented frequency-based models models of gradiant grammaticality metrics of complexity

Resource description metadata language: Spanish, English, German sublanguage/register: regional dialect, sociolect, vernacular, professional jargon, toddler speech text sort(s): newspaper articles, wire news, political speech, control commands subject domain: stock rates, flight reservations, type of producers: professional journalist, student, radiologist mode of production: spoken, written, signed, morsed medium of production: pencil, PC with MS Word, dictaphone conditions of production: spontaneous, carefully composed, produced under time pressure transmission encoding: raw ascii code, HTML, digitized phone signal, unicode medium of transmission: telephone, WWW, CB radio storage encoding: raw ASCII code, HTML, AIFF medium of storage: DAT tape, CD ROM, hard disk mode of presentation: spoken, written, signed medium of presentation: newspaper, radio, book, tv show, theater performance, type of intended recipients: newspaper reader, booking agent, theater audience number of intended recipients: point-to-point, multicast, broadcast synchronicity of discourse: synchronous dialogue, asynchronous direction: one-way, two-way

Linguistic annotation part-of-speech tags, word sense information, morphosyntactic features of words, constituent structures for phrases or sentences, coreference markers, dependency structures, predicate-argument structures, reference identifications for term phrases, information structures within sentences, intonation contours, speech acts, discourse relations - discourse structures.

Other annotations judgements of native speakers on the acceptability or appropriateness of the utterance, information on speaker(s), information on hearer(s) or intended audience, information on the utterance situation (time, place, circumstances) information on the published source, typographic information, layout and document structure, textual transcriptions of spoken utterances, transcription of pauses, error tagging.

Raw vs. linguistically interpreted corpora search term: word=form...play a significant part in determining growth and form....each molecule can form four hydrogen bonds... vs. search term: word=form & pos=N...play a significant part in determining growth and form. search term: word=form & pos=V...each molecule can form four hydrogen bonds...

search term: is *ed Alpha interferon is produced by white blood cells... search term: were *ed In the late 1970s interferons were hailed as "wonder drugs"... vs. search term: pos=VB {0,1} pos=VVN Gamma is not induced by viruses at all... So interferons could be described as the antibiotics of the virus... Only two of these have yet been identified... Raw vs. linguistically interpreted corpora

Syntactically annotated corpora: treebanks German treebank project: TiGer Treebank English reference treebank: Penn Treebank Treebank + semantic information: Prague Dependency Bank

TiGer Treebank Im APPRART Dat in nächsten ADJA Sup.Dat. Sg.Neut nahe Jahr NN Dat. Pl.Neut Jahr. $. HDSBOC HDOAMO AC NK S VP NP PP will VMFIN 3.Sg. Pres.Ind wollen die ART Nom. Sg.Fem die Regierung NN Nom. Sg.Fem Regierung ihre PPOSAT Acc. Pl.Masc ihr Reformpläne NN Acc. Pl.Masc Plan umsetzen VVINF Inf umsetzen

Im APPRART Dat in nächsten ADJA Sup.Dat. Sg.Neut nahe Jahr NN Dat. Pl.Neut Jahr. $. HDSBOC HDOAMO AC NK S VP NP PP will VMFIN 3.Sg. Pres.Ind wollen die ART Nom. Sg.Fem die Regierung NN Nom. Sg.Fem Regierung ihre PPOSAT Acc. Pl.Masc ihr Reformpläne NN Acc. Pl.Masc Plan umsetzen VVINF Inf umsetzen annotation on word level: part-of-speech, morphology, lemmata TiGer Treebank

Im APPRART Dat in nächsten ADJA Sup.Dat. Sg.Neut nahe Jahr NN Dat. Pl.Neut Jahr. $. HDSBOC HDOAMO AC NK S VP NP PP will VMFIN 3.Sg. Pres.Ind wollen die ART Nom. Sg.Fem die Regierung NN Nom. Sg.Fem Regierung ihre PPOSAT Acc. Pl.Masc ihr Reformpläne NN Acc. Pl.Masc Plan umsetzen VVINF Inf umsetzen node labels: phrase categories TiGer Treebank

Im APPRART Dat in nächsten ADJA Sup.Dat. Sg.Neut nahe Jahr NN Dat. Pl.Neut Jahr. $. HDSBOC HDOAMO AC NK S VP NP PP will VMFIN 3.Sg. Pres.Ind wollen die ART Nom. Sg.Fem die Regierung NN Nom. Sg.Fem Regierung ihre PPOSAT Acc. Pl.Masc ihr Reformpläne NN Acc. Pl.Masc Plan umsetzen VVINF Inf umsetzen edge labels: syntactic functions TiGer Treebank

Im APPRART Dat in nächsten ADJA Sup.Dat. Sg.Neut nahe Jahr NN Dat. Pl.Neut Jahr. $. HDSBOC HDOAMO AC NK S VP NP PP will VMFIN 3.Sg. Pres.Ind wollen die ART Nom. Sg.Fem die Regierung NN Nom. Sg.Fem Regierung ihre PPOSAT Acc. Pl.Masc ihr Reformpläne NN Acc. Pl.Masc Plan umsetzen VVINF Inf umsetzen crossing branches for discontinuous constituency types TiGer Treebank

Penn Treebank ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (,,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (,,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (..) ))

Penn Treebank ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (,,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (,,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (..) )) annotation on word level: part-of-speech

Penn Treebank ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (,,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (,,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (..) )) phrase categories

Penn Treebank ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (,,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (,,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (..) )) syntactic functions

Prague Dependency Bank chce wants Sb Kdo who Sb ACT.T investovat to-invest Obj ACT.VOL.T ste hundred Obj RESTR.F korun crowns Atr PAT.F do to AuxP automobilu car Adv DIR.F

Prague Dependency Bank chce wants Sb Kdo who Sb ACT.T investovat to-invest Obj ACT.VOL.T ste hundred Obj RESTR.F korun crowns Atr PAT.F do to AuxP automobilu car Adv DIR.F annotation on word level: lemmata, morphology

Prague Dependency Bank chce wants Sb Kdo who Sb ACT.T investovat to-invest Obj ACT.VOL.T ste hundred Obj RESTR.F korun crowns Atr PAT.F do to AuxP automobilu car Adv DIR.F syntactic functions

Prague Dependency Bank chce wants Sb Kdo who Sb ACT.T investovat to-invest Obj ACT.VOL.T ste hundred Obj RESTR.F korun crowns Atr PAT.F do to AuxP automobilu car Adv DIR.F dependency structure

Prague Dependency Bank chce wants Sb Kdo who Sb ACT.T investovat to-invest Obj ACT.VOL.T ste hundred Obj RESTR.F korun crowns Atr PAT.F do to AuxP automobilu car Adv DIR.F semantic information on constituent roles, theme/rheme, etc.

New developments historical dimension (e.g., Corpus of the History of German Language) multilayer stand-off linguistic markup multimodal markup/interpretation new types of treebanks: CS treebanks with dependency links (NEGRA, TIGER) machine-annotated corpora for statistical training (e.g., Redwoods Treebank) Dependency (Tree)Banks (Prague, PARC) Grammatical Relation (Tree)Banks (Briscoe & Carroll)