Die Präsentation wird geladen. Bitte warten

Die Präsentation wird geladen. Bitte warten

Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra.

Ähnliche Präsentationen


Präsentation zum Thema: "Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra."—  Präsentation transkript:

1 Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra

2 Outline Why corpora, why interpreted corpora Many types of annotation - linguistic annotation - non-linguistic annotation New developments

3 Why corpora? Cognition models of human language processing Cognition models of human language processing Engineering language technology applicationsEngineering applications Linguistics linguistic theory Linguistics

4 Empirical linguistics corpus data experimental psycholinguistic data introspective data DB of relevant data research

5 Engineering motivation information extraction question-answering statistical machine translation parser training and evaluation => increased need for deeply annotated corpora

6 Cognitive motivation experience-oriented frequency-based models models of gradiant grammaticality metrics of complexity

7 Resource description metadata language: Spanish, English, German sublanguage/register: regional dialect, sociolect, vernacular, professional jargon, toddler speech text sort(s): newspaper articles, wire news, political speech, control commands subject domain: stock rates, flight reservations, type of producers: professional journalist, student, radiologist mode of production: spoken, written, signed, morsed medium of production: pencil, PC with MS Word, dictaphone conditions of production: spontaneous, carefully composed, produced under time pressure transmission encoding: raw ascii code, HTML, digitized phone signal, unicode medium of transmission: telephone, WWW, CB radio storage encoding: raw ASCII code, HTML, AIFF medium of storage: DAT tape, CD ROM, hard disk mode of presentation: spoken, written, signed medium of presentation: newspaper, radio, book, tv show, theater performance, type of intended recipients: newspaper reader, booking agent, theater audience number of intended recipients: point-to-point, multicast, broadcast synchronicity of discourse: synchronous dialogue, asynchronous direction: one-way, two-way

8 Linguistic annotation part-of-speech tags, word sense information, morphosyntactic features of words, constituent structures for phrases or sentences, coreference markers, dependency structures, predicate-argument structures, reference identifications for term phrases, information structures within sentences, intonation contours, speech acts, discourse relations - discourse structures.

9 Other annotations judgements of native speakers on the acceptability or appropriateness of the utterance, information on speaker(s), information on hearer(s) or intended audience, information on the utterance situation (time, place, circumstances) information on the published source, typographic information, layout and document structure, textual transcriptions of spoken utterances, transcription of pauses, error tagging.

10 Raw vs. linguistically interpreted corpora search term: word=form...play a significant part in determining growth and form....each molecule can form four hydrogen bonds... vs. search term: word=form & pos=N...play a significant part in determining growth and form. search term: word=form & pos=V...each molecule can form four hydrogen bonds...

11 search term: is *ed Alpha interferon is produced by white blood cells... search term: were *ed In the late 1970s interferons were hailed as "wonder drugs"... vs. search term: pos=VB {0,1} pos=VVN Gamma is not induced by viruses at all... So interferons could be described as the antibiotics of the virus... Only two of these have yet been identified... Raw vs. linguistically interpreted corpora

12 Syntactically annotated corpora: treebanks German treebank project: TiGer Treebank English reference treebank: Penn Treebank Treebank + semantic information: Prague Dependency Bank

13 TiGer Treebank Im APPRART Dat in nächsten ADJA Sup.Dat. Sg.Neut nahe Jahr NN Dat. Pl.Neut Jahr. $. HDSBOC HDOAMO AC NK S VP NP PP will VMFIN 3.Sg. Pres.Ind wollen die ART Nom. Sg.Fem die Regierung NN Nom. Sg.Fem Regierung ihre PPOSAT Acc. Pl.Masc ihr Reformpläne NN Acc. Pl.Masc Plan umsetzen VVINF Inf umsetzen

14 Im APPRART Dat in nächsten ADJA Sup.Dat. Sg.Neut nahe Jahr NN Dat. Pl.Neut Jahr. $. HDSBOC HDOAMO AC NK S VP NP PP will VMFIN 3.Sg. Pres.Ind wollen die ART Nom. Sg.Fem die Regierung NN Nom. Sg.Fem Regierung ihre PPOSAT Acc. Pl.Masc ihr Reformpläne NN Acc. Pl.Masc Plan umsetzen VVINF Inf umsetzen annotation on word level: part-of-speech, morphology, lemmata TiGer Treebank

15 Im APPRART Dat in nächsten ADJA Sup.Dat. Sg.Neut nahe Jahr NN Dat. Pl.Neut Jahr. $. HDSBOC HDOAMO AC NK S VP NP PP will VMFIN 3.Sg. Pres.Ind wollen die ART Nom. Sg.Fem die Regierung NN Nom. Sg.Fem Regierung ihre PPOSAT Acc. Pl.Masc ihr Reformpläne NN Acc. Pl.Masc Plan umsetzen VVINF Inf umsetzen node labels: phrase categories TiGer Treebank

16 Im APPRART Dat in nächsten ADJA Sup.Dat. Sg.Neut nahe Jahr NN Dat. Pl.Neut Jahr. $. HDSBOC HDOAMO AC NK S VP NP PP will VMFIN 3.Sg. Pres.Ind wollen die ART Nom. Sg.Fem die Regierung NN Nom. Sg.Fem Regierung ihre PPOSAT Acc. Pl.Masc ihr Reformpläne NN Acc. Pl.Masc Plan umsetzen VVINF Inf umsetzen edge labels: syntactic functions TiGer Treebank

17 Im APPRART Dat in nächsten ADJA Sup.Dat. Sg.Neut nahe Jahr NN Dat. Pl.Neut Jahr. $. HDSBOC HDOAMO AC NK S VP NP PP will VMFIN 3.Sg. Pres.Ind wollen die ART Nom. Sg.Fem die Regierung NN Nom. Sg.Fem Regierung ihre PPOSAT Acc. Pl.Masc ihr Reformpläne NN Acc. Pl.Masc Plan umsetzen VVINF Inf umsetzen crossing branches for discontinuous constituency types TiGer Treebank

18 Penn Treebank ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (,,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (,,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (..) ))

19 Penn Treebank ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (,,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (,,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (..) )) annotation on word level: part-of-speech

20 Penn Treebank ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (,,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (,,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (..) )) phrase categories

21 Penn Treebank ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (,,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (,,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (..) )) syntactic functions

22 Prague Dependency Bank chce wants Sb Kdo who Sb ACT.T investovat to-invest Obj ACT.VOL.T ste hundred Obj RESTR.F korun crowns Atr PAT.F do to AuxP automobilu car Adv DIR.F

23 Prague Dependency Bank chce wants Sb Kdo who Sb ACT.T investovat to-invest Obj ACT.VOL.T ste hundred Obj RESTR.F korun crowns Atr PAT.F do to AuxP automobilu car Adv DIR.F annotation on word level: lemmata, morphology

24 Prague Dependency Bank chce wants Sb Kdo who Sb ACT.T investovat to-invest Obj ACT.VOL.T ste hundred Obj RESTR.F korun crowns Atr PAT.F do to AuxP automobilu car Adv DIR.F syntactic functions

25 Prague Dependency Bank chce wants Sb Kdo who Sb ACT.T investovat to-invest Obj ACT.VOL.T ste hundred Obj RESTR.F korun crowns Atr PAT.F do to AuxP automobilu car Adv DIR.F dependency structure

26 Prague Dependency Bank chce wants Sb Kdo who Sb ACT.T investovat to-invest Obj ACT.VOL.T ste hundred Obj RESTR.F korun crowns Atr PAT.F do to AuxP automobilu car Adv DIR.F semantic information on constituent roles, theme/rheme, etc.

27 New developments historical dimension (e.g., Corpus of the History of German Language) multilayer stand-off linguistic markup multimodal markup/interpretation new types of treebanks: CS treebanks with dependency links (NEGRA, TIGER) machine-annotated corpora for statistical training (e.g., Redwoods Treebank) Dependency (Tree)Banks (Prague, PARC) Grammatical Relation (Tree)Banks (Briscoe & Carroll)


Herunterladen ppt "Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra."

Ähnliche Präsentationen


Google-Anzeigen