Präsentation herunterladen
Die Präsentation wird geladen. Bitte warten
Veröffentlicht von:Carla Weit Geändert vor über 10 Jahren
1
Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra
2
Outline Why corpora, why interpreted corpora Many types of annotation - linguistic annotation - non-linguistic annotation New developments
3
Why corpora? Cognition models of human language processing Cognition models of human language processing Engineering language technology applicationsEngineering applications Linguistics linguistic theory Linguistics
4
Empirical linguistics corpus data experimental psycholinguistic data introspective data DB of relevant data research
5
Engineering motivation information extraction question-answering statistical machine translation parser training and evaluation => increased need for deeply annotated corpora
6
Cognitive motivation experience-oriented frequency-based models models of gradiant grammaticality metrics of complexity
7
Resource description metadata language: Spanish, English, German sublanguage/register: regional dialect, sociolect, vernacular, professional jargon, toddler speech text sort(s): newspaper articles, wire news, political speech, control commands subject domain: stock rates, flight reservations, type of producers: professional journalist, student, radiologist mode of production: spoken, written, signed, morsed medium of production: pencil, PC with MS Word, dictaphone conditions of production: spontaneous, carefully composed, produced under time pressure transmission encoding: raw ascii code, HTML, digitized phone signal, unicode medium of transmission: telephone, WWW, CB radio storage encoding: raw ASCII code, HTML, AIFF medium of storage: DAT tape, CD ROM, hard disk mode of presentation: spoken, written, signed medium of presentation: newspaper, radio, book, tv show, theater performance, type of intended recipients: newspaper reader, booking agent, theater audience number of intended recipients: point-to-point, multicast, broadcast synchronicity of discourse: synchronous dialogue, asynchronous direction: one-way, two-way
8
Linguistic annotation part-of-speech tags, word sense information, morphosyntactic features of words, constituent structures for phrases or sentences, coreference markers, dependency structures, predicate-argument structures, reference identifications for term phrases, information structures within sentences, intonation contours, speech acts, discourse relations - discourse structures.
9
Other annotations judgements of native speakers on the acceptability or appropriateness of the utterance, information on speaker(s), information on hearer(s) or intended audience, information on the utterance situation (time, place, circumstances) information on the published source, typographic information, layout and document structure, textual transcriptions of spoken utterances, transcription of pauses, error tagging.
10
Raw vs. linguistically interpreted corpora search term: word=form...play a significant part in determining growth and form....each molecule can form four hydrogen bonds... vs. search term: word=form & pos=N...play a significant part in determining growth and form. search term: word=form & pos=V...each molecule can form four hydrogen bonds...
11
search term: is *ed Alpha interferon is produced by white blood cells... search term: were *ed In the late 1970s interferons were hailed as "wonder drugs"... vs. search term: pos=VB {0,1} pos=VVN Gamma is not induced by viruses at all... So interferons could be described as the antibiotics of the virus... Only two of these have yet been identified... Raw vs. linguistically interpreted corpora
12
Syntactically annotated corpora: treebanks German treebank project: TiGer Treebank English reference treebank: Penn Treebank Treebank + semantic information: Prague Dependency Bank
13
TiGer Treebank Im APPRART Dat in nächsten ADJA Sup.Dat. Sg.Neut nahe Jahr NN Dat. Pl.Neut Jahr. $. HDSBOC HDOAMO AC NK S VP NP PP will VMFIN 3.Sg. Pres.Ind wollen die ART Nom. Sg.Fem die Regierung NN Nom. Sg.Fem Regierung ihre PPOSAT Acc. Pl.Masc ihr Reformpläne NN Acc. Pl.Masc Plan umsetzen VVINF Inf umsetzen
14
Im APPRART Dat in nächsten ADJA Sup.Dat. Sg.Neut nahe Jahr NN Dat. Pl.Neut Jahr. $. HDSBOC HDOAMO AC NK S VP NP PP will VMFIN 3.Sg. Pres.Ind wollen die ART Nom. Sg.Fem die Regierung NN Nom. Sg.Fem Regierung ihre PPOSAT Acc. Pl.Masc ihr Reformpläne NN Acc. Pl.Masc Plan umsetzen VVINF Inf umsetzen annotation on word level: part-of-speech, morphology, lemmata TiGer Treebank
15
Im APPRART Dat in nächsten ADJA Sup.Dat. Sg.Neut nahe Jahr NN Dat. Pl.Neut Jahr. $. HDSBOC HDOAMO AC NK S VP NP PP will VMFIN 3.Sg. Pres.Ind wollen die ART Nom. Sg.Fem die Regierung NN Nom. Sg.Fem Regierung ihre PPOSAT Acc. Pl.Masc ihr Reformpläne NN Acc. Pl.Masc Plan umsetzen VVINF Inf umsetzen node labels: phrase categories TiGer Treebank
16
Im APPRART Dat in nächsten ADJA Sup.Dat. Sg.Neut nahe Jahr NN Dat. Pl.Neut Jahr. $. HDSBOC HDOAMO AC NK S VP NP PP will VMFIN 3.Sg. Pres.Ind wollen die ART Nom. Sg.Fem die Regierung NN Nom. Sg.Fem Regierung ihre PPOSAT Acc. Pl.Masc ihr Reformpläne NN Acc. Pl.Masc Plan umsetzen VVINF Inf umsetzen edge labels: syntactic functions TiGer Treebank
17
Im APPRART Dat in nächsten ADJA Sup.Dat. Sg.Neut nahe Jahr NN Dat. Pl.Neut Jahr. $. HDSBOC HDOAMO AC NK S VP NP PP will VMFIN 3.Sg. Pres.Ind wollen die ART Nom. Sg.Fem die Regierung NN Nom. Sg.Fem Regierung ihre PPOSAT Acc. Pl.Masc ihr Reformpläne NN Acc. Pl.Masc Plan umsetzen VVINF Inf umsetzen crossing branches for discontinuous constituency types TiGer Treebank
18
Penn Treebank ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (,,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (,,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (..) ))
19
Penn Treebank ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (,,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (,,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (..) )) annotation on word level: part-of-speech
20
Penn Treebank ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (,,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (,,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (..) )) phrase categories
21
Penn Treebank ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (,,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (,,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (..) )) syntactic functions
22
Prague Dependency Bank chce wants Sb Kdo who Sb ACT.T investovat to-invest Obj ACT.VOL.T ste hundred Obj RESTR.F korun crowns Atr PAT.F do to AuxP automobilu car Adv DIR.F
23
Prague Dependency Bank chce wants Sb Kdo who Sb ACT.T investovat to-invest Obj ACT.VOL.T ste hundred Obj RESTR.F korun crowns Atr PAT.F do to AuxP automobilu car Adv DIR.F annotation on word level: lemmata, morphology
24
Prague Dependency Bank chce wants Sb Kdo who Sb ACT.T investovat to-invest Obj ACT.VOL.T ste hundred Obj RESTR.F korun crowns Atr PAT.F do to AuxP automobilu car Adv DIR.F syntactic functions
25
Prague Dependency Bank chce wants Sb Kdo who Sb ACT.T investovat to-invest Obj ACT.VOL.T ste hundred Obj RESTR.F korun crowns Atr PAT.F do to AuxP automobilu car Adv DIR.F dependency structure
26
Prague Dependency Bank chce wants Sb Kdo who Sb ACT.T investovat to-invest Obj ACT.VOL.T ste hundred Obj RESTR.F korun crowns Atr PAT.F do to AuxP automobilu car Adv DIR.F semantic information on constituent roles, theme/rheme, etc.
27
New developments historical dimension (e.g., Corpus of the History of German Language) multilayer stand-off linguistic markup multimodal markup/interpretation new types of treebanks: CS treebanks with dependency links (NEGRA, TIGER) machine-annotated corpora for statistical training (e.g., Redwoods Treebank) Dependency (Tree)Banks (Prague, PARC) Grammatical Relation (Tree)Banks (Briscoe & Carroll)
Ähnliche Präsentationen
© 2024 SlidePlayer.org Inc.
All rights reserved.