Die Präsentation wird geladen. Bitte warten

Die Präsentation wird geladen. Bitte warten

Corpus based Creation and Extension of Domain-Specific Resources Manuela Kunze, Dietmar Rösner University of Magdeburg.

Ähnliche Präsentationen


Präsentation zum Thema: "Corpus based Creation and Extension of Domain-Specific Resources Manuela Kunze, Dietmar Rösner University of Magdeburg."—  Präsentation transkript:

1 Corpus based Creation and Extension of Domain-Specific Resources Manuela Kunze, Dietmar Rösner University of Magdeburg

2 Manuela Kunze2 Overview Background: Corpus Characteristics Experiment 1: Context-related Derivation of Concepts Experiment 2: Clustering of Values

3 Manuela Kunze3 Corpus: Forensic Autopsy Protocols different document parts: findings histological findings background discussion …

4 Manuela Kunze4 Autopsy Protocols: Findings short linguistic structures typical attribute-value structures expressed by noun phrases: Unterblutung des Gewebes/Bleeding of tissue. Oberlippenbart/Upper lip beard. noun phrases + verb/adjective/noun phrase Mund geschlossen./Mouth closed. Nebennieren ohne Besonderheiten./Adrenal glands without anomalies. Useable for the extension of the resources in combination with GermaNet?

5 Manuela Kunze5 Corpus 400 Protocols parsed with a context free grammar (ca. 40 rules) focus of the analyses complex noun phrases derivation of concepts attribute-value structures clustering of values

6 Manuela Kunze6 Overview Corpus Characteristics Experiment 1: Context-related Derivation of Concepts Experiment 2: Clustering of Values

7 Manuela Kunze7 Approach analysis of high-frequency complex noun phrases example: Bruch des/der … (fracture of …) occurrence 749 types: 93 known (31): Rippe/rib (254), Brustbein/sternum (65), Wirbelsäule/spine (58), Schambein/pubic bone (30), Schulterblatt/omoplate (23), … unknown (62): Schädeldach/calvarium (43), Oberschenkelknochen/femur (37), Schädelbasis/base of the skull (34), Schlüsselbein/clavicle (33), Brustwirbelsäule/thoracic spine (28), Halswirbelsäule/cervical spine (26), …

8 Manuela Kunze8 Idea: Analysis of Complex Noun Phrases fracture of keyword of complement fracture of in corpus: class of deduce: class of == class of in GermaNet:

9 Manuela Kunze9 Approach top level category : T remove senses which are not assigned with the preferred top level category collect all (GermaNet) senses determine the most frequent top level category known complements types of a keyword collect all semantic classes from the hypernym graph for each sense known (31): Rippe/rib (254), Brustbein/sternum (65), Wirbelsäule/spine (58), Schambein/pubic bone (30), Schulterblatt/omoplate (23), … … Finger => Gliedmaße, Extremität Finger => Computerprogramm, Programm Rippe => Knochen, Gebein … top level category: noun.body 36 senses 27 senses 22 different semantic classes 36 senses … Rippe, => Knochen, Gebein, => Hornsubstanz, => Körpersubstanz, => Stoff1, Substanz, Materie, => Objekt, => Hornsubstanz, => Körpersubstanz, => Stoff1, Substanz, Materie, => Objekt, … 31 complement types … Finger => Gliedmaße, Extremität Finger => Computerprogramm, Programm Rippe => Knochen, Gebein …

10 Manuela Kunze10 Approach collect all semantic classes from the hypernym graph for each semantic class sc: determine the level in the hypernym tree (f sc ) count occurences (n sc ) most specific semantic class: Knochen 22 different semantic classes select the maximum of (f sc * n sc )/N N: number of all semantic classes … Rippe, => Knochen, Gebein, => Hornsubstanz, => Körpersubstanz, => Stoff1, Substanz, Materie, => Objekt, => Hornsubstanz, => Körpersubstanz, => Stoff1, Substanz, Materie, => Objekt, …

11 Manuela Kunze11 Results 85 % correct assignments (types) 94 % correct assignments (tokens) erroneous cases: correct assignments to wrong complements wrong assignments to correct complements

12 Manuela Kunze12 Results: Erroneous Cases correct assignments to wrong complements: misspelling of tokens: Oberschenkelknorren erroneous fragments of the treatment of Germans truncations: Bruch des Ober- und Unterarmes erroneous syntactic analysis of the second NP: Bruch der Wandung der … wrong assignments to correct complements: (complex) systems of bones, cartilages, connective tissues: elbow joint

13 Manuela Kunze13 Overview Corpus Characteristics Experiment 1: Context-related Derivation of Concepts Experiment 2: Clustering of Values

14 Manuela Kunze14 Clustering of Values conceptual analysis of linguistic structures Mund geschlossen/Mouth closed. Rachenschleimhaut duesterrot. /Mucosa of fauces dark red. Beckengeruest festgefuegt und unversehrt. /Pelvis closely joined and entire. Herzohren frei, ovales Vorhoffenster geschlossen./Auricles of heart clear, oval atrium closed. Brustbein, Rippen und Wirbelsaeule intakt./Sternum, ribs and spine intact. Brustkorb sehr schmal und leicht eindrueckbar./Thorax very narrow and easy to incise. Nebennieren ohne Besonderheiten./Adrenal glands without anomalies. … 1908 concepts Mund/mouth Rachenschleimhaut/mucosa of fauces Beckengeruest/pelvis Herzohren, Vorhoffenster/auricles of heart, atrium Brustbein, Rippen, Wirbelsaeule/sternum, ribs, spine Brustkorb/thorax Nebennieren/adrenal glands 2098 different (linguistic) values geschlossen/closed duesterrot /dark red festgefuegt, unversehrt /closely joined, entire frei, geschlossen/clear, closed intakt/intact sehr schmal, leicht eindrueckbar/very narrow, easy to incise ohne Besonderheiten/ without anomalies Have similar concepts same attributes? What are the values for an attribute?

15 Manuela Kunze15 Relations Between Values Do the values describe different attributes? color, shape etc. if not, are the values paraphrases/synonyms? antonyms? values of an open range? Which lexical or conceptual relations exist between the values, e.g. synonyms, antonyms etc.? clustering of values

16 Manuela Kunze16 Examples Mund/mouth: deutlich geoeffnet fischmaulartig geoeffnet schlotartig geoeffnet ruesselartig geoeffnet froschmaulartig geoeffnet ovalaer geoeffnet geoeffnet spaltfoermig geoeffnet geschlossen different kinds of 'opened' vs. closed

17 Manuela Kunze17 Examples Milzgewebe/spleen tissue: nicht sehr blutreich fest deutlich gelockert stark gelockert relativ gelockert verhaertet gelockert leicht gelockert blutreich sehr blutarm faeulnisbedingt gelockert etwas faeulnisbedingt aufgelockert sehr blutreich concentration of blood consistency, form of tissue

18 Manuela Kunze18 Examples Wirbelsaeule/spine: ebenfalls unversehrt ebenfalls intakt intakt unversehrt ohne Besonderheiten ohne Verletzungen same findings

19 Manuela Kunze19 Approach comparison of values of a concept comparisons comparison in several steps 1.character-based: via bigrams 2.lexical-conceptual relations: available information in Germanet

20 Manuela Kunze20 Approach values of a concept removing negations removing modificators 'corrected' values lexical/conceptual relations in GermaNet? compound? bigrams of values particles: sehr, sonst, ebenfalls adjectives with suffixes: -artig, -lich, -ig example: 'sonst unaufällig' 'unauffällig' negations: 'kein', 'nicht', …

21 Manuela Kunze21 Results: Character-based Analysis similar values with modifications (particles) and negations selbst unauffaellig sonst unauffaellig unauffaellig glaenzend nicht glaenzend geoeffnet leicht geoeffnet rundlich geoeffnet spaltfoermig geoeffnet spaltweit geoeffnet froschmaulartig geoeffnet… geoeffnet sehr muskelkraeftig nicht sehr muskelstark muskelkraeftig nicht sehr muskelkraeftig nicht muskelkraeftig blutreich nicht-sehr-blutreich sehr-blutreich blutarm relativ-blutarm muskelschwach sehr-muskelschwach geschlossen spaltfoermig-geschlossen

22 Manuela Kunze22 Integration of GermaNet search for relations between two tokens parts of tokens queries about: coordinate terms synonyms, hypernyms, hyponyms antonyms

23 Manuela Kunze23 Results with GermaNet sehr muskelkraeftig/very strong muscle vs. sehr muskelschwach/very weak muscle bigrams: , antonym: kraeftig vs. schwach blutarm/bloodless vs. blutreich/bloodrich bigrams: GermaNet: antonym: arm vs. reich feucht/wet vs. sehr trocken/very dry bigrams: GermaNet: coordinate terms, antonym sehr gross/very great vs. sehr weit/very broad bigrams: GermaNet: hypernym frei/free vs. größtenteils vorhanden/mostly existent bigrams: GermaNet: coordinate terms keine Schwellung/no swelling vs. keine Verletzung/no trauma bigrams: 0.42, 0.4 GermaNet: hypernym

24 Manuela Kunze24 Results: Character-based + GermaNet selbst unauffaellig sonst unauffaellig unauffaellig glaenzend nicht glaenzend blutreich nicht-sehr-blutreich sehr-blutreich blutarm relativ-blutarm sehr muskelkraeftig nicht sehr muskelstark muskelkraeftig nicht sehr muskelkraeftig nicht muskelkraeftig muskelschwach sehr-muskelschwach geoeffnet leicht geoeffnet rundlich geoeffnet spaltfoermig geoeffnet spaltweit geoeffnet froschmaulartig geoeffnet… geoeffnet geschlossen spaltfoermig-geschlossen

25 Manuela Kunze25 Problem: Paraphrases Wirbelsaeule/spine: intakt unversehrt ohne Besonderheiten ohne Verletzungen same findings future work

26 Manuela Kunze26 Idea: Detection of Paraphases/Synonyms document information + corpus information to analyse the value sets of a document compare the value sets of a concept described in different documents values, which are synonyms or antonyms dont occur in a document Example: Spine closely joined and entire. closely joined, entire: different attributes

27 Manuela Kunze27 Idea: Detection of Paraphases/Synonyms collect all values for a concept: candidates entire closely jointed entire closely jointed candidates: intact == broken == entire/closely jointed == entire ? AP#1Ap#nAP#2AP#3 … … broken intact AP#4AP#5 entire values for the concept 'spine':

28 Manuela Kunze28 Idea: Detection of Paraphases/Synonyms removing of candidates: only one paraphrase bleedings or without bleedings antonyms closely joined vs. entire occur in the same document (for a concept) prefer: entire (number of occurrences) assumption: closely joined is an 'additional' attribute selection of candidates (restrictions): only frequent values similar number of occurrences? verification of results: to obtain value sets of other concepts which have similar values

29 Manuela Kunze29 Problems: Detection of Paraphrases a value can be expressed by more than one value 'value 1' == 'value 2' + 'value 3' result (set of paraphrases for a value) can contain antonyms

30 Manuela Kunze30 Detection of Paraphases/Synonyms solutions? integration of other resources: UMLS extension of GermaNet 1 sense of unversehrt Sense 1 unverletzt, unversehrt => heil => gesund => ?krankheitsspezifisch => ?körperzustandsspezifisch => ?körperspezifisch 1 sense of intakt Sense 1 intakt, ganz1, funktionstüchtig, funktionsfähig => ?funktionalitätsspezifisch => ?relationsspezifisch same meaning?

31 Manuela Kunze31 Conclusion experiments about corpus based semiautomatic extension of GermaNet analysis of complex noun phrases detection and transfer of GermaNet classes clustering of values bigrams using GermaNet information

32 Manuela Kunze32 Improvement wrong splitting based on wrong parsing results GermaNet-Interface treatment of umlauts inflectional suffixes selection of the relevant tokens selection of the correct sense

33 Manuela Kunze33 Example Brustwirbelsaeule festgefuegt und unversehrt: {festgefuegt, unversehrt/entire} Brustwirbelsaeule intakt: {intakt/intact} Brustwirbelsaeule festgefuegt und unversehrt: {festgefuegt, unversehrt /entire} Brustwirbelsaeule intakt:{intakt/intact} Die Brustwirbelsaeule ist zweifach gebrochen: {gebrochen /broken} Brustwirbelsaeule unversehrt: {unversehrt/entire} ….

34 Manuela Kunze34 Idea : Detection of Paraphases/Synonyms restriction: candidates are frequent phrases in a value set assumption: paraphrases don't occur in one set (document) paraphrases have a similar number of occurrences verification: to obtain other value sets

35 Manuela Kunze35 Idea: Analysis of Complex Noun Phrases high-frequency complex noun phrases NP NP genitive NP NP genitive PP+ NP PP+


Herunterladen ppt "Corpus based Creation and Extension of Domain-Specific Resources Manuela Kunze, Dietmar Rösner University of Magdeburg."

Ähnliche Präsentationen


Google-Anzeigen