Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Slides:



Advertisements
Ähnliche Präsentationen
Art der Arbeit (Projekt-/Studien-/Diplomarbeit/
Advertisements

Can you think of some KEY phrases which would be useful in multiple contexts? Take 2 minutes with a partner and come up with as many as you can!
Institut AIFB, Universität Karlsruhe (TH) Forschungsuniversität gegründet 1825 Towards Automatic Composition of Processes based on Semantic.
| DC-IAP/SVC3 | © Bosch Rexroth Pneumatics GmbH This document, as well as the data, specifications and other information set forth in.
CALPER Publications From Handouts to Pedagogical Materials.
Universität StuttgartInstitut für Wasserbau, Lehrstuhl für Hydrologie und Geohydrologie Copulas (1) András Bárdossy IWS Universität Stuttgart.
Research-guided Teaching Representation in the Biology Curriculum.
Research-guided Teaching Representation in the Biology Curriculum.
Institut für Angewandte Mikroelektronik und Datentechnik Phase 5 Architectural impact on ASIC and FPGA Nils Büscher Selected Topics in VLSI Design (Module.
Lust auf Lesen Treffpunkt Deutsch Sixth Edition. Relative Pronoun object of a preposition Recall from chapter 9 that relative clauses describe people,
Mein Arbeitspraktikum. Today we are learning to talk about work experience we have done, giving facts, details and opinions The bigger picture: We are.
1IWF/ÖAW GRAZ Data Combination David Fischer, Rumi Nakamura (IWF/OeAW)  Fluxgate: noise + distortion gets worse than the searchcoil at ~ 6 Hz.  Searchcoil:
Die Fragen Wörter Wer? Was? Wann?.
Synchronization: Multiversion Concurrency Control
Stephanie Müller, Rechtswissenschaftliches Institut, Universität Zürich, Rämistrasse 74/17, 8001 Zürich, Criminal liability.
Literary Machines, zusammengestellt für ::COLLABOR:: von H. Mittendorfer Literary MACHINES 1980 bis 1987, by Theodor Holm NELSON ISBN
Name: ___________________________________________ Hör verstehen: (______/10) Mark whether you hear a “du”, an “ihr” or a “Sie” command Wer sagt.
English Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 How much rhetoric does an account of multimodal/hypermedial design need?
English Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Mapping the multimodal genres of traditional and electronic newspapers John.
What is a “CASE”? in English: pronouns, certain interrogatives
Physik multimedial Lehr- und Lernmodule für das Studium der Physik als Nebenfach Julika Mimkes: Links to e-learning content for.
What is a “CASE”? in English: pronouns, certain interrogatives
Ordering Food A Guide. Im Restaurant An actual restaurant is the chance to use more formal ordering. “Ich hätte gern eine Pizza.” “Ich möchte eine Cola.”
type / function / form type of words:
Unterwegs.
1/15 Thursday, 21 June 2007 Werner Sudendorf, Jürgen Keiper Deutsche Kinemathek – Museum für Film und Fernsehen Werner Sudendorf, Jürgen Keiper Reconstructing.
Magnetenzephalogramm, MEG
Gregor Graf Oracle Portal (Part of the Oracle Application Server 9i) Gregor Graf (2001,2002)
© Crown copyright 2011, Department for Education These materials have been designed to be reproduced for internal circulation, research and teaching or.
You need to use your mouse to see this presentation © Heidi Behrens.
EUROPÄISCHE GEMEINSCHAFT Europäischer Sozialfonds EUROPÄISCHE GEMEINSCHAFT Europäischer Fonds für Regionale Entwicklung Workpackage 5 – guidelines Tasks.
Fakultät für Gesundheitswissenschaften Gesundheitsökonomie und Gesundheitsmanagement Universität Bielefeld WP 3.1 and WP 4.1: Macrocost.
Imperfekt (Simple Past) Irregular or strong verbs
Kapitel 2 Grammar INDEX 1.Subjects & Verbs 2.Conjugation of Verbs 3.Subject Verb Agreement 4.Person and Number 5.Present Tense 6.Word Order: Position of.
Memorisation techniques
Kapitel 8 Grammar INDEX 1.Command Forms: The Du-Command Form & Ihr- Command 2.Sentences & Clauses.
EUROPÄISCHE GEMEINSCHAFT Europäischer Sozialfonds EUROPÄISCHE GEMEINSCHAFT Europäischer Fonds für Regionale Entwicklung Workpackage 5 – guidelines Tasks.
Großvater Großmutter Großvater Großmutter Tante/Onkel Vater Mutter.
Word order: 1.In a main clause the VERB is the second idea: Helgakommteben aus der Bäckerei This may not be the second word Meiner Meinung nachsind Hobbys.
On the case of German has 4 cases NOMINATIVE ACCUSATIVE GENITIVE DATIVE.
B LOCKED DAY 1 OBJECTIVES: To consolidate vocabulary and structures within the theme of DIE UMWELT To further practise the techniques used in the prose.
Essay structure Example: Die fetten Jahre sind vorbei: Was passiert auf der Almhütte? Welche Bedeutung hat sie für jede der vier Personen? Intro: One or.
LLP DE-COMENIUS-CMP Dieses Projekt wurde mit Unterstützung der Europäischen Kommission finanziert. Die Verantwortung für den Inhalt dieser.
Kanton Basel-Stadt Howto crash a sequencer …and a path to get a difficult package to work APP-V Swissgroup / Daniel Müller.
Interrogatives and Verbs
Sentence Structure Questions
FREE ICONS POWERPOINT TEMPLATE.
Volume 1, Chapter 8.
Freizeit Thema 5 Kapitel 1 (1)
Sentence Structure Connectives
The dynamic ultrasound
Aspect-Oriented Programming: Fad or the Future
Synonyms are two or more words belonging to the same part of speech and possessing one or more identical or nearly identical denotational meanings, interchangeable.
IETF 80 Prague DISPATCH WG
Das Wetter.
Students have revised SEIN and HABEN for homework
Metadata - Catalogues and Digitised works
Wohin bist du gegangen? Where did you go?
THE PERFECT TENSE IN GERMAN
Ferrite Material Modeling (1) : Kicker principle
Arbeitsfeld Lehrerausbildung im LfS Edwin Stiller
Collaborative Webmeeting November 24th, 2010 Geneve / Darmstadt
ELECTR IC CARS Karim Aly University of Applied Sciences.
type / function / form type of words:
Official Statistics Web Cartography in Germany − Regional Statistics, Federal and European Elections, Future Activities − Joint Working Party meeting.
School supplies.
- moodle – a internet based learning platform
Zhunussova G., AA 81. Linguistic communication, i.e. the use of language, is characteristically vocal and verbal behaviour, involving the use of discrete.
 Präsentation transkript:

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John Bateman Renate Henschel Judy Delin talk given by: Guowen Yang Taipei, September 2002

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Overview of Talk Orientation: –describing the approach to annotation of multimodal documents developed in the GeM project What is the GeM project? –goals, methods, requirements Summary of annotation problems raised Annotation solutions adopted

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 The GeM project (‘Genre and Multimodality’) –supported by the British Economic and Social Research Council (ESRC) –Cooperation: University of Stirling University of Bremen Enterprise Information Design Unit –Goal: to put the description of multimodal page-based documents on a sound empirical footing

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 The problem of data –there is now much theorizing about how multimodal documents work –but the empirical basis of this theorizing is often less than strong

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Basic GeM hypotheses –Documents belonging to different kinds of ‘genres’ will exhibit different kinds of multimodal patterns just as text sorts exhibit different lexicogrammatical patterns –It should be possible to map out these patterns for different genres –There should be a regular relationship between genre- type and the patterns found

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Requirement –An annotated corpus needed to be constructed containing the extra information that we know/expect to be most useful in establishing descriptions of multimodal documents –The extra information is then to serve as the basis for generalizations about genre

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 The problem of data selection and description –what kinds of documents are we talking about? –what kinds of annotation do we need?

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 –what kinds of documents are we talking about? Any page-based medium which combines information from a variety of modalities in order to get its message across

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Initial genres selected for the GeM corpus –field guides (birds) –instruction manuals (telephones) –print newspapers –electronic web-based versions of newspapers

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Field guides

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Instruction manuals

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Print newspapers

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Web-based newspapers

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Motivations for selections –all contain combinations of graphical, textual, photographic material –all use the layout of these elements in complex ways –for all the documents taken we were able to obtain feedback and discussion from their designers

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Some Relations to Natural Language Processing –it is our belief that we can approach the design and function of these documents using established linguistic techniques –the ‘unit of analysis’ is scaled-up from the sentence or the text to the page (at least) –given a formal specification of the motivation and realization of such documents, we can consider their automatic generation

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 The problem of data selection and description –what kinds of documents are we talking about? –what kinds of annotation do we need?

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 –what kinds of annotation do we need?

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 The GeM annotation layers Content structure Rhetorical structure Layout structure Navigation structure Linguistic structure

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 The GeM annotation layers Content structure Rhetorical structure Layout structure Navigation structure Linguistic structure genre form

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Practical information required –the GeM model also takes seriously the notion that the concrete, practical conditions of production (technology, material, time-available, etc.) all contribute substantially to the properties of a genre Canvas constraints Production constraints Consumption constraints

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Requirements –From the GeM perspective, a page-based multimodal document requires analysis from at least these levels and considering the sources of constraint identified. –Only then do we have enough information to consider: motivation of design critique of design and communicative effectiveness repurposing automatic generation

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Pointers –The assumptions made, and the particular layers of analysis adopted, are motivated and introduced at length in: Delin/Bateman/Allen: Information Design Journal Delin/Bateman: Document Design –Details on the website

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Overview of Talk Orientation: –describing the approach to annotation of multimodal documents developed in the GeM project What is the GeM project? –goals, methods, requirements Summary of annotation problems raised Annotation solutions adopted

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Overview of Talk Orientation: –describing the approach to annotation of multimodal documents developed in the GeM project What is the GeM project? –goals, methods, requirements Summary of annotation problems raised Annotation solutions adopted

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Summary of annotation problems raised –form of annotation to select –criteria for recognising units –multiple non-isomorphic intersecting hierarchies –non-linear information –complex query requirements

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Annotation solutions adopted –form of annotation to select TEI: Text Encoding Initiative CES: Corpus Encoding Standard XCES: XML version GEM annotation scheme

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Annotation solutions adopted –criteria for recognising units basic vocabulary of the page: images, signs, sentences, numbers,... layout units: hierarchy determined visually and by considering the degree to which elements ‘belong together’ rhetorical structure: traditional analysis according to Mann&Thompson’s rhetorical structure theory (RST) navigation units: elements pointing elsewhere in the document

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Annotation solutions adopted –multiple non-isomorphic intersecting hierarchies stand-off annotation...

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 XML stand-off annotation for encoding the GeM layers a single ‘base’ element annotated file several ‘stand-off’ layers of annotation a Document Type Definition (DTD) for each layer of annotation each annotation layer corresponds to a GeM analysis layer

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 GeM layers: the base file Huge (90cm) unmistakable seabird. Watch for white, cigar-shaped body and long straight, slender, black-tipped wings. In summer, yellow head of adult inconspicuous. Plunges spectacularly for fish. Sexes similar. Basic ‘vocabulary’ of the page, segmented and numbered. Actual ordering and positioning on the page irrelevant at this stage. Predominantly ‘flat’ structure.

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 base units Layout Semantic Content RST segments navigational elements layout units Distribution of information across layers

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 base units Layout Semantic Content RST segments navigational elements layout units Distribution of information across layers

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Working visually from the page, decompose the objects on the page in terms of their visual unity Transform the page decomposition into a hierarchical structure Specify presentation information for units: e.g., font size, type, colour, image type, resolution, etc. Example: Derivation of layout structure

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 provides a place for assigning specific information about the layout units contents given by collections of the base units of the page Complete layout structure

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 GeM layers: layout units (1) Layout units content defined by cross references (xrefs) to base units Content here not formally used and may be ommitted <layout-unit id="lay-flegg-text" xref="u-21.7 u-21.8 u-21.9 u u-21.11"> Huge (90cm) unmistakable seabird. Watch for white, cigar-shaped body and long straight, slender, black-tipped wings. In summer, yellow head of adult inconspicuous. Plunges spectacularly for fish. Sexes similar.

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 GeM layers: layout units (2) Layout units contain typographical details common over the unit and its children Layout units again identified via cross-references Typographical information modelled on CSS and XSL:FO <text xref="lay lay lay lay lay-21.20" font-family="sans-serif" font-size="10" font-style="normal" font-weight="bold" case="mixed" justification="right" color="black"/>

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 GeM layers: layout units (3) Layout structure is recursive page-21 header-21 body-21page-no-21 lay-21.2 lay-21.3

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Annotation solutions adopted –non-linear information positioning of layout units within a page is specified two-dimensionally with respect to a generalized page model the page model decomposes the page area into a hierarchy of grids specifying the grid for a page is part of the annotation task.

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Working visually from the page, decompose the objects on the page in terms of their visual unity Transform the page decomposition into a hierarchical structure Specify presentation information for units: e.g., font size, type, colour, image type, resolution, etc. Inspect the page for any local or global grid structure Relate layout units to grid positions Example: Derivation of layout structure

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 each sub-tree can additionally be assigned to a position in a hierarchically ordered page grid Complete layout structure + page model

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Complete layout structure + page model

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 85% 5% 10% 14cm GeM layers: area model Layout units are related to identified elements of a hierarchical grid specified in the area model <area-root id="page-frame" cols="1" rows="3" hspacing="100" vspacing=" " height="16cm" width="14cm"> <sub-area id="body-frame" location="row-2" cols="2" rows="1" hspacing="50 50" vspacing="100"/> <layout-leaf xref="header-21" location="row-1" area-ref="page-frame"/>... 16cm

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Annotation solutions adopted –complex query requirements Xpath Queries using standard tools

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Conclusions The annotation scheme allows detailed annotation of complex page-based documents Regularities can be sought using complex Xpath queries The system is open-ended and extensible without any redefinition of existing resources

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Ongoing Work Further collection and ongoing annotation of corpus – Use of results for criticism of document design and for exploring the relation between layout and rhetorical structure –Delin/Bateman: Document Design, 2002 Use of Xpath queries within sequences of extensible style sheet transformations for automatic document generation –Henschel/Bateman/Delin: Konvens2002, Saarbrücken

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Future Work Extension of annotation schemes –Current violations of the grid area model handled by relative offsets, need more flexible approach –non-rectilinear grids for more complex design –consideration of dynamic elements, animation, etc. Extension of genres considered –advertisements –scientific documents Extension of languages considered

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Thank you !