Die Präsentation wird geladen. Bitte warten

Die Präsentation wird geladen. Bitte warten

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Ähnliche Präsentationen


Präsentation zum Thema: "Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John."—  Präsentation transkript:

1

2 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John Bateman Renate Henschel Judy Delin talk given by: Guowen Yang Taipei, September 2002

3 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Overview of Talk Orientation: –describing the approach to annotation of multimodal documents developed in the GeM project What is the GeM project? –goals, methods, requirements Summary of annotation problems raised Annotation solutions adopted

4 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 The GeM project (‘Genre and Multimodality’) –supported by the British Economic and Social Research Council (ESRC) –Cooperation: University of Stirling University of Bremen Enterprise Information Design Unit –Goal: to put the description of multimodal page-based documents on a sound empirical footing

5 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 The problem of data –there is now much theorizing about how multimodal documents work –but the empirical basis of this theorizing is often less than strong

6 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Basic GeM hypotheses –Documents belonging to different kinds of ‘genres’ will exhibit different kinds of multimodal patterns just as text sorts exhibit different lexicogrammatical patterns –It should be possible to map out these patterns for different genres –There should be a regular relationship between genre- type and the patterns found

7 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Requirement –An annotated corpus needed to be constructed containing the extra information that we know/expect to be most useful in establishing descriptions of multimodal documents –The extra information is then to serve as the basis for generalizations about genre

8 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 The problem of data selection and description –what kinds of documents are we talking about? –what kinds of annotation do we need?

9 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 –what kinds of documents are we talking about? Any page-based medium which combines information from a variety of modalities in order to get its message across

10 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Initial genres selected for the GeM corpus –field guides (birds) –instruction manuals (telephones) –print newspapers –electronic web-based versions of newspapers

11 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Field guides

12 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Instruction manuals

13 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Print newspapers

14 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Web-based newspapers

15 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Motivations for selections –all contain combinations of graphical, textual, photographic material –all use the layout of these elements in complex ways –for all the documents taken we were able to obtain feedback and discussion from their designers

16 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Some Relations to Natural Language Processing –it is our belief that we can approach the design and function of these documents using established linguistic techniques –the ‘unit of analysis’ is scaled-up from the sentence or the text to the page (at least) –given a formal specification of the motivation and realization of such documents, we can consider their automatic generation

17 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 The problem of data selection and description –what kinds of documents are we talking about? –what kinds of annotation do we need?

18 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 –what kinds of annotation do we need?

19 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 The GeM annotation layers Content structure Rhetorical structure Layout structure Navigation structure Linguistic structure

20 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 The GeM annotation layers Content structure Rhetorical structure Layout structure Navigation structure Linguistic structure genre form

21 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Practical information required –the GeM model also takes seriously the notion that the concrete, practical conditions of production (technology, material, time-available, etc.) all contribute substantially to the properties of a genre Canvas constraints Production constraints Consumption constraints

22 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Requirements –From the GeM perspective, a page-based multimodal document requires analysis from at least these levels and considering the sources of constraint identified. –Only then do we have enough information to consider: motivation of design critique of design and communicative effectiveness repurposing automatic generation

23 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Pointers –The assumptions made, and the particular layers of analysis adopted, are motivated and introduced at length in: Delin/Bateman/Allen: Information Design Journal Delin/Bateman: Document Design –Details on the website http://purl.org/net/gem

24 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Overview of Talk Orientation: –describing the approach to annotation of multimodal documents developed in the GeM project What is the GeM project? –goals, methods, requirements Summary of annotation problems raised Annotation solutions adopted

25 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Overview of Talk Orientation: –describing the approach to annotation of multimodal documents developed in the GeM project What is the GeM project? –goals, methods, requirements Summary of annotation problems raised Annotation solutions adopted

26 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Summary of annotation problems raised –form of annotation to select –criteria for recognising units –multiple non-isomorphic intersecting hierarchies –non-linear information –complex query requirements

27 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Annotation solutions adopted –form of annotation to select TEI: Text Encoding Initiative CES: Corpus Encoding Standard XCES: XML version GEM annotation scheme

28 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Annotation solutions adopted –criteria for recognising units basic vocabulary of the page: images, signs, sentences, numbers,... layout units: hierarchy determined visually and by considering the degree to which elements ‘belong together’ rhetorical structure: traditional analysis according to Mann&Thompson’s rhetorical structure theory (RST) navigation units: elements pointing elsewhere in the document

29 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Annotation solutions adopted –multiple non-isomorphic intersecting hierarchies stand-off annotation...

30 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 XML stand-off annotation for encoding the GeM layers a single ‘base’ element annotated file several ‘stand-off’ layers of annotation a Document Type Definition (DTD) for each layer of annotation each annotation layer corresponds to a GeM analysis layer

31 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 GeM layers: the base file --------------- Huge (90cm) unmistakable seabird. Watch for white, cigar-shaped body and long straight, slender, black-tipped wings. In summer, yellow head of adult inconspicuous. Plunges spectacularly for fish. Sexes similar. Basic ‘vocabulary’ of the page, segmented and numbered. Actual ordering and positioning on the page irrelevant at this stage. Predominantly ‘flat’ structure.

32 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 base units Layout Semantic Content RST segments navigational elements layout units Distribution of information across layers

33 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 base units Layout Semantic Content RST segments navigational elements layout units Distribution of information across layers

34 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Working visually from the page, decompose the objects on the page in terms of their visual unity Transform the page decomposition into a hierarchical structure Specify presentation information for units: e.g., font size, type, colour, image type, resolution, etc. Example: Derivation of layout structure

35 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10

36 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10

37 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10

38 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 provides a place for assigning specific information about the layout units contents given by collections of the base units of the page Complete layout structure

39 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 GeM layers: layout units (1) Layout units content defined by cross references (xrefs) to base units Content here not formally used and may be ommitted <layout-unit id="lay-flegg-text" xref="u-21.7 u-21.8 u-21.9 u-21.10 u-21.11"> Huge (90cm) unmistakable seabird. Watch for white, cigar-shaped body and long straight, slender, black-tipped wings. In summer, yellow head of adult inconspicuous. Plunges spectacularly for fish. Sexes similar.

40 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 GeM layers: layout units (2) Layout units contain typographical details common over the unit and its children Layout units again identified via cross-references Typographical information modelled on CSS and XSL:FO <text xref="lay-21.12 lay-21.14 lay-21.16 lay-21.18 lay-21.20" font-family="sans-serif" font-size="10" font-style="normal" font-weight="bold" case="mixed" justification="right" color="black"/>

41 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 GeM layers: layout units (3) Layout structure is recursive page-21 header-21 body-21page-no-21 lay-21.2 lay-21.3

42 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Annotation solutions adopted –non-linear information positioning of layout units within a page is specified two-dimensionally with respect to a generalized page model the page model decomposes the page area into a hierarchy of grids specifying the grid for a page is part of the annotation task.

43 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Working visually from the page, decompose the objects on the page in terms of their visual unity Transform the page decomposition into a hierarchical structure Specify presentation information for units: e.g., font size, type, colour, image type, resolution, etc. Inspect the page for any local or global grid structure Relate layout units to grid positions Example: Derivation of layout structure

44 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 each sub-tree can additionally be assigned to a position in a hierarchically ordered page grid Complete layout structure + page model

45 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Complete layout structure + page model

46 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 85% 5% 10% 14cm GeM layers: area model Layout units are related to identified elements of a hierarchical grid specified in the area model <area-root id="page-frame" cols="1" rows="3" hspacing="100" vspacing="10 85 5" height="16cm" width="14cm"> <sub-area id="body-frame" location="row-2" cols="2" rows="1" hspacing="50 50" vspacing="100"/> <layout-leaf xref="header-21" location="row-1" area-ref="page-frame"/>... 16cm

47 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Annotation solutions adopted –complex query requirements Xpath Queries using standard tools

48 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Conclusions The annotation scheme allows detailed annotation of complex page-based documents Regularities can be sought using complex Xpath queries The system is open-ended and extensible without any redefinition of existing resources

49 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Ongoing Work Further collection and ongoing annotation of corpus –http://purl.org/net/gem Use of results for criticism of document design and for exploring the relation between layout and rhetorical structure –Delin/Bateman: Document Design, 2002 Use of Xpath queries within sequences of extensible style sheet transformations for automatic document generation –Henschel/Bateman/Delin: Konvens2002, Saarbrücken

50 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Future Work Extension of annotation schemes –Current violations of the grid area model handled by relative offsets, need more flexible approach –non-rectilinear grids for more complex design –consideration of dynamic elements, animation, etc. Extension of genres considered –advertisements –scientific documents Extension of languages considered

51 Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 Thank you !


Herunterladen ppt "Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John."

Ähnliche Präsentationen


Google-Anzeigen