Content vs. Encoding

From SIGANN

Jump to: navigation, search

Encoding format is the way we physically represent the information—e.g. as XML, or LISP-like structures as in the Penn Treebank, etc. the encoding format contributes to the “meaning” of the annotation by establishing relationships between annotations and data (what applies to what) as well as between two data items (e.g. a coreference link), and may also provide information about the organization of the annotation (e.g., a list of constituents, a set of alternatives, etc.). It does not provide the content category labels per se, except insofar as that may be implicit in the structure of the format.

This kind of implicit information is where the annotation content and the format usually (and often unintentionally) overlap. If you take something like the Penn Treebank annotation format, it is clear that the items grouped with parentheses are intended to be constituents of the enclosing item. However, it is important to note that this is true only because we, as humans, can look at it and see that based on the labels of the categories. The very same format has been used for other types of linguistic information, for example, in NOMLEX. There, the groupings mean a number of different things—sometimes groups of related information, sometimes sets of alternatives—and in fact, without examining the data, it is impossible to tell which. This is easy for humans, but hard for computers. So the golden rule concerning format vs content is to make explicit the relations that the format is being used for, so that machines can process it without human interpretation. This sounds like a minor point, but formatting ambiguity has been known to cause problems for using annotations with software other than that which it was designed for (and which therefore "knows" what the format signals in specific contexts)..

The ISO approach in the Linguistic Annotation Framework (LAF) is to provide a very generic XML serialization format (called GrAF), consisting of a graph with feature structures attached to the nodes, which provide the annotation content. GrAF Is accompanied by some very general rules for structuring annotations, but is otherwise extremely general.

The idea is behind LAF is that any annotation format can be used as long as it can be mapped without information loss into GrAF. If several different annotations are independently mapped into GrAF, they can be easily combined (merged) into a single graph. From here, the graph may itself be used for analysis of annotation layers (the graph is increasingly seen as a useful data structure for representing and analyzing linguistic data). Alternatively, the graph can be transduced to another format. The ANC project provides a tool that transduces GrAF structures into several formats for input to software systems like GATE and UIMA, and programs like GraphViz, MomoConc, and the BNC XIARA search engine, and more formats are continually added.

Go Back

Personal tools