Content of Linguistic Annotation: Standards and Practices (CLASP)
Preliminary Work on CLASP
The CLASP effort begins with a Nov 7, 2009 meeting at NYU, sponsored by the National Science Foundation (IIS 0948101 Content of Linguistic Annotation: Standards and Practices (CLASP)), and sponsored by the Association for Computational Linguistics Special Interest Group for Annotation (SIGANN).
CLASP seeks to improve the interoperability of annotation content by fostering standardization, mapping procedures, best practices among other methods. Specifically, we would like to encourage the NLP community to adopt a set of practices that will make it easier to integrate annotation created by different transducers or different manual annotation projects. CLASP is concerned only with annotation content, leaving issues concerning annotation encoding format to other efforts such as LAF/GrAF.
Apart from its focus on annotation content, there are a number of other differences between the CLASP effort and other annotation efforts:
1. Our focus on interoperability means that we are most interested in those aspects of linguistic content that are shared across different forms of annotation. For example, suppose an application requires that correlations be drawn between the output of a semantic role labeler and a coreference engine. This would work much better if both systems shared assumptions about tokenization and/or phrase structure.
2. We are concerned about developing recommendations that the community is likely to follow through on. We are considering issues such as: the overhead of following our recommendations and the sort of incentives that would cause researchers to pay attention to proposed guidelines.
Travel and Logistics
The first CLASP meeting will be held at New York University on Saturday, November 7, 2009. Breakfast is at 9:30 and the meeting goes from 10PM until 6PM. It is at Warren Weaver Hall, 251 Mercer Street room 101. Click on CLASPTravelLogistics for schedule information, travel directions and other logistical information.
Questions About Annotation Content Guidelines
The agenda of that meeting will be closely tied to seeking answers to the attached list of questions. We encourage participants in CLASP, ACL members (including SIGANN), and other consumers/producers of linguistic annotation to provide possible answers to these questions, additional questions, and comments as appropriate. Please help edit the attached pages.
Agenda for the CLASP Meeting
The goal of the CLASP meeting is to develop a workable plan for creating standards for linguistic content. We are primarily interested in standards that are likely to be accepted by the community and identifying mechanisms for facilitating their acceptance. In some cases, this will mean sacrificing detail and only identifying parts of analyses that can be agreed upon. In other cases, this will mean allowing multiple analyses or mapping procedures. The purpose of the standards is to promote interoperability and mergability of annotation frameworks.
The schedule (subject to change) will be as follows:
- 9:30-10:00 Breakfast
- 10-10:30 CLASP Goals, Approach and Methodology -- Adam Meyers
- 10:30-11:15 CLASP in Context of Other Standardization Efforts -- Nancy Ide
- 11:15-12:45 Working Groups
- 12:45-1:45 A Nice Lunch (Catered)
- 1:45-3:15 Working Group Presentations
- 3:15-3:45 Group Discussion
- 3:45-4:15 Coffee Break
- 4:15-4:45 Preparation of Summary Slides
- 4:45-5:30 Presentation of Summary Slides
- 5:30-6:00 Final Discussion
A large part of the CLASP meeting will be devoted to working group meetings and presentations. This section lists several groups, the composition of which needs to be decided in advance of the CLASP meeting. Ideally, each participant in the meeting will participate in one of the working groups. Please send an indication to Adam Meyers (meyers at cs dot nyu dot edu) of your first, second and third choice of group you would like to join. Adam will take into account both preferences and balance of members between groups to determine group membership. It is suggested that researchers from the same project select different working groups.
There will be the following working groups (minor changes are possible). The first two working groups deal with substantive issues regarding the future organization and scope of CLASP efforts. The second two working groups attempt to provide concrete examples of the standardization process by attempting to work through standards which are particularly crucial to the interoperability and mergability of annotation.
These and related questions are also discussed in detail at CLASP_Questions.
- CLASP Policy:
- A. How should standards be instantiated (published, disseminated)?
- B. How can we motivate annotators, researchers and developers to follow standards?
- i. Should there be one or more SIGANN stamps of Approval?
- ii. During peer review, should consideration be given to adherence to content standards and justification for standard violation?
- C. Should Annotation Content Standards constrain derivative tasks? If yes, then when, why and how?
- While it is clear that part of speech standards may constrain automatic and manual part of speech annotation, it is not clear whether part of speech guidelines should, in any way, constrain manual/automatic systems that do predicate argument structure, reference resolution, question answering, machine translation, etc.
- CLASP Scope:
- A. Which Classes of Content Categories are Ready for Standardization (POS, NE, etc.)?
- B. Which prior standards can we leverage off? What connection, if any, should CLASP standards have with the standards in ISO TC37 SC4?
- C. What are some of initial details of standards for some areas not provided by other working groups?
- D. While CLASP will initially only cover annotation standards for English, how should other languages influence these standards?
- Assuming separate standards for each language has advantages and disadvantages, both of which need to be taken into account.
- i. When should English standards be constrained by how well they could apply to other languages?
- ii. When should truly anglocentric standards be adopted?
- Tokenization Working Group:
- A. What is a token?
- B. To what degree are segments at the character level relevant to tokenization?
- i. Can the same character be part of two tokens?
- ii. Can a character be part of no token?
- iii. Can a character in a string be mapped to a different character in a token?
- C. Which regularizations are part of tokenization and which regularizations are part of some "higher" level of annotation?
- i. normalization of contracted forms?
- ii. correction of misspelled words?
- iii. normalization of alternative spellings?
- iv. aliases for named entities?
- v. identification of immutable multiword units?
- vi. morphological analysis (inflectional and/or derivational) including separating the base form from a representation of any features represented by morphological features.
- Anchor Working Group -- For each sentence, how can we select a set of representative words/tokens called anchors. This task is essentially equivalent to identifying a dependency graph representation of a sentence or identifying all the head words, except that the task is extended to cover all the problem cases. Specifically, this group must grope with the problem of select anchors even for constituents/constructions that do not seem to admit anchors.
- A. How can we balance linguistic accuracy and ease of implementation?
- There can be several alternatives that are incompatible with each other, such that: (i) the hardest to implement corresponds to the most predictive in terms of lexical co-occurrance properties; and (ii) the easiest to implement relies on simple rules like, "if in doubt mark the last (or first) constituent as the anchor". These rules contribute little to the overall analysis other than keeping the graph connected. This raises further questions such as:
- i. Should we choose one linguistically adequate analysis and standardize via an annotation project?
- ii. Should we choose more than one analysis and allow variation depending on the amount of overhead assumed?
- iii. Should a compromise analysis be adopted?
- B. How do we approach anchor identification for non-headed structures?
- i. For some constructions (e.g., names), should multi-word anchors be considered?
- ii. Should two types of anchors logical and surface be identified to solve some such problems? How often can these came cases be handle using "transparent" links between constituents?
- iii. Are there some elements (e.g., false starts), which should be ignored for purposes of anchor selection, thus effectively, removing them from a dependency-style analysis?
- iv. Is anchor identification enough, or must certain other information be identified in order to implement an anchor standard?
- C. Enumerate as many difficult cases as possible.
- A. How can we balance linguistic accuracy and ease of implementation?
- Goals, Approaches and Methodology: Adam Meyers, NYU
- Background and Context for CLASP: Nancy Ide, Vassar College
- Policy Working Group: Chris Cieri, Aravind Joshi, Adam Meyers, Martha Palmer
- WG2: Scope of Standards and Standardization: Nicoletta Calzolari, Nancy Ide, Rashmi Prasad, James Pustejovsky, Janyce Wiebe
- Tokenization Group: Collin Baker, Bran Boguraev, Catherine Macleod, Critina Mota, Nianwen Xue
- Anchor Working Group: Chuck Fillmore, Dan Flickinger, Jan Hajic, Owen Rambow, Zdenka Uresova