Obstacles, Limitations and Best Practices for Content harmonization


Jump to: navigation, search

Standardization can have a very high overhead.

If one is interested in studying some novel phenomenon N, it may take some time and effort to research: (1) who else has previously worked on N; (2) which other phenomena does N depend on and how have they been previously analyzed; (3) how to handle cases where N seems to be incompatible with previous analyses of P. Arguably, every diligent annotation researcher should address each of these issues. However, in a large interdisciplinary field like computational linguistics, researchers may be overwhelmed with good practices along many different dimensions. A researcher in a project involving annotation, most likely, needs to be aware of good practices in interannotator agreement, machine learning and some linguistic theory (where we are breaking with tradition here and including the purely n-gram based analysis as one of the plethora of linguistic theories). In addition, there is much disagreement among computational linguists about which of these concerns are of importance. One reason for this is that the backgrounds of computational linguists vary in several ways: (1) knowledge of linguistics, statistics, artificial intelligence, among other areas; (2) exposure to different frameworks within these disciplines; and (3) beliefs about the "right" path to creating a successful application—there are those who are biased towards a particular type of solution to a computational problem. While the need for using multiple resources should be a shared concern, it is important that we go slow. It is important to identify a minimal level of standards (ways to harmonize data) that can be followed with a low level of effort. Standards with low overhead are more likely to be adopted than high-overhead varieties.

This suggests that the burden of proof is on the people who want standards, rather than those who do not.

Content standardization should be limited

Arguing for limited standardization should prove easier than arguing for global standards.

Considerations include:

  1. We should choose standards that are important to combining and merging analyses. These areas include features that need to be shared by analyses of different phenomena including the identification of minimal linguistic units (segmentation, tokenization, anchor-selection, etc.).

Best practices should be backed up in some tangible way

  1. Reviewers of papers and grant proposals could start considering how standards are addressed
  2. Start up funds could be provided for small (even speculative) student projects that are standard-compliant
  3. SIGANN could start awarding a SIGANN-STANDARDS-COMPLIANT seal of approval.


Personal tools