Tokenization Standards


Jump to: navigation, search

A token is the smallest linguistic unit for some level of linguistic description. Tokenization standards can be developed according to (at least) two different strategies:

These strategies tend to raise slightly different questions about standardized tokenization. Discussion of each follows.

Another consideration is how tokenization should be defined, or more specifically, should tokenization include string regularization (mapping alternative spellings, identifying immutable idioms, etc.). Consider the following derivational sequence:

  1. Big Blue's stock is going topsy turvy.
  2. Big + Blue + 's + stock + is + going + topsy + turvy
  3. IBM + 's + stock + is + going + topsy turvy
  4. IBM + 's stock + be + 3rd-sing-present + go + progressive + topsy turvy

1 is the original string 2 and 3 are possible tokenizations and 4 is an approximate next step (morphology). On one view, step 2 is a necessary intermediate step called tokenization which future analysis should be based on, e.g., 3 is derived from 2. On another view, 3 is really the level of tokenization and 2 is a useless and necessarily inconsistent level of representation. Of course there are intermediary positions which would include aspects of 2 and 3.

Substring Preserving Strategy

The String and Syntax Biased Definition

At largest a token is a word long. When syntactically necessary, a word can be broken up into multiple tokens. This definition is loosely based on the evolving practices of the Penn Treebank. We would be interested in any comments, particularly those that point out some of the differences between the current Penn Treebank (or other tokenization standards) and the ones suggested below.

Among other things, this assumes (for now) that two strings separated by white space are always considered separate tokens. So most of the clarifications of this definition govern tokenization at the sub-word level, when there is a reason to divide strings that do not contain spaces. We will assume the following two principles:

A. Splits should try to minimize the number of spelling variants for each word. This principle favors the split (a) can't --> can + 't over the split (b) can't --> ca + n't because the former prevents the word can from having the spelling variation ca. By this same principle, (b) is preferred to (a) because the negative form n't occurs in other splits, e.g., doesn't --> does + n't. Thus assuming (b) would make it unnecessary to add 't as a possible variant of the negative adverb not. Rather, n't could be assumed to be the only contracted form of not. Assuming the split can't --> can + n't would give us both advantages, allowing us to generalize the most.
B. Splits should avoid reusing, deleting characters or adding characters as part of tokenization. The simplest tokenization rules are ones that simply split strings into substrings. Changing the substrings means that there are special cases which have to be accounted for. This principle favors can't --> can + 't over the split can't --> can + n't because the former does entail that the n in the original text corresponds to parts of two tokens in the token annotation. For example, a simple rule for contraction: (i) “separate the final n't from the rest of the string” would be insufficient by itself. If we were changing the string, we may need an additional rule like “(ii) double the n if the preceding letter is a vowel”. Note that this objection mainly holds for open ended phonemena. It is actually possible to enumerate all the cases of English not-contraction: can't → can + n't; shouldn't → should + n't, etc. Of course, some rare cases could easily be missed (needn't → need + n't, won't → will + n't and shan't → shall + n't).

[Note: These concerns are independent of whether one takes an inline or offset approach to annotation. They have to do with whether one is concerned with preserving substrings in the process of tokenization and possibly leaving further regularization to other levels of annotation.]

Obviously, these two principles are often in conflict. We will thus suggest that there are two possible variations on some standards, one favoring principle A, and the other favoring principle B. We will initially favor a Principle A-based standard, but this is, of course, subject to further discussion. The proposed standards for word-initial punctuation, word-final punctuation and contractions are:

1. Initial and final punctuation should be separated from words with one exception. If a sentence-final period is shared by a word that includes a period, e.g., an abbreviation like etc., then the period is doubled for tokenization purposes, i.e., it is part of the final word (etc.) and also the sentence-final punctuation token.
i. An alternative, principle B biased standard is: always separate sentence-final periods, even if they occur at the end of abbreviations. Thus, we must assume that all words ending in a period, like etc., have a period-less variants (etc) that occur sentence finally. [[It would be nice if we had a solution that could be used by the Text-to-Speech system used in the Kindle so that (per the New Yorker review) a sentence-final "miss." wouldn't get read as Mississippi! Kindle, too, needs to know the difference. It's likely, of course, that ONLY "etc." shows up in sentence-final position and TTS systems should know that. cjf]]
2. Contractions divide words into recognizable units (by look up). Given the finite number of contractions in English, this should be easy. Thus can't --> can + n't.
i. An alternative, principle B biased standard would imply that contraction should always occur in a standard place, e.g., before the n in negative contraction. This would imply that ca is a possible form of the modal can, as per the discussion above.
ii. Another alternative would be to regularize further and replace n't with the uncontracted form not. This has the advantage that there are fewer forms of “not” to consider. The disadvantage is that the tokenization loses the information that this is a contracted form and there is no other “level” that records that information, should one want to study distributional information of the contracted form. Similar issues may arise with alternative spellings or misspellings of words.
For hyphens and forward slashes, we propose that:
3. Hyphens and surrounding strings of text should be treated as separate words, provided that all the resulting tokens are: words, numbers or members of a list of prefixes (a per an co pre post un anti ante ex extra fore non over pro re super sub tri bi uni ultra). Under this view the string, "New York - based" consists of the 4 tokens: New, York, -, based. This is essentially the definition used in the 2008 and 2009 CONLL shared tasks.
4. The forward slash / is assumed to always divide words unless it is part of a numeric fraction, e.g., The U.S. / Japan trade agreement includes the tokens: U.S., / and Japan. We assume that fractions like 3/4 are single tokens, but fractions like Distance/Time should be broken down into tokens (3 tokens in this case).
5. Finally, there are some additional issues regarding more symbolic types of texts: numbers, symbols, web addresses and other computer paths and the like. We will preliminarily suppose that:
A. Numbers consisting of digits periods and commas and slashes (fractions) constitute single tokens. However, unit symbols like $, %, #, etc. are separate symbols from the number they modify.
B. Web addresses and other computer paths (e.g., file paths) and other types of symbols not named constitute single tokens. Except when embedded in file paths/web addresses, white space is assumed to separate symbols, e.g., the address 435 5th Avenue, is assumed to consist of 3 tokens (435, 5th, Avenue).
6. "/" is a common character in British text.

Substring Mapping Strategy

The substring mapping strategy is a function f(s) = S, where s is a (possibly empty?) substring in a text, and S is a set of 0 or more linguistic tokens. Annotations are associated with the resulting token stream.

The substring mapping strategy eliminates some considerations in the proposals listed above, and changes the nature of others:

1. Separation of initial and final punctuation is accomplished by mapping the same character to (parts of) two tokens. For example, the substring "etc." would map to the token "etc.", and the terminating period would map to a "period" or "punctuation" token.
2. Representation of contractions may be less complicated with substring mapping, for example, by allowing for association of two tokens with one construct (e.g., "can" and "not" associated with the substring "can't" in the text. For other languages such as Romance languages, enabling constructs where two or more "linguistic" tokens are associated with a single substring in the text is more critical and tricky (e.g. "du" in French --> "de" + "le").

[[Could we even allow ourselves, for deep grammar-writing, to recognize some questionable decisions in English orthography---by which, for example what could be realized as "an" + "other" comes out as "another", or "who" + "'s" comes out as "whose"? Certain generalizations accounting for "a whopping 2000 bucks" and "an additional 3 years" could be made to fit "another 5 pages"; and generalizations we could make about the clitic genitive ("the king of England's hat") would include "who the hell's", "who else's" and the like. cjf]]

a.There is still a decision of whether to regard "not" or "n't" or "'t" as the token (and hence include in the lexicon), but in principle it would seem that with the mapping strategy there is little argument against representing the lexical unit in full (i.e., use "not" in all cases).
b.The issue of information loss is obviated because the original form is one of the outputs of the function.
3. Hyphenation is not as simple as deciding where to split--for example, in French "pomme-de-terre" cannot be broken into separate tokens, since the words "pomme", "de", and "terre" have independent meanings (lexical entries) and "pomme-de-terre" is yet another entry. There are almost certainly examples for English as well. Some pre-processing strategy or reconstruction would have to be applied to handle this.
4. and 5. Tokenization involving slashes, numbers and other special constructions are an issue regardless of the strategy. With substring mapping, you could effect both the single token representation and the multiple-token representation with a recursive structure, i.e., a token containing several other tokens (a set containing a subset).

Go Back

Personal tools