TOKENIZATION AS THE INITIAL PHASE IN NLP Jonathan J. Webster & Chunyu Kit City Polytechnic of Hong Kong 83 Tat Chee Avenue, Kowloon, Hong Kong E-mail: ctwebste@cphkvx.bitnet ABSTRACT In this paper, the authors address the significance and complexity of tokenization, the beginning step of NLP. Notions of word and token are discussed and defined from the viewpoints of lexicography and pragmatic implementation, respectively. Automatic segmentation of Chinese words is presented as an illustration of tokenization. Practical approaches to identification of compound tokens in English, such as idioms, phrasal verbs and fixed expressions, are developed. 1. Introduction: Tokenization in NLP In NLP studies, it is conventional to concentrate on pure analysis or generation while taking the basic units, namely words, for granted. It ks an obvious truth, how- ever, that without these basic units clearly segregated, it is impossible to carry out any analysis or generation. But too little attention has so far been paid to the process, a kind of preproeessing in a sense, of iden- tifying basic units to be processed. The simplicity of recognizing words in English, resulting from the exis- tence of space marks as explicit delimiters, has most likely misled us into overlooking the complexity of distinguishing other units in English, such as idioms and fixed expressions, not to mention the difficulty in identifying words in other languages, like Chinese, resulting from the absence of delimiters. In this paper, we define this preprocessing as token- ization. The In'st step in NLP is to identify tokens, or those basic units which need not be decomposed in a subsequent processing. The entity word is one kind of token for NLP, the most basic one. Our concern, how- ever, is with using the computer to recognize those tokens without distinct delimiters, such as Chinese words, English idioms and fixed expressions. So far, there exists very little research adopting the notion of tokenization we put forward here. Santos (1990) explored a pragmatic way to transfer English idioms and fixed expressions in the domain of machine translation. Linden et al (1990) focused on determining the idiomatic or non-idiomatic meaning of idioms. It is believed that, by taking idioms and fixed expressions as a kind of basic unit at the same level as words, tokeulzation should take on a more generalized and realistic significance making NLP and MT systems more robust and practical. Before we can achieve the identification of such tokens by computational means, many fundamental issues need to be resolved. Among these the nmst important is clearly the definition of token. 2. Defining the entity word There are a number of notions of what counts as a token in NLP. Different notions depend on different objectives (e.g. parsing, MT) and often different language backgrounds. To arrive at a definition of token, which is at once linguistically significant and methodologically useful, we propose to first address the issue of what is a word frQm ~t l~xicograoher's Speaking as a lexicographer, J. McH. Sinclair pro- poses to define a lexical item as "a formal item (at least one morpheme long) whose pattern of occurrence can be described in terms of a uniquely ordered series of other lexical items occurring in its environment" (1966:412). For the lexicographer, it is simply a question of finding significant collocations. Sinclair differentiates betweeu what he calls 'casual' and 'significant' collocation. Casual collocation includes items which have no bearing on the node, and as Sinclair explains "may be accidental, reflecting the place, perhaps, where someone breaks into a commit- tee meeting with the coffee; or they may include the magnificent images of some of our greatest poetry" (1966:418). The larger the corpus, the more casual collocates will be netted, but at the same time their significance will steadily decrease. While, on the other hand, 'collocates typical of the item in question will impress their pattern more and more strongly until the pattern is broadly speaking complete and the evidence of further occurrence does not materially alter the pattern" (1966:418). The lexicographer's approach to identifying words has significance for tokenization. By comparing observed collocation patterns of strings with stored patterns we can proceed to segment the text into words. Finding significant tokens depends on the ability to recognize patterns displaying significant collocation. Rather than simply relying on whether a string is bounded by delimiters on either side, segmen- tation into significant token relies on a kind of pattern recognition involving collocational patterns. While suggesting that the search for lexical items begin with those units "which we widely call mor- phemes', Sinclair acknowledges those problems which AcrEs DE COLING-92. NANTE'S. 23-28 AOt'rr 1992 1 1 0 6 PROC. OV COLING-92, NANTES, AUG. 23 28, 1992 are likely to complicate matters for the lexicographer: (i) homographs; and (it) compounds or multi-morpheme items. Both problems are likely also to affect, perhaps even frustrate attempts at automatic segmentation of strings into meaningful units, ltomographs, for instance, possess multiple collocational patterns. It becomes a question of not simply finding a match, but evaluating between patterns to lind the one with the best fit. Taking the complex homograph, hand, as an example, Sinclair writes, "Grammar is hardly any help at all, and the distinctions gained by a word-class division make little inroads on the complexity of Ihis form. The lexicographer is forced to examine the blternal consistency of the cluster" (1966:42.5). What one discovers is that some occurrences of hand are collo- cationally distinct from other nceurrences of the same form. Sinclair cites two instances of hand. One having collocates like marriage, daughter, and engagement. The other with collocates which include words we as- sociate with card games like whist, ntmmy, ace, and flush. Both grouping~s are qnlte distinct. As Sinclair puts it, "tbe chances of whist and marriage coooccuring are as poor as those of archbishop and fish." This suggests we are dealing with difierent lexical items, fin the other hand, "groupings which shade into each other, even though opposite ends do not intercollocate, suggest one item with a wide range." Polymorphemic items further complicate the situa- tion. Richard Hudson, in his Word Gr~mm~r, treats compounds as a single word whose composition con- sists of a string of two words (1984:50). Citing the example offi~miture shop, he explains, "the word shop provides the link between furniture and the rest of the sentence - it is because shop is a noun that the whole can occur where nouns can occur; if the whole were plural, the suffix would be added to shop, not to furniture; and semantically, afarniture shop is a kind of shop, and not a kind of furniture"(1984:87). Hudson goes so far as to float the idea that we treat expressions like London shop, expensive ship, or even soon left (as in John soon left) in the same manner as single words consisting of a modifier followed by its head (1984:89). While he admits this rcanalysis "may seem perverse", nevertheless he believes there are arguments in its favour. For one thing, the word-order rules for English would be simplified (i.e. a modifier follows its head unless both modifier and head are part of a word, then the modifier comes first). Also, his reanalysis would help to explain why premodifiers cannot themselves have postmodifiers. Sinclair, on the other hand, would not regard a particular combination of words as a separate poly- morphemic item unless its cluster cannot be predicted from the clusters of its components (1966:423). Thus, while some occurrences of cold + feet are regarded as a separate polymorphemic item, cold + hands would not be treated as such. Sinclair fixes no limit on the size of a polymor- phemic item. Moreover, contrary to a claim made by Hudson "that the parts of a word cannot be separated by other words which are not part of the same wurd" (1984:89), Sinclair argues that the components of a polymorphemic item may in tact be discontinuous. Sin- clair cites examples like you must cat your coat, I'm afrai~ according to your ch~th, and from a Sunday newspaper, put all his nuclear eg~" in the West German basket. The possibility of achieving word recognition through mapping collocations in the text to stored collocafional patterus suggests a common-sense, practical approach to tokenization and dismnbiguation. 3. Automatic word segmentation in Cltinese NI,P - An example of tokenization ldentificatinn of words is still a perl,lexing problem in Chinese NLP. As with English words, i)articularly idioms aml compounds, the source of dilficulty has to do wilh the absence of dclinfiters between tokens. 3.1 Background As we know, a Chinese character is compntatioitally represented by an internal code. Words, however, each of which may consist of one or more characters, do not have any obvious indicators to mark their boun.- dories. Tokenizafion of Chinese words, including idioms and fixed expressions which are, of conrse, phrases containing words as their constituents but used as words, is generally regarded as another bottleneck following "Chinese character coding". It ks known in formal terms as automatic word segmentation in China mainland and as word identification abroad. In recent years, it has become a very important topic in Chinese NLP. Without coding, it is impossible to input characters into computer. Without word identification, we cannot hope to achieve text processing. This topic has been approached flora two sides. On the theoretical side, researchers have sought an explicit specification of the entity word. The difficulty of word identification has rcsnlted fi'om a confusion of character, word and phrase in Chinese finguistk's. Because the construction of words, phrases and sel).tences arc so similar, some scholars even believe they are identical. In an attempt to bring this debate to some conclusion, a standard was introdnced by the Chinese State Bureau of Standardization t0r word segmentation. The term segrnentation unit was em- ployed to refer to words, idioms, fixed expressious, terminology as long as two or three dozen characters and even any entities which can be treated as an undivided unit in a processing (Kit 1989). This term, as a prototype of token, indicates the appearance of tokenization notion in Chinese computing. 3_2 Basic methods On the practical side, studies have concentrated on AcrEs DECOLING-92, NANTES, 23 28 aO(J'r 1992 I l 0 7 PICOt:. 0~: COLING-92, NANrES, AUC;. 23 28, 1992 two aspects: 1) the implementation of mechanical segmentation with fundamental supports, such as the Construction of a dictionary wlfich permits quick and efficient access; 2) strategies for disambiguation. At the outset, segmentation methods were invented one after another and seemed inexhaustible. But after systematic study, a structural model was finally built (Kit 1988; Kit et al 1989). In essence, word segmen- tation involves table-look-up and string matching such that character string of the input text is compared with entities in an existing word list, i.e., the dictionary. Every automatic segmenting method of this kind is proven to be decided by three factors, as shown below in the structural model ASM(d,a,m), in which ASM stands for Automatic Segmenting Method; d~{+l,-1}, indicates the scanning directions in matching, scanning from left to right is forward and the opposite is backward, respectively;, a~ { + 1,-1}, indicates character addition or omission in each round of string matching that finds a word, respectively; m~{+l,-l}, indicates the usage of maximum or minimum matching, respectively. It is believed that all elemental methods are included in this model. Furthermore, it can be viewed as the ultimate model for methods of string matching of any elements, including methods for finding English idioms. The minimum match methods are not appropriate for Chinese word segmentation because almost every Chinese character can be used as a token - a word or a single morpheme. By contrast, however, a maximum match method can obtain an identification rate as high as around 98 per cent, with an adequately large dictionary. The earliest and most influential implemeno tation was the CWDS system (Liang 1984; Liu & Liang 1986), which processed a corpus of 200 million characters in practical use. A segmenting strategy may integrate more than one basic method to achieve a special task, e.g., forward and backward scanning methods are often employed together to check segmentation ambiguities. Such has been proven an efficient approach, though not perfect. 3.3 Handling ambiguities Most research today on Chinese word segmentation has shifted to handling ambiguities in order to achieve a higher identification rate. There are two types of ambiguities at the level of word segmentation: Type I: In a sequence of chinese characters S = av..a i bt...bi, if at...ai, bt...b I and S are each a word, then there is conjunctive ambiguity in S. The segment S which is itself a word contains other words. It is also known as multi-combinational ambiguity. Type 1I: In a sequence of Chinese characters S = a...a i b~...bjct...q, if at...a~bt...b i and b~...b~q...cx are each a word, then S is an overlapping ambiguous segment, or in other words the segment S displays disjunctive ambiguity. The segment bl...bt is known as an overlap, which is usually one character long. 3.3.1 Ambiguity checking The first step toward resolving segmentation ambi- guities is to find them. Bidirectional scanning is one simple and powerful method. Differences in segmen- tation resulting fi'om the two methods reveal the presence of ambiguities. But there still remain many ambiguities not found using this method. An integral approach to checking segmentation ambiguities was developed as follows: 1. Find all possible words from the beginning of the string and record their end positions; 2. Redo step 1 from those end positions, rather than from the beginning, if there is any new end position equal to or exceeding previous greatest one, a type I or type I1 ambiguity, respectively, is found. It is a very simple and efficient strategy for finding any ambiguity and prevent all unnecessary operations on false ambiguities (Kit 1988 & 1992). 3.3.2 Approaches to disamhiguation Normally, tile disambiguation stage follows the mechanical segmentation and the ambiguity checking. Two distinct approaches to disambignation are the knowledge-based and the statistic'd-based. Tile former is to discriminate all ambiguities by means of a built-in knowledge base, including rules, which are applied to a series of similar ambiguities, and special case knowledge for particular cases of ambiguities (Liang 1984 & 1990; Ho et al 1991). A large number of uncertainties are settled in this way, but there is a side-affect: the rules may result in some mistakes that even a mechanical segmenting method can handle properly (Kit 1988). This may be partially due to the complexity of language, but a more sophis- ticated approach to organizing and applying knowledge is still needed. As for the latter, deriving from corpus linguistics, general techniques in tagging are employed and some advances have been reported (Lai et al 1991). But the design of a comprehensive and efficient tagging system is still, however, a big problem. Besides, a relaxation approach, which skips the mechanical segmentation and entirely relies on calculation of possibility, is theoretically sound, but practically, its identification rate is just about 95% (Fan & Tsai 1988), lower than that of mechanical methods. An appropriate combina- tion of relaxation and mechanical means is expected to achieve a better result. 4. English compound tokens in NLP In previous sections, we concentrated on words, in both English and Chinese: In fact, there are still a large number of compound tokens that take simple tokens, like words, as their constituents. They are critical to many processes in NLP and machine trans- lation so that their identification is of great significance. AcrEs DE COLING-92, NANI'ES, 23-28 AO(n 1992 1 l 0 8 PROC. OV COLING-92, NANTES, AUG. 23-28, 1992 4.1 Ratiouale for the notion of token Concepts such as word, collocation, and multi- morpheme item are important to lexicographers and linguists; whereas the concept of token is specific to certain processes in NLP and MT. A tokcn will not be broken down into smaller parts, lit other words, for tbe purpose of computational processing, it can be treated as an atom. There are many compound tokens, composed of a number of words, to be trans-lerred as a whole in MT. In syntactic analysis, if it is decided to treat them as indivisible units, with no care as to their inner structurc, then they become tokens for syntactic analysis. Token, then is a terminal node ill processing. This is the essence, and also the importance, of the concept of token. 4.2 Decomposition versus Ideutificatiou There are mainly two opposing views on how one should deal with English idioms, which have been identified as compound tokens in our framework: one stresses the decomposltlonality of idioms into constituent parts (Wasow, Sag, and Nunberg 1983; Gazxlar et al 1985; Stock 19891; another considers idioms as units of language as basic as word and wholly non-compositional In meaning (Wood 1986; Linden et al 1990; Santos 1990). The concept of token may offer a possible solution to this debate. To what degree a linguistic unit requires decomposition will depend on the nature of the task to be performed. In the case of lexieal transfer in MT, there is no need to decompose an idiom into its constituent parts. However as noted below in our discussion of idioms and fixed expressions in dis- continuous co-occurrence, structural analysis is sometimes necessary. In either ease, priority must be given to the recognition of compound tokens. The whole must first be ascertained before one can even consider what are its constituents. 5. Tokenization and lexical information retrieval There are a number of approaches to recognizing compound token.s. In this section we discuss two in particular. In the first, recognition is achieved by means of accessing lexical knowledge represented as a network of associations. The second adopts an approach combining table-look-up matching and knowledge processing. 5.1 l~xical Information retrieval as a basis for token recognition As noted above, the lexicographer's notion of word corresponds closely to the notion of token we have adopted here. We saw that what the lexicographer takes to be a word is an entity for which there exists some distinctive and significant collocation pattern. This bidirectional association between a word and its companions is itself evidence of that word's integrity and offcrs insight into its interp~'etation* The lexico- grapher's discovery procedure offers a useful model for achieving token identification. We are proposing to train a neural network to recognize tokens on the basis of their compauion relations. Once the training process is completed, the neural network will be enabled to perform the tasks of tokeni~'ation and disambignation by matching input with learned patterns of companion relalimls. The network might also have to include information about other kinds of relatinns as well. The basic premise of Richard ]Iudson's Word Grammar is that the entity word can be realized as part of a system or network of relations. Entities in the lexicon, he explains, include words and their parts, their models, their companions, their referents and their contexts of use. Lexemes arc emic units joined systematically to one another along vertical and horizontal dimensions. Every entity in the lexicon is at once a whole arid an instance. It is the composite realization of its parts as well as the reali:,atinn of some model. Along this w:rtical dimcnsion~ information flows from the more geucral to the more specific. The horizontal dimension, on the other hand, includes the lexical constraints imposed by heads tm modifiers as well as vice versa. "l'hcse lludson refers to as an entity's companinn relations. Such are the relations between collocates. Hudson's network approach accounts for the various realizations of entities as they occur in context in terms of the eom~ectinns drawn between an entity and its rcfcrent(s), utterance-event(s), and companinn(s). In a previous implementation of Hudson's network approach, we represented each lexical entry by means of a frmne whose slots coH'espondcd to Hudson's five relations (Webster,1987). 52 Table-look-up matching The simplest approach to identification of compound tokens is obvinnsly table-look-up matching. Admittedly, it presumes that a list of sample tokens in sufficient nmnber already exists. The basic steps of this approach arc, first, tokcnize each single word, then continue matchi~ to find whether there are any compound tokens among these single words. Such an approach is very similar to the basic method of automatic segmentation of Chinese words. This method can recoglfize English idioIas and other compound tokens whose constituents are continuous, but has no ability to handle ambiguities and catch variaut forms of idioms and lixed expressions in discontinuous co-occurrence. 5.3 Generalized Table-look-up This is an adjttsted table~hmk-up method designed to deal with idioms and ft~ed expressions in discon- tinuous co.occurrence, e.g., keep INP] in mind in wlfich keep plus/n mind constitutes a fixed expression and in rabid, a preposifional phrase, Ls merely part of a bigger token. Between these two parts, there is a noun phrase ACIES DE COL1NG-92. NANq'ES, 23-28 Aof;i 1992 l I 0 9 lhz~c., ol; C(iI.lNG.92, NANrES, AUti. 23-28. 1992 which is usually not too long. If it is long, we have a variant form of it, i.e., keep in mind [NP]. Of course, the [NP] can be substituted with a [Subclause]. In order to identify a compound token like this, we have to determine the NP and Subclause. Operations of this type need to, in part, make use of syntactic analysis. Thus, partialpwxing needs to be incorporated into the table-look-up. Notice that the parsing may take every word as its token in order to find compound tokens. From this, one can see the importance of struc- tural analysis to the identification of compound tokens. Besides structural analysis, knowledge about com- pound tokens, such as where the INP] and [Subclause] should be put in the discontinuous token keep ... in mind, is also required. Discontinuous idioms, phrasal verbs such as figure out [NP] and figure [it] out, and fixed expressions, all have to be processed with the aid of knowledge. By now it is dear that the generalized table-lookoup is an approach combining parsing and knowledge processing. With adequate knowledge about discontinuous compound tokens, it may prove effective in their identification. 6. Conclusion The notion of token must first be defined before computational processing can proceed. Obviously there is more to the issue than simply identifying strings delimited on both sides by spaces or punctuation. We have considered what constitutes a token from two perspectives: one from the lexicographer's experience with identifying words, the second from the experience of researchers in the area of Chinese NLP. From the work on automatic word segmentation in Chinese NLP, we have noted some valuable lessons which can be applied to the recognition of idioms and other tixed-expressions in English. The lexicographer's discovery procedures, informed with the knowledge of lexical relations implemented either as a neural net- work or in lcxical frames, also provide a useful model for the construction of a practical, knowledge-based approach to tokenization and disambiguation. Main references [1] Fan, C. K., and Tsai, W. H. 1988. Automatic word identification in Chinese sentences by the relaxation technique, Computer Processing of Chinese & Oriental Languages, V.4, No.1. [2] Ga:,xiar, G., Klein, E., PuUum, G., and Sag, 1. 1985. Generalized Phrase Structure Grammar. Cambridge, Mass.: Harvard University Press [3] He, K., Xn, H., and Sun, B. 1991. Expert system for automatic word segmentation of written Chinese. Pro- ceedings of 1991 International Conference on Computer Processing of Chinese and Oriental Languages, Taiwan. [41 Hudson, R.A. 1984. Word Grammar. Oxford: Basil BlackweU. [5] Kit(=Jie), C. 1988. Methods of Chinese automatic word segmentation and the design and implementation of a practical system. Master's Thesis, Graduate School, Chinese Academy of Social Sciences, Beijing. [6] Kit(=Jie), C., Liu, Y., and Liang, N. 1989. On methods of Chinese automatic word segmentation. Journal of Chinese Information Processing, Vol.3, No.1. [8] Kit(=Jie), C. 1989. Some key issues on the Contemporary Chinese language Word Segmentation Standard Used for Information Processing, Pro- ceedings of 1989 International Symposium on Standar- dization of Chinese Infomzation Processing, Beijing, [9] Kit, C. 1992. Practical techniques of chinese automatic word segmentation in the applied system CASS, Proceedings of PAN-ASIATIC LINGUISTICS- 92, Bangkok. [10] Kramsky, Jiri. 1%9. The Wordas a Linguistic Unit. The Hague: Mouton. [11] lai, T., Lun, S., Sun, C., and Sun, M. 1991. A maximal match Chinese text segmentation algorithm using mainly tagrs for resolution of ambiguities, Proceedings of ROCLING IV, Taiwan. .[12] Liang, N. 1984. Chinese automatic word segmen- tation system CDWS, Journal of Beijing University of Aeronautics and Astronautics, 1984, No.4. [13] Liang, N. 1990. The knowledge for Chinese word segmentation, Journal of Chinese Information Processing, Vol. 4, No. 2. [14] Linden, E. van der, and Kraaij, W. 1990. Ambiguity resolution and the retrieval of idioms: two approaches. Proceedings of COLING-90. Helsinki, Finland. [15] Lin, Y., and Liang, N. 1986. Basic engineering for Chinese processing - modem Chinese word frequency count, Journal of Chinese Information Processing, Vol.1, No.1. [16] Santos, D. 1990. Lexlcal gaps and idioms in machine translation, Proceedings of COLING-90, Helsinki, Finland. [17] Sinclair, John McH. 1966. Beginning the study of lexis (1966:410-429), in Bazell, et al. (eds.) In Memory of J R Firth. Longmans: London. [18] Stock, O. 1989. Parsing with flexibility, dynamic strategies and idioms in mind. Journal of Computa- tional Linguistics, Vol. 15, No. 1. [19] Wasow, T., Sag, 1., and Nunberg, G. 1983. Idioms: an interim report. In S. Hattori and K. Inoue (eds.) Proceedings of the 13th international congress of linguistics, Tokyo, Japan. [20] Webster, Jonathan J. 1987. A computational model for representing WORD knowledge (1987:432- 442) in the Thirteenth LACUS Forum 1986, Chicago, Illinois: LINGUISTIC ASSOCIATION OF CANADA AND THE US. [21] Wood, M. McGee. 1986. A Definition of Idiom. Master's Thesis, University of Manchester (1981). Reproduced by the Indiana University Linguistics Club. A(:tES DE COLING-92, NArZlJ:.S, 23-28 ^O(n 1992 1 1 1 0 PRO(:. o[: COLING-92, NANTES. Auo. 23-28, 1992