Splitting-merging model of Chinese word tokenization and segmentation
Published online by Cambridge University Press: 01 December 1998
Currently, word tokenization and segmentation are still a hot topic in natural language processing, especially for languages like Chinese in which there is no blank space for word delimitation. Three major problems are faced: (1) tokenizing direction and efficiency; (2) insufficient tokenization dictionary and new words; and (3) ambiguity of tokenization and segmentation. Most existing tokenization and segmentation methods have not dealt with the above problems together. To tackle the three problems in one basket, this paper presents a novel dictionary-based method called the Splitting-Merging Model (SMM) for Chinese word tokenization and segmentation. It uses the mutual information of Chinese characters to find the boundaries and the non-boundaries of Chinese words, and finally leads to a word segmentation by resolving ambiguities and detecting new words.
- Research Article
- © 1998 Cambridge University Press