Currently, word tokenization and segmentation are still a hot topic in natural language
processing, especially for languages like Chinese in which there is no blank space for word
delimitation. Three major problems are faced: (1) tokenizing direction and efficiency; (2)
insufficient tokenization dictionary and new words; and (3) ambiguity of tokenization and
segmentation. Most existing tokenization and segmentation methods have not dealt with the
above problems together. To tackle the three problems in one basket, this paper presents a
novel dictionary-based method called the Splitting-Merging Model (SMM) for Chinese word
tokenization and segmentation. It uses the mutual information of Chinese characters to find the
boundaries and the non-boundaries of Chinese words, and finally leads to a word segmentation
by resolving ambiguities and detecting new words.