Skip to main content

A new PPM variant for Chinese text compression

  • PEILIANG WU (a1) and W. J. TEAHAN (a1)

Large alphabet languages such as Chinese are very different from English, and therefore present different problems for text compression. In this article, we first examine the characteristics of Chinese, then we introduce a new variant of the Prediction by Partial Match (PPM) model especially for Chinese characters. Unlike the traditional PPM coding schemes, which encodes an escape probability if a novel character occurs in the context, the new coding scheme directly encodes the order first before encoding a symbol, without having to output an escape probability. This scheme achieves excellent compression rates in comparison with other schemes on a variety of Chinese text files.

Hide All
Bassiou, N., and Kotropoulos, C. 2005. Interpolated distanced bigram language models for robust word clustering. In International Workshop on Nonlinear Signal and Image Processing (NSIP 2005).
Bell, T. C., Cleary, J. G., and Witten, I. H. 1990. Text Compression. Upper Saddle Rivee, NJ: Prentice Hall.
Bodden, E., Clasen, M., and Kneis, J. 2004. Arithmetic Coding revealed. Germany: RWTH Aachen University.
Burrows, M., and Wheeler, D. J. 1994. A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, Palo Alto, CA.
Cheng, K.-S., Young, G. H., and Wong, K.-F. 1999. A study of word-based and integral-bit chinese text compression algorithms. Journal of the American Society for Information Science 50 (3):218228.
Cleary, J. G., and Witten, I. H. 1984. Data compression using adaptive coding and partial string matching. IEEE Transactions on Communication COM-32 (4):396402.
Cleary, J. G., and Teahan, W. J. 1997. Unbounded length contexts for PPM. The Computer Journal 40 (2/3):6775.
Gu, H. Y. 1995. Large-alphabet Chinese text compression using adaptive Markov model and arithmetic coder. Computer Processing of Chinese and Oriental Languages 9 (2):111124.
Gu, H.-Y. 2005. A large-alphabet-oriented scheme for Chinese and English text compression. Software—Practice and Experience 35:10271039.
Jelinek, F. 1985. Self-organized language modeling for speech recognition. In Readings in Speech Recognition, A. Waibel and K. Lee (eds.), Morgan Kaufmann, Weshington, DC vol. 28, pp. 25912594.
Jin, G. 1992. PH Corpus of Mandarin Chinese. Date accessed 30 June, 2007.
Lua, K. T. 1994. Compression of Chinese text. In International Conference on Chinese Computing. pp. 367–375.
McEnery, T., and Xiao, R. 2004. The Lancaster Corpus of Mandarin Chinese. European Language Resources Association. Date accessed 30 June, 2007.
Moffat, A. 1989. Word-based text compression. Software Practice and Experience 19 (2):185198.
Moffat, A. 1990. Implementing the PPM data compression scheme. IEEE Transaction on Communication 38 (11):19171921.
Moffat, A., Neal, R., and Witten, I. 1998. Arithmetic Coding Revisited. ACM Transactions on Information Systems 16 (3):256294.
Ong, G. H., and Ng, J. P. 2005. Dynamic Markov Compression using a crossbar-like tree initial structure for Chinese texts. In ICITA '05: Proceedings of the Third International Conference on Information Technology and Applications (ICITA'05), vol. 2, pp. 407–410.
Shkarin, D. 2002. PPM : One step to practicality. In Data Compression Conference 2002. pp. 202–211.
Teahan, W. J. 1998. Modelling English Text, PhD thesis. New Zealand: University of Waikato.
Teahan, W. J., and Harper, D. J. 2001. Combining PPM models using a text mining approach. In Data Compression Conference 2001 pp. 153–162.
TREC Mandarin Corpus. 2000 Text Retrieval Conference test data. Date accessed 30 June, 2007.
Vines, P., and Zobel, J. 1998 Compression techniques for Chinese text. Software—Practice and Experience 28 (12):12991314.
Witten, I. H., Bray, Z., Mahoui, M., and Teahan, W. J. 1999. Text mining: A new frontier for lossless compression. In Data Compression Conference 1999. pp. 198–207.
Wu, P., and Teahan, W. J. 2005. Modelling Chinese for text compression. In Data Compression Conference, 2005, Proceedings. DCC 2005. p. 488.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *


Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed