A new PPM variant for Chinese text compression

PEILIANG WU; W. J. TEAHAN

doi:10.1017/S1351324907004597

A new PPM variant for Chinese text compression

Published online by Cambridge University Press: 01 July 2008

PEILIANG WU and

W. J. TEAHAN

Show author details

PEILIANG WU: Affiliation:
School of Informatics, University of Wales Bangor, Dean Street, Bangor, Gwynedd LL57 1UT, UK email: perry@informatics.bangor.ac.uk, wjt@informatics.bangor.ac.uk
W. J. TEAHAN: Affiliation:
School of Informatics, University of Wales Bangor, Dean Street, Bangor, Gwynedd LL57 1UT, UK email: perry@informatics.bangor.ac.uk, wjt@informatics.bangor.ac.uk

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Large alphabet languages such as Chinese are very different from English, and therefore present different problems for text compression. In this article, we first examine the characteristics of Chinese, then we introduce a new variant of the Prediction by Partial Match (PPM) model especially for Chinese characters. Unlike the traditional PPM coding schemes, which encodes an escape probability if a novel character occurs in the context, the new coding scheme directly encodes the order first before encoding a symbol, without having to output an escape probability. This scheme achieves excellent compression rates in comparison with other schemes on a variety of Chinese text files.

Information

Type: Papers
Information: Natural Language Engineering , Volume 14 , Issue 3 , July 2008 , pp. 417 - 430

DOI: https://doi.org/10.1017/S1351324907004597 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2007

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Bassiou, N., and Kotropoulos, C. 2005. Interpolated distanced bigram language models for robust word clustering. In International Workshop on Nonlinear Signal and Image Processing (NSIP 2005).Google Scholar

Bell, T. C., Cleary, J. G., and Witten, I. H. 1990. Text Compression. Upper Saddle Rivee, NJ: Prentice Hall.Google Scholar

Bodden, E., Clasen, M., and Kneis, J. 2004. Arithmetic Coding revealed. Germany: RWTH Aachen University.Google Scholar

Burrows, M., and Wheeler, D. J. 1994. A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, Palo Alto, CA.Google Scholar

Cheng, K.-S., Young, G. H., and Wong, K.-F. 1999. A study of word-based and integral-bit chinese text compression algorithms. Journal of the American Society for Information Science 50 (3):218–228.3.0.CO;2-1>CrossRef Google Scholar

Cleary, J. G., and Witten, I. H. 1984. Data compression using adaptive coding and partial string matching. IEEE Transactions on Communication COM-32 (4):396–402.CrossRef Google Scholar

Cleary, J. G., and Teahan, W. J. 1997. Unbounded length contexts for PPM. The Computer Journal 40 (2/3):67–75.CrossRef Google Scholar

Gu, H. Y. 1995. Large-alphabet Chinese text compression using adaptive Markov model and arithmetic coder. Computer Processing of Chinese and Oriental Languages 9 (2):111–124.Google Scholar

Gu, H.-Y. 2005. A large-alphabet-oriented scheme for Chinese and English text compression. Software—Practice and Experience 35:1027–1039.CrossRef Google Scholar

Jelinek, F. 1985. Self-organized language modeling for speech recognition. In Readings in Speech Recognition, A. Waibel and K. Lee (eds.), Morgan Kaufmann, Weshington, DC vol. 28, pp. 2591–2594.Google Scholar

Jin, G. 1992. PH Corpus of Mandarin Chinese. ftp://ftp.cogsci.ed.ac.uk/pub/chinese. Date accessed 30 June, 2007.Google Scholar

Lua, K. T. 1994. Compression of Chinese text. In International Conference on Chinese Computing. pp. 367–375.Google Scholar

McEnery, T., and Xiao, R. 2004. The Lancaster Corpus of Mandarin Chinese. European Language Resources Association. http://bowland-files.lancs.ac.uk/corplang/lcmc/. Date accessed 30 June, 2007.Google Scholar

Moffat, A. 1989. Word-based text compression. Software Practice and Experience 19 (2):185–198.CrossRef Google Scholar

Moffat, A. 1990. Implementing the PPM data compression scheme. IEEE Transaction on Communication 38 (11):1917–1921.CrossRef Google Scholar

Moffat, A., Neal, R., and Witten, I. 1998. Arithmetic Coding Revisited. ACM Transactions on Information Systems 16 (3):256–294.CrossRef Google Scholar

Ong, G. H., and Ng, J. P. 2005. Dynamic Markov Compression using a crossbar-like tree initial structure for Chinese texts. In ICITA '05: Proceedings of the Third International Conference on Information Technology and Applications (ICITA'05), vol. 2, pp. 407–410.Google Scholar

Shkarin, D. 2002. PPM : One step to practicality. In Data Compression Conference 2002. pp. 202–211.Google Scholar

Teahan, W. J. 1998. Modelling English Text, PhD thesis. New Zealand: University of Waikato.Google Scholar

Teahan, W. J., and Harper, D. J. 2001. Combining PPM models using a text mining approach. In Data Compression Conference 2001 pp. 153–162.Google Scholar

TREC Mandarin Corpus. 2000 Text Retrieval Conference test data. http://www.ldc.upenn.edu. Date accessed 30 June, 2007.Google Scholar

Vines, P., and Zobel, J. 1998 Compression techniques for Chinese text. Software—Practice and Experience 28 (12):1299–1314.3.0.CO;2-E>CrossRef Google Scholar

Witten, I. H., Bray, Z., Mahoui, M., and Teahan, W. J. 1999. Text mining: A new frontier for lossless compression. In Data Compression Conference 1999. pp. 198–207.CrossRef Google Scholar

Wu, P., and Teahan, W. J. 2005. Modelling Chinese for text compression. In Data Compression Conference, 2005, Proceedings. DCC 2005. p. 488.Google Scholar

Article contents

A new PPM variant for Chinese text compression

Abstract

Information

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests