Skip to main content

Universal Data Compression Algorithm Based on Approximate String Matching

  • Ilan Sadeh (a1)

A practical source coding scheme based on approximate string matching is proposed. It is an approximate fixed-length string matching data compression combined with a block-coder based on the empirical distribution. A lemma on approximate string matching, which is an extension of the Kac Lemma, is proved. It is shown, based on the lemma, that the deterministic algorithm converts the stationary and ergodic source, u, into an output process v, and under the assumption that v is a stationary process, after the scheme has run for an infinite time, the optimal compression ratio R(D) is achieved. This reduces the problem of the universal lossy coder to the proof of stationarity of the output process ν in the proposed algorithm. The main advantages of the proposed method are the asymptotic sequential behavior of the encoder and the simplicity of the decoder.

Hide All
1.Algoet, P.H. & Cover, T.M. (1988). A sandwich proof of the Shannon McMillan Breiman theorem. Annals of Probability 16(2): 899909.
2.Alon, N. & Spencer, J. (1992). The probabilistic method. New York: John Wiley.
3.Arratia, R.Gordon, L. & Waterman, M. (1990). The Erdos Renyi law in distribution for coin tossing and sequence matching. Annals of Statistics 18: 539570.
4.Arratia, R. & Waterman, M. (1989). The Erdos Renyi strong law for pattern matching with given proportion of mismatches. Annals of Probability 17: 11521169.
5.Blahut, R.E. (1987). Principles and practice of information theory. Reading, MA: Addison-Wesley.
6.Breiman, L. (1957). The individual ergodic theorem of information theory. Annals of Mathematical Statistics 28: 809811.
7.Elias, P. (1975). Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory IT-21: 194203.
8.Galil, Z. & Giancarlo, R. (1988). Data structures and algorithms for approximate string matching. Journal of Complexity 4: 3372.
9.Gilbert, E. & Kadota, T. (1992). The Lempel Ziv algorithm and the message complexity. IEEE Transactions on Information Theory IT-38: 18391842.
10.Gray, R.M. (1975). Sliding block source coding. IEEE Transactions on Information Theory IT-21: 357368.
11.Gray, R.M. (1990). Entropy and information theory. New York: Springer-Verlag.
12.Jacquet, P. & Szpankowski, W. (1996). Asymptotic behaviour of the Lempel Ziv parsing scheme and digital search trees. Theoretical Computer Science June.
13.Kac, M. (1947). On the notion of recurrence in discrete stochastic processes. Bulletin of the American Mathematical Society 53: 10021010.
14.Knuth, D.E. (1981). The art of computer science. Sorting and searching, Vol. 3. Reading, MA: Addison-Wesley.
15.Landau, G.M. & Vishkin, U. (1986). Introducing efficient parallelism into approximate string matching and a new serial algorithm. Proceedings of the 18th Annual ACM Symposium on the Theory of Computing, Berkeley, CA.
16.Le, Gall D. (1991). MPEG: A video compression standard for multimedia applications. Communications of the Association for Computing Machines 34: 4658.
17.Lempel, A. & Ziv, J. (1977). A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 1T-23: 337343.
18.Lempel, A. & Ziv, J. (1978). Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory IT-24: 530536.
19.Luczak, T. & Szpankowski, W. (to appear). A lossy data compression based on an approximate pattern matching. Submitted for publication.
20.Manber, U. & Myers, E.W. (1993). Suffix trees: A new method for on-line string searchers. SIAM Journal on Computing 22(5): 935948.
21.McCreight, E.M. (1976). A space economical suffix tree construction algorithm. Journal of the Association for Computing Machines 32(2): 262272.
22.Ornstein, D.S. & Shields, P.C. (1990). Universal almost sure data compression. Annals of Probability 18: 441452.
23.Ornstein, D.S. & Weiss, B. (1993). Entropy and data compression schemes. IEEE Transactions on Information Theory 39(1).
24.Pittel, B. (1985). Asymptotic growth of a class of random trees. Annals of Probability 13: 414427.
25.Plotnik, E.Weinberger, M. & Ziv, J. (1992). Upper bounds on the probability of sequences emitted by finite state sources and on the redundancy of the Lempel Ziv algorithm. IEEE Transactions on Information Theory 38: 6672.
26.Rodeh, M.Pratt, V. & Even, S. (1981). Linear algorithm for data compression via string matching. Journal of the Association for Computing Machines 28: 1624.
27.Padmanabha Rao, R. & Pearlman, W.A. (1991). On entropy of pyramid structures. IEEE Transactions on Information Theory 37(2).
28.Pearlman, W. (1994). Personal communications.
29.Sadeh, I. (1993). Data compression in computer networks. Ph.D. dissertation, Tel-Aviv University.
30.Sadeh, I. (1993). On approximate string matching. Proceedings of the 1993 Data Compression Conference DCC '93, 03, Snowbird, Utah.
31.Sadeh, I. (1995). Operational rate distortion theory. Journal of Applied Mathematics and Computer Science 5(1): 139169.
32.Steinberg, Y. & Gutman, M. (1993). An algorithm for source coding subject to a fidelity criterion, based on string matching. IEEE Transactions on Information Theory 39: 877887.
33.Storer, J.A. (1990). Lossy on line dynamic data compression. New York: Springer-Verlag.
34.Szpankowski, W. (1993). A generalized suffix tree and its unexpected asymptotic behaviours. SIAM Journal of Computing 22: 11761198.
35.Welch, T.A. (1984). A technique for high performance data compression. IEEE Transactions on Computers 17(6): 819.
36.Wyner, A.D. & Ziv, J. (1989). Some asymptotic properties of the entropy of a stationary ergodic data source with applications to data compression. IEEE Transactions on Information Theory 35: 12501258.
37.Ziv, J. (1978). Coding theorem for individual sequences. IEEE Transactions on Information Theory IT-24: 405412.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Probability in the Engineering and Informational Sciences
  • ISSN: 0269-9648
  • EISSN: 1469-8951
  • URL: /core/journals/probability-in-the-engineering-and-informational-sciences
Please enter your name
Please enter a valid email address
Who would you like to send this to? *


Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed