Universal Data Compression Algorithm Based on Approximate String Matching

Ilan Sadeh

doi:10.1017/S0269964800004502

Universal Data Compression Algorithm Based on Approximate String Matching

Published online by Cambridge University Press: 27 July 2009

Ilan Sadeh

Show author details

Ilan Sadeh: Affiliation:
Department of Communications Engineering, Center for Technological Education and Research, Holon affiliated with Tel Aviv University, P.O. Box 305, Holon, 58102, Israel

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

A practical source coding scheme based on approximate string matching is proposed. It is an approximate fixed-length string matching data compression combined with a block-coder based on the empirical distribution. A lemma on approximate string matching, which is an extension of the Kac Lemma, is proved. It is shown, based on the lemma, that the deterministic algorithm converts the stationary and ergodic source, u, into an output process v, and under the assumption that v is a stationary process, after the scheme has run for an infinite time, the optimal compression ratio R(D) is achieved. This reduces the problem of the universal lossy coder to the proof of stationarity of the output process ν in the proposed algorithm. The main advantages of the proposed method are the asymptotic sequential behavior of the encoder and the simplicity of the decoder.

Information

Type: Research Article
Information: Probability in the Engineering and Informational Sciences , Volume 10 , Issue 4 , October 1996 , pp. 465 - 486

DOI: https://doi.org/10.1017/S0269964800004502 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 1996

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

1.Algoet, P.H. & Cover, T.M. (1988). A sandwich proof of the Shannon McMillan Breiman theorem. Annals of Probability 16(2): 899–909.CrossRef Google Scholar

2.Alon, N. & Spencer, J. (1992). The probabilistic method. New York: John Wiley.Google Scholar

3.Arratia, R.Gordon, L. & Waterman, M. (1990). The Erdos Renyi law in distribution for coin tossing and sequence matching. Annals of Statistics 18: 539–570.CrossRef Google Scholar

4.Arratia, R. & Waterman, M. (1989). The Erdos Renyi strong law for pattern matching with given proportion of mismatches. Annals of Probability 17: 1152–1169.CrossRef Google Scholar

5.Blahut, R.E. (1987). Principles and practice of information theory. Reading, MA: Addison-Wesley.Google Scholar

6.Breiman, L. (1957). The individual ergodic theorem of information theory. Annals of Mathematical Statistics 28: 809–811.CrossRef Google Scholar

7.Elias, P. (1975). Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory IT-21: 194–203.CrossRef Google Scholar

8.Galil, Z. & Giancarlo, R. (1988). Data structures and algorithms for approximate string matching. Journal of Complexity 4: 33–72.CrossRef Google Scholar

9.Gilbert, E. & Kadota, T. (1992). The Lempel Ziv algorithm and the message complexity. IEEE Transactions on Information Theory IT-38: 1839–1842.CrossRef Google Scholar

10.Gray, R.M. (1975). Sliding block source coding. IEEE Transactions on Information Theory IT-21: 357–368.CrossRef Google Scholar

11.Gray, R.M. (1990). Entropy and information theory. New York: Springer-Verlag.CrossRef Google Scholar

12.Jacquet, P. & Szpankowski, W. (1996). Asymptotic behaviour of the Lempel Ziv parsing scheme and digital search trees. Theoretical Computer Science June.Google Scholar

13.Kac, M. (1947). On the notion of recurrence in discrete stochastic processes. Bulletin of the American Mathematical Society 53: 1002–1010.CrossRef Google Scholar

14.Knuth, D.E. (1981). The art of computer science. Sorting and searching, Vol. 3. Reading, MA: Addison-Wesley.Google Scholar

15.Landau, G.M. & Vishkin, U. (1986). Introducing efficient parallelism into approximate string matching and a new serial algorithm. Proceedings of the 18th Annual ACM Symposium on the Theory of Computing, Berkeley, CA.Google Scholar

16.Le, Gall D. (1991). MPEG: A video compression standard for multimedia applications. Communications of the Association for Computing Machines 34: 46–58.Google Scholar

17.Lempel, A. & Ziv, J. (1977). A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 1T-23: 337–343.Google Scholar

18.Lempel, A. & Ziv, J. (1978). Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory IT-24: 530–536.Google Scholar

19.Luczak, T. & Szpankowski, W. (to appear). A lossy data compression based on an approximate pattern matching. Submitted for publication.Google Scholar

20.Manber, U. & Myers, E.W. (1993). Suffix trees: A new method for on-line string searchers. SIAM Journal on Computing 22(5): 935–948.CrossRef Google Scholar

21.McCreight, E.M. (1976). A space economical suffix tree construction algorithm. Journal of the Association for Computing Machines 32(2): 262–272.CrossRef Google Scholar

22.Ornstein, D.S. & Shields, P.C. (1990). Universal almost sure data compression. Annals of Probability 18: 441–452.CrossRef Google Scholar

23.Ornstein, D.S. & Weiss, B. (1993). Entropy and data compression schemes. IEEE Transactions on Information Theory 39(1).CrossRef Google Scholar

24.Pittel, B. (1985). Asymptotic growth of a class of random trees. Annals of Probability 13: 414–427.CrossRef Google Scholar

25.Plotnik, E.Weinberger, M. & Ziv, J. (1992). Upper bounds on the probability of sequences emitted by finite state sources and on the redundancy of the Lempel Ziv algorithm. IEEE Transactions on Information Theory 38: 66–72.CrossRef Google Scholar

26.Rodeh, M.Pratt, V. & Even, S. (1981). Linear algorithm for data compression via string matching. Journal of the Association for Computing Machines 28: 16–24.CrossRef Google Scholar

27.Padmanabha Rao, R. & Pearlman, W.A. (1991). On entropy of pyramid structures. IEEE Transactions on Information Theory 37(2).Google Scholar

28.Pearlman, W. (1994). Personal communications.Google Scholar

29.Sadeh, I. (1993). Data compression in computer networks. Ph.D. dissertation, Tel-Aviv University.Google Scholar

30.Sadeh, I. (1993). On approximate string matching. Proceedings of the 1993 Data Compression Conference DCC '93, 03, Snowbird, Utah.Google Scholar

31.Sadeh, I. (1995). Operational rate distortion theory. Journal of Applied Mathematics and Computer Science 5(1): 139–169.Google Scholar

32.Steinberg, Y. & Gutman, M. (1993). An algorithm for source coding subject to a fidelity criterion, based on string matching. IEEE Transactions on Information Theory 39: 877–887.CrossRef Google Scholar

33.Storer, J.A. (1990). Lossy on line dynamic data compression. New York: Springer-Verlag.CrossRef Google Scholar

34.Szpankowski, W. (1993). A generalized suffix tree and its unexpected asymptotic behaviours. SIAM Journal of Computing 22: 1176–1198.CrossRef Google Scholar

35.Welch, T.A. (1984). A technique for high performance data compression. IEEE Transactions on Computers 17(6): 8–19.Google Scholar

36.Wyner, A.D. & Ziv, J. (1989). Some asymptotic properties of the entropy of a stationary ergodic data source with applications to data compression. IEEE Transactions on Information Theory 35: 1250–1258.CrossRef Google Scholar

37.Ziv, J. (1978). Coding theorem for individual sequences. IEEE Transactions on Information Theory IT-24: 405–412.CrossRef Google Scholar

Article contents

Universal Data Compression Algorithm Based on Approximate String Matching

Abstract

Information

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests