Ph.D. Thesis, Electrical Engineering Deprtment,
by Ioannis Kontoyiannis
Note that
most of the results in this thesis have been subsequently extended or improved
upon. In particular see the following three papers (in pdf) and the references
therein: [1999 Lempel-Ziv paper] [2002 review paper] [2004 conference paper on lossy MDL]
Abstract
Over the past 25 years, the
practical requirement for efficient data compression algorithms has generated a
large volume of research covering the whole spectrum from practically
implementable algorithms to deep theoretical results. One prominent example is
the Lempel-Ziv algorithm for lossless data compression: Not only is it
implemented on most computers used today, but also, attempts to analyze its
performance have provided new problems in probability, information theory and
ergodic theory, whose solutions reveal a series of interesting results about
the entropy and the recurrence structure of stationary processes.
The main problems
considered in this thesis are those of determining the asymptotic behavior of
waiting times and recurrence times in stationary processes. These questions are
motivated primarily by their important applications in data compression and the
analysis of string matching algorithms in DNA sequence analysis. In particular,
solving the waiting times problem also allowed us to solve a long-standing open
problem in data compression: That of finding a practical extension of the
Lempel-Ziv coding algorithm for lossy compression.
This thesis is divided into
three parts. In the first part we generalize one of the central theoretical
results in source coding theory: We prove a natural generalization of the
celebrated Shannon-McMillan-Breiman theorem (as well as its subsequent
refinements by Ibragimov and by Philipp and Stout) for real-valued processes
and for the case when distortion is allowed. These results are inspired by, and
provide the key technical ingredient in, our asymptotic analysis of recurrence
and waiting times, in the second part. The main probabilistic tools used in
establishing them are uniform almost-sure approximation, powerful techniques
from large deviations, and classical second-moment blocking arguments.
In the second part we
consider the problem of waiting times between stationary processes. We show
that waiting times grow exponentially with probability one and, that their rate
is given by the solution to an explicit variational problem in terms of the
entropies of the underlying processes. Moreover, we show that, properly scaled,
the deviations of the waiting times from their limiting exponent are
asymptotically Gaussian (with a limiting variance explicitly identified), and
we prove finer theorems (e.g., a law of the iterated logarithm and an almost
sure invariance principle) that provide the exact rate of convergence in the
above limit theorems. Corresponding results are proved for recurrence times,
and dual results are stated and proved for certain longest-match lengths
between stationary processes.
Finally, in the third part,
we use the insight gained by the waiting times results to find a practical
extension of the Lempel-Ziv scheme for the case of lossy data compression. We
propose a new lossy version of the so-called Fixed-Database Lempel-Ziv coding
algorithm, which is of complexity ``comparable'' to that of the corresponding
lossless scheme, and we prove that its compression performance is
(asymptotically) optimal.