On the Ziv–Merhav theorem beyond Markovianity I

Nicholas Barnfield; Raphaël Grondin; Gaia Pozzoli; Renaud Raquépas

doi:10.4153/S0008414X24000178

On the Ziv–Merhav theorem beyond Markovianity I

Part of: Topological dynamics Approximation methods and numerical treatment of dynamical systems Communication, information

Published online by Cambridge University Press: 07 March 2024

and

Nicholas Barnfield: Affiliation:
Department of Mathematics and Statistics, McGill University, Montréal, QC, Canada e-mail: nicholas.barnfield@mail.mcgill.ca
Raphaël Grondin: Affiliation:
Department of Mathematics and Statistics, McGill University, Montréal, QC, Canada e-mail: raphael.grondin@mail.mcgill.ca
Gaia Pozzoli: Affiliation:
Department of Mathematics, CY Cergy Paris Université, CNRS UMR 8088, Cergy-Pontoise, France e-mail: gaia.pozzoli@cyu.fr
Renaud Raquépas*: Affiliation:
Courant Institute of Mathematical Sciences, New York University, New York, NY, United States
*: e-mail: rr4374@nyu.edu

Article contents

Abstract
Introduction
Setting
Main result
Examples
Footnotes
References

Rights & Permissions

Abstract

We generalize to a broader class of decoupled measures a result of Ziv and Merhav on universal estimation of the specific cross (or relative) entropy, originally for a pair of multilevel Markov measures. Our generalization focuses on abstract decoupling conditions and covers pairs of suitably regular g-measures and pairs of equilibrium measures arising from the “small space of interactions” in mathematical statistical mechanics.

Keywords

Cross entropy estimator parsing match lengths decoupling

MSC classification

Primary: 94A17: Measures of information, entropy 37M25: Computational methods for ergodic theory (approximation of invariant measures, computation of Lyapunov exponents, entropy) 37B10: Symbolic dynamics

Information

Type: Article
Information: Canadian Journal of Mathematics , Volume 77 , Issue 3 , June 2025 , pp. 891 - 915

DOI: https://doi.org/10.4153/S0008414X24000178 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright: © The Author(s), 2024. Published by Cambridge University Press on behalf of Canadian Mathematical Society

1 Introduction

In 1993, Ziv and Merhav proposed a “new notion of empirical informational divergence,” or relative-entropy estimator, based on the celebrated Lempel–Ziv compression algorithm [Reference Ziv and MerhavZM93]. While this estimator received – to the best of our knowledge – little attention in the mathematical literature, it (and its variants) has met with success in many practical applications across fields such as linguistics, medicine, and physics (see, e.g., [Reference Basile, Benedetto, Caglioti and Degli EspostiBBCDE08, Reference Benedetto, Caglioti and LoretoBCL02, Reference Coutinho, Figueiredo, Marques, Pérez de la Blanca and PinaCF05, Reference Coutinho, Fred and FigueiredoCFF10, Reference Lippi, Montemurro, Degli Esposti and CristadoroLMDEC19, Reference Ro, Guo, Shih, Phan, Austin, Levine, Chaikin and MartinianiRGS⁺22, Reference Roldán and ParrondoRP12], to only cite a few). In fact, our main motivation for a more extensive rigorous treatment of the convergence of this estimator is that the very limited Markovian class of sources covered by the original result of Ziv and Merhav pales in comparison with the breadth of apparent applicability.

Ziv and Merhav’s estimator is defined as follows. Given two strings $x_1^N$ and $y_1^N$ , let $c_N(y|x)$ be the number of words in a sequential parsing of $y_1^N$ using the longest possible substrings of $x_1^N$ ; if there is no such substring of $x_1^N$ , the parsed word is set to be one letter long. For example, if

$$ \begin{align*} x &= 01000101110100111001000122021\dots, \\ y &= 01100101000102011101001000210\dots, \end{align*} $$

and $N=24$ , then the Ziv–Merhav parsing of $ y_1^{24}=011001010001020111010010 $ with respect to $ x_1^{24}=010001011101001110010001 $ is

$$\begin{align*}y_1^{24} = 011|00101|00010|2|011101001|0 \end{align*}$$

and $c_{24}(y|x)=6$ .Footnote ¹ Ziv and Merhav show that the estimator

$$\begin{align*}\widehat{Q}_N(y,x) := \frac{c_N(y|x) \ln N}{N} \end{align*}$$

converges to the specific cross entropy $h_{\text {c}}({\mathbb Q}|{\mathbb P})$ – see (2.1) below – between the sources ${\mathbb P}$ and ${\mathbb Q}$ that have produced x and y, respectively, under the assumption that those measures come from irreducible multilevel Markov chains. We will refer to $(\widehat {Q}_N)_{N=1}^\infty $ as the ZM estimator. The relative entropy $h_{\text {r}}({\mathbb Q}|{\mathbb P})$ can then be estimated by combining the above with an estimation of the specific entropy $h({\mathbb Q})$ , say à la Lempel–Ziv [Reference Ziv and LempelZL78]. Both quantities are defined in Section 2 for the reader’s convenience.

One may note that the behavior of $c_N$ is intimately related to the so-called Wyner–Ziv problem on waiting times. With

$$\begin{align*}W_\ell(y,x) := \inf\{r \in {\mathbb N} : x_{r}^{r+\ell-1} = y_1^\ell\}, \end{align*}$$

the Wyner–Ziv problem concerns the convergence

(1.1)

$$ \begin{align} \frac{\ln W_\ell}{\ell} \to h_{\text{c}}({\mathbb Q}|{\mathbb P}) \end{align} $$

as $\ell \to \infty $ within sufficiently nice classes of measures [Reference KontoyiannisKon98, Reference ShieldsShi93, Reference Wyner and ZivWZ89]. To see the relation, note that the length of the first word in the ZM parsing of $y_1^N$ with respect to $x_1^N$ is – save some edge cases – the largest possible $\ell $ such that $W_{\ell }(y,x) \leq N - \ell +1$ . This dual quantity is known as the longest-match length

$$\begin{align*}\Lambda_N(y,x) := \max\left\{ 1, \sup\{\ell\in{\mathbb N} : W_\ell(y,x) \leq N-\ell+1 \}\right\}. \end{align*}$$

The length of the second word in this parsing is then – again save some edge cases handled in Section 3.4 – the longest-match length $\Lambda _{N}(T^{\Lambda _N(y,x)}y,x)$ , and so on. Any attempt at a theory of the asymptotic behavior of waiting times and its derived quantities beyond Markovianity must take two important caveats into account. First, it is known that the specific cross entropy between two ergodic sources does not always exist (see, e.g., [Reference van Enter, Fernández and SokalvEFS93, Section A.5.2]). Second, it is known that there exists a mixing measure ${\mathbb P}$ such that (1.1) fails with ${\mathbb Q}={\mathbb P}$ (see [Reference ShieldsShi93, Section 4]). While the precise breadth of the validity of (1.1) and its different refinements remains unknown, a focus on decoupling conditions in the spirit of [Reference Pfister and SidoraviciusPfi02] has recently proved effective for making significant progress [Reference Cristadoro, Degli Esposti, Jakšić and RaquépasCDEJR23a, Reference Cuneo and RaquépasCR23]; the present contribution follows along those lines.

Broadly speaking, the present work is part of a research program [Reference Benoist, Cuneo, Jakšić and PilletBCJP21, Reference Benoist, Jakšić, Pautrat and PilletBJPP18, Reference Cristadoro, Degli Esposti, Jakšić and RaquépasCDEJR23a, Reference Cristadoro, Degli Esposti, Jakšić and RaquépasCDEJR23b, Reference Cuneo, Jakšić, Pillet and ShirikyanCJPS19, Reference Cuneo and RaquépasCR23] whose goals include promoting the efficiency of this decoupling perspective originating in statistical mechanics in revisiting long-standing problems in dynamical systems and information theory. This efficiency concerns both the reformulation of different existing proof strategies in a common language and the generation of nontrivial extensions. In the case presently at hand, revisiting Ziv and Merhav’s argument from this decoupling perspective allows us to replace the Markovianity assumption with a more permissive combination of three abstract assumptions, namely ID, KB, and FE below.

Organization of the paper.

The rest of the paper is organized as follows. In Section 2, we set the stage by properly introducing our notation, objects of interest, and assumptions. In Section 3, we state our main result, provide its proof, and make several comments. In Section 4, we discuss examples to which this result applies beyond Markovianity.

2 Setting

Let be equipped with the $\sigma $ -algebra generated by cylinders of the form $[a] := \{x \in \Omega : x_1^n = a\}$ . The shift map $T: \Omega \to \Omega $ defined by is then a measurable surjection. Let ${\mathbb P}$ and ${\mathbb Q}$ be stationary (i.e., T-invariant) probability measures on $\Omega $ . We set

and

and similarly for ${\mathbb Q}$ . We consider samples $(x,y)$ from the product measure ${\mathbb P} \otimes {\mathbb Q}$ on $\Omega \times \Omega $ , meaning that the two sources produce strings of symbols independently of each other. The (specific) entropy $h({\mathbb P})$ of a measure ${\mathbb P}$ is

$$\begin{align*}h({\mathbb P}):=\lim_{n\to\infty}-\frac{1}{n}\sum_{a\in \mathcal{A}^n}{\mathbb P}[a]\ln {\mathbb P}[a]. \end{align*}$$

Fekete’s lemma ensures that this limit always exists and lies in $[0, \ln (\#{\cal A})]$ . The (specific) cross entropy of ${\mathbb Q}$ with respect to ${\mathbb P}$ is

(2.1)

$$ \begin{align} h_{\text{c}}({\mathbb Q}|{\mathbb P}):=\lim_{n\to\infty}-\frac{1}{n}\sum_{a\in\mathcal{A}^n}{\mathbb Q}[a]\ln {\mathbb P}[a], \end{align} $$

when the limit exists in $[0,\infty ]$ . In this case, the (specific) relative entropy $h_{\text {r}}({\mathbb Q}|{\mathbb P})$ of ${\mathbb Q}$ with respect to ${\mathbb P}$ is then defined as

$$\begin{align*}h_{\text{r}}({\mathbb Q}|{\mathbb P}):= h_{\text{c}}({\mathbb Q}|{\mathbb P}) - h({\mathbb Q}). \end{align*}$$

The abstract properties of stationary measures that we will work with are the following:

ID A measure ${\mathbb P}$ is said to be immediately decoupled on its support if there exists a nondecreasing,Footnote ² nonnegative $o(n)$ -sequence $(k_n)_{n=1}^\infty $ such that, for every $n\in {\mathbb N}$ , both
(2.2)

and
(2.3)

FE The ${\mathbb P}$ -measure of cylinders is said to decay fast enough if there exists $\gamma _+ < 0$ such that
(2.4)

for all $n\in {\mathbb N}$ .

KB The measure ${\mathbb P}$ is said to satisfy Kontoyiannis’ bound on waiting times if there exist nonnegative $o(n)$ -sequences $(k_n)_{n=1}^\infty $ and $(\tau _n)_{n=1}^\infty $ such that
$$\begin{align*}{\mathbb P}\{x : W_\ell(a,x) \geq r\} \leq \exp \left(-\mathrm{e}^{-k_\ell}{\mathbb P}[a]\left\lfloor \frac{r-1}{\ell+\tau_\ell}\right\rfloor\right) \end{align*}$$
for every $\ell \in {\mathbb N}$ , $a \in {\cal A}^\ell $ , and $r\in {\mathbb N}$ .

Let us briefly discuss these abstract assumptions. First, it is straightforward to show that if ${\mathbb P}$ is the stationary measure for an irreducible multilevel Markov chain with positive entropy, then ${\mathbb P}$ satisfies ID, FE, and KB. Already for Markov chains, we see that only requiring the lower-decoupling bound (2.3) when $ab$ is in the support is significant: requiring the lower bound whenever a and b are in the support (separately) would be considerably more restrictive, as this would exclude all Markov measures for which some transition probability is null. Second, the bound KB was derived in [Reference KontoyiannisKon98] under a $\psi $ -mixing assumption, but the following implication seems more natural for the classes of examples we have in mind: KB will follow from the decoupling condition ID if one is willing to assume that the support of ${\mathbb P}$ satisfies – as a subshift of $\Omega $ – a suitable notion of specification.Footnote ³ Still regarding $\psi $ -mixing, in the notation of [Reference BradleyBra05, Section 2], Condition ID is implied by $0 <\psi '(0) \leq \psi ^*(0) < \infty $ , but $\psi $ -mixing itself does not imply ID and ID does not imply $\psi $ -mixing – or any form of mixing for that matter. Third, combining ID and FE yields Shields’s finite-energy condition [Reference ShieldsShi96, Section II.5.a], but the converse implication is not true. It is also worth mentioning the following relations to the Doeblin-type condition discussed, e.g., in [Reference Kontoyiannis, Suhov and KellyKS94]: it implies FE but not ID, and it is not implied by our assumptions; we will come back to this point in Section 4. Finally, repeated uses of (2.3) in ID implies the following property, which naturally complements FE:

SE The ${\mathbb P}$ -measure of cylinders is said to decay slow enough if there existsFootnote ⁴ $\gamma _- < 0$ such that

for all $n\in {\mathbb N}$ .

These assumptions are established and discussed in the context of important classes of examples in Section 4.

3 Main result

3.1 Statement and structure of the proof

Theorem 3.1 Suppose that the stationary measure ${\mathbb P}$ satisfies ID, FE, and KB and that the ergodic measure ${\mathbb Q}$ satisfies ID and FE.Footnote ⁵ Then,

$$\begin{align*}\lim_{N\to\infty} \widehat{Q}_N(y,x) = h_{\text{c}}({\mathbb Q}|{\mathbb P}) \end{align*}$$

for $({\mathbb P}\otimes {\mathbb Q})$ -almost every $(x,y)$ .

Let us now provide the structure of the proof in the case where , postponing the more technical aspects to Sections 3.2 and 3.3. Throughout,

$$\begin{align*}{ \ell_{-,N}} := \frac{\ln N}{-2\gamma_-} \end{align*}$$

and

$$\begin{align*}{ \ell_{+,N}} := \frac{2\ln N}{-\gamma_+}. \end{align*}$$

These will serve as a priori bounds on the lengths of the words in different auxiliary parsings.

Upper bound. Let $\epsilon \in (0, \tfrac 12)$ and be arbitrary. We consider an auxiliary sequential parsing $y_1^N= y^{(1,N)} y^{(2,N)} \dotsc y^{(\widehat {c}_N,N)}$ , where each word $y^{(j,N)}$ has length $\ell _{j,N}$ and is – except possibly for $y^{(\widehat c_N,N)}$ – the shortest prefix of $T^{\ell _{1,N}+\dots +\ell _{j-1,N}}y_1^N$ satisfying
(3.1) $$ \begin{align} {\mathbb P}[y^{(j,N)}] \leq N^{-1+\epsilon}, \end{align} $$
where we define and for any $k<N$ . The power is chosen in the hope that the words in this auxiliary parsing will be long enough, yet likely enough for ${\mathbb P}$ that the vast majority of them find a match in $x_1^N$ . To motivate this Ansatz, note that, by linearity of expectation, the expected number of times a given string of ${\mathbb P}$ -probability $N^{-1+\epsilon }$ appears in a string of length N obtained from ${\mathbb P}$ grows as $N^{\epsilon }$ .
For N large enough, each length $\ell _{j,N}$ is between $\ell _{-,N}$ and $\ell _{+,N}$ , due to Properties FE and SE, except possibly for $\ell _{\widehat {c}_N,N}$ which need only satisfy the upper bound. In particular, $\widehat {c}_N = O(\frac {N}{\ln N})$ .

Note that, for each $j = 1,2, \dotsc , \widehat c_N$ , the appearance of $y^{(j,N)}$ as a substring of $x_1^N$ – written $y^{(j,N)} \in x_1^N$ in what follows – implies the presence of at most one separator of the original ZM parsing within $y^{(j,N)}$ , that is,
$$\begin{align*}{\mathbb P}\left\{ x : \#\{ j \leq \widehat{c}_N : y^{(j,N)} \in x_1^N\} = \widehat{c}_N \right\}\leq \mathbb{P}\{x\,:\, c_N(y|x) \leq \widehat c_N\}, \end{align*}$$
which in turn implies
$$\begin{align*}\mathbb{P}\{x\,:\, c_N(y|x)> \widehat c_N\}\leq {\mathbb P}\left\{ x : \#\{ j \leq \widehat{c}_N : y^{(j,N)} \notin x_1^N\} > 0 \right\}. \end{align*}$$
We show in Lemma 3.3 that the probability on the right-hand side is summable in N and hence
(3.2) $$ \begin{align} \sum_{N=1}^\infty{\mathbb P}\left\{ x : c_N(y|x)> \widehat c_N \right\}<\infty. \end{align} $$

On the other hand, Lemma 3.10 below shows that
$$ \begin{align*} (-1+\epsilon) (\widehat{c}_N-1) \ln N &= \sum_{j=1}^{\widehat{c}_N-1} \ln N^{-1+\epsilon} \\ &\geq \sum_{j=1}^{\widehat{c}_N} \ln {\mathbb P}[y^{(j,N)}] \\ &\geq \ln {\mathbb P}[y_1^N] - o(N). \end{align*} $$
Hence,
$$ \begin{align*} &\mathbb{P}\left\{x\,:\, c_N(y|x)\ln N+\ln \mathbb{P}[y_1^N]>-\frac\epsilon{1-\epsilon}\ln \mathbb{P}[y_1^N]+\ln N + \epsilon N \right\} \\ &\quad\qquad\qquad\quad\qquad\qquad\qquad\qquad \leq \mathbb{P}\left\{x\,:\, c_N(y|x)\ln N>\widehat c_N\ln N \right\} \end{align*} $$
for all N large enough. Recall that, by Condition SE, $\ln {\mathbb P}[y_1^N]\geq \gamma _- N$ with $\gamma _- < 0$ . Combining this with (3.2), we obtain
$$ \begin{align*} &\sum_{N=1}^\infty\mathbb{P}\left\{x\,:\, \frac{c_N(y|x)\ln N}N+\frac{\ln \mathbb{P}[y_1^N]}N> -\frac{\epsilon}{1-\epsilon}\gamma_- + 2\epsilon \right\}<\infty. \end{align*} $$
Appealing to the Borel–Cantelli lemma, using the cross entropy analogue of the Shannon–McMillan–Breiman theorem in Lemma 3.12, and then taking $\epsilon \to 0$ , we conclude that, for every , we have
$$ \begin{align*} \limsup_{N\to\infty} \widehat{Q}_N(y,x) \leq {h^{\mathrm{c}}({\mathbb Q}|{\mathbb P})} \end{align*} $$
for almost every x sampled from ${\mathbb P}$ .
Lower bound I. Before we obtain the almost sure lower bound required for Theorem 3.1, let us summarize Ziv and Merhav’s argument for proving that the lower bound holds in probability. This argument will also help the reader understand the more complicated construction used for the almost sure version. Let $\epsilon \in (0, \tfrac 12)$ and be arbitrary. We consider an analogous auxiliary sequential parsing $y_1^N= y^{(1,N)} y^{(2,N)} \dotsc y^{(\bar {c}_N,N)}$ , where each word $y^{(j,N)}$ has length $\ell _{j,N}$ and is – except possibly for $y^{(\bar c_N,N)}$ – the shortest prefix of $T^{\ell _{1,N}+\dots +\ell _{j-1,N}}y_1^N$ that has probability
(3.3) $$ \begin{align} {\mathbb P}[y^{(j,N)}] \leq N^{-1-\epsilon}, \end{align} $$
where we define $\ell _{0,N}:=0$ . The power is chosen in the hope that the words in this auxiliary parsing will be numerous enough, yet unlikely enough for ${\mathbb P}$ that the vast majority of them find no match in $x_1^N$ . To motivate this Ansatz, note that the expected number of times a given string of ${\mathbb P}$ -probability $N^{-1-\epsilon }$ appears in a string of length N obtained from ${\mathbb P}$ decays as $N^{-\epsilon }$ .
Again, for N large enough, each length $\ell _{j,N}$ in this parsing falls between $\ell _{-,N}$ and $\ell _{+,N}$ , due to Properties FE and SE, except possibly for the last one, which only satisfies the upper bound. In particular, $\bar {c}_N = O(\frac {N}{\ln N})$ .

The correspondence between the parsing cardinalities $c_N(y|x)$ and $\bar c_N$ relies on the following observation: $c_N(y|x)$ must be at least equal to the number of words in the auxiliary parsing of $y_1^N$ that do not appear as strings in $x_1^N$ . Indeed, if a word $y^{(j,N)}$ does not appear as a substring of $x_1^N$ – written $y^{(j,N)} \notin x_1^N$ in what follows – then the ZM parsing has at least one separator within $y^{(j,N)}$ . That is,
$$ \begin{align*} c_N(y|x) &\geq \#\{j\,:\,y^{(j,N)}\notin x_1^N\}\\ &\geq \#\{j\leq \bar c_N-1\,:\,y^{(j,N)}\notin x_1^N\} \end{align*} $$
and so
$$ \begin{align*} &{\mathbb P}\left\{x\,:\, c_N(y|x)\geq(\bar c_N-1)\left(1-\epsilon\right)\right\} \\ &\qquad \geq{\mathbb P}\left\{x\,:\, \#\{j\leq \bar c_N-1\,:\,y^{(j,N)}\in x_1^N\}\leq(\bar c_N-1)\epsilon\right\} \\ &\qquad \geq 1-{\mathbb P}\left\{x: \#\{j\leq \bar c_N-1\,:\,y^{(j,N)}\in x_1^N\}> (\bar{c}_N-1)\, \epsilon\right\}. \end{align*} $$
One can easily show using a crude union bound and Markov’s inequality that the appearance in $x_1^N$ of more than an arbitrarily small proportion of all the words in the auxiliary parsing except for the last one has vanishing – but not necessarily summable – probability, and this enables us to conclude that
(3.4) $$ \begin{align} \lim_{N\to\infty}{\mathbb P}\left\{x : c_N(y|x)\geq(\bar c_N-1)(1-\epsilon)\right\}=1. \end{align} $$

Note that since, by construction, for any $j = 1,2, \dotsc , \bar c_N$ , the auxiliary word $y^{(j,N)}$ has no strict prefix with probability less than $N^{-1-\epsilon }$ , the lower bound in Condition ID implies that $ {\mathbb P}[y^{(j,N)}] \geq N^{-1-2 \epsilon } $ for N large enough. Therefore, Lemma 3.10 yields
(3.5) $$ \begin{align} (-1-2\epsilon) \bar{c}_N \ln N-o(N) &\leq \sum_{j=1}^{\bar{c}_N} \ln {\mathbb P}[y^{(j,N)}]\nonumber\\ &\leq \ln {\mathbb P}[y_1^N] + o(N), \end{align} $$
which together with (3.4) implies
$$\begin{align*}\lim_{N\to \infty}{\mathbb P}\left\{x : \frac{c_N(y|x)\ln N}N +\frac{\ln{\mathbb P}[y_1^N]}N\geq -{{3}}\epsilon\frac{\bar{c}_N\ln N}{N}-\epsilon\right\}=1, \end{align*}$$

Thus, using Lemma 3.12, the fact that $\bar {c}_N=O(\frac {N}{\ln N})$ and taking $\epsilon \to 0$ , we conclude that, for all , we have
(3.6) $$ \begin{align} h^{\mathrm{c}}({\mathbb Q}|{\mathbb P}) \leq \liminf_{N\to\infty} \widehat{Q}_N(y,x) \end{align} $$
in probability with respect to x sampled from ${\mathbb P}$ .
Lower bound II. Let $\epsilon \in (0, \tfrac 12)$ and be arbitrary, and fix $0 < \alpha <\frac {\gamma _+}{8\gamma _-} < 1$ . In what follows and in the last part of Section 3.2, the number $N^{\alpha }$ is to be understood as its integer part $\lfloor N^{\alpha }\rfloor $ . Following Ziv and Merhav’s original strategy for strengthening convergence in probability to almost sure convergence, we modify the auxiliary parsing of $y_1^{N}$ in “Lower Bound I” by applying the same algorithm separately to subsequent blocks of $N^\alpha $ symbols of $y_1^{N}$ .

First, let $y^{(1,1,N)}$ be the shortest prefix of $y_1^{N^{\alpha }}$ such that ${\mathbb P}[y^{(1,1,N)}] \leq N^{-1-\epsilon }$ ; it has length $\ell _{1,1,N}$ , between $\ell _{-,N}$ and $\ell _{+,N}$ for N large enough due to Properties FE and SE. Now, let $y^{(2,1,N)}$ be the shortest prefix of $y_{\ell _{1,1,N}+1}^{N^{\alpha }}$ such that ${\mathbb P}[y^{(2,1,N)}] \leq N^{-1-\epsilon }$ , and so on until not possible. We have parsed a first block of size $N^\alpha $ :
$$ \begin{align*} y_1^{N^\alpha} = y^{(1,1,N)} y^{(2,1,N)} \dotsb y^{(d_{1,N},1,N)} \xi^{(1,N)}, \end{align*} $$
where the (possibly empty) buffer $\xi ^{(1,N)}$ has probability at least $N^{-1-\epsilon }$ and length at most $\ell _{+,N}$ due to Property FE.
We then repeat the procedure with $T^{N^\alpha }y_1^N$ to obtain the second block, and so on until
(3.7) $$ \begin{align} &y_1^{N}= y^{(1,1,N)} y^{(2,1,N)} \dotsb y^{(d_{1,N},1,N)} \xi^{(1,N)} y^{(1,2,N)} y^{(2,2,N)} \dotsb y^{(d_{2,N},2,N)} \xi^{(2,N)} \nonumber\\ &\qquad\qquad\qquad\qquad\quad \dotsb y^{(1,M_N,N)} y^{(2,M_N,N)} \dotsb y^{(d_{M_N,N},M_N,N)} \xi^{(M_N,N)}. \end{align} $$
The construction of $y^{(1,M_N,N)} y^{(2,M_N,N)} \dotsb y^{(d_{M_N,N},M_N,N)} \xi ^{(M_N,N)}$ may differ from that of the previous $y^{(1,s,N)} y^{(2,s,N)} \dotsb y^{(d_{s,N},s,N)} \xi ^{(s,N)}$ for $s < M_N$ in that it might be the parsing of a block of a length smaller than $N^\alpha $ if there is a remainder in the division of N by $N^\alpha $ . Note that, for N large enough, $N^{1-\alpha }\leq M_N \leq 2N^{1-\alpha }$ and . The number of auxiliary parsed words to be considered is
$$\begin{align*}\tilde{c}_N := d_{1,N} + d_{2,N} + \dotsb +d_{M_N,N}. \end{align*}$$
It follows from the above that $\tilde {c}_N=O(\frac {N}{\ln N})$ , since for any $s<M_N$ . As explained in “Lower bound I,” $c_N(y|x)$ must be at least equal to the number of words in the auxiliary parsing of $y_1^N$ that do not appear as strings in $x_1^N$ , that is,
$$\begin{align*}c_N(y|x)\geq \sum_{s = 1}^{M_N} \#\left\{j : y^{(j,s,N)} \notin x_1^N\right\}. \end{align*}$$

In order to control the latter, we prove below the two following technical estimates:
- • Proposition 3.6: For almost every y sampled from ${\mathbb Q}$ , there exists $N_\epsilon (y)$ such that, for $N \geq N_\epsilon (y)$ , the number of indices s such that the words $y^{(1,s,N)}$ , $y^{(2,s,N)}, \dotsc , y^{(d_{s,N},s,N)}$ are not distinct is smaller than $\epsilon M_N$ ; we comment on the reason for this technical consideration in Remark 3.8.
- • Proposition 3.9: Denoting by $\mathcal {S}_{\mathrm {g}}(y_1^N)$ the set of indices s whose block of $y_1^N$ does consist of distinct words, we have
  $$ \begin{align*} {\mathbb P}\{\#\{j : y^{(j,s,N)} \in x_1^N\}> \epsilon {d_{+,N}}\} \leq {\ell_{+,N}}^2 \mathrm{e}^{\frac{\epsilon^2 \gamma_+ }{8}\frac{d_{+,N}}{\ell_{+,N}}} \end{align*} $$
  for N large enough and all $s \in \mathcal {S}_{\mathrm {g}}(y_1^N)$ . This means that, with high probability, only a small fraction of the words in these “good blocks” can appear in $x_1^N$ (and fail to contribute to $c_N(y|x)$ ).
Therefore, even considering the worst-case scenario where all $y^{(i,s,N)}$ with $s\notin \mathcal {S}_{\mathrm {g}}(y_1^N)$ do appear in $x_1^N$ , we find that, for almost every y sampled from ${\mathbb Q}$ ,
$$ \begin{align*} &\sum_{N = N_\epsilon(y)}^\infty {\mathbb P}\left\{x\,:\, c_N(y|x) <\tilde{c}_N - 2\epsilon {d_{+,N}}M_N \right\}\\ & \quad\leq \sum_{N = N_\epsilon(y)}^\infty M_N \max_{s \in \mathcal{S}_{\mathrm{g}}(y_1^N)}{\mathbb P}\{\#\{j : y^{(j,s,N)} \in x_1^N\}> \epsilon {d_{+,N}}\}<\infty. \end{align*} $$
In fact,
$$ \begin{align*} & \left\{x\,:\, \sum_{s = 1}^{M_N} \#\left\{j : y^{(j,s,N)} \notin x_1^N\right\}< \tilde c_N-2\epsilon { d_{+,N}} M_N\right\} \\ &\hspace{1cm}= \left\{x\,:\, \sum_{s = 1}^{M_N} \#\left\{j : y^{(j,s,N)} \in x_1^N\right\}> 2\epsilon {d_{+,N} }M_N\right\} \\ &\hspace{1cm}\subseteq \left\{x\,:\, \sum_{s \in \mathcal{S}_{\mathrm{g}}(y_1^N)} \#\left\{j : y^{(j,s,N)} \in x_1^N\right\}> \epsilon {d_{+,N} }M_N\right\}, \end{align*} $$
and we can perform a union bound after observing that for the number of words in the auxiliary parsing of $y_1^N$ that appear in $x_1^N$ and belong to “good blocks” to exceed $\epsilon {d_{+,N}}M_N$ , at least the number of such words in one of the “good blocks” must exceed $\epsilon d_{+,N}$ .
Appealing to Lemma 3.10 and Remark 3.11, the relation (3.5) between $\tilde {c}_N$ and $\ln {\mathbb P}[y_1^N]$ remains valid and yields
$$ \begin{align*} & {\mathbb P}\left\{x: \frac{c_N(y|x)\ln N}N +\frac{\ln {\mathbb P}[y_1^N]}N<-2\epsilon \frac{\tilde{c}_N\ln N}N-\epsilon-8\epsilon\frac{ \ln N}{\ell_{-,N}} \right\}\\ &\qquad\qquad\qquad\qquad \quad\leq {\mathbb P}\left\{x: \frac{c_N(y|x)\ln N}N <\frac{\tilde{c}_N\ln N}N - 2\epsilon {d_{+,N}}M_N\frac{\ln N}N \right\} \end{align*} $$
for N large enough, which implies that there exists some constant $C=C(\gamma _-)>0$ such that
$$\begin{align*}\sum_{N=1}^\infty{\mathbb P}\left\{x: \frac{c_N(y|x)\ln N}N +\frac{\ln {\mathbb P}[y_1^N]}N< -C\epsilon \right\}<\infty. \end{align*}$$
By Lemma 3.12 and the Borel–Cantelli lemma, taking $\epsilon \to 0$ , we conclude that
$$ \begin{align*} h^{\mathrm{c}}({\mathbb Q}|{\mathbb P}) \leq \limsup_{N\to\infty} \widehat{Q}_N(y,x) \end{align*} $$
for almost every $(x,y)$ sampled from ${\mathbb P}\otimes {\mathbb Q}$ .

The above strategy is essentially that of Ziv and Merhav, but the lemmas and propositions on which it relies need to be adapted beyond Markovianity. Before we do so, let us state and prove a proposition that justifies our focus on situations where .

Proposition 3.2 Suppose that ${\mathbb Q}$ is ergodic. If there exists $k\in {\mathbb N}$ such that , then $\widehat {Q}_N\to \infty $ almost surely as $N\to \infty $ , in agreement with Theorem 3.1.

Proof Fix k as in the hypothesis and then . Because , a crude counting argument yields that the ZM parsing satisfies

$$\begin{align*}c_N(y|x) \geq \frac{\#\{j \leq N-k+1 : T^{j-1}y \in [a]\}}{k} \end{align*}$$

for all . Because and k is fixed, Birkhoff’s ergodic theorem applied to the function $\mathbf {1}_{[a]}$ yields

$$\begin{align*}\liminf_{N\to\infty} \frac{c_N(y|x)}{N}> 0, \end{align*}$$

for almost every $y\sim {\mathbb Q}$ . This allows us to conclude that, almost surely, the estimator diverges.

As for the claim that this is in agreement with Theorem 3.1, it is based on the observation that if , then for all $n \geq k$ . Since the existence of such that ${\mathbb P}_n[a] = 0$ causes at least one summand to be infinite on the right-hand side of (2.1), this allows us to conclude that $h_{\text {c}}({\mathbb Q}|{\mathbb P}) = \infty $ .

3.2 Properties of the auxiliary parsings

Throughout this section, $\epsilon \in (0,1/2)$ is fixed but arbitrary. We assume that ${\mathbb P}$ and ${\mathbb Q}$ are stationary and satisfy . For readability, we will omit keeping track of the N-dependence in some of the notation introduced above. As foreshadowed in the introduction, our analysis of the cardinalities of the auxiliary parsings will use reformulations in terms of waiting times.

Lemma 3.3 Suppose that ${\mathbb P}$ satisfies ID, FE, and KB. Let be arbitrary and consider the auxiliary parsing of $y_1^N$ built around the requirement (3.1). Then,

$$\begin{align*}{\mathbb P}\left\{ x : \#\{ j \leq \widehat{c}_N : y^{(j)} \notin x_1^N\}> 0 \right\} \leq N\mathrm{e}^{-\frac{N^{\frac{\epsilon}{4}}}{3\ell_+}} \end{align*}$$

for N large enough.

Proof Let $\underline {y}^{(j)}$ be the word that is obtained by removing the last letter from $y^{(j)}$ ; by construction, ${\mathbb P}[\underline {y}^{(j)}]> N^{-1+\epsilon }$ . So, in view of ID,

(3.8)

for N large enough. We have used the fact that $k_{\ell _j-1} \leq k_{\ell _+}$ with $k_\ell = o(\ell )$ and ${\ell _+=O(\ln N)}$ . Using KB and considering all N large enough, we have

(3.9)

$$ \begin{align} {\mathbb P}\{ x\,:\,W_{\ell_j}(y^{(j)},x)>N-\ell_j+1 \} &\leq \exp\left(-\frac{N^{\frac{\epsilon}{4}}}{2\ell_++\tau_{\ell_+}}\right), \end{align} $$

where we used the defining properties of $k_{\ell _j}$ and $\ell _j$ . Then, using that for N large enough we have $\tau _{\ell _+}\leq \ell _+$ and taking a union bound over j,

$$\begin{align*}{\mathbb P}\left(\bigcup_{j=1}^{\widehat c_N}\{x : W_{\ell_j}(y^{(j)},x)>N-\ell_j+1\} \right) \leq N\exp\left(-\frac{N^{\frac{\epsilon}{4}}}{3\ell_+}\right). \end{align*}$$

To conclude, note that $W_{\ell _j}(y^{(j)},x)>N-\ell _j+1$ is a necessary and sufficient condition for $y^{(j)} \notin x_1^N$ .

While, on the one hand, the last lemma states that the words in the auxiliary parsing built around (3.1) tend to appear in $x_1^N$ , one can show that, on the other hand, the words in the auxiliary parsing built around (3.3) tend to not appear in $x_1^N$ . However, the probabilistic estimate obtained pursuing this strategy only achieves convergence in probability of the ZM estimator; see “Lower Bound I.” As Ziv and Merhav showed in their original paper in the Markovian case, this estimate can actually be refined and made summable in N using some additional combinatorial and probabilistic arguments. Such a refinement is used to go from convergence in probability to almost sure convergence in Section 3.1. We recall the following basic facts about our modified auxiliary parsing (3.7) for N large enough:

• there are $M_N \leq 2N^{1-\alpha }$ blocks, indexed by s, each of length $N^\alpha $ except for the last one $(s=M_N)$ which possibly has length less than $N^\alpha $ ;
• the sth block contains $d_{s}$ words $y^{(i,s)}$ with

except for the last one $(s=M_N)$ for which the lower bound may not apply, and one (possibly empty) buffer $\xi ^{(s)}$ ;
• each word $y^{(i,s)}$ has length $\ell _{i,s}$ , with
$$\begin{align*}\ell_- := \frac{\ln N}{-2\gamma_-} \leq \ell_{i,s} \leq \frac{2\ln N}{-\gamma_+} =: \ell_+. \end{align*}$$

Most of the factors of 2 in these facts are suboptimal; they are only meant to avoid having to consider integer parts or superficial dependence on $\epsilon $ .

Definition 3.1 If the words $y^{(1,s)}$ , $y^{(2,s)}, \dotsc , y^{(d_{s},s)}$ in (3.7) are all distinct, we say that the sth block of $y_1^N$ is good and write $s \in \mathcal {S}_{\mathrm {g}}(y_1^N)$ . If that is not the case, we say that the block is bad and write $s \in \mathcal {S}_{\mathrm {b}}(y_1^N)$ .

Lemma 3.4 If ${\mathbb Q}$ satisfies ID and FE, then

$$\begin{align*}{\mathbb Q}\{y : s \in \mathcal{S}_{\mathrm{b}}(y_1^N)\} \leq \mathrm{e}^{k_{\ell_-}}N^{-2\alpha}, \end{align*}$$

for every s and every N large enough.

Proof Fix ${\mathbb Q}$ as in the statement. By shift invariance, ${\mathbb Q}\{y : s \in \mathcal {S}_{\mathrm {b}}(y_1^N) \} \leq {\mathbb Q}\{y : 1 \in \mathcal {S}_{\mathrm {b}}(y_1^N)\}$ .Footnote ⁶ For the first block to be bad, two words $y^{(i,1)}$ and $y^{(j,1)}$ need to coincide, and in particular, their $\ell _-$ -prefixes need to coincide. Hence, considering all possible starting indices of these two words, and appealing to shift-invariance, ID and FE, we derive

To conclude, recall that we have chosen $\alpha <\frac {\gamma _+}{8\gamma _-} = - \frac {\gamma _+\ell _-}{4\ln N}$ and that $\gamma _\pm < 0$ .

Lemma 3.5 If ${\mathbb Q}$ satisfies ID and FE, then

$$\begin{align*}{\mathbb Q}\{y : \#\mathcal{S}_{\mathrm{b}}(y_1^N) = m \}\leq \binom{M_N}{m}\mathrm{e}^{2mk_{\ell_-}}N^{-2m\alpha}, \end{align*}$$

for all $m\in {\mathbb N}$ and for all N large enough.

Proof Fix ${\mathbb Q}$ and m as in the statement. Let us first consider the probability that the blocks of $y_1^N$ labeled $s_m$ , $s_{m-1}$ down to $s_1$ are bad. This event can be thought of as mth in a sequence of events defined inductively by $E^{\prime }_{k+1}=T^{-N^{\alpha }(s_{k+1}-1)}\{1 \in \mathcal {S}_{\mathrm {b}}\} \cap E^{\prime }_k$ where $E^{\prime }_0=\Omega $ . It follows, by a straightforward adaptation of the strategy of Lemma 3.4, that

Iterating and accounting for the different choices of $s_1, \dotsc , s_{m-1}, s_m$ (recall that $s \leq M_N$ ) gives the proposed bound.

Proposition 3.6 If ${\mathbb Q}$ satisfies ID and FE, then for almost every $y\sim {\mathbb Q}$ , there exists $N_\epsilon $ such that $\#\mathcal {S}_{\mathrm {b}}(y_1^N) < \epsilon M_N$ for all $N \geq N_\epsilon $ .

Proof Fixing ${\mathbb Q}$ as in the statement, using Markov’s inequality, the binomial theorem, and Lemma 3.5, for every $b>0$ , we have

$$ \begin{align*} {\mathbb Q}\left\{y: \#\mathcal{S}_{\mathrm{b}}(y_1^N) \geq \epsilon M_N\right\}&\leq \mathbb{E}\left(\mathrm{e}^{b(\#\mathcal{S}_{\mathrm{b}}(y_1^N)) }\right)\mathrm{e}^{-b\epsilon M_N}\\ &=\mathrm{e}^{-b\epsilon M_N}\sum_{m=1}^{M_N}\mathrm{e}^{bm}{\mathbb Q}\left\{y: \#\mathcal{S}_{\mathrm{b}}(y_1^N) = m \right\}\\ &\leq\mathrm{e}^{-b\epsilon M_N}\left(1+\mathrm{e}^{b+2k_{\ell_-}}N^{-2\alpha}\right)^{M_N}. \end{align*} $$

Choosing $b=2\alpha \ln N-2k_{\ell _-}$ , recalling that $M_N/N^{1-\alpha }\in (1,2)$ , and considering N large enough so that $b>0$ gives the bound

(3.10)

$$ \begin{align} {\mathbb Q}\left\{y: \#\mathcal{S}_{\mathrm{b}}(y_1^N) \geq \epsilon M_N\right\}\leq \mathrm{e}^{-N^{1-\alpha}(2\alpha\epsilon\ln N-2\epsilon k_{\ell_-}-2\ln 2)}. \end{align} $$

The proposition thus follows from the Borel–Cantelli lemma.

Lemma 3.7 Suppose that ${\mathbb P}$ satisfies ID and that the sth block of $y_1^N$ is good. Given $\ell $ and $K\in \{1,2,\dotsc , \ell \}$ ,

$$ \begin{align*} & {\mathbb P}\left\{x:\#\left\{j : y^{(j,s)}=x_{K+r\ell}^{K+r\ell+(\ell-1)}\text{ for some } r \in \left\{0, 1, \dots,\left\lfloor\frac{N-K+1}{\ell}\right\rfloor-1\right\} \right\}=m\right\} \\ & \quad\leq \binom{d_+}{m}\mathrm{e}^{mk_{\ell_+}}N^{-m\epsilon}. \end{align*} $$

Proof By shift invariance, we can assume that $s=1$ . Consider a set $I=\{i_k\}_{k=1}^m$ of m distinct indices such that $y^{(i_k,1)}$ has length $\ell $ , and let $F(I)$ denote the event that all the words $\{y^{(i_k,1)}\}_{k=1}^m$ have a match in $x_1^N$ with a starting point equivalent to K mod $\ell $ . Since the words $\{y^{(i_k,1)}\}_{k=1}^m$ are distinct, the starting positions of the matches considered must be distinct. Moreover, by assumption, each such starting position is of the form $r\ell + K$ for some r at most $\lfloor \tfrac {N-K+1}{\ell }\rfloor - 1$ . Therefore, enumerating all possibilities, we find

$$ \begin{align*} F(I) \subseteq \bigcup_{r_1,\dotsc,r_{m}}\bigcap_{k=1}^{m}T^{-r_k\ell-K }[y^{(i_{k},\ell)}], \end{align*} $$

where the union is taken over distinct nonnegative integers $r_1, \dotsc , r_m$ all at most $\lfloor \tfrac {N-K+1}{\ell }\rfloor -1$ . Using ID, shift invariance, and subadditivity gives

$$ \begin{align*} {\mathbb P}(F(I)) &\leq m!\binom{\lfloor \frac{N-K+1}{\ell}\rfloor-1}{m}\left(\mathrm{e}^{k_{\ell_+}}\max_{i \in I}{\mathbb P}[y^{(i,\ell)}]\right)^m \\ & \leq N^m(\mathrm{e}^{k_{\ell_+}}N^{-1-\epsilon})^m \\ & \leq \mathrm{e}^{mk_{\ell_+}}N^{-m\epsilon}. \end{align*} $$

To conclude, we use a union bound, together with an upper bound on the number of sets I of this nature.

Remark 3.8 The separation into fixed values of $\ell $ and K is a technical device to avoid overlaps that would prevent the use of ID, and will be taken care of momentarily by a union bound. For fixed $\ell $ , and for the purpose of relating $c_N$ and $\tilde {c}_N$ , the important quantity is the number of j such that $y^{(j,s)}$ has size $\ell $ and appears in $x_1^N$ (this is the only way a separator could fail to appear within $y^{(j,s)}$ ), and not the number of substrings of size $\ell $ in $x_1^N$ that are matches for some $y^{(j,s)}$ . The probability of the latter is easier to control (this is what we control in the proof), and coincides with the former when $s \in \mathcal {S}_{\mathrm {g}}(y_1^N)$ .

Proposition 3.9 Let be arbitrary and consider the modified auxiliary parsing of $y_1^N$ in (3.7). Suppose that ${\mathbb P}$ satisfies ID and that the sth block of $y_1^N$ is good. Then, for N large enough, the event that more than a fraction $\epsilon $ of the maximum number $d_+$ of words $y^{(i,s)}$ in the sth block appears in $x_1^N$ satisfies

(3.11)

$$ \begin{align} {\mathbb P}\{x:\#\{j : y^{(j,s)} \in x_1^N\}> \epsilon d_+\}\leq \ell_+^2\mathrm{e}^{ \frac{\gamma_+\epsilon^2}8\frac{d_+}{\ell_+}}. \end{align} $$

Proof Fix $s\in \mathcal {S}_{\mathrm {g}}(y_1^N)$ . Given $\ell $ and $K \in \{1,\dotsc ,\ell \}$ , consider

(3.12)

$$ \begin{align} \chi_{(K,\ell)} := \sum_{i : \ell_{i,s} = \ell} \mathbf{1}_{W_\ell(y^{(i,s)},\,\cdot\,)\leq N} \cdot \mathbf{1}_{W_\ell(y^{(i,s)},\,\cdot\,) \equiv_{\operatorname{mod} \ell} K}. \end{align} $$

Observe that, for any fixed x,

$$\begin{align*}\#\{j : y^{(j,s)} \in x_1^N\} \leq \sum_{i=1} ^{d_{s}}\mathbf{1}_{W_{\ell_{i,s}}(y^{(i,s)},x)\leq N}, \end{align*}$$

and so for the random variable in (3.11) to exceed $\epsilon d_+$ , at least one of the random variables $\chi _{(K,\ell )}$ defined by (3.12) must exceed $\tfrac {\epsilon d_+}{\ell _+^2}$ , that is,

(3.13)

$$ \begin{align} {\mathbb P}\{x:\#\{j : y^{(j,s)} \in x_1^N\}> \epsilon d_+\} &\leq{\mathbb P}\left( \bigcup_{(K,\ell)} \left\{x: \chi_{(K,\ell)}(x)> \epsilon\frac{d_+}{\ell_+^2 } \right\} \right). \end{align} $$

Following the same strategy as in the proof of Proposition 3.6, we use Markov’s inequality, the binomial theorem, and Lemma 3.7 to derive that, for every $b>0$ ,

$$\begin{align*}{\mathbb P}\left\{x: \chi_{(K,\ell)}(x)> \epsilon\frac{d_+}{\ell_+^2} \right\}\leq \left(1+\frac{\mathrm{e}^{b+k_{\ell_+}}}{N^{\epsilon}}\right)^{d_+}\mathrm{e}^{-b\epsilon\frac{d_+}{\ell_+^2 }}. \end{align*}$$

Choosing $b=\frac {\epsilon }{2}\ln N$ yields

$$ \begin{align*} {\mathbb P}\left\{x: \chi_{(K,\ell)}(x)> \epsilon\frac{d_+}{\ell_+^2} \right\} &\leq \mathrm{e}^{ \frac{\gamma_+\epsilon^2}4 \frac{d_+}{\ell_+}(1-o(1))} \\ &\leq \mathrm{e}^{ \frac{\gamma_+\epsilon^2}8\frac{d_+}{\ell_+}} \end{align*} $$

for N large enough, recalling that $\gamma _+<0$ . Going back to our observation (3.13), we conclude the proof by performing a union bound over K and $\ell $ .

3.3 Cross entropy

Lemma 3.10 If ${\mathbb P}$ satisfies ID, , and $y_1^N$ is parsed as

$$\begin{align*}y_1^N = y^{(1,N)} y^{(2,N)} \dotsc y^{(c^{\prime}_N-1,N)} y^{(c^{\prime}_N,N)}, \end{align*}$$

with for some properly diverging, nonnegative sequence $(\lambda _N)_{N=1}^\infty $ , then

$$\begin{align*}\sum_{j=1}^{c^{\prime}_N} \ln {\mathbb P}[y^{(j,N)}] = \ln {\mathbb P}[y_1^N] + o(N). \end{align*}$$

Proof Suppose ${\mathbb P}$ satisfies ID, , and $y_1^N$ is parsed as in the statement. Both the upper and lower bounds are proved similarly, so we only provide the proof of the former. Let $\epsilon> 0$ be arbitrary and note that ID yields

$$ \begin{align*} \ln {\mathbb P}[y_1^N]&=\ln {\mathbb P}[y^{(1,N)} y^{(2,N)} \dotsc y^{(c^{\prime}_N-1,N)} y^{(c^{\prime}_N,N)}]\\ &\leq \ln \left(\mathrm{e}^{k_{\ell_1}+\dots+k_{\ell_{c^{\prime}_N-1}}}{\mathbb P}[y^{(1,N)}]{\mathbb P}[ y^{(2,N)}] \dotsc {\mathbb P}[ y^{(c^{\prime}_N,N)}]\right)\\ &=\sum_{j=1}^{c^{\prime}_N}\ln {\mathbb P}[y^{(j,N)}]+\sum_{j=1}^{c^{\prime}_N-1}k_{\ell_j}. \end{align*} $$

Now since $k_\ell = o(\ell )$ and $\lambda _N \to \infty $ , we have $k_{\ell _j} < \epsilon \ell _{j}$ for N large enough. Therefore,

$$ \begin{align*} \ln {\mathbb P}[y_1^N] &< \sum_{j=1}^{c^{\prime}_N}\ln {\mathbb P}[y^{(j,N)}]+\sum_{j=1}^{c^{\prime}_N-1}\epsilon {\ell_j} \\ &< \sum_{j=1}^{c^{\prime}_N}\ln {\mathbb P}[y^{(j,N)}] + \epsilon N \end{align*} $$

for N large enough.

Remark 3.11 Note that the contribution coming from the buffers $\xi ^{(s,N)}$ , with $s\in \{1,\dotsc ,M_N\}$ , in the modified auxiliary parsing (3.7) can be embedded in the correction term $o(N)$ in the statement of Lemma 3.10. This immediately follows by observing that $M_N=o(\tilde c_N)$ .

Lemma 3.12 If ${\mathbb P}$ satisfies ID and ${\mathbb Q}$ is ergodic, and if , then

$$\begin{align*}-\ln {\mathbb P}[y_1^N] = Nh_{\text{c}}({\mathbb Q}|{\mathbb P}) + o(N) \end{align*}$$

for ${\mathbb Q}$ -almost every y.

Proof Fix ${\mathbb P}$ and ${\mathbb Q}$ as in the statement. In view of the upper bound in ID, we can apply Kingman’s subadditive ergodic theorem to the sequence $(f_n)_{n=1}^{\infty }$ of measurable functions on the dynamical system defined by $f_n(x) := \ln {\mathbb P}[x_1^n]$ .

3.4 Comments

The following consequence of ID played an important role in the proof of the upper bound:

Ad For every $n \in {\mathbb N}$ , the bound
(3.14)

holds.

Indeed, by construction of Ziv and Merhav’s auxiliary parsings, there is a lower bound on ${\mathbb P}[\underline {y}^{(j,N)}]$ and an upper bound on ${\mathbb P}[y^{(j,N)}]$ , but both the bounds (3.2) and (3.5) require a lower bound on ${\mathbb P}[{y^{(j,N)}}]$ (see Lemma 3.3). Condition Ad serves as a way of going back and forth between the two. Unfortunately, Ad may fail upon relaxing the lower bound in ID to the more general lower-decoupling conditions that have met with success in tackling other related problems [Reference Benoist, Cuneo, Jakšić and PilletBCJP21, Reference Cristadoro, Degli Esposti, Jakšić and RaquépasCDEJR23a, Reference Cuneo, Jakšić, Pillet and ShirikyanCJPS19, Reference Cuneo and RaquépasCR23]. We will come back to this point in Section 4.4.

As for the arguments available in the literature to establish KB, we foresee no difficulty in adapting our argument to a set of hypotheses where the roles of a and b are exchanged in the decoupling inequalities. Indeed, this would not affect SE nor Ad. While the Markov property can be equivalently written in terms of conditioning on the past or conditioning on the future, the class of g-measures discussed in Section 4 and its “reverse” counterpart do not coincide (see, e.g., [Reference Berghout, Fernández and VerbitskiyBFV19, Section 4.4]).

As mentioned in the introduction, the Ziv–Merhav estimator can be written in terms of longest-match lengths:

$$ \begin{align*} \frac{c_N \ln N}{N} &= \frac{\ln N}{\frac{1}{c_N}\sum_{i=1}^{c_N} \ell^{(i,N)}}, \end{align*} $$

whereFootnote ⁷

$$\begin{align*}\ell^{(i,N)} = \min\{\Lambda_N(T^{L^{(i-1,N)}}y,x), N-L^{(i-1,N)}\}, \end{align*}$$

with

$$ \begin{align*} L^{(0,N)} = 0 \qquad&\text{and}\qquad L^{(i,N)} = L^{(i-1,N)} + \ell^{(i,N)} \end{align*} $$

for $i = 1,2,\dotsc , c_N$ . It is known that the longest-match estimator $(\ell ^{(1,N)})^{-1}\ln N = \Lambda _N(y,x)^{-1} \ln N$ converges almost surely to the cross entropy, with good probability estimates, for a class of measures that is more general than that considered here (see [Reference KontoyiannisKon98, Section 1.3] and [Reference Cristadoro, Degli Esposti, Jakšić and RaquépasCDEJR23a, Section 3]). Hence, if each $T^{L^{(i-1,N)}}y$ were replaced by a new independent sample from ${\mathbb Q}$ , or by $T^{\Delta (i - 1)}y$ for some fixed deterministic $\Delta \in {\mathbb N}$ , then one would expect the convergence of the Ziv–Merhav estimator to also hold considerably more generally. However, the dependence structure of the starting indices seems to be posing a serious technical difficulty for the strategy of Ziv and Merhav.

4 Examples

In this section, we discuss broad classes of measures to which our results apply. For this discussion, we need basic topological considerations that we had avoided so far. A one-sided (resp. two-sided) subshift is a closed and shift-invariant subset of ${\cal A}^{\mathbb N}$ (resp. ${\cal A}^{\mathbb Z}$ ) obtained by removing all sequences containing at least one string from some set of forbidden strings. Closure is understood in the product topology, and the subshift is equipped with the subspace topology inherited from that topology. A subshift is said to be of finite type if the list of forbidden strings that defines it can be chosen to be finite. A subshift of finite type is said to be topologically transitive if, for any two strings a and b with $[a]$ and $[b]$ intersecting the subshift, there exists a third string $\xi $ such that $[a\xi b]$ also intersects the subshift. We refer the reader to [Reference Denker, Grillenberger and SigmundDGS76, Section 7] or [Reference Kwietniak, Łącka, Oprocha, Kolyad, Möller, Moree and WardKŁO16, Section 8] for a more thorough discussion.

4.1 Markov measures

As mentioned in Section 2, if ${\mathbb P}$ is the stationary measure for an irreducible Markov chain with positive entropy, then ${\mathbb P}$ is ergodic and satisfies ID, FE, and KB. We use this setting to illustrate the role of some of our conditions.

First, it is worth noting that the stationary measure for an irreducible Markov chain need not satisfy any form of mixing, nor the Doeblin-type condition in [Reference Kontoyiannis, Suhov and KellyKS94] because it could be periodic.

In the case of a reducible Markov chain, a stationary measure can charge two disjoint communication classes; let us call those classes ${\cal A}'$ and ${\cal A}"$ . Then, for $a \in {\cal A}'$ , the probability ${\mathbb P}\{x : W_1(a,x) \geq r\} \geq {\mathbb P}\{x : x_1 \in {\cal A}"\}$ does not decay as $r \to \infty $ . In terms of the language of subshifts, the failure of KB is due to the fact that does not satisfy any form of specification; it is a subshift of finite type that fails to be transitive. More concretely, if the sequence x starts in ${\cal A}"$ , then it remains there forever and we do not expect to be able to probe any entropic quantity that also involves the behavior of ${\mathbb P}$ on ${\cal A}'$ using the information contained in x.

Also note that a stationary measure for an irreducible Markov chain could fail to have positive entropy if, for example, it is a convex combination of Dirac masses on periodic orbits. Such a behavior is at odds with FE and can cause the bounds on the lengths of the parsed words not to be controlled in terms of $\ell _{\pm ,N}$ , a fact which was used repeatedly throughout our proofs.

4.2 Regular g-measures

Let $\Omega '$ be a topologically transitive one-sided subshift of finite type. Choosing as a starting point one particular definition in the literature among others, we will say that a translation-invariant measure ${\mathbb P}$ on $\Omega $ is a regular g-measure on $\Omega '$ if and there exists a continuous function $g: \Omega ' \to (0,1]$ such that

(4.1)

$$ \begin{align} \sum_{\substack{y \in \Omega'\\ Ty = x}} g(y) = 1 \end{align} $$

for all $x\in \Omega '$ and

(4.2)

$$ \begin{align} \lim_{n\to\infty} \sup_{x \in \Omega'} \left|\frac{{\mathbb P}[x_1^{n}]}{{\mathbb P}[x_2^n]} - g(x)\right| = 0. \end{align} $$

The convergence (4.2) can be used to show that ${\mathbb P}$ satisfies the decoupling condition ID (see [Reference Cuneo and RaquépasCR23, Section B.3]). Our assumption on $\Omega '$ more than suffices for ID to yield KB (see [Reference Cuneo and RaquépasCR23, Sections 3.1 and B.2]).

Note that the ratio being compared to g is continuous in x at finite n, and the k-level Markov condition, once written in terms of conditioning on the future, implies that this ratio is eventually constant in n – starting with $n=k+1$ . Hence, regular g-measures do generalize stationary k-level Markov measures.

Finally, let us discuss Condition FE in the context of regular g-measures. To do so, we will use the fact that the convergence (4.2) can also be used to establish the following weak Gibbs condition of Yuri at vanishing topological pressure: there exists an $\mathrm {e}^{o(n)}$ -sequence $(K_n)_{n=1}^\infty $ such that

$$\begin{align*}K_n^{-1} \mathrm{e}^{\sum_{j=0}^{n-1} \ln g(T^jx) } \leq {\mathbb P}[x_1^n] \leq K_n \mathrm{e}^{\sum_{j=0}^{n-1} \ln g(T^jx)} \end{align*}$$

for every $x \in \Omega '$ ; again, see [Reference Cuneo and RaquépasCR23, Section B.3], but it should be noted that this can be seen as part of the “g-measure folklore” [Reference Berghout, Fernández and VerbitskiyBFV19, Reference Olivier, Sidorov and ThomasOST05, Reference WaltersWal05]. We are now ready to provide a necessary and sufficient condition on the subshift $\Omega '$ for FE to hold for all regular g-measures on $\Omega '$ . One special case will be that regular g-measures on topologically mixing subshifts of finite type with more than one letter satisfy ID and FE, allowing for an application of our main result.

Lemma 4.1 Suppose that ${\mathbb P}$ is a regular g-measure on $\Omega '$ . Then, ${\mathbb P}$ satisfies FE if and only if there exists r with the following property: for every $y \in \Omega '$ , there exists $t\leq r$ such that $T^ty$ has more than one preimage in $\Omega '$ .

Proof Suppose that there exists r as above. Then, for every $y \in \Omega '$ , there exists $t \leq r$ such that

$$\begin{align*}g(T^{t-1} y) = 1 - \sum_{\substack{z \in \Omega'\setminus\{T^{t-1} y\} \\ Tz = T^{t}y}} g(z) \leq 1 - \delta, \end{align*}$$

where $\delta := \min g$ . This number is positive by continuity and compactness. Therefore,

$$ \begin{align*} \ln g(y) + \ln g(Ty) + \dotsb + \ln g(T^{t-1}y) \leq \ln(1-\delta) \end{align*} $$

and $\ln g(T^{t'}y) \leq \ln (1-\delta )$ for any $t'\geq t$ . But then, the weak Gibbs property yields

$$ \begin{align*} {\mathbb P}[y_1^n] &\leq K_n \mathrm{e}^{\sum_{i=0}^{\lfloor \frac nr\rfloor-1} \sum_{t=0}^{ r-1} \ln g(T^{ir+t}y)} \\ &\leq \exp\left(\ln K_n + \left\lfloor \frac nr \right\rfloor \ln(1-\delta)\right), \end{align*} $$

with $\ln K_n = o(n)$ . We conclude that Condition FE holds. Suppose now that no such r exists. Then, for every $n\in {\mathbb N}$ , there exists $y\in \Omega '$ such that $T^ty$ has only one preimage in $\Omega '$ for all $t\leq n$ . By the condition (4.1), this means that

$$\begin{align*}\ln g(y)+\ln g(Ty)+\dots+\ln g(T^{ n-1} y)=0, \end{align*}$$

which, together with the lower bound in the weak Gibbs property, implies

$$\begin{align*}{\mathbb P}[y_1^n]\geq K_n^{-1}=\mathrm{e}^{-o(n)}. \end{align*}$$

Since the right-hand side is eventually greater than $\mathrm {e}^{\gamma _+n}$ for any $\gamma _+<0$ , FE fails as well.

Remark 4.2 If ${\cal A}$ contains at least two symbols, then the hypotheses – and thus the conclusions – of Lemma 4.1 can be derived from a suitable specification property, but not any specification property from the literature.Footnote ⁸

4.3 Statistical mechanics

Let $\overline {\Omega }'$ be a topologically transitive, two-sided subshift of finite type, and let $\Omega '$ be its one-sided counterpart. Consider a family $(\Phi _X)_{X \Subset {\mathbb Z}}$ of interactions with

• the continuity property that, for all $X \Subset {\mathbb Z}$ , the function $\Phi _X$ – although seen as a measurable function on $\overline {\Omega }'$ – depends on the symbols with indices in the finite subset X only,
• the translation-invariance property that, for all $X \Subset {\mathbb Z}$ , $\Phi _{X+1} = \Phi _X \circ T$ ,
• the absolute summability property that $\sum _{\substack {X \Subset {\mathbb Z} \\ X \ni 1}} \sup _{x\in \overline {\Omega }'} |\Phi _X(x)|< \infty .$

Such interactions are considered, e.g., in [Reference RuelleRue04, Sections 1.2 and 3.1] and are colloquially said to be in “the small space.” It is well known that any equilibrium measure ${\mathbb P}$ (in the sense of the variational principle) for the energy-per-site potential

$$\begin{align*}\phi := \sum_{\substack{X \Subset {\mathbb Z} \\ \min X = 1}} {\Phi_X} \end{align*}$$

coming from such a family of interactions is a translation-invariant Gibbs state in the sense of the Dobrushin–Lanford–Ruelle equations (see, e.g., [Reference RuelleRue04, Sections 3.2 and 4.2]).Footnote ⁹ Because we are working with a sufficiently regular subshift $\overline {\Omega }'$ , the Dobrushin–Lanford–Ruelle equations and absolute summability can be used to show that ${\mathbb P}$ satisfies ID by adapting the argument of [Reference Lewis, Pfister and SullivanLPS95, Section 9] for the case $\overline {\Omega }' = {\cal A}^{\mathbb Z}$ . Again, the subshift is sufficiently regular for ID to yield KB (see [Reference Cuneo and RaquépasCR23, Sections 3.1 and B.2]).

We now turn to Condition FE, assuming a certain familiarity with the thermodynamic formalism, physical equivalence, and the Griffiths–Ruelle theorem on the reader’s part (see, e.g., [Reference RuelleRue04, Section 4]).

Lemma 4.3 Suppose that $\overline {\Omega }'$ , $\Phi $ , and ${\mathbb P}$ are as above. If $\Omega '$ has positive topological entropy, then ${\mathbb P}$ satisfies FE.

Proof sketch.

We split the proof according to whether or not $\Phi $ is equivalent to a constant in the sense of Ruelle. Because we can always add or subtract a constant from each $\Phi _{\{i\}}$ , there is no loss of generality in assuming that $\phi $ has topological pressure $P_{\text {top}}(\phi )=0$ .

Case 1. On the one hand, if $\Phi $ is not equivalent to a constant in the sense of Ruelle, then the Griffiths–Ruelle theorem guarantees that $\alpha \mapsto P_{\text {top}}(\phi -\alpha \phi )$ is strictly convex (see, e.g., [Reference RuelleRue04, Section 4.6]). On the other hand, by the weak Gibbs property established, e.g., in [Reference Pfister and SullivanPS20, Section 2], we have

It is easy to see from this relation that the function $\alpha \mapsto P_{\text {top}}(\phi -\alpha \phi )$ is nondecreasing. Combining these properties, we deduce that $P_{\text {top}}(\phi -\alpha \phi )<0$ for all $\alpha < 0$ . Assuming for the sake of contradiction that FE fails, one easily derives a contradiction.
Case 2. If, on the contrary, $\Phi $ is equivalent to a constant the sense of Ruelle, then the weak Gibbs property reads,
$$\begin{align*}K_n^{-1} \mathrm{e}^{\nu n} \leq {\mathbb P}[x_1^n] \leq K_n \mathrm{e}^{\nu n} \end{align*}$$
for all $x \in \Omega '$ and some constant $\nu $ . Given that $\Omega '$ has positive topological entropy, summing the left-most inequality over $x_1^n$ , the fact that $K_n = \mathrm {e}^{o(n)}$ can be used to show that $\nu < 0$ . Then, the right-most inequality yields that FE holds, again thanks to the fact that $K_n = \mathrm {e}^{o(n)}$ .

Every irreducible, stationary Markov measure with stochastic matrix $[P_{a,b}]_{a,b\in {\cal A}}$ can be obtained in this way by considering the following nearest-neighbor interactions on its support:

$$\begin{align*}\Phi_{\{i,i+1\}}(x) = \ln P_{x_i,x_{i+1}} \end{align*}$$

for $i\in {\mathbb N}$ and $\Phi _X(x) = 0$ for X not of the form $\{i,i+1\}$ . To see this, one can check by direct computation that, on its support, the Markov measure satisfies the Bowen–Gibbs condition for the corresponding $\phi $ . For k-level Markov measures, consider instead

$$\begin{align*}\Phi_{\{i,\dotsc,i+k-1,i+k\}}(x) = \ln \frac{{\mathbb P}[x_i\dotsc x_{i+k-1}x_{i+k}]}{{\mathbb P}[x_i\dotsc x_{i+k-1}]}. \end{align*}$$

In this sense, equilibrium measures for potentials arising from interactions that are absolutely summable do generalize stationary k-level Markov measures; we refer the reader to [Reference Barbieri, Gómez, Marcus, Meyerovitch and TaatiBGM⁺21, Reference Chandgotia, Han, Marcus, Meyerovitch and PavlovCHM⁺14] for recent thorough discussions of variants and converses to this observation. This generalization is far reaching as the theory of entropy, large deviations, and phase transition is much richer in the small space of interactions than in the space of finite-range interactions.

In a similar vein, equilibrium measures (in the sense of the variational principle on $\Omega '$ ) for abstract potentials $\phi $ in the Bowen class also satisfy ID, thanks to the Bowen–Gibbs property (see [Reference WaltersWal01, Section 4]). We refer the reader to [Reference WaltersWal01, Section 1] for a definition of the Bowen class, which can be traced back to [Reference BowenBow74]. This class includes potentials with summable variations, and thus Hölder-continuous potentials, and thus potentials naturally associated with stationary k-level Markov measures. A more complete discussion from the point of view of decoupling – including relaxation of the conditions on $\Omega '$ – can be found in [Reference Cuneo and RaquépasCR23, Section 2.3].

4.4 Hidden-Markov measures

While the above generalizations beyond Markovianity are often studied in the literature on mathematical physics and abstract dynamical systems, they might not be the most natural from an information-theoretic point of view; hidden-Markov models would most likely come to mind first for many practitioners. We recall that, among several equivalent representations, a stationary hidden-Markov measure ${\mathbb P}$ can be characterized by a tuple $(\pi , P, R)$ where $(\pi , P)$ characterizes in the usual way a stationary Markov process on a set $\mathcal {S}$ , called the hidden alphabet, and R is a $(\#\mathcal {S})$ -by- $(\#{\cal A})$ matrix whose rows each sum to 1:

$$ \begin{align*} {\mathbb P}[a_1^n] = \sum_{s_1^n \in \mathcal{S}^n} \pi_{s_1} R_{s_1, a_1} P_{s_1, s_2} R_{s_2, a_2}\cdots P_{s_{n-1}, s_{n}} R_{s_n, a_n} \end{align*} $$

for $n \in {\mathbb N}$ and $a_1^n \in {\cal A}^n$ . We restrict our attention to the case where $\mathcal {S}$ is a finite set and P is irreducible. We view the entry $R_{s,a}$ as the probability of observing $a \in {\cal A}$ at a given time step given the hidden state $s \in \mathcal {S}$ at that same time step – the dynamics of the latter governed by the hidden-Markov chain $(\pi ,P)$ . There exist only very singular examples of such measures for which FE fails. As exhibited by our next lemma, this can only happen if the process is eventually almost-surely deterministic.

Lemma 4.4 Let ${\mathbb P}$ be as above. Then, ${\mathbb P}$ satisfies FE if and only if, for each $s \in \mathcal {S}$ , there exists L such that

$$\begin{align*}\#\{a \in {\cal A}^L : {\mathbb P}[a | s_1 = s]> 0 \} > 1. \end{align*}$$

Proof Suppose that for each $s\in \mathcal S$ there exists L as above. By inspection of the canonical form of P provided by the Perron–Frobenius theorem, one deduces that there exists a finite set $\Sigma '$ of possible row vectors $\sigma $ that can arise as limit points for sequences of the form $([P^{m}]_{i,\cdot \,})_{m=1}^\infty $ . Let $\Sigma := \Sigma ' \cup \{\pi \}$ with $\pi $ the unique invariant probability row vector for P. By stochasticity, each $\sigma \in \Sigma $ has nonnegative entries that sum to 1. In this context, by assumption, there exists $L \in {\mathbb N}$ such that

is strictly less than $1$ . Given $\epsilon> 0$ , by inspection of the same canonical form, there exists $m \in {\mathbb N}$ with the following property: for all i, there is $\sigma \in \Sigma $ such that

$$\begin{align*}[P^m]_{i,\,\cdot} < \sigma + \epsilon. \end{align*}$$

Then, taking and $n \geq L$ ,

$$ \begin{align*} {\mathbb P}[a_1^n] &\leq {\mathbb P}[a_1^{L + q(m+L)}] \end{align*} $$

for $q := \max \{k \in {\mathbb N}_0 : n \geq L + k(m+L) \}$ . We introduce the shorthands $\mathcal {R}_0(s) = R_{s_1, a_1} \cdots R_{s_L, a_L}$ ,

$$\begin{align*}\mathcal{R}_k(s) = R_{s_{(k-1)(m+L)+L+1}, a_{(k-1)(m+L)+L+1}} \cdots R_{s_{k(m+L)+L}, a_{k(m+L)+L}} \end{align*}$$

and

$$\begin{align*}\mathcal{R}^{\prime}_k(s) = R_{s_{k(m+L) +1}, a_{k(m+L) +1}} \cdots R_{s_{k(m+L)+L}, a_{k(m+L)+L}} \end{align*}$$

when $1 \leq k \leq q$ . We also identify $s^1_0 \equiv s_L$ , $s^2_0 \equiv s^1_{m+L}$ , $s^3_0 \equiv s^2_{m+L}$ , and so on, and so forth. One then obtains

$$ \begin{align*} &{\mathbb P}[a_1^{L + q(m+L)}]\\ & = \!\! \sum_{\substack{s_1, \dotsc, s_L \\ s_1^{k}, \dotsc, s_{m+L}^k \\ \text{for } 1 \leq k \leq q}} \pi_{s_1} P_{s_1, s_2} \cdots P_{s_{L-1}, s_{L}} \mathcal{R}_0(s) \prod_{k=1}^q P_{s^{k}_{0}, s^k_1}P_{s^{k}_{1}, s^{k}_2} \dotsb P_{s^{k}_{m+L-1}, s^{k}_{m+L}} \mathcal{R}_k(s) \\ & \leq \!\! \sum_{\substack{s_1, \dotsc, s_L \\ s_{m}^{k}, \dotsc, s_{m+L}^k \\ \text{for } 1 \leq k \leq q}} \pi_{s_1} P_{s_1, s_2} \cdots P_{s_{L-1}, s_{L}} \mathcal{R}_0(s) \prod_{k=1}^q (\sigma_{s_m^k}^{(s_0^k)} + \epsilon)P_{s^{k}_{m}, s^{k}_{m+1}} \dotsb P_{s^{k}_{m+L-1}, s^{k}_{m+L}} \mathcal{R}_k'(s) \end{align*} $$

for some appropriate choices of $\sigma ^{(s^{k}_0)} \in \Sigma $ that depend on m and the index $s^{k}_0$ only. Therefore,

$$ \begin{align*} {\mathbb P}[a_1^{L + q(m+L)}] &\leq \delta \cdot (\delta + \epsilon (\#\mathcal{S}))^{q}. \end{align*} $$

By taking $\epsilon>0$ such that $\delta + \epsilon (\#\mathcal {S}) < 1$ and noting that q scales linearly with n, FE holds.

To see the converse implication, suppose that there exists $t\in \mathcal S$ such that there is no L as above. Then, there exists $a \in \Omega $ such that ${\mathbb P}[a_1^n|s_1 = t] = 1$ for all $n \in {\mathbb N}$ . Since

$$ \begin{align*}{\mathbb P}[a_1^n] \geq {\mathbb P}[a_1^n | s_1 = t] \cdot \pi_t = \pi_t\end{align*} $$

for all $n \in {\mathbb N}$ and $\mathrm {e}^{\gamma _{+}n}$ is eventually smaller than $\pi _t> 0$ for all $\gamma _{+} < 0$ , FE fails.

Figure 1: An example that does not satisfy Ad: a direct computation shows that is too unlikely compared to .

One can show that every stationary hidden-Markov measure satisfies the upper bound in ID. But in general – even if P is irreducible – only a weaker form of the lower bound, known as selective lower decoupling, holds (see [Reference Benoist, Cuneo, Jakšić and PilletBCJP21, Section 2] and [Reference Cuneo, Jakšić, Pillet and ShirikyanCJPS19, Section 2]). The fact that selective lower decoupling implies KB but does not imply the condition called Ad in Section 3.4 seems to pose a genuine obstacle. Determining whether the ZM estimation remains generally valid in the class of irreducible, hidden-Markov measures remains – to the best of our knowledge – an important open problem.

In the further specialized case where the elements of R are all in $\{0,1\}$ – this is sometimes called the function-Markov or lumped-Markov case – some conditions for the g-measure property (and thus ID) are discussed in [Reference Chazottes and UgaldeCU03, Reference Verbitskiy, Marcus, Petersen and WeissmanVer11, Reference YooYoo10]. However, it is not difficult to find examples for which none of these known sufficient conditions hold.

Example 4.5 The stationary measure on built from the four-hidden-state chain depicted in Figure 1 satisfies the upper bound in ID, as well as FE and SE, but not Ad – and hence not ID. However, note that this example satisfies the Doeblin-type condition of [Reference Kontoyiannis, Suhov and KellyKS94] with $r=3$ .

Acknowledgements

The authors would like to thank G. Cristadoro, N. Cuneo, and V. Jakšić for stimulating discussions on the topic of this note, as well as the referees for comments that helped improve the manuscript.

Footnotes

The research of N.B. and R.R. was partially funded by the Fonds de recherche du Québec – Nature et technologies (FRQNT) and by the Natural Sciences and Engineering Research Council of Canada (NSERC). The research of R.G. was partially funded by the Rubin Gruber Science Undergraduate Research Award and Axel W Hundemer. The research of G.P. was supported by the CY Initiative of Excellence through the grant Investissements d’Avenir ANR-16-IDEX-0008, and was done under the auspices of the Gruppo Nazionale di Fisica Matematica (GNFM) section of the Istituto Nazionale di Alta Matematica (INdAM) while G.P. was a postdoctoral researcher at the University of Milano-Bicocca (Milan, Italy). Part of this work was done during a stay of the four authors in Neuville-sur-Oise, funded by CY Initiative (grant Investissements d’avenir ANR-16-IDEX-0008).

1 Throughout this paper, we will refer to the partitioning symbol “ $|$ ” as a separator, and we will say that a separator falls within a given string if the separator lies after one of the letters that make up the string.

2 The requirement that $(k_n)_{n=1}^\infty $ be nondecreasing presents no loss of generality because one can always set and preserve the other desired properties.

3 For example, in the notation of [Reference Kwietniak, Łącka, Oprocha, Kolyad, Möller, Moree and WardKŁO16, Section 8], Property (8) suffices. We refer the reader to [Reference Cuneo and RaquépasCR23] for more optimal specification properties.

4 In fact, by (2.3), we can set

5 We are using “ergodic” in the sense of dynamical systems, meaning that all shift-invariant subsets of ${\cal A}^{\mathbb Z}$ either have measure 0 or 1 according to ${\mathbb Q}$ , so the following common caveat is in order: ${\mathbb Q}$ could come from a Markov chain and be ergodic in this sense even though it is periodic, which is at odds with a terminology sometimes used in the literature on Markov chains. As far as the decoupling of ${\mathbb Q}$ is concerned, we in fact only use (2.2), and not (2.3).

6 In fact, as long as $s < M_N$ , the probabilities are equal.

7 The minimum over the two terms will be given by the former as long as $i<c_N$ . However, this formulation is necessary to take care of the “edge cases” alluded to in the Introduction.

8 For example, in the terminology of [Reference Kwietniak, Łącka, Oprocha, Kolyad, Möller, Moree and WardKŁO16, Section 8], Property (6) suffices, but Property (8) does not, as it allows $\Omega ' = \{{0101010101010}\dotsc , {1010101010101}\dotsc \}$ .

9 With a slight abuse of notation, we are using ${\mathbb P}$ for both the equilibrium measure on $\overline {\Omega }' \subseteq {\cal A}^{\mathbb Z}$ and its natural marginal on $\Omega ' \subseteq {\cal A}^{\mathbb N}$ . Note that, by construction, the potential $\phi $ only depends on symbols from $\Omega '$ .

References

Barbieri, S., Gómez, R., Marcus, B., Meyerovitch, T., and Taati, S., Gibbsian representations of continuous specifications: the theorems of Kozlov and Sullivan revisited . Commun. Math. Phys. 382(2021), 1111–1164.CrossRef Google Scholar

Basile, C., Benedetto, D., Caglioti, E., and Degli Esposti, M., An example of mathematical authorship attribution . J. Math. Phys. 49(2008), no. 12, 125211.CrossRef Google Scholar

Benedetto, D., Caglioti, E., and Loreto, V., Language trees and zipping . Phys. Rev. Lett. 88(2002), 048702.CrossRef Google Scholar

Benoist, T., Cuneo, N., Jakšić, V., and Pillet, C.-A., On entropy production of repeated quantum measurements II examples . J. Stat. Phys. 182(2021), no. 3, 1–71.CrossRef Google Scholar

Benoist, T., Jakšić, V., Pautrat, Y., and Pillet, C.-A., On entropy production of repeated quantum measurements I general theory. Commun. Math. Phys. 357(2018), no. 1, 77–123.CrossRef Google Scholar

Berghout, S., Fernández, R., and Verbitskiy, E., On the relation between Gibbs and g-measures . Ergodic Theor. Dyn. Syst. 39(2019), no. 12, 3224–3249.CrossRef Google Scholar

Bowen, R., Some systems with unique equilibrium states . Math. Syst. Theor. 8(1974), no. 3, 193–202.CrossRef Google Scholar

Bradley, R. C., Basic properties of strong mixing conditions. A survey and some open questions . Probab. Surv. 2(2005), 107–144.CrossRef Google Scholar

Chandgotia, N., Han, G., Marcus, B., Meyerovitch, T., and Pavlov, R., One-dimensional Markov random fields, Markov chains and topological Markov fields . Proc. Amer. Math. Soc. 142(2014), no. 1, 227–242.CrossRef Google Scholar

Chazottes, J.-R. and Ugalde, E., Projection of Markov measures may be Gibbsian . J. Stat. Phys. 111(2003), no. 5/6, 1245–1272.CrossRef Google Scholar

Coutinho, D. P. and Figueiredo, M. A., Information theoretic text classification using the Ziv–Merhav method . In: Marques, J. S., Pérez de la Blanca, N., and Pina, P. (eds.), Pattern recognition and image analysis, Lecture Notes in Computer Science, 3523, Springer, Berlin, 2005, pp. 355–362.CrossRef Google Scholar

Coutinho, D. P., Fred, A. L., and Figueiredo, M. A., One-lead ECG-based personal identification using Ziv–Merhav cross parsing . In: 20th international conference on pattern recognition, IEEE, Los Alamitos, 2010, pp. 3858–3861.Google Scholar

Cristadoro, G., Degli Esposti, M., Jakšić, V., and Raquépas, R., On a waiting-time result of Kontoyiannis: mixing or decoupling? Stoch Proc. Appl. 166(2023), 104222.CrossRef Google Scholar

Cristadoro, G., Degli Esposti, M., Jakšić, V., and Raquépas, R., Recurrence times, waiting times and universal entropy production estimators . Lett. Math. Phys. 113(2023), no. 1, Article no. 19.CrossRef Google Scholar

Cuneo, N., Jakšić, V., Pillet, C.-A., and Shirikyan, A., Large deviations and fluctuation theorem for selectively decoupled measures on shift spaces . Rev. Math. Phys. 31(2019), no. 10, 1950036.CrossRef Google Scholar

Cuneo, N. and Raquépas, R., Large deviations of return times and related entropy estimators on shift spaces. Commun. Math. Phys. Preprint, 2023, arXiv:2306.05277 [math.PR].CrossRef Google Scholar

Denker, M., Grillenberger, C., and Sigmund, K., Ergodic theory on compact spaces, Lecture Notes in Mathematics, 527, Springer, Berlin, 1976.CrossRef Google Scholar

Kontoyiannis, I., Asymptotic recurrence and waiting times for stationary processes . J. Theor. Probab. 11(1998), no. 3, 795–811.CrossRef Google Scholar

Kontoyiannis, I. and Suhov, Y. M., Prefixes and the entropy rate for long-range sources . In: Kelly, F. P. (ed.), Probability, statistics and optimization: a tribute to Peter whittle, Wiley, New York, 1994.Google Scholar

Kwietniak, D., Łącka, M., and Oprocha, P., A panorama of specification-like properties and their consequences . In: Kolyad, S., Möller, M., Moree, P., and Ward, T. (eds.), Dynamics and numbers, Contemporary Mathematics, 669, American Mathematical Society, Providence, RI, 2016, pp. 155–186.CrossRef Google Scholar

Lewis, J. T., Pfister, C.-É., and Sullivan, W. G., Entropy, concentration of probability and conditional limit theorems . Markov Proc. Relat. Fields 1(1995), no. 3, 319–386.Google Scholar

Lippi, M., Montemurro, M. A., Degli Esposti, M., and Cristadoro, G., Natural language statistical features of LSTM-generated texts . IEEE Trans. Neural Netw. Learn. Syst. 30(2019), no. 11, 3326–3337.CrossRef Google Scholar PubMed

Olivier, E., Sidorov, N., and Thomas, A., On the Gibbs properties of Bernoulli convolutions related to

$\beta$ -numeration in multinacci bases . Monatshefte Math. 145(2005), no. 2, 145–174.CrossRef Google Scholar

Pfister, C.-É., Thermodynamical aspects of classical lattice systems . In: Sidoravicius, V. (ed.), In and out of equilibrium: probability with a physics flavor, Progress in Probability, 51, Birkhäuser, 2002, pp. 393–472.CrossRef Google Scholar

Pfister, C.-É. and Sullivan, W. G., Asymptotic decoupling and weak Gibbs measures for finite alphabet shift spaces . Nonlinearity 33(2020), no. 9, 4799–4817.CrossRef Google Scholar

Ro, S., Guo, B., Shih, A., Phan, T. V., Austin, R. H., Levine, D., Chaikin, P. M., and Martiniani, S., Model-free measurement of local entropy production and extractable work in active matter . Phys. Rev. Lett. 129(2022), no. 22, 220601.CrossRef Google Scholar PubMed

Roldán, É. and Parrondo, J. M. R., Entropy production and Kullback–Leibler divergence between stationary trajectories of discrete systems . Phys. Rev. E. 85(2012), 031129.CrossRef Google Scholar PubMed

Ruelle, D., Thermodynamic formalism, 2nd ed., Cambridge University Press, Cambridge, 2004.CrossRef Google Scholar

Shields, P. C., Waiting times: positive and negative results on the Wyner–Ziv problem . J. Theor. Probab. 6(1993), no. 3, 499–519.CrossRef Google Scholar

Shields, P. C., The ergodic theory of discrete sample paths, Graduate Studies in Mathematics, 13, American Mathematical Society, Providence, RI, 1996.CrossRef Google Scholar

van Enter, A. C., Fernández, R., and Sokal, A. D., Regularity properties and pathologies of position-space renormalization-group transformations: scope and limitations of Gibbsian theory . J. Stat. Phys. 72(1993), 879–1167.CrossRef Google Scholar

Verbitskiy, E., Thermodynamics of hidden Markov processes . In: Marcus, B., Petersen, K., and Weissman, T. (eds.), Entropy of hidden Markov processes and connections to dynamical systems, London Mathematical Society Lecture Note Series, Cambridge University Press, Cambridge, 2011, pp. 258–272.CrossRef Google Scholar

Walters, P., Convergence of the Ruelle operator for a function satisfying Bowen’s condition . Trans. Amer. Math. Soc. 353(2001), no. 1, 327–347.CrossRef Google Scholar

Walters, P., Regularity conditions and Bernoulli properties of equilibrium states and g-measures . J. Lond. Math. Soc. 71(2005), no. 2, 379–396.CrossRef Google Scholar

Wyner, A. D. and Ziv, J., Some asymptotic properties of the entropy of a stationary ergodic data source with applications to data compression . IEEE Int. Symp. Inf. Theory. 35(1989), no. 6, 1250–1258.CrossRef Google Scholar

Yoo, J., On factor maps that send Markov measures to Gibbs measures . J. Stat. Phys. 141(2010), no. 6, 1055–1070.CrossRef Google Scholar

Ziv, J. and Lempel, A., Compression of individual sequences via variable-rate coding . IEEE Trans. Inf. Theory 24(1978), no. 5, 530–536.CrossRef Google Scholar

Ziv, J. and Merhav, N., A measure of relative entropy between individual sequences with application to universal classification . IEEE Trans. Inf. Theory 39(1993), no. 4, 1270–1279.CrossRef Google Scholar

Figure 1: An example that does not satisfy Ad: a direct computation shows that is too unlikely compared to .

Article contents

On the Ziv–Merhav theorem beyond Markovianity I

Abstract

Keywords

MSC classification

Information

1 Introduction

Organization of the paper.

2 Setting

3 Main result

3.1 Statement and structure of the proof

3.2 Properties of the auxiliary parsings

3.3 Cross entropy

3.4 Comments

4 Examples

4.1 Markov measures

4.2 Regular g-measures

4.3 Statistical mechanics

Proof sketch.

4.4 Hidden-Markov measures

Acknowledgements

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests