Pointer chasing via triangular discrimination

Abstract We prove an essentially sharp 
$\tilde \Omega (n/k)$
 lower bound on the k-round distributional complexity of the k-step pointer chasing problem under the uniform distribution, when Bob speaks first. This is an improvement over Nisan and Wigderson’s 
$\tilde \Omega (n/{k^2})$
 lower bound, and essentially matches the randomized lower bound proved by Klauck. The proof is information-theoretic, and a key part of it is using asymmetric triangular discrimination instead of total variation distance; this idea may be useful elsewhere.


Introduction
Pointer chasing is a natural and well-known problem that captures the importance of interaction. In its two-player bit version, Alice gets as input a map f A : A → B and Bob gets as input terminates the inputs are still fairly random, which is impossible when the protocol achieves its goal. The proof uses a measure of distance between distributions that is new in this context: the triangular discrimination. Roughly speaking, triangular discrimination replaces total variation distance in a way that allows us to avoid the square-root loss that Pinsker's inequality yields.
This square-root loss appears in many works, and is directly related to several fundamental questions. For example, it appears in the parallel repetition theorem, and is connected to the 'strong parallel repetition' conjecture, which is motivated by Khot's unique games conjecture [10]. The 'strong parallel repetition' conjecture was falsified by Raz [21]; showing this square-root loss is necessary for parallel repetition. This loss also appears in direct sums and products in communication complexity [1,3], where it is related to the question of optimal compression of protocols. It is still unclear if the square-root loss is necessary for the direct sum question.
Coming back to pointer chasing, the square-root loss also appears in Nisan and Wigderson's lower bound [17]. This work shows that we can circumvent this loss by using triangular discrimination instead of Kullback-Leibler divergence. We are not aware of any other metric or divergence that can replace triangular discrimination in this respect. We believe that using triangular discrimination can yield better quantitative bounds in other cases as well. For this reason, in Section 2.1, we provide a clean example that demonstrates the main new technical idea.

Triangular discrimination
Measures of distance between probability distributions are extremely useful tools in many areas of research. A specific family of such measures is f -divergences (also known as Csiszár-Morimoto or Ali-Silvey divergences). These are measures of the form for a real convex function f so that f (1) = 0 (where some conventions such as 0f (0/0) = 0 are used). For more background, see [5] and the references within. Each of these measures has unique properties, which make it useful in different contexts. For example, 1 is useful due to its statistical meaning, and the Kullback-Leibler divergence is useful due to its tight relation to information theory (and properties such as the chain rule).
Here we use the triangular discrimination [24] defined as (p, q) = D f (p||q) with where by convention 0/0 = 0. Since is not so well known in this context, we briefly discuss its properties (for more details see [24,5]). Like all f -divergences, it is non-negative, it is convex in (p, q), it satisfies a data processing inequality (also known as a lumping property), and more. It is also equivalent to the Jensen-Shannon divergence: /2 JS 2 . It is, however, sometimes easier to work with than JS since its formula is 'simpler'. It satisfies the following 'improvement' over Pinsker's inequality (which states that |p − q| 2 1 2D(p||q)).
Another interesting ('operational' or 'dual') interpretation of is that ' is to 2 what 1 is to ∞ ' in the following sense. It is well known that where p.g = ω∈ p(ω)g(ω). This property of 1 is related to the fact that 1 is equivalent to total variation distance. For we have the following.
Proof. If On the other hand, for every g, by Cauchy-Schwarz, As a final remark, we mention that recently was implicitly used in information-theoretic proofs in group theory; it was used to construct group homomorphisms [8], it was used to study harmonic functions on groups [2], and it was used in a functional analytic proof of Gromov's theorem on groups of polynomial growth [18]. It is therefore reasonable that will find more applications in computer science as well.

An example
Before proving the lower bound for pointer chasing, we describe a clean example that demonstrates how can one use instead of 1 to get quantitatively better bounds. Let X be a random vector in {0, 1} n . Assume that it has high entropy: where u n is the uniform distribution on {0, 1} n . Also assume that I is chosen uniformly in [n] and independently of X. Lemma 3.1 implies that That is, on average over I, the marginal distribution of X I is close to uniform in Kullback-Leibler divergence, when k n. Pinsker's inequality allows us to deduce that the distribution of X I is close to uniform in 1 distance as well.
It is natural to ask what happens when I is not uniform but only close to uniform. Let J be a random element of [n], chosen independently of X, I, with very high entropy: Pinsker's inequality implies that |p J − p I | 1 √ 2ε, which in turn allows us to prove that This square-root dependence is often too expensive, especially when we apply such an argument several times, as discussed after the statement of Theorem 1.1. Triangular discrimination allows us to remove this square-root dependence.
For the rest of this subsection we prove Theorem 2.3. We start with the following simple claim.
Proof. Assume without loss of generality that a > 0. If ξ 2 − aξ − ab 0 then Lemma 2.1 and (2.1) allow us to bound the left term: It thus remains to upper-bound This is done as follows: = 2 (p J , p I ) ξ + 2p I .g.

Asymmetric triangular discrimination
To prove the lower bound for pointer chasing, we shall actually use the following variant of : Note that and that is symmetric in p, q while is not. The following lemma states important properties of ; it relates to 1 , and shows that is at most one ( may take the value two).
This difference between 1 and 2 is useful when we iteratively bound the 'distance' between two distributions, as in the proof of Theorem 1.1, since 1 k = 1 but 2 k grows quickly with k.
Proof. The left inequality holds by Cauchy-Schwarz: The middle inequality holds by the first equality in the equation above, because The right inequality holds since |p − q| 1 |p| 1 + |q| 1 = 2.
To explain the reason for using instead of , let us go back to Theorem 2.3. Although the theorem avoids the square-root loss, the coefficient of ε on the right-hand side is 4. When repeatedly applying this theorem, we get an exponential blowup, which is too costly to carry. The following theorem shows that allows us to avoid this blowup; the coefficient on the right-hand side can be 1.

The lower bound
Proof of Theorem 1.1. Let denote the length of the protocol (which we assume to be deterministic). Let M 1 , . . . , M t denote the messages sent in the first t rounds of the protocol. Recall that Z 0 , Z 1 , . . . are defined in (1.1). We shall show that if is small then Z k is close to being uniform, even conditioned on the transcript of the protocol. This implies that must be large, if the protocol achieves its goal.
We prove, by induction on t = 0, 1, . . . , k, that the following holds. Let R t denote the random variable (where R 0 is empty and R 1 = M 1 ). We shall prove that where δ = 2 + k log n n .
Roughly speaking, the expression ER t (p Z t |r t , p Z t ) measures how much we learned on the value of Z t from observing r t ; if this expression is small then we did not learn much on Z t from the first t rounds of the protocol.
Before proving (4.1), we explain why it completes the proof. Since the fraction of even numbers in [n] is at least 1/2 − 1/n, the error of the protocol conditioned on R k = r k is at least Hence, since the protocol has error 1/3, The lower bound on thus follows (we may assume n 1000). It thus remains to prove (4.1). When t = 0 it indeed holds (R 0 is empty). Suppose t 1. There are two cases to consider, depending on the parity of t. We consider the case when t is odd, and Bob sends the message M t . When t is even, the argument is similar due to the symmetry between Alice and Bob.
By induction, we have We want to bound ER t (p Z t |r t , p Z t ) from above. We start by simplifying it.
The following two independence properties are crucial: let X denote the vector that represents Alice's input (X s = f A (s) for each s), and let Y denote the vector that represents Bob's input (Y s = f B (n + s) for each s).
and therefore also of M t which is a function of (Y, m 1 , . . . , m t−1 ). (B) Conditioned on R t−1 = r t−1 , we know that X and Z t−1 are independent (when t = 1 we have Z t−1 = 1 and when t > 1 we have Z t−1 = Y z t−2 ). This means that conditioned on (R t−1 , Z t−1 ) = (r t−1 , z t−1 ), the distribution p X z t−1 |r t−1 ,z t−1 is equal to p X z t−1 |r t−1 .
These properties hold since (i) the distribution of (X, Y) conditioned on the values of Z 0 , Z 1 , . . . , Z t is a product distribution, (ii) conditioning on the value of M 1 , . . . , M t means focusing on some rectangle (e.g. a product set) in the input space, and (iii) the conditional distribution of a product distribution on a rectangle is again a product distribution. We are therefore interested in where s = s (r t−1 ) is Intuitively, by induction we know that p Z t−1 |r t−1 is close to uniform, so we start by checking what happens if we replace Z t−1 |r t−1 with a truly uniform variable. Let I be chosen uniformly at random in [n], and independently of all other choices. Since the coordinates in X are uniform and independent, Start by fixing r t−1 . Let q = p Z t−1 |r t−1 . The difference inside the expectation on the right-hand side above equals Finally, by (4.3) and (4.4), the inductive claim is proved.