bounded independence plus noise fools space

There are many classes of functions on n bits that we know are fooled by bounded independence, including small-depth circuits, halfspaces, etc. (See this previous post.)

On the other hand the simple parity function is not fooled. It’s easy to see that you require independence at least n-1. However, if you just perturb the bits with a little noise N, then parity will be fooled. You can find other examples of functions that are not fooled by bounded independence alone, but are if you just perturb the bits a little.

In [3] we proved that any distribution with independence about n^{2/3} fools space-bounded algorithms, if you perturb it with noise. We asked, both in the paper and many people, if the independence could be lowered. Forbes and Kelley have recently proved [2] that the independence can be lowered all the way to O(\log n), which is tight [1]. Shockingly, their proof is nearly identical to [3]!

This exciting result has several interesting consequences. First, we now have almost the same generators for space-bounded computation in a fixed order as we do for any order. Moreover, the proof greatly simplifies a number of works in the literature. And finally, an approach in [4] to prove limitations for the sum of small-bias generators won’t work for space (possibly justifying some optimism in the power of the sum of small-bias generators).

My understanding of all this area is inseparable from the collaboration I have had with Chin Ho Lee, with whom I co-authored all the papers I have on this topic.

The proof

Let f:\{0,1\}^{n}\to \{0,1\} be a function. We want to show that it is fooled by D+E, where D has independence k, E is the noise vector of i.i.d. bits coming up 1 with probability say 1/4, and + is bit-wise XOR.

The approach in [3] is to decompose f as the sum of a function L with Fourier degree k, and a sum of t functions H_{i}=h_{i}\cdot g_{i} where h_{i} has no Fourier coefficient of degree less than k, and h_{i} and g_{i} are bounded. The function L is immediately fooled by D, and it is shown in [3] that each H_{i} is fooled as well.

To explain the decomposition it is best to think of f as the product of \ell :=n/k functions f_{i} on k bits, on disjoint inputs. The decomposition in [3] is as follows: repeatedly decompose each f_{i} in low-degree f_{L} and high-degree f_{H}. To illustrate:

\begin{aligned} f_{1}f_{2}f_{3} & =f_{1}f_{2}(f_{3H}+f_{3L})=f_{1}f_{2}f_{3H}+f_{1}(f_{2H}+f_{2L})f_{3L}=\ldots \\ = & f_{1H}f_{2L}f_{3L}+f_{1}f_{2H}f_{3L}+f_{1}f_{2}f_{3H}+f_{1L}f_{2L}f_{3L}\\ = & H_{1}+H_{2}+H_{3}+L. \end{aligned}

This works, but the problem is that even if each time f_{iL} has degree 1, the function L increases the degree by at least 1 per decomposition; and so we can afford at most k decompositions.

The decomposition in [2] is instead: pick L to be the degree k part of f, and H_{i} are all the Fourier coefficients which are non-zero in the inputs to f_{i} and whose degree in the inputs of f_{1},\ldots ,f_{i} is \ge k. The functions H_{i} can be written as h_{i}\cdot g_{i}, where h_{i} is the high-degree part of f_{1}\cdots f_{i} and h_{i} is f_{i+1}\cdots f_{\ell }.

Once you have this decomposition you can apply the same lemmas in [3] to get improved bounds. To handle space-bounded computation they extend this argument to matrix-valued functions.

What’s next

In [3] we asked for tight “bounded independence plus noise” results for any model, and the question remains. In particular, what about high-degree polynomials modulo 2?

References

[1]   Ravi Boppana, Johan Håstad, Chin Ho Lee, and Emanuele Viola. Bounded independence vs. moduli. In Workshop on Randomization and Computation (RANDOM), 2016.

[2]   Michael A. Forbes and Zander Kelley. Pseudorandom generators for read-once branching programs, in any order. In IEEE Symp. on Foundations of Computer Science (FOCS), 2018.

[3]   Elad Haramaty, Chin Ho Lee, and Emanuele Viola. Bounded independence plus noise fools products. SIAM J. on Computing, 47(2):295–615, 2018.

[4]   Chin Ho Lee and Emanuele Viola. Some limitations of the sum of small-bias distributions. Theory of Computing, 13, 2017.

 

 

Advertisements

Nonclassical polynomials and exact computation of Boolean functions

Guest post by Abhishek Bhrushundi.

I would like to thank Emanuele for giving me the opportunity to write a guest post here. I recently stumbled upon an old post on this blog which discussed two papers: Nonclassical polynomials as a barrier to polynomial lower bounds by Bhowmick and Lovett, and Anti-concentration for random polynomials by Nguyen and Vu. Towards the end of the post, Emanuele writes:

“Having discussed these two papers in a sequence, a natural question is whether non-classical polynomials help for exact computation as considered in the second paper. In fact, this question is asked in the paper by Bhowmick and Lovett, who conjecture that the answer is negative: for exact computation, non-classical polynomials should not do better than classical.”

In a joint work with Prahladh Harsha and Srikanth Srinivasan from last year, On polynomial approximations over \mathbb {Z}/2^k\mathbb {Z}, we study exact computation of Boolean functions by nonclassical polynomials. In particular, one of our results disproves the aforementioned conjecture of Bhowmick and Lovett by giving an example of a Boolean function for which low degree nonclassical polynomials end up doing better than classical polynomials of the same degree in the case of exact computation.

The counterexample we propose is the elementary symmetric polynomial of degree 16 in \mathbb {F}_2[x_1, \ldots , x_n]. (Such elementary symmetric polynomials also serve as counterexamples to the inverse conjecture for the Gowers norm [LMS11GT07], and this was indeed the reason why we picked these functions as candidate counterexamples),

\begin{aligned}S_{16}(x_1, \ldots , x_n) = \left (\sum _{S\subseteq [n],|S| = 16} \prod _{i \in S}x_i\right )\textrm { mod 2} = {|x| \choose 16} \textrm { mod 2},\end{aligned}

where |x| = \sum _{i=1}^n x_i is the Hamming weight of x. One can verify (using, for example, Lucas’s theorem) that S_{16}(x_1, \ldots , x_n) = 1 if and only if the 5^{th} least significant bit of |x| is 1.

We use that no polynomial of degree less than or equal to 15 can compute S_{16}(x) correctly on more than half of the points in \{0,1\}^n.

Theorem 1. Let P be a polynomial of degree at most 15 in \mathbb {F}_2[x_1, \ldots , x_n]. Then

\begin{aligned}\Pr _{x \sim \{0,1\}^n}[P(x) = S_{16}(x)] \le \frac {1}{2} + o(1).\end{aligned}

[Emanuele’s note. Let me take advantage of this for a historical remark. Green and Tao first claimed this fact and sent me and several others a complicated proof. Then I pointed out the paper by Alon and Beigel [AB01]. Soon after they and I independently discovered the short proof reported in [GT07].]

The constant functions (degree 0 polynomials) can compute any Boolean function on half of the points in \{0,1\}^n and this result shows that even polynomials of higher degree don’t do any better as far as S_{16}(x_1, \ldots , x_n) is concerned. What we prove is that there is a nonclassical polynomial of degree 14 that computes S_{16}(x_1, \ldots , x_n) on 9/16 \ge 1/2 + \Omega (1) of the points in \{0,1\}^n.

Theorem 2. There is a nonclassical polynomial P of degree 14 such that

\begin{aligned}\Pr _{x \sim \{0,1\}^n}[P(x) = S_{16}(x)] = \frac {9}{16} - o(1).\end{aligned}

A nonclassical polynomial takes values on the torus \mathbb {T} = \mathbb {R}/\mathbb {Z} and in order to compare the output of a Boolean function (i.e., a classical polynomial) to that of a nonclassical polynomial it is convenient to think of the range of Boolean functions to be \{0,1/2\} \subset \mathbb {T}. So, for example, S_{16}(x_1, \ldots , x_n) = \frac {1}{2} if |x|_4 = 1, and S_{16}(x_1, \ldots , x_n) = 0 otherwise. Here |x|_4 denotes the 5^{th} least significant bit of |x|.

We show that the nonclassical polynomial that computes S_{16}(x) on 9/16 of the points in \{0,1\}^n is

\begin{aligned}P(x_1, \ldots , x_n) = \frac {\sum _{S \subseteq [n], |S|=12} \prod _{i \in S}x_i}{8} \textrm { mod 1}= \frac {{|x| \choose 12}}{8} \textrm { mod 1} .\end{aligned}

The degree of this nonclassical polynomial is 14 but I wouldn’t get into much detail as to why this is case (See [BL15] for a primer on the notion of degree in the nonclassical world).

Understanding how P(x) behaves comes down to figuring out the largest power of two that divides |x| \choose 12 for a given x: if the largest power of two that divides |x| \choose 12 is 2 then P(x) = 1/2, otherwise if the largest power is at least 3 then P(x) = 0. Fortunately, there is a generalization of Lucas’s theorem, known as Kummer’s theorem, that helps characterize this:

Theorem 3.[Kummer’s theorem] The largest power of 2 dividing a \choose b for a,b \in \mathbb {N}, a \ge b, is equal to the number of borrows required when subtracting b from a in base 2.
Equipped with Kummer’s theorem, it doesn’t take much work to arrive at the following conclusion.

Lemma 4. P(x) = S_{16}(x) if either |x|_{2} = 0 or (|x|_2, |x|_3, |x|_4, |x|_5) = (1,0,0,0), where |x|_i denotes the (i+1)^{th} least significant bit of |x|.

If x = (x_1, \ldots , x_n) is uniformly distributed in \{0,1\}^n then it’s not hard to verify that the bits |x|_0, \ldots , |x|_5 are almost uniformly and independently distributed in \{0,1\}, and so the above lemma proves that P(x) computes S_{16}(x) on 9/16 of the points in \{0,1\}^n. It turns out that one can easily generalize the above argument to show that S_{2^\ell }(x) is a counterexample to Bhowmick and Lovett’s conjecture for every \ell \ge 4.

We also show in our paper that it is not the case that nonclassical polynomials always do better than classical polynomials in the case of exact computation — for the majority function, nonclassical polynomials do as badly as their classical counterparts (this was also conjectured by Bhowmick and Lovett in the same work), and the Razborov-Smolensky bound for classical polynomials extends to nonclassical polynomials.

We started out trying to prove that S_4(x_1, \ldots , x_n) is a counterexample but couldn’t. It would be interesting to check if it is one.

References

[AB01]    N. Alon and R. Beigel. Lower bounds for approximations by low degree polynomials over z m. In Proceedings 16th Annual IEEE Conference on Computational Complexity, pages 184–187, 2001.

[BL15]    Abhishek Bhowmick and Shachar Lovett. Nonclassical polynomials as a barrier to polynomial lower bounds. In Proceedings of the 30th Conference on Computational Complexity, pages 72–87, 2015.

[GT07]    B. Green and T. Tao. The distribution of polynomials over finite fields, with applications to the Gowers norms. ArXiv e-prints, November 2007.

[LMS11]   Shachar Lovett, Roy Meshulam, and Alex Samorodnitsky. Inverse conjecture for the gowers norm is false. Theory of Computing, 7(9):131–145, 2011.

Entropy polarization

Sometimes you see quantum popping up everywhere. I just did the opposite and gave a classical talk at a quantum workshop, part of an AMS meeting held at Northeastern University, which poured yet another avalanche of talks onto the Boston area. I spoke about the complexity of distributions, also featured in an earlier post, including a result I posted two weeks ago which gives a boolean function f:\{0,1\}^{n}\to \{0,1\} such that the output distribution of any AC^{0} circuit has statistical distance 1/2-1/n^{\omega (1)} from (Y,f(Y)) for uniform Y\in \{0,1\}^{n}. In particular, no AC^{0} circuit can compute f much better than guessing at random even if the circuit is allowed to sample the input itself. The slides for the talk are here.

The new technique that enables this result I’ve called entropy polarization. Basically, for every AC^{0} circuit mapping any number L of bits into n bits, there exists a small set S of restrictions such that:

(1) the restrictions preserve the output distribution, and

(2) for every restriction r\in S, the output distribution of the circuit restricted to r either has min-entropy 0 or n^{0.9}. Whence polarization: the entropy will become either very small or very large.

Such a result is useless and trivial to prove with |S|=2^{n}; the critical feature is that one can obtain a much smaller S of size 2^{n-n^{\Omega (1)}}.

Entropy polarization can be used in conjunction with a previous technique of mine that works for high min-entropy distributions to obtain the said sampling lower bound.

It would be interesting to see if any of this machinery can yield a separation between quantum and classical sampling for constant-depth circuits, which is probably a reason why I was invited to give this talk.

Hardness amplification proofs require majority… and 15 years

Aryeh Grinberg, Ronen Shaltiel, and myself have just posted a paper which proves conjectures I made 15 years ago (the historians want to consult the last paragraph of [2] and my Ph.D. thesis).

At that time, I was studying hardness amplification, a cool technique to take a function f:\{0,1\}^{k}\to \{0,1\} that is somewhat hard on average, and transform it into another function f':\{0,1\}^{n}\to \{0,1\} that is much harder on average. If you call a function \delta -hard if it cannot be computed on a \delta fraction of the inputs, you can start e.g. with f that is 0.1-hard and obtain f' that is 1/2-1/n^{100} hard, or more. This is very important because functions with the latter hardness imply pseudorandom generators with Nisan’s design technique, and also “additional” lower bounds using the “discriminator lemma.”

The simplest and most famous technique is Yao’s XOR lemma, where

\begin{aligned} f'(x_{1},x_{2},\ldots ,x_{t}):=f(x_{1})\oplus f(x_{2})\oplus \ldots \oplus f(x_{t}) \end{aligned}

and the hardness of f' decays exponentially with t. (So to achieve the parameters above it suffices to take t=O(\log k).)

At the same time I was also interested in circuit lower bounds, so it was natural to try to use this technique for classes for which we do have lower bounds. So I tried, and… oops, it does not work! In all known techniques, the reduction circuit cannot be implemented in a class smaller than TC^{0} – a class for which we don’t have lower bounds and for which we think it will be hard to get them, also because of the Natural proofs barrier.

Eventually, I conjectured that this is inherent, namely that you can take any hardness amplification reduction, or proof, and use it to compute majority. To be clear, this conjecture applied to black-box proofs: decoding arguments which take anything that computes f' too well and turn it into something which computes f too well. There were several partial results, but they all had to restrict the proof further, and did not capture all available techniques.

Should you have had any hope that black-box proofs might do the job, in this paper we prove the full conjecture (improving on a number of incomparable works in the literature, including a 10-year-anniversary work by Shaltiel and myself which proved the conjecture for non-adaptive proofs).

Indistinguishability

One thing that comes up in the proof is the following basic problem. You have a distribution X on n bits that has large entropy, very close to n. A classic result shows that most bits of X are close to uniform. We needed an adaptive version of this, showing that a decision tree making few queries cannot distinguish X from uniform, as long as the tree does not query a certain small forbidden set of variables. This also follows from recent and independent work of Or Meir and Avi Wigderson.

Turns out this natural extension is not enough for us. In a nutshell, it is difficult to understand what queries an arbitrary reduction is making, and so it is hard to guarantee that the reduction does not query the forbidden set. So we prove a variant, where the variables are not forbidden, but are fixed. Basically, you condition on some fixing X_{B}=v of few variables, and then the resulting distribution X|X_{B}=v is indistinguishable from the distribution U|U_{B}=v where U is uniform. Now the queries are not forbidden but have a fixed answer, and this makes things much easier. (Incidentally, you can’t get this simply by fixing the forbidden set.)

Fine, so what?

One great question remains. Can you think of a counter-example to the XOR lemma for a class such as constant-depth circuits with parity gates?

But there is something more why I am interested in this. Proving 1/2-1/n average-case hardness results for restricted classes “just” beyond AC^{0} is more than a long-standing open question in lower bounds: It is necessary even for worst-case lower bounds, both in circuit and communication complexity, as we discussed earlier. And here’s hardness amplification, which intuitively should provide such hardness results. It was given many different proofs, see e.g. [1]. However, none can be applied as we just saw. I don’t know, someone taking results at face value may even start thinking that such average-case hardness results are actually false.

References

[1]   Oded Goldreich, Noam Nisan, and Avi Wigderson. On Yao’s XOR lemma. Technical Report TR95–050, Electronic Colloquium on Computational Complexity, March 1995. http://www.eccc.uni-trier.de/.

[2]   Emanuele Viola. The complexity of constructing pseudorandom generators from hard functions. Computational Complexity, 13(3-4):147–188, 2004.

Matrix rigidity, and all that

The rigidity challenge asks to exhibit an n × n matrix M that cannot be written as M = A + B where A is “sparse” and B is “low-rank.” This challenge was raised by Valiant who showed in [Val77] that if it is met for any A with at most n1+ϵ non-zero entries and any B with rank O(n∕ log log n) then computing the linear transformation M requires either logarithmic depth or superlinear size for linear circuits. This connection relies on the following lemma.


Lemma 1. Let C : {0, 1}n →{0, 1}n be a circuit made of XOR gates. If you can remove e edges and reduce the depth to d then the linear transformation computed by C equals A + B where A has ≤ 2d non-zero entries per row (and so a total of ≤ n2d non-zero entries), and B has rank ≤ e.


Proof: After you remove the edges, each output bit is a linear combination of the removed edges and at most 2d input variables. The former can be done by B, the latter by A. QED

Valiant shows that in a log-depth, linear-size circuit one can remove O(n∕ log log n) edges to reduce the depth to nϵ – a proof can be found in [Vio09] – and this gives the above connection to lower bounds.

However, the best available tradeoff for explicit matrices give sparsity n2∕r log(n∕r) and rank r, for any parameter r; and this is not sufficient for application to lower bounds.

Error-correcting codes

It was asked whether generator matrixes of good linear codes are rigid. (A code is good if it has constant rate and constant relative distance. The dimensions of the corresponding matrixes are off by only a constant factor, and so we can treat them as identical.) Spielman [Spi95] shows that there exist good codes that can be encoded by linear-size logarithmic depth circuits. This immediately rules out the possibility of proving a lower bound, and it gives a non-trivial rigidity upper bound via the above connections.

Still, one can ask if these matrices at least are more rigid than the available tradeoffs. Goldreich reports a negative answer by Dvir, showing that there exist good codes whose generating matrix C equals A + B where A has at most O(n2∕d) non-zero entries and B has rank O(d log n∕d), for any d.

A similar negative answer follows by the paper [GHK+13]. There we show that there exist good linear codes whose generating matrix can be written as the product of few sparse matrixes. The corresponding circuits are very structured, and so perhaps it is not surprising that they give good rigidity upper bounds. More precisely, the paper shows that we can encode an n-bit message by a circuit made of XOR gates and with say n log *n wires and depth O(1) – with unbounded fan-in. Each gate in the circuit computes the XOR of some t gates, which can be written as a binary tree of depth log 2t + O(1). Such trees have poor rigidity:


Lemma 2.[Trees are not rigid] Let C be a binary tree of depth d. You can remove an O(1∕2b) fraction of edges to reduce the depth to b, for any b.


Proof: It suffices to remove all edges at depths d – b, d – 2b, …. The number of such edges is O(2d-b + 2d-2b + …) = O(2d-b). Note this includes the case d ≤ b, where we can remove 0 edges. QED

Applying Lemma 2 to a gate in our circuit, we reduce the depth of the binary tree computed at that gate to b. Applying this to every gate we obtain a circuit of depth O(b). In total we have removed an O(1∕2b) fraction of the n log *n edges.

Writing 2b = n∕d, by Lemma 1 we can write the generating matrixes of our code as C = A + B where A has at most O(n∕d) non-zero entries per row, and B has rank O(d log *n). These parameters are the same as in Dvir’s result, up to lower-order terms. The lower-order terms appear incomparable.

Walsh-Fourier transform

Another matrix that was considered is the n×n Inner Product matrix H, aka the Walsh-Hadamard matrix, where the x,y entry is the inner product of x and y modulo 2. Alman and Williams [AW16] recently give an interesting rigidity upper bound which prevents this machinery to establish a circuit lower bound. Specifically they show that H can be written as H = A + B where A has at most n1+ϵ non-zero entries, and B has rank n1-ϵ′, for any ϵ and an ϵ′ which goes to 0 when ϵ does.

Their upper bound works as follows. Let h = log 2n. Start with the univariate, real polynomial p(z1,z2,…,zh) which computes parity exactly on inputs of Hamming weight between 2ϵn and (1∕2 + ϵ)n. By interpolation such a polynomial exists with degree (1∕2 – ϵ)n. Replacing zi with xiyi you obtain a polynomial of degree n – ϵn which computes IP correctly on inputs x,y whose inner product is between 2ϵn and (1∕2 + ϵ)n.

This polynomial has 2(1-ϵ′)n monomials, where ϵ′ = Ω(ϵ2). The truth-table of a polynomial with m monomials is a matrix with rank m, and this gives a low-rank matrix B′.

The fact that sparse polynomials yield low-rank matrixes also appeared in the paper [SV12], which suggested to study the rigidity challenge for matrixes arising from polynomials.

Returning to the proof in [AW16], it remains to deal with inputs whose inner product does not lie in that range. The number of x whose weight is not between (1∕2 – ϵ)n and (1∕2 + ϵ)n is 2(1-ϵ′)n. For each such input x we modify a row of the matrix B′. Repeating the process for the y we obtain the matrix B, and the rank bound 2(1-ϵ′)n hasn’t changed.

Now a calculation shows that B differs from H in few entries. That is, there are few x and y with Hamming weight between (1∕2 – ϵ)n and (1∕2 + ϵ)n, but with inner product less than 2ϵn.

Boolean complexity

There exists a corresponding framework for boolean circuits (as opposed to circuits with XOR gates only). Rigid matrixes informally correspond to depth-3 Or-And-Or circuits. If this circuit has fan-in fo at the output gate and fan-in fi at each input gate, then the correspondence in parameters is

rank = log fo
sparsity = 2fi .

More precisely, we have the following lemma.


Lemma 3. Let C : {0, 1}n →{0, 1}n be a boolean circuit. If you can remove e edges and reduce the depth to d then you can write C as an Or-And-Or circuit with output fan-in 2e and input fan-in 2d.


Proof: After you remove the edges, each output bit and each removed edge depends on at most 2d input bits or removed edges. The output Or gate of the depth-3 circuit is a big Or over all 2e assignments of values for the removed edges. Then we need to check consistency. Each consistency check just depends on 2d inputs and so can be written as a depth-2 circuit with fan-in 2d. QED

The available bounds are of the form log fo = n∕fi. For example, for input fan-in fi = nα we have lower bounds exponential in n1-α but not more. Again it can be shown that breaking this tradeoff in certain regimes (namely, log 2fo = O(n∕ log log n)) yields lower bounds against linear-size log-depth circuits. (A proof appears in [Vio09].) It was also pointed out in [Vio13] that breaking this tradeoff in any regime yields lower bounds for branching programs. See also the previous post.

One may ask how pairwise independent hash functions relate to this challenge. Ishai, Kushilevitz, Ostrovsky, and Sahai showed [IKOS08] that they can be computed by linear-size log-depth circuits. Again this gives a non-trivial upper bound for depth-3 circuits via these connections, and one can ask for more. In [GHK+13] we give constructions of such circuits which in combination with Lemma 3 can again be used to almost match the available trade-offs.

The bottom line of this post is that we can’t prove lower bounds because they are false, and it is a puzzle to me why some people appear confident that P is different from NP.

References

[AW16]    Josh Alman and Ryan Williams. Probabilistic rank and matrix rigidity, 2016. https://arxiv.org/abs/1611.05558.

[GHK+13]   Anna Gál, Kristoffer Arnsfelt Hansen, Michal Koucký, Pavel Pudlák, and Emanuele Viola. Tight bounds on computing error-correcting codes by bounded-depth circuits with arbitrary gates. IEEE Transactions on Information Theory, 59(10):6611–6627, 2013.

[IKOS08]    Yuval Ishai, Eyal Kushilevitz, Rafail Ostrovsky, and Amit Sahai. Cryptography with constant computational overhead. In 40th ACM Symp. on the Theory of Computing (STOC), pages 433–442, 2008.

[Spi95]    Daniel Spielman. Computationally Efficient Error-Correcting Codes and Holographic Proofs. PhD thesis, Massachusetts Institute of Technology, 1995.

[SV12]    Rocco A. Servedio and Emanuele Viola. On a special case of rigidity. Available at http://www.ccs.neu.edu/home/viola/, 2012.

[Val77]    Leslie G. Valiant. Graph-theoretic arguments in low-level complexity. In 6th Symposium on Mathematical Foundations of Computer Science, volume 53 of Lecture Notes in Computer Science, pages 162–176. Springer, 1977.

[Vio09]    Emanuele Viola. On the power of small-depth computation. Foundations and Trends in Theoretical Computer Science, 5(1):1–72, 2009.

[Vio13]    Emanuele Viola. Challenges in computational lower bounds. Available at http://www.ccs.neu.edu/home/viola/, 2013.

Mixing in groups, II

In the previous post we have reduced the “three-step mixing” over SL(2,q), the group of 2×2 matrices over the field with q elements with determinant 1, to the following statement about mixing of conjugacy classes.


Theorem 1.[Mixing of conjugacy classes of SL(2,q)] Let G = SL(2,q). With probability ≥ 1 -|G|-Ω(1) over uniform a,b in G, the distribution C(a)C(b) is |G|-Ω(1) close in statistical distance to uniform.


Here and throughout this post, C(g) denotes a uniform element from the conjugacy class of g, and every occurrence of C corresponds to an independent draw.

In this post we sketch a proof of Theorem 1, following [GV15]. Similar theorems were proved already. For example Shalev [Sha08] proves a version of Theorem 1 without a quantitative bound on the statistical distance. It is possible to plug some more representation-theoretic information into Shalev’s proof and get the same quantitative bound as in Theorem 1, though I don’t have a good reference for this extra information. However the proof in [Sha08] builds on a number of other things, which also means that if I have to prove a similar but not identical statement, as we had to do in [GV15], it would be hard for me.

Instead, here is how you can proceed to prove the theorem. First, we remark that the distribution of C(a)C(b) is the same as that of

C(C(a)C(b)),

because for uniform x, y, and z in Fq we have the following equalities of distributions:

C(C(a)C(b)) = x-1(y-1ayz-1bz)x = x-1(y-1ayxx-1z-1bz)x = C(a)C(b)

where the last equality follows by replacing y with yx and z with zx.

That means that we get one conjugation “for free” and we just have to show that C(a)C(b) falls into various conjugacy classes with the right probability.

Now the great thing about SL(2,q) is that you can essentially think of it as made up of q conjugacy classes each of size q2 (the whole group has size q3 – q). This is of course not exactly correct, in particular the identity element obviously gives a conjugacy class of size 1. But except for a constant number of conjugacy classes, every conjugacy class has size q2 up to lower-order terms. This means that what we have to show is simply that the conjugacy class of C(a)C(b) is essentially uniform over conjugacy classes.

Next, the trace map Tr  : SL(2,q) → Fq is essentially a bijection between conjugacy classes and the field Fq. To see this recall that the trace map satisfies the cyclic property:

Tr xyz = Tr yzx.

This implies that

Tr u-1au = Tr auu-1 = Tr a,

and so conjugate elements have the same trace. On the other hand, the q matrixes

x  1

1  0

for x in Fq all have different traces, and by what we said above their conjugacy classes make up essentially all the group.

Putting altogether, what we are trying to show is that

Tr C(a)C(b)

is |G|-Ω(1) close to uniform over F q in statistical distance.

Furthermore, again by the cyclic property we can consider without loss of generality

Tr aC(b)

instead, and moreover we can let a have the form

0  1

1  w

and b have the form

v  1

1  0

(there is no important reason why w is at the bottom rather than at the top).

Writing a generic g in SL(2,q) as the matrix

u1   u2

u3   u4

you can now with some patience work out the expression

Tr au-1bu = vu 3u4 – u32 + u 42 – vu 1u2 + u12 – vwu 2u3 + wu1u3 – u22 – wu 2u4.

What we want to show is that for typical choices of w and v, the value of this polynomial is q-Ω(1) close to uniform over F q for a uniform choice of u subject to the determinant of u being 1, i.e, u1u4 – u2u3 = 1.

Maybe there is some machinery that immediately does that. Lacking the machinery, you can use the equation u1u4 – u2u3 = 1 to remove u4 by dividing by u1 (the cases where u1 = 0 are few and do not affect the final answer). Now you end up with a polynomial p in three variables, which we can rename x, y, and z. You want to show that p(x,y,z) is close to uniform, for uniform choices for x,y,z. The benefit of this substitution is that we removed the annoying condition that the determinant is one.

To argue about p(x,y,z), the DeMillo–Lipton–Schwartz-Zippel lemma comes to mind, but is not sufficient for our purposes. It is consistent with that lemma that the polynomial doesn’t take a constant fraction of the values of the field, which would give a constant statistical distance. One has to use more powerful results known as the Lang-Weil theorem. This theorem provides under suitable conditions on p a sharp bound on the probability that p(x,y,z) = a for a fixed a in Fq. The probability is 1∕q plus lower-order terms, and then by summing over all a in Fq one obtains the desired bound on the statistical distance.

I am curious if there is a way to get the statistical distance bound without first proving a point-wise bound.

To apply the Lang-Weil theorem you have to show that the polynomial is “absolutely irreducible,” i.e., irreducible over any algebraic extension of the field. This can be proven from first principles by a somewhat lengthy case analysis.

References

[GV15]   W. T. Gowers and Emanuele Viola. The communication complexity of interleaved group products. In ACM Symp. on the Theory of Computing (STOC), 2015.

[Sha08]   Aner Shalev. Mixing and generation in simple groups. J. Algebra, 319(7):3075–3086, 2008.

Mixing in groups

Non-abelian groups behave in ways that are useful in computer science. Barrington’s famous result [Bar89] shows that we can write efficiently an arbitrary low-depth computation as a group product over any non-solvable group. (Being non-solvable is a certain strengthening of being non-abelian which is not important now.) His result, which is false for abelian groups, has found myriad applications in computer science. And it is amusing to note that actually results about representing computation as group products were obtained twenty years before Barrington, see [KMR66]; but the time was not yet ripe.

This post is about a different property that certain non-abelian groups have and that is also useful. Basically, these groups ”mix” in the sense that if you have several distributions over the group, and the distributions have high entropy, then the product distribution (i.e., sample from each distribution and output the product) is very close to uniform.

First, let us quickly remark that this is completely false for abelian groups. To make an example that is familiar to computer scientists, consider the group of n-bit strings with bit-wise xor. Now let A be the uniform distribution over this group where the first bit is always 0. Then no matter how many independent copies of A you multiply together, the product is always A.

Remarkably, over other groups it is possible to show that the product distribution will become closer and closer to uniform. A group that works very well in this respect is SL(2,q), the group of 2×2 matrices over the field with q elements with determinant 1. This is a group that in some sense is very far from abelian. In particular, one can prove the following result.


Theorem 1.[Three-step mixing [Gow08BNP08]] Let G = SL(2,q), and let A, B, and C be three subsets of G of constant density. Let a, b, and c be picked independently and uniformly from A, B, and C respectively. Then for any g in G we have

| Pr[abc = g] – 1∕|G|| < 1∕|G|1+Ω(1).


Note that the conclusion of the theorem in particular implies that abc is supported over the entire group. This is remarkable, since the starting distributions are supported over only a small fraction of the group. Moreover, by summing over all elements g in the group we obtain that abc is polynomially close to uniform in statistical distance.

Theorem 1 can be proved using representation theory. This must be a great tool, but for some reason I always found it a little difficult to digest the barrage of definitions that usually anticipate the interesting stuff.

Luckily, there is another way to prove Theorem 1. I wouldn’t be surprised if this is in some sense the same way, and moreover this other way is not sometimes I would call elementary. But it is true that I will be able to sketch a proof of the theorem without using the word ”representation”. In this post we will prove some preliminary results that are valid for all groups, and the most complicated thing used is the Cauchy-Schwarz inequality. In the next post we will work specifically with the group SL(2,q), and use more machinery. This is all taken from this paper with Gowers [GV15] (whose main focus is the study of mixing in the presence of dependencies).

First, for convenience let us identify a set A with its characteristic function. So we write A(a) for a belongs to the set A. It is convenient to work with a slightly different statement:


Theorem 2. Let G = SL(2,q) and let A,B,C be three subsets of G of densities α,β,γ respectively. For any g in G,

|Eabc=gA(a)B(b)C(c) – αβγ|≤|G|-Ω(1)

where the expectation is over uniform elements a, b, and c from the group G such that their product is equal to g.


This Theorem 2 is equivalent to Theorem 1, because

Eabc=gA(a)B(b)C(c) = Pr[A(a),B(b),C(c)|abc = g]
= Pr[abc = g|A(a),B(b),C(c)]|G|αβγ

by Bayes’ rule. So we can get Theorem 1 by dividing by |G|αβγ.

Now we observe that to prove this ”mixing in three steps” it actually suffices to prove mixing in four steps.


Theorem 3.[Mixing in four steps] Let G = SL(2,q) and let A,B,C,D be four subsets of G of densities α,β,γ,δ respectively. For any g in G,

Eabcd=gA(a)B(b)C(c)D(d) – αβγδ ≤|G|-Ω(1),

where the expectation is over uniform elements a, b, c, and d from the group G such that their product is equal to g.



Lemma 4. Mixing in four steps implies mixing in three.


Proof: Rewrite

|Eabc=gA(a)B(b)C(c) – αβγ| = |Eabc=gf(a)B(b)C(c)|

where f(a) := A(a) – α.

In these proofs we will apply Cauchy-Schwarz several times. Each application ”loses a square,” but since we are aiming for an upper bound of the form 1∕|G|Ω(1) we can afford any constant number of applications. Our first one is now:

(Eabc=gf(a)B(b)C(c))2 ≤ (E cC(c)2)(E c(Eab=gc-1f(a)B(b))2)
= γEcEab=a′b′=gc-1f(a)B(b)f(a′)B(b′)
= γEab=a′b′(A(a) – α)B(b)B(b′)(A(a′) – α).

There are four terms that make up the expectation. The terms that involve at least one α sum to -α2β2. The remaining term is the expectation of A(a)B(b)B(b′)A(a′). Note that ab = a′b′ is equivalent to ab(1∕b′)(1∕a′) = 1G. Hence by Theorem 3 this expectation is at most |G|-Ω(1). QED

So what remains to see is how to prove mixing in four steps. We shall reduce the mixing problem to the following statement about the mixing of conjugacy classes of our group.


Definition 5. We denote by C(g) the conjugacy class {h-1gh : h in G} of an element g in G. We also denote by C(g) the uniform distribution over C(g) for a uniformly selected g in G.



Theorem 6.[Mixing of conjugacy classes of SL(2,q)] Let G = SL(2,q). With probability ≥ 1 -|G|-Ω(1) over uniform a,b in G, the distribution C(a)C(b) is |G|-Ω(1) close in statistical distance to uniform.


Theorem 6 is proved in the next blog post. Here we just show that is suffices for our needs.


Lemma 7. Theorem 6 implies Theorem 3.


Proof: We rewrite the quantity to bound as

Ea,c(A(a)C(c)Eb,d:abcd=gf(b,d))

for f(b,d) = B(b)D(d) – βδ.

Now by Cauchy-Schwarz we bound this above by

Ea,c,b,d,b′,d′f(b,d)f(b′,d′)

where the expectation is over variables such that abcd = g and ab′cd′ = g. As in the proof that mixing in four steps implies mixing in three, we can rewrite the last two equations as the single equation bcd = b′cd′.

The fact that the same variable c occurs on both sides of the equation is what gives rise to conjugacy classes. Indeed, this equation can be rewritten as

c-1(1∕b)b′c = d(1∕d′).

Performing the following substitutions: b = x,b′ = xh,d′ = y we can rewrite our equation as

d = c-1hcy.

Hence we have reduced our task to that of bounding

Ef(x,C(h)y)f(xh,y)

for uniform x,y,h.

We can further replace y with C(h)-1y, and rewrite the expression as

Ex,y(f(x,y)Ehf(xh,C(h-1)y)).

This is at most

(Ex,yf2(x,y))E x,y,h,h′f(xh,C(h-1)y)f(xh′,C(h′-1)y).

Recalling that f(b,d) = B(b)D(d) – βδ, and that E[f] = βδ, the first factor is at most 1. The second can be rewritten as

Ex,y,h,h′f(x,y)f(xh-1h′,C(h′-1)C(h)y)

replacing x with xh-1 and y with C(h-1)-1y = C(h)y.

Again using the definition of f this equals

Ex,y,h,h′B(x)D(y)B(xh-1h′)D(C(h′-1)C(h)y) – β2δ2.

Now Lemma 6 guarantees that the distribution (x,y,xh-1h′,C(h′-1)C(h)y) is 1∕|G|Ω(1)-close in statistical distance to the uniform distribution over G4, and this concludes the proof. QED

References

[Bar89]    David A. Mix Barrington. Bounded-width polynomial-size branching programs recognize exactly those languages in NC1. J. of Computer and System Sciences, 38(1):150–164, 1989.

[BNP08]    László Babai, Nikolay Nikolov, and László Pyber. Product growth and mixing in finite groups. In ACM-SIAM Symp. on Discrete Algorithms (SODA), pages 248–257, 2008.

[Gow08]    W. T. Gowers. Quasirandom groups. Combinatorics, Probability & Computing, 17(3):363–387, 2008.

[GV15]    W. T. Gowers and Emanuele Viola. The communication complexity of interleaved group products. In ACM Symp. on the Theory of Computing (STOC), 2015.

[KMR66]   Kenneth Krohn, W. D. Maurer, and John Rhodes. Realizing complex Boolean functions with simple groups. Information and Control, 9:190–195, 1966.

Bounded indistinguishability

Countless papers study the properties of k-wise independent distributions, which are distributions where any k bits are uniform and independent. One property of interest is which computational models are fooled by such distributions, in the sense that they cannot distinguish any such distribution from a uniformly random one. Recently, Bazzi’s breakthrough, discussed earlier on this blog, shows that k = polylog(n) independence fools any polynomial-size DNF on n bits.

Let us change the question. Let us say that instead of one distribution we have two, and we know that any k bits are distributed identically, but not necessarily uniformly. We call such distributions k-wise indistinguishable. (Bounded independence is the special case when one distribution is uniform.) Can a DNF distinguish the two distributions? In fact, what about a single Or gate?

This is the question that we address in a paper with Bogdanov, Ishai, and Williamson. A big thank you goes to my student Chin Ho Lee for connecting researchers who were working on the same problems on different continents. Here at NEU the question was asked to me by my neighbor Daniel Wichs.

The question turns out to be equivalent to threshold/approximate degree, an influential complexity measure that goes back to the works by Minsky and Papert and by Nisan and Szegedy. The equivalence is a good example of the usefulness of duality theory, and is as follows. For any boolean function f on n bits the following two are equivalent:

1. There exist two k-wise indistinguishable distributions that f tells apart with advantage e;

2. No degree-k real polynomial can approximate f to pointwise error at most e/2.

I have always liked this equivalence, but at times I felt slightly worried that could be considered too “simple.” But hey, I hope my co-authors don’t mind if I disclose that it’s been four different conferences, and not one reviewer filed a complaint about that.

From the body of works on approximate degree one readily sees that bounded indistinguishability behaves very differently from bounded independence. For example, one needs k = Ω(√ n) to fool an Or gate, and that is tight. Yes, to spell this out, there exist two distributions which are 0.001 √ n indistinguishable but Or tells them apart with probability 0.999. But obviously even constant independence fools Or.

The biggest gap is achieved by the Majority function: constant independence suffices, by this, while linear indistinguishability is required by Paturi’s lower bound.

In the paper we apply this equivalence in various settings, and here I am just going to mention the design of secret-sharing schemes. Previous schemes like Shamir’s required the computation of things like parity, while the new schemes use different types of functions, for example of constant depth. Here we also rely on the amazing ability of constant-depth circuits to sample distributions, also pointed out earlier on this blog, and apply some expander tricks to trade alphabet size for other parameters.

The birthday paradox

The birthday paradox is the fact that if you sample t independent variables each uniform in {1, 2,,n} then the probability that two will be equal is at least a constant independent from n when t √n. the The word ”paradox” refers to the fact that t can be as small as √n, as opposed to being closer to n. (Here I am not interested in the precise value of this constant as a function of t.)

The Wikipedia page lists several proofs of the birthday paradox where it is not hard to see why the √n bound arises. Nevertheless, I find the following two-stage approach more intuitive.

Divide the random variables in two sets of 0.5√n each. If there are two in the first set that are equal, then we are done. So we can condition on this event not happening, which means that the variables in the first set are all distinct. Now take any variable in the second set. The probability that it is equal to any variable in the first set is 0.5√n∕n = 0.5√n. Hence, the probability that all the variables in the second set are different from those in the first is

(1 0.5√n)0.5√n e0.25 < 1.

How difficult is it to prove new lower bounds?

Two recent papers address the challenge of proving new correlation bounds for low-degree polynomials, which as depicted in a previous post are also necessary for a number of other long-standing problems, such as lower bounds for number-on-forehead communication protocols and depth-3 Majority circuits.

Nonclassical polynomials as a barrier to polynomial lower bounds, by Bhowmick and Lovett brings non-classical polynomials into the picture and shows that those polynomials of degree only log(n) are capable of various things that classical polynomials we know or conjecture are not. Consider for example the correlation between the mod 3 function on n boolean variables, and polynomials of degree d modulo 2. It has been natural to conjecture that this correlation is say super-polynomially small (less than 1/nc for every c) for degrees d up to d = n0.1, but despite substantial effort we cannot even show correlation at most 1/n for degree log(n). However, we can show exponentially small correlation bounds for degrees at most 0.1 log(n), and correlation bounds of 1/n0.1 for degrees up to n0.1, see this survey.

Bhowmick and Lovett construct a non-classical polynomial of degree O(log n) that achieves correlation 99% with the mod 3 function. What is the trick? First, suppose that my polynomial was defined modulo 1, i.e., over the torus, and that I was allowed to divide by 3. Then I could consider the polynomial p(x1, x2, …, xn) = (x1 + x2 + … + xn)/3, which has degree 1, and obtain maximum correlation 1. You can’t quite do that, but close. Non-classical polynomials are indeed defined over the torus, and allow division by powers of 2 of their integer coefficients. Basically, you can arrange things so that with polynomials of degree d you can divide by 3(1+1/2d). Setting d = O(log n) gets you close enough to a division by 3 that you obtain correlation 99%.

They also exhibit degree-O(log n) non-classical polynomials that correlate well with the majority function, and that weak-represent Or. In all cases, the polynomial is (x1 + x2 + … + xn) divided by a suitable number — the square root of n in the case of majority.

It is not clear to me how serious an obstacle this is, since as mentioned above and in their paper we can still prove vanishing correlation bounds for classical polynomials of degree O(log n), so we do have techniques that separate classical from non-classical polynomials in certain regimes. But it is refreshing to have this new viewpoint.

You can see their title and abstract following the link above. Mine would have been something like this: The power of non-classical polynomials. We show that non-classical polynomials of logarithmic degree are capable of several feats that we know or conjecture classical polynomials are not.

Anti-concentration for random polynomials by Nguyen and Vu proves that real polynomials (as opposed to polynomials modulo 2) have correlation zero (not small, but exactly zero) with the parity function, up to degree log(n)/loglog(n), improving on a degree bound of loglog(n) in this paper with Razborov. Here the polynomial is supposed to compute the parity function exactly: any non-boolean output counts as a mistake, thus making proving correlation bounds supposedly easier. Again, we can’t prove bounds of 1/n on the correlation for degree log(n), a problem which as also depicted in the previous post appears even more fundamental than correlation bounds for polynomials modulo 2 (and formally is a necessary first step).

Nguyen and Vu obtain their exponential improvement by a corresponding improvement in the anti-concentration bound in our paper, which was in turn a slight improvement over a special case of a previous result by Costello, Tao, and Vu. (Our improvement was only for the probability of hitting one element, as opposed to landing in an interval, but that was sufficient for the application.) Nguyen and Vu simultaneously improve on both results.

Whereas our proof is more or less from first principles, Nguyen and Vu use a lot of machinery that was developed in theoretical computer science, including invariance principles and regularity lemmas, and it is very cool to see how everything fits together.

Having discussed these two papers in a sequence, a natural question is whether non-classical polynomials help for exact computation as considered in the second paper. In fact, this question is asked in the paper by Bhowmick and Lovett, who conjecture that the answer is negative: for exact computation, non-classical polynomials should not do better than classical.