Matrix rigidity, and all that

The rigidity challenge asks to exhibit an n × n matrix M that cannot be written as M = A + B where A is “sparse” and B is “low-rank.” This challenge was raised by Valiant who showed in [Val77] that if it is met for any A with at most n1+ϵ non-zero entries and any B with rank O(n∕ log log n) then computing the linear transformation M requires either logarithmic depth or superlinear size for linear circuits. This connection relies on the following lemma.

Lemma 1. Let C : {0, 1}n →{0, 1}n be a circuit made of XOR gates. If you can remove e edges and reduce the depth to d then the linear transformation computed by C equals A + B where A has ≤ 2d non-zero entries per row (and so a total of ≤ n2d non-zero entries), and B has rank ≤ e.

Proof: After you remove the edges, each output bit is a linear combination of the removed edges and at most 2d input variables. The former can be done by B, the latter by A. QED

Valiant shows that in a log-depth, linear-size circuit one can remove O(n∕ log log n) edges to reduce the depth to nϵ – a proof can be found in [Vio09] – and this gives the above connection to lower bounds.

However, the best available tradeoff for explicit matrices give sparsity n2∕r log(n∕r) and rank r, for any parameter r; and this is not sufficient for application to lower bounds.

Error-correcting codes

It was asked whether generator matrixes of good linear codes are rigid. (A code is good if it has constant rate and constant relative distance. The dimensions of the corresponding matrixes are off by only a constant factor, and so we can treat them as identical.) Spielman [Spi95] shows that there exist good codes that can be encoded by linear-size logarithmic depth circuits. This immediately rules out the possibility of proving a lower bound, and it gives a non-trivial rigidity upper bound via the above connections.

Still, one can ask if these matrices at least are more rigid than the available tradeoffs. Goldreich reports a negative answer by Dvir, showing that there exist good codes whose generating matrix C equals A + B where A has at most O(n2∕d) non-zero entries and B has rank O(d log n∕d), for any d.

A similar negative answer follows by the paper [GHK+13]. There we show that there exist good linear codes whose generating matrix can be written as the product of few sparse matrixes. The corresponding circuits are very structured, and so perhaps it is not surprising that they give good rigidity upper bounds. More precisely, the paper shows that we can encode an n-bit message by a circuit made of XOR gates and with say n log *n wires and depth O(1) – with unbounded fan-in. Each gate in the circuit computes the XOR of some t gates, which can be written as a binary tree of depth log 2t + O(1). Such trees have poor rigidity:

Lemma 2.[Trees are not rigid] Let C be a binary tree of depth d. You can remove an O(1∕2b) fraction of edges to reduce the depth to b, for any b.

Proof: It suffices to remove all edges at depths d – b, d – 2b, …. The number of such edges is O(2d-b + 2d-2b + …) = O(2d-b). Note this includes the case d ≤ b, where we can remove 0 edges. QED

Applying Lemma 2 to a gate in our circuit, we reduce the depth of the binary tree computed at that gate to b. Applying this to every gate we obtain a circuit of depth O(b). In total we have removed an O(1∕2b) fraction of the n log *n edges.

Writing 2b = n∕d, by Lemma 1 we can write the generating matrixes of our code as C = A + B where A has at most O(n∕d) non-zero entries per row, and B has rank O(d log *n). These parameters are the same as in Dvir’s result, up to lower-order terms. The lower-order terms appear incomparable.

Walsh-Fourier transform

Another matrix that was considered is the n×n Inner Product matrix H, aka the Walsh-Hadamard matrix, where the x,y entry is the inner product of x and y modulo 2. Alman and Williams [AW16] recently give an interesting rigidity upper bound which prevents this machinery to establish a circuit lower bound. Specifically they show that H can be written as H = A + B where A has at most n1+ϵ non-zero entries, and B has rank n1-ϵ′, for any ϵ and an ϵ′ which goes to 0 when ϵ does.

Their upper bound works as follows. Let h = log 2n. Start with the univariate, real polynomial p(z1,z2,…,zh) which computes parity exactly on inputs of Hamming weight between 2ϵn and (1∕2 + ϵ)n. By interpolation such a polynomial exists with degree (1∕2 – ϵ)n. Replacing zi with xiyi you obtain a polynomial of degree n – ϵn which computes IP correctly on inputs x,y whose inner product is between 2ϵn and (1∕2 + ϵ)n.

This polynomial has 2(1-ϵ′)n monomials, where ϵ′ = Ω(ϵ2). The truth-table of a polynomial with m monomials is a matrix with rank m, and this gives a low-rank matrix B′.

The fact that sparse polynomials yield low-rank matrixes also appeared in the paper [SV12], which suggested to study the rigidity challenge for matrixes arising from polynomials.

Returning to the proof in [AW16], it remains to deal with inputs whose inner product does not lie in that range. The number of x whose weight is not between (1∕2 – ϵ)n and (1∕2 + ϵ)n is 2(1-ϵ′)n. For each such input x we modify a row of the matrix B′. Repeating the process for the y we obtain the matrix B, and the rank bound 2(1-ϵ′)n hasn’t changed.

Now a calculation shows that B differs from H in few entries. That is, there are few x and y with Hamming weight between (1∕2 – ϵ)n and (1∕2 + ϵ)n, but with inner product less than 2ϵn.

Boolean complexity

There exists a corresponding framework for boolean circuits (as opposed to circuits with XOR gates only). Rigid matrixes informally correspond to depth-3 Or-And-Or circuits. If this circuit has fan-in fo at the output gate and fan-in fi at each input gate, then the correspondence in parameters is

rank = log fo
sparsity = 2fi .

More precisely, we have the following lemma.

Lemma 3. Let C : {0, 1}n →{0, 1}n be a boolean circuit. If you can remove e edges and reduce the depth to d then you can write C as an Or-And-Or circuit with output fan-in 2e and input fan-in 2d.

Proof: After you remove the edges, each output bit and each removed edge depends on at most 2d input bits or removed edges. The output Or gate of the depth-3 circuit is a big Or over all 2e assignments of values for the removed edges. Then we need to check consistency. Each consistency check just depends on 2d inputs and so can be written as a depth-2 circuit with fan-in 2d. QED

The available bounds are of the form log fo = n∕fi. For example, for input fan-in fi = nα we have lower bounds exponential in n1-α but not more. Again it can be shown that breaking this tradeoff in certain regimes (namely, log 2fo = O(n∕ log log n)) yields lower bounds against linear-size log-depth circuits. (A proof appears in [Vio09].) It was also pointed out in [Vio13] that breaking this tradeoff in any regime yields lower bounds for branching programs. See also the previous post.

One may ask how pairwise independent hash functions relate to this challenge. Ishai, Kushilevitz, Ostrovsky, and Sahai showed [IKOS08] that they can be computed by linear-size log-depth circuits. Again this gives a non-trivial upper bound for depth-3 circuits via these connections, and one can ask for more. In [GHK+13] we give constructions of such circuits which in combination with Lemma 3 can again be used to almost match the available trade-offs.

The bottom line of this post is that we can’t prove lower bounds because they are false, and it is a puzzle to me why some people appear confident that P is different from NP.


[AW16]    Josh Alman and Ryan Williams. Probabilistic rank and matrix rigidity, 2016.

[GHK+13]   Anna Gál, Kristoffer Arnsfelt Hansen, Michal Koucký, Pavel Pudlák, and Emanuele Viola. Tight bounds on computing error-correcting codes by bounded-depth circuits with arbitrary gates. IEEE Transactions on Information Theory, 59(10):6611–6627, 2013.

[IKOS08]    Yuval Ishai, Eyal Kushilevitz, Rafail Ostrovsky, and Amit Sahai. Cryptography with constant computational overhead. In 40th ACM Symp. on the Theory of Computing (STOC), pages 433–442, 2008.

[Spi95]    Daniel Spielman. Computationally Efficient Error-Correcting Codes and Holographic Proofs. PhD thesis, Massachusetts Institute of Technology, 1995.

[SV12]    Rocco A. Servedio and Emanuele Viola. On a special case of rigidity. Available at, 2012.

[Val77]    Leslie G. Valiant. Graph-theoretic arguments in low-level complexity. In 6th Symposium on Mathematical Foundations of Computer Science, volume 53 of Lecture Notes in Computer Science, pages 162–176. Springer, 1977.

[Vio09]    Emanuele Viola. On the power of small-depth computation. Foundations and Trends in Theoretical Computer Science, 5(1):1–72, 2009.

[Vio13]    Emanuele Viola. Challenges in computational lower bounds. Available at, 2013.

Mixing in groups, II

In the previous post we have reduced the “three-step mixing” over SL(2,q), the group of 2×2 matrices over the field with q elements with determinant 1, to the following statement about mixing of conjugacy classes.

Theorem 1.[Mixing of conjugacy classes of SL(2,q)] Let G = SL(2,q). With probability ≥ 1 -|G|-Ω(1) over uniform a,b in G, the distribution C(a)C(b) is |G|-Ω(1) close in statistical distance to uniform.

Here and throughout this post, C(g) denotes a uniform element from the conjugacy class of g, and every occurrence of C corresponds to an independent draw.

In this post we sketch a proof of Theorem 1, following [GV15]. Similar theorems were proved already. For example Shalev [Sha08] proves a version of Theorem 1 without a quantitative bound on the statistical distance. It is possible to plug some more representation-theoretic information into Shalev’s proof and get the same quantitative bound as in Theorem 1, though I don’t have a good reference for this extra information. However the proof in [Sha08] builds on a number of other things, which also means that if I have to prove a similar but not identical statement, as we had to do in [GV15], it would be hard for me.

Instead, here is how you can proceed to prove the theorem. First, we remark that the distribution of C(a)C(b) is the same as that of


because for uniform x, y, and z in Fq we have the following equalities of distributions:

C(C(a)C(b)) = x-1(y-1ayz-1bz)x = x-1(y-1ayxx-1z-1bz)x = C(a)C(b)

where the last equality follows by replacing y with yx and z with zx.

That means that we get one conjugation “for free” and we just have to show that C(a)C(b) falls into various conjugacy classes with the right probability.

Now the great thing about SL(2,q) is that you can essentially think of it as made up of q conjugacy classes each of size q2 (the whole group has size q3 – q). This is of course not exactly correct, in particular the identity element obviously gives a conjugacy class of size 1. But except for a constant number of conjugacy classes, every conjugacy class has size q2 up to lower-order terms. This means that what we have to show is simply that the conjugacy class of C(a)C(b) is essentially uniform over conjugacy classes.

Next, the trace map Tr  : SL(2,q) → Fq is essentially a bijection between conjugacy classes and the field Fq. To see this recall that the trace map satisfies the cyclic property:

Tr xyz = Tr yzx.

This implies that

Tr u-1au = Tr auu-1 = Tr a,

and so conjugate elements have the same trace. On the other hand, the q matrixes

x  1

1  0

for x in Fq all have different traces, and by what we said above their conjugacy classes make up essentially all the group.

Putting altogether, what we are trying to show is that

Tr C(a)C(b)

is |G|-Ω(1) close to uniform over F q in statistical distance.

Furthermore, again by the cyclic property we can consider without loss of generality

Tr aC(b)

instead, and moreover we can let a have the form

0  1

1  w

and b have the form

v  1

1  0

(there is no important reason why w is at the bottom rather than at the top).

Writing a generic g in SL(2,q) as the matrix

u1   u2

u3   u4

you can now with some patience work out the expression

Tr au-1bu = vu 3u4 – u32 + u 42 – vu 1u2 + u12 – vwu 2u3 + wu1u3 – u22 – wu 2u4.

What we want to show is that for typical choices of w and v, the value of this polynomial is q-Ω(1) close to uniform over F q for a uniform choice of u subject to the determinant of u being 1, i.e, u1u4 – u2u3 = 1.

Maybe there is some machinery that immediately does that. Lacking the machinery, you can use the equation u1u4 – u2u3 = 1 to remove u4 by dividing by u1 (the cases where u1 = 0 are few and do not affect the final answer). Now you end up with a polynomial p in three variables, which we can rename x, y, and z. You want to show that p(x,y,z) is close to uniform, for uniform choices for x,y,z. The benefit of this substitution is that we removed the annoying condition that the determinant is one.

To argue about p(x,y,z), the DeMillo–Lipton–Schwartz-Zippel lemma comes to mind, but is not sufficient for our purposes. It is consistent with that lemma that the polynomial doesn’t take a constant fraction of the values of the field, which would give a constant statistical distance. One has to use more powerful results known as the Lang-Weil theorem. This theorem provides under suitable conditions on p a sharp bound on the probability that p(x,y,z) = a for a fixed a in Fq. The probability is 1∕q plus lower-order terms, and then by summing over all a in Fq one obtains the desired bound on the statistical distance.

I am curious if there is a way to get the statistical distance bound without first proving a point-wise bound.

To apply the Lang-Weil theorem you have to show that the polynomial is “absolutely irreducible,” i.e., irreducible over any algebraic extension of the field. This can be proven from first principles by a somewhat lengthy case analysis.


[GV15]   W. T. Gowers and Emanuele Viola. The communication complexity of interleaved group products. In ACM Symp. on the Theory of Computing (STOC), 2015.

[Sha08]   Aner Shalev. Mixing and generation in simple groups. J. Algebra, 319(7):3075–3086, 2008.

Mixing in groups

Non-abelian groups behave in ways that are useful in computer science. Barrington’s famous result [Bar89] shows that we can write efficiently an arbitrary low-depth computation as a group product over any non-solvable group. (Being non-solvable is a certain strengthening of being non-abelian which is not important now.) His result, which is false for abelian groups, has found myriad applications in computer science. And it is amusing to note that actually results about representing computation as group products were obtained twenty years before Barrington, see [KMR66]; but the time was not yet ripe.

This post is about a different property that certain non-abelian groups have and that is also useful. Basically, these groups ”mix” in the sense that if you have several distributions over the group, and the distributions have high entropy, then the product distribution (i.e., sample from each distribution and output the product) is very close to uniform.

First, let us quickly remark that this is completely false for abelian groups. To make an example that is familiar to computer scientists, consider the group of n-bit strings with bit-wise xor. Now let A be the uniform distribution over this group where the first bit is always 0. Then no matter how many independent copies of A you multiply together, the product is always A.

Remarkably, over other groups it is possible to show that the product distribution will become closer and closer to uniform. A group that works very well in this respect is SL(2,q), the group of 2×2 matrices over the field with q elements with determinant 1. This is a group that in some sense is very far from abelian. In particular, one can prove the following result.

Theorem 1.[Three-step mixing [Gow08BNP08]] Let G = SL(2,q), and let A, B, and C be three subsets of G of constant density. Let a, b, and c be picked independently and uniformly from A, B, and C respectively. Then for any g in G we have

| Pr[abc = g] – 1∕|G|| < 1∕|G|1+Ω(1).

Note that the conclusion of the theorem in particular implies that abc is supported over the entire group. This is remarkable, since the starting distributions are supported over only a small fraction of the group. Moreover, by summing over all elements g in the group we obtain that abc is polynomially close to uniform in statistical distance.

Theorem 1 can be proved using representation theory. This must be a great tool, but for some reason I always found it a little difficult to digest the barrage of definitions that usually anticipate the interesting stuff.

Luckily, there is another way to prove Theorem 1. I wouldn’t be surprised if this is in some sense the same way, and moreover this other way is not sometimes I would call elementary. But it is true that I will be able to sketch a proof of the theorem without using the word ”representation”. In this post we will prove some preliminary results that are valid for all groups, and the most complicated thing used is the Cauchy-Schwarz inequality. In the next post we will work specifically with the group SL(2,q), and use more machinery. This is all taken from this paper with Gowers [GV15] (whose main focus is the study of mixing in the presence of dependencies).

First, for convenience let us identify a set A with its characteristic function. So we write A(a) for a belongs to the set A. It is convenient to work with a slightly different statement:

Theorem 2. Let G = SL(2,q) and let A,B,C be three subsets of G of densities α,β,γ respectively. For any g in G,

|Eabc=gA(a)B(b)C(c) – αβγ|≤|G|-Ω(1)

where the expectation is over uniform elements a, b, and c from the group G such that their product is equal to g.

This Theorem 2 is equivalent to Theorem 1, because

Eabc=gA(a)B(b)C(c) = Pr[A(a),B(b),C(c)|abc = g]
= Pr[abc = g|A(a),B(b),C(c)]|G|αβγ

by Bayes’ rule. So we can get Theorem 1 by dividing by |G|αβγ.

Now we observe that to prove this ”mixing in three steps” it actually suffices to prove mixing in four steps.

Theorem 3.[Mixing in four steps] Let G = SL(2,q) and let A,B,C,D be four subsets of G of densities α,β,γ,δ respectively. For any g in G,

Eabcd=gA(a)B(b)C(c)D(d) – αβγδ ≤|G|-Ω(1),

where the expectation is over uniform elements a, b, c, and d from the group G such that their product is equal to g.

Lemma 4. Mixing in four steps implies mixing in three.

Proof: Rewrite

|Eabc=gA(a)B(b)C(c) – αβγ| = |Eabc=gf(a)B(b)C(c)|

where f(a) := A(a) – α.

In these proofs we will apply Cauchy-Schwarz several times. Each application ”loses a square,” but since we are aiming for an upper bound of the form 1∕|G|Ω(1) we can afford any constant number of applications. Our first one is now:

(Eabc=gf(a)B(b)C(c))2 ≤ (E cC(c)2)(E c(Eab=gc-1f(a)B(b))2)
= γEcEab=a′b′=gc-1f(a)B(b)f(a′)B(b′)
= γEab=a′b′(A(a) – α)B(b)B(b′)(A(a′) – α).

There are four terms that make up the expectation. The terms that involve at least one α sum to -α2β2. The remaining term is the expectation of A(a)B(b)B(b′)A(a′). Note that ab = a′b′ is equivalent to ab(1∕b′)(1∕a′) = 1G. Hence by Theorem 3 this expectation is at most |G|-Ω(1). QED

So what remains to see is how to prove mixing in four steps. We shall reduce the mixing problem to the following statement about the mixing of conjugacy classes of our group.

Definition 5. We denote by C(g) the conjugacy class {h-1gh : h in G} of an element g in G. We also denote by C(g) the uniform distribution over C(g) for a uniformly selected g in G.

Theorem 6.[Mixing of conjugacy classes of SL(2,q)] Let G = SL(2,q). With probability ≥ 1 -|G|-Ω(1) over uniform a,b in G, the distribution C(a)C(b) is |G|-Ω(1) close in statistical distance to uniform.

Theorem 6 is proved in the next blog post. Here we just show that is suffices for our needs.

Lemma 7. Theorem 6 implies Theorem 3.

Proof: We rewrite the quantity to bound as


for f(b,d) = B(b)D(d) – βδ.

Now by Cauchy-Schwarz we bound this above by


where the expectation is over variables such that abcd = g and ab′cd′ = g. As in the proof that mixing in four steps implies mixing in three, we can rewrite the last two equations as the single equation bcd = b′cd′.

The fact that the same variable c occurs on both sides of the equation is what gives rise to conjugacy classes. Indeed, this equation can be rewritten as

c-1(1∕b)b′c = d(1∕d′).

Performing the following substitutions: b = x,b′ = xh,d′ = y we can rewrite our equation as

d = c-1hcy.

Hence we have reduced our task to that of bounding


for uniform x,y,h.

We can further replace y with C(h)-1y, and rewrite the expression as


This is at most

(Ex,yf2(x,y))E x,y,h,h′f(xh,C(h-1)y)f(xh′,C(h′-1)y).

Recalling that f(b,d) = B(b)D(d) – βδ, and that E[f] = βδ, the first factor is at most 1. The second can be rewritten as


replacing x with xh-1 and y with C(h-1)-1y = C(h)y.

Again using the definition of f this equals

Ex,y,h,h′B(x)D(y)B(xh-1h′)D(C(h′-1)C(h)y) – β2δ2.

Now Lemma 6 guarantees that the distribution (x,y,xh-1h′,C(h′-1)C(h)y) is 1∕|G|Ω(1)-close in statistical distance to the uniform distribution over G4, and this concludes the proof. QED


[Bar89]    David A. Mix Barrington. Bounded-width polynomial-size branching programs recognize exactly those languages in NC1. J. of Computer and System Sciences, 38(1):150–164, 1989.

[BNP08]    László Babai, Nikolay Nikolov, and László Pyber. Product growth and mixing in finite groups. In ACM-SIAM Symp. on Discrete Algorithms (SODA), pages 248–257, 2008.

[Gow08]    W. T. Gowers. Quasirandom groups. Combinatorics, Probability & Computing, 17(3):363–387, 2008.

[GV15]    W. T. Gowers and Emanuele Viola. The communication complexity of interleaved group products. In ACM Symp. on the Theory of Computing (STOC), 2015.

[KMR66]   Kenneth Krohn, W. D. Maurer, and John Rhodes. Realizing complex Boolean functions with simple groups. Information and Control, 9:190–195, 1966.

Bounded indistinguishability

Countless papers study the properties of k-wise independent distributions, which are distributions where any k bits are uniform and independent. One property of interest is which computational models are fooled by such distributions, in the sense that they cannot distinguish any such distribution from a uniformly random one. Recently, Bazzi’s breakthrough, discussed earlier on this blog, shows that k = polylog(n) independence fools any polynomial-size DNF on n bits.

Let us change the question. Let us say that instead of one distribution we have two, and we know that any k bits are distributed identically, but not necessarily uniformly. We call such distributions k-wise indistinguishable. (Bounded independence is the special case when one distribution is uniform.) Can a DNF distinguish the two distributions? In fact, what about a single Or gate?

This is the question that we address in a paper with Bogdanov, Ishai, and Williamson. A big thank you goes to my student Chin Ho Lee for connecting researchers who were working on the same problems on different continents. Here at NEU the question was asked to me by my neighbor Daniel Wichs.

The question turns out to be equivalent to threshold/approximate degree, an influential complexity measure that goes back to the works by Minsky and Papert and by Nisan and Szegedy. The equivalence is a good example of the usefulness of duality theory, and is as follows. For any boolean function f on n bits the following two are equivalent:

1. There exist two k-wise indistinguishable distributions that f tells apart with advantage e;

2. No degree-k real polynomial can approximate f to pointwise error at most e/2.

I have always liked this equivalence, but at times I felt slightly worried that could be considered too “simple.” But hey, I hope my co-authors don’t mind if I disclose that it’s been four different conferences, and not one reviewer filed a complaint about that.

From the body of works on approximate degree one readily sees that bounded indistinguishability behaves very differently from bounded independence. For example, one needs k = Ω(√ n) to fool an Or gate, and that is tight. Yes, to spell this out, there exist two distributions which are 0.001 √ n indistinguishable but Or tells them apart with probability 0.999. But obviously even constant independence fools Or.

The biggest gap is achieved by the Majority function: constant independence suffices, by this, while linear indistinguishability is required by Paturi’s lower bound.

In the paper we apply this equivalence in various settings, and here I am just going to mention the design of secret-sharing schemes. Previous schemes like Shamir’s required the computation of things like parity, while the new schemes use different types of functions, for example of constant depth. Here we also rely on the amazing ability of constant-depth circuits to sample distributions, also pointed out earlier on this blog, and apply some expander tricks to trade alphabet size for other parameters.

The birthday paradox

The birthday paradox is the fact that if you sample t independent variables each uniform in {1, 2,,n} then the probability that two will be equal is at least a constant independent from n when t √n. the The word ”paradox” refers to the fact that t can be as small as √n, as opposed to being closer to n. (Here I am not interested in the precise value of this constant as a function of t.)

The Wikipedia page lists several proofs of the birthday paradox where it is not hard to see why the √n bound arises. Nevertheless, I find the following two-stage approach more intuitive.

Divide the random variables in two sets of 0.5√n each. If there are two in the first set that are equal, then we are done. So we can condition on this event not happening, which means that the variables in the first set are all distinct. Now take any variable in the second set. The probability that it is equal to any variable in the first set is 0.5√n∕n = 0.5√n. Hence, the probability that all the variables in the second set are different from those in the first is

(1 0.5√n)0.5√n e0.25 < 1.

How difficult is it to prove new lower bounds?

Two recent papers address the challenge of proving new correlation bounds for low-degree polynomials, which as depicted in a previous post are also necessary for a number of other long-standing problems, such as lower bounds for number-on-forehead communication protocols and depth-3 Majority circuits.

Nonclassical polynomials as a barrier to polynomial lower bounds, by Bhowmick and Lovett brings non-classical polynomials into the picture and shows that those polynomials of degree only log(n) are capable of various things that classical polynomials we know or conjecture are not. Consider for example the correlation between the mod 3 function on n boolean variables, and polynomials of degree d modulo 2. It has been natural to conjecture that this correlation is say super-polynomially small (less than 1/nc for every c) for degrees d up to d = n0.1, but despite substantial effort we cannot even show correlation at most 1/n for degree log(n). However, we can show exponentially small correlation bounds for degrees at most 0.1 log(n), and correlation bounds of 1/n0.1 for degrees up to n0.1, see this survey.

Bhowmick and Lovett construct a non-classical polynomial of degree O(log n) that achieves correlation 99% with the mod 3 function. What is the trick? First, suppose that my polynomial was defined modulo 1, i.e., over the torus, and that I was allowed to divide by 3. Then I could consider the polynomial p(x1, x2, …, xn) = (x1 + x2 + … + xn)/3, which has degree 1, and obtain maximum correlation 1. You can’t quite do that, but close. Non-classical polynomials are indeed defined over the torus, and allow division by powers of 2 of their integer coefficients. Basically, you can arrange things so that with polynomials of degree d you can divide by 3(1+1/2d). Setting d = O(log n) gets you close enough to a division by 3 that you obtain correlation 99%.

They also exhibit degree-O(log n) non-classical polynomials that correlate well with the majority function, and that weak-represent Or. In all cases, the polynomial is (x1 + x2 + … + xn) divided by a suitable number — the square root of n in the case of majority.

It is not clear to me how serious an obstacle this is, since as mentioned above and in their paper we can still prove vanishing correlation bounds for classical polynomials of degree O(log n), so we do have techniques that separate classical from non-classical polynomials in certain regimes. But it is refreshing to have this new viewpoint.

You can see their title and abstract following the link above. Mine would have been something like this: The power of non-classical polynomials. We show that non-classical polynomials of logarithmic degree are capable of several feats that we know or conjecture classical polynomials are not.

Anti-concentration for random polynomials by Nguyen and Vu proves that real polynomials (as opposed to polynomials modulo 2) have correlation zero (not small, but exactly zero) with the parity function, up to degree log(n)/loglog(n), improving on a degree bound of loglog(n) in this paper with Razborov. Here the polynomial is supposed to compute the parity function exactly: any non-boolean output counts as a mistake, thus making proving correlation bounds supposedly easier. Again, we can’t prove bounds of 1/n on the correlation for degree log(n), a problem which as also depicted in the previous post appears even more fundamental than correlation bounds for polynomials modulo 2 (and formally is a necessary first step).

Nguyen and Vu obtain their exponential improvement by a corresponding improvement in the anti-concentration bound in our paper, which was in turn a slight improvement over a special case of a previous result by Costello, Tao, and Vu. (Our improvement was only for the probability of hitting one element, as opposed to landing in an interval, but that was sufficient for the application.) Nguyen and Vu simultaneously improve on both results.

Whereas our proof is more or less from first principles, Nguyen and Vu use a lot of machinery that was developed in theoretical computer science, including invariance principles and regularity lemmas, and it is very cool to see how everything fits together.

Having discussed these two papers in a sequence, a natural question is whether non-classical polynomials help for exact computation as considered in the second paper. In fact, this question is asked in the paper by Bhowmick and Lovett, who conjecture that the answer is negative: for exact computation, non-classical polynomials should not do better than classical.

Restricted models


Map 1



Map 2


To understand Life, what should you study?

a. People’s dreams.

b. The AMPK gene of the fruit fly.

Studying restricted computational models corresponds to b. Just like microbes constitute a wealth of open problems whose solutions are sometimes far-reaching, so restricted computational models present a number of challenges whose study is significant. For one example, Valiant’s study of arithmetic lower bounds boosted the study of superconcentrators, an influential type of graphs closely related to expanders.

The maps above, taken from here, include a number of challenges together with their relationships. Arrows go towards special cases (which are presumably easier). As written in the manuscript, my main aim was to put these challenges in perspective, and to present some connections which do not seem widely known. Indeed, one specific reason why I drew the first map was the realization that an open problem that I spent some time working on can actually be solved immediately by combining known results. The problem was to show that multiparty (number-on-forehead) communication lower bounds imply correlation bounds for polynomials over GF(2). The classic work by Hastad and Goldman does show that k-party protocols can simulate polynomials of degree k-1, and so obviously that correlation bounds for k-party protocols imply the same bounds for polynomials of degree k-1. But what I wanted was a connection with worst-case communication lower bounds, to show that correlation bounds for polynomials (survey) are a prerequisite even for that.

As it turns out, and as the arrows from (1.5) to (1.2) in the first map show, this is indeed true when k is polylogarithmic. So, if you have been trying to prove multiparty lower bounds for polylogarithmic k, you may want to try correlation bounds first. (This connection is for proving correlation bounds under some distribution, not necessarily uniform.)

Another reason why I drew the first map was to highlight a certain type of correlation bound (1.3), discussed in this paper with Razborov. It is a favorite example of mine of a seemingly very basic open problem that is, again, a roadblock for much of what we’d like to know. The problem is to obtain correlation bounds against polynomials that are real valued, with the convention that whenever the polynomial does not output a boolean value we count that as a mistake, thus making the goal of proving a lower bound presumably easier. Amazingly, the following is still open:

Prove that the correlation of the parity function on n bits is at most 1/n with any real polynomial of degree log(n).

To be precise, correlation is defined as the probability that the polynomial correctly computes parity, minus 1/2. For example, the constant polynomial 1 has correlation zero with parity — it gets it right half the times. Whereas the polynomial x1+x2+…+xn does a lot worse, it has negative correlation with parity or in fact any boolean function, just because it is unlikely that its output is in {0,1}.

What we do in the paper, in my opinion, is to begin to formalize the intuition that these polynomials cannot do much. We show that the correlation with parity is zero (not very small, but actually zero) as long as the polynomial has degree 0.001 loglog(n). This is different from the more familiar models of polynomials modulo m or sign polynomials, because those can achieve non-zero correlation even with constant degree.

On the other hand, with a simple construction, we can obtain non-zero correlation with polynomials of degree O(sqrt(n)). Note the huge gap with the 0.001 loglog(n) lower bound.

Question: what is the largest degree for which the correlation is zero?

The second map gives another slice of open problems. It highlights how superlinear-length lower bounds for branching programs are necessary for several notorious circuit lower bounds.

A third map was scheduled to include Valiant’s long-standing rigidity question and algebraic lower bounds. In the end it was dropped because it required a lot of definitions while I knew of very few arrows. But one problem that was meant to be there is a special case of the rigidity question from this work with Servedio. The question is basically a variant of the above question of real polynomials, where instead of considering low-degree polynomials we consider sparse polynomials. What may not be immediately evident, although in hindsight it is technically immediate, is that this problem is indeed a special case of the rigidity question. The question is to improve on the rigidity bounds in this special case.

In the paper we prove some variant that does not seem to be known in the rigidity world, but what I want to focus on right now is an application that such bounds would have, if established for the Inner Product function modulo 2 (IP). They would imply that IP cannot be computed by polynomial-size AC0-Parity circuits, i.e., AC0 circuits which take as input a layer of parity gates that’s connected to the input. It seems ridiculous that IP can be computed by such circuits, of course. It is easy to handle Or-And-Parity circuits, but circuits of higher depth have resisted attacks.

The question was reasked by Akavia, Bogdanov, Guo, Kamath, and Rosen.

Cheraghchi, Grigorescu, Juba, Wimmer, and Xie have just obtained some lower bounds for this problem. For And-Or-And-Parity circuits they obtain almost quadratic; the bounds degrade for larger depth but stay polynomial. Their proof of the quadratic lower bound looks nice to me. Their first moves are relatively standard: first they reduce to an approximation question for Or-And-Parity circuits; then they fix half the variables of IP so that IP becomes a parity that is “far” from the parities that are input to the DNF. The more interesting step of the argument, in my opinion, comes at this point. They consider the random variable N that counts the number of And-parity gates that evaluate to one, and they observe that the distribution of several moments of this variable is the same in the case where the parity that comes from IP is zero or one. From this, they use approximation theory to argue about the probability that N will be zero in the two cases. They get that these probabilities are also quite close, as long as the circuit is not too large, which shows that the circuit is not correctly computing IP.

Is Nature a low-complexity sampler?

“It is often said that we live in a computational universe. But if Nature “computes” in a classical, input-output fashion then our current prospect to leverage this viewpoint to gain fundamental insights may be scarce. This is due to the combination of two facts. First, our current understanding of fundamental questions such as “P=NP?” is limited to restricted computational models, for example the class AC0 of bounded-depth circuits. Second, those restricted models are incapable of modeling many processes which appear to be present in nature. For example, a series of works in complexity theory culminating in [Hås87] shows that AC0 cannot count.

But what if Nature, instead, “samples?” That is, what if Nature is better understood as a computational device that given some initial source of randomness, samples the observed distribution of the universe? Recent work by the Project Investigator (PI) gives two key insights in this direction. First, the PI has highlighted that, when it comes to sampling, restricted models are capable of surprising behavior. For example, AC0 can count, in the sense that it can sample a uniform bit string together with its hamming weight.[Vio12a] Second, despite the growth in power given by sampling, for these restricted models the PI was still able to answer fundamental questions of the type of “P=NP?”[Vio14]

Thus begins my application for the Turing Centenary Research Fellowship. After reading it, perhaps you too, like me, are not surprised that it was declined. But I was unprepared for the strange emails that accompanied its rejection. Here’s an excerpt:

“[…] A reviewing process can be thought of as a kind of Turing Test for fundability. There is a built-in fallibility; and just as there is as yet no intelligent machine or effective algorithm for recognising one (otherwise why would we bother with a Turing Test), there is no algorithm for either writing the perfect proposal, or for recognising the worth of one.

Of course, the feedback may well be useful, and will come. But we will be grateful for your understanding in the meantime.”

Well, I am still waiting for comments.

Even the rejection was sluggish: for months I and apparently others were told that our proposal didn’t make it, but was so good that they were looking for extra money to fund it anyway. After the money didn’t materialize, I was invited to try the regular call (of the sponsoring foundation). The first step of this was submitting a preliminary proposal, which I took: I re-sent them the abstract of my proposal. I was then invited to submit the full proposal. This is a rather painstaking process which requires you to address a seemingly endless series of minute questions referring to mysterious concepts such as the “Theory of Change.” Nevertheless, given that they had suggested I try the regular call, they had seen what I was planning to submit, and they had still invited me for the full proposal, I did answer all the questions and re-sent them what they already had, my Turing Research Fellowship application. Perhaps it only makes sense that the outcome was as it was.

The proposal was part of a research direction which started exactly five years ago, when the question was raised of proving computational lower bounds for sampling. Since then, there has been progress: [Vio12aLV12DW11Vio14Vio12bBIL12BCS14]. One thing I like of this area is that it is uncharted – wherever you point your finger chances are you find an open problem. While this is true for much of Complexity Theory, questions regarding sampling haven’t been studied nearly as intensely. Here’s three:

A most basic open question. Let D be the distribution on n-bit strings where each bit is independently 1 with probability 14. Now suppose you want to sample D given some random bits x1,x2,. You can easily sample D exactly with the map

(x1 x2,x3 x4,,x2n1 x2n).

This map is 2-local, i.e., each output bit depends on at most 2 input bits. However, we use 2n inputs bits, whereas the entropy of the distribution is H(14)n 0.81n. Can you show that any 2-local map using a number of bits closer to H(14)n will sample a distribution that is very far from D? Ideally, we want to show that the statistical distance between the distribution is very high, exponentially close to n.

Such strong statistical distance bounds also enable a connection to lower bounds for succinct dictionaries; a problem that Pǎtraşcu thinks important. A result for d-local maps corresponds to a result for data structures which answer membership queries with d non-adaptive bit probes. Adaptive bit probes correspond to decision trees. While d cell probes correspond to samplers whose input is divided in blocks of O(log n) bits, and each output bit depends on d cells, adaptively.

There are some results in [Vio12a] on a variant of the above question where you need to sample strings whose Hamming weight is exactly n∕4, but even there there are large gaps in our knowledge. And I think the above case of 2-local maps is still open, even though it really looks like you cannot do anything unless you use 2n random bits.

Stretch. With Lovett we suggested [LV12] to prove negative results for sampling (the uniform distribution over a) subset S ⊆{0, 1}n by bounding from below the stretch of any map

f : {0, 1}r S.

Stretch can be measured as the average Hamming distance between f(x) and f(y), where x and y are two uniform input strings at Hamming distance 1. If you prove a good lower bound on this quantity then some complexity lower bounds for f follow because local maps, AC0 maps, etc. have low stretch.

We were able to apply this to prove that AC0 cannot sample good codes. Our bounds are only polynomially close to 1; but a nice follow-up by Beck, Impagliazzo, and Lovett, [BIL12], improves this to exponential. But can this method be applied to other sets that do not have error-correcting structure?

Consider in particular the distribution UP which is uniform over the upper-half of the hypercube, i.e., uniform over the n-bit strings whose majority is 1. What stretch is required to sample UP? At first sight, it seems the stretch must be quite high.

But a recent paper by Benjamini, Cohen, and Shinkar, [BCS14], shows that in fact it is possible with stretch 5. Moreover, the sampler has zero error, and uses the minimum possible number of input bits: n 1!

I find their result quite surprising in light of the fact that constant-locality samplers cannot do the job: their output distribution has Ω(1) statistical distance from UP [Vio12a]. But local samplers looked very similar to low-stretch ones. Indeed, it is not hard to see that a local sampler has low average stretch, and the reverse direction follows from Friedgut’s theorem. However, the connections are only average-case. It is pretty cool that the picture changes completely when you go to worst-case computation.

What else can you sample with constant stretch?

AC0 vs. UP. Their results are also interesting in light of the fact that AC0 can sample UP with exponentially small error. This follows from a simple adaptation of the dart-throwing technique for parallel algorithms, known since the early 90’s [MV91Hag91] – the details are in [Vio12a]. However, unlike their low-stretch map, this AC0 sampler uses superlinear randomness and has a non-zero probability of error.

Can AC0 sample UP with no error? Can AC0 sample UP using O(n) random bits?

Let’s see what the next five years bring.


[BCS14]   Itai Benjamini, Gil Cohen, and Igor Shinkar. Bi-lipschitz bijection between the boolean cube and the hamming ball. In IEEE Symp. on Foundations of Computer Science (FOCS), 2014.

[BIL12]    Chris Beck, Russell Impagliazzo, and Shachar Lovett. Large deviation bounds for decision trees and sampling lower bounds for AC0-circuits. In IEEE Symp. on Foundations of Computer Science (FOCS), pages 101–110, 2012.

[DW11]    Anindya De and Thomas Watson. Extractors and lower bounds for locally samplable sources. In Workshop on Randomization and Computation (RANDOM), 2011.

[Hag91]    Torben Hagerup. Fast parallel generation of random permutations. In 18th Coll. on Automata, Languages and Programming (ICALP), pages 405–416. Springer, 1991.

[Hås87]    Johan Håstad. Computational limitations of small-depth circuits. MIT Press, 1987.

[LV12]    Shachar Lovett and Emanuele Viola. Bounded-depth circuits cannot sample good codes. Computational Complexity, 21(2):245–266, 2012.

[MV91]    Yossi Matias and Uzi Vishkin. Converting high probability into nearly-constant time-with applications to parallel hashing. In 23rd ACM Symp. on the Theory of Computing (STOC), pages 307–316, 1991.

[Vio12a]   Emanuele Viola. The complexity of distributions. SIAM J. on Computing, 41(1):191–218, 2012.

[Vio12b]   Emanuele Viola. Extractors for turing-machine sources. In Workshop on Randomization and Computation (RANDOM), 2012.

[Vio14]    Emanuele Viola. Extractors for circuit sources. SIAM J. on Computing, 43(2):355–972, 2014.

The sandwich revolution: behind the paper

Louay Bazzi’s breakthrough paper “Polylogarithmic Independence Can Fool DNF Formulas” (2007) introduced the technique of sandwiching polynomials which is used in many subsequent works.  While some of these are about constant-depth circuits, for example those referenced in Bazzi’s text below, sandwiching polynomials have also been used to obtain results about sign polynomials, including for example a central limit theorem for k-wise independent random variables.  Rather than attempting an exhaustive list, I hope that the readers who are familiar with a paper using sandwiching polynomials can add a reference in the comments.

I view the technique of sandwiching polynomials as a good example of something simple — it follows immediately from LP duality — which is also extremely useful.

Bazzi has kindly provided  the following text for the second post of the series behind the paper.



Originally, my objective was to show that the quadratic residues PRG introduced by Alon, Goldreich, Hastad, and Peralta in 1992 looks random to something more powerful than parity functions such as DNF formula or small-width branching programs. My motivation was that the distribution of  quadratic residues promises great derandomization  capabilities. I was hoping to be able to use tools from number theory to achieve this goal.   I worked on this problem for some time, but all the approaches  I tried didn’t use more than what boils down to the small-bias property. In the beginning, my goal was to go beyond this property, but eventually I started to question how far we can go with this property alone in the context of DNF formula. I turned to investigating the question of whether spaces with small-bias  fool DNF formulas, which lead me to the dual question of whether one can construct sandwiching polynomials with low L1-norm in the Fourier domain for DNF formulas. I was not able to use the high frequencies in the Fourier spectrum in the context of DNF formulas. Thus I dropped the low L1-norm requirement and I focused on the simpler low-degree polynomials special case,  which is equivalent to trying to show that limited independence fools DNF formulas. The approaches I  tried in the beginning were based on lifting the k-wise independent probability distribution to the clauses and trying to reduce  the problem  to a LP with  moments constraints. I started to believe that this approach won’t work because  I was ignoring the values of the moments  which are specific to DNF formulas. While  trying to understand the limitations of this approach and researching the related literature, I came across the 1990 paper of Linial and Nisan on approximate inclusion-exclusion,  which excludes the approach I was having trouble with and   conjectures the correctness  of what I was trying to prove.  The attempts I tried later  were all based  on an L2-approximation of the Formula by low-degree polynomials  subject to the constraint that the polynomial is zero on all the zeros of the DNF Formula. The difficulty was in the zeros constraints which  was needed to construct the sandwiching polynomials. Without the zeros constraint, the conjecture would follow from Linial-Mansour-Nisan  energy bound.  I was not hoping that the LMN energy bound can be applied to the problem I was working on since one can construct boolean functions which satisfy the LMN bound  but violates the claim I was after. I was trying to construct the sandwiching polynomials by other methods …
Eventually, I was able to derive many  DNF formulas from the original formula and apply LMN energy bound to each of those formulas to prove the conjecture. Later on, the proof was simplified by Razborov  and extended by Braverman to AC0.

Local reductions: Behind the paper

In the spirit of Reingold’s research-life stories, the series “behind the paper” collects snapshots of the generation of papers. For example, did you spend months proving an exciting bound, only to discover it was already known? Or what was the key insight which made everything fit together? Records of this baffling process are typically expunged from research publications. This is a place for them. The posts will have a technical component.




The classical Cook-Levin reduction of non-deterministic time to 3SAT can be optimized along two important axes.

Axis 1: The size of the 3SAT instance. The tableau proof reduces time T to a 3SAT instance of size O(T2), but this has been improved to a quasilinear T x polylog(T) in the 70s and 80s, notably using the oblivious simulation by Pippenger and Fischer, and the non-deterministic, quasilinear equivalence between random-access and sequential machines by Gurevich and Shelah.

Axis 2: The complexity of computing the 3SAT instance. If you just want to write down the 3SAT instance in time poly(T), the complexity of doing so is almost trivial.  The vast majority of the clauses are fixed once you fix the algorithm and its running time, while each of the rest depends on a constant number of bits of the input to the algorithm.

However, things get more tricky if we want our 3SAT instance to enjoy what I’ll call clause-explicitness. Given an index i to a clause, it should be possible to compute that clause very efficiently, say in time polynomial in |i| = O(log T), which is logarithmic in the size of the formula. Still, it is another classical result that it is indeed possible to do so, yielding for example the NEXP completeness of succinct-3SAT (where your input is a circuit describing a 3SAT instance). More uses of clause explicitness can be found in a 2009 paper by Arora, Steurer, and Wigderson, where they show that interesting graph problems remain hard even on exponential-size graphs that are described by poly-size AC0 circuits.


I got more interested in the efficiency of reductions after Williams’ paper Improving exhaustive search implies superpolynomial lower bounds, because the in-efficiency of available reductions was a bottleneck to the applicability of his connection to low-level circuit classes. Specifically, for a lower bound against a circuit class C, one needed a reduction to 3SAT that both has quasilinear blowup and is C-clause-explicit: computing the ith clause had to be done by a circuit from the class C on input i. For one thing, since previous reductions where at best NC1-clause-explicit, the technique wouldn’t apply to constant-depth classes.

I had some ideas how to obtain an AC0-clause-explicit reduction, when Williams’ sequel came out. This work did not employ more efficient reductions, instead it used the classical polynomial-size-clause-explicit reduction as a black-box together with an additional argument to more or less convert it to a constant-depth-clause-explicit one. This made my preliminary ideas a bit useless, since there was a bypass. However disappointing, a lot worse was to come.

I was then distracted by other things, then eventually returned to the topic. I still found it an interesting question whether a very clause-explicit reduction could be devised. First, it would remove Williams’ bypass, resulting in a possibly more direct proof. Second, the in-efficiency of the reduction was still a bottleneck to obtain further lower bounds (more on this later).

The first step for me was to gain a deeper understanding of the classical quasilinear-size reduction — ignoring clause explicitness — so I ran a mini-polymath project in a Ph.D. class at NEU. The result is this survey, which presents a proof using sorting networks that may be conceptually simpler than the one based on Pippenger and Fischer’s oblivious simulation.  The idea to use sorting is from the paper by Gurevich and Shelah, but if you follow the reductions without thinking you will make the sorting oblivious using the general, complicated simulation.  About one hour after posting the fruit of months of collaborative work on ECCC, we are notified that this is Dieter van Melkebeek’s proof from Section 2.3 in his survey, and that this is the way he’s been teaching it for over a decade. This was a harder blow, yet worse was to come.

On the positive side, I am happy I have been exposed to this proof, which is strangely little-known.  Now I never miss an opportunity to teach my students



 To try to stay positive I’ll add that our survey has reason to exist, perhaps, because it proves some technicalities that I cannot find elsewhere, and for completeness covers the required sorting network which has disappeared from standard algorithms textbooks.

Armed with this understanding, we went back to our original aim, and managed to show that reductions can be made constant-locality clause-explicit: each bit of the ith clause depends only on a constant number of bits of the index i. Note with constant locality you can’t even add 1 to the input in binary. This is a joint work with two NEU students: Hamid Jahanjou and Eric Miles. Eric will start a postdoc at UCLA in September.


The proof

Our first and natural attempt involved showing that the sorting network has the required level of explicitness, since that network is one of the things encoded in the SAT instance. We could make this network pretty explicit (in particular, DNF-clause-explicit). Kowalski and Van Melkebeek independently obtained similar results, leading to an AC0-clause-explicit reduction.

But we could not get constant locality, no matter how hard we dug in the bottomless pit of different sorting algorithms… on the bright side, when I gave the talk at Stanford and someone whom I hadn’t recognized asked “why can’t you just use the sorting algorithm in my thesis?” I knew immediately who this person was and what he was referring to.  Can you guess?

Then a conversation with Ben-Sasson made us realize that sorting was an overkill, and that we should instead switch to switching networks, as has long been done in the PCP literature, starting, to my knowledge, with the ’94 work of Polishchuk and Spielman. Both sorting and switching networks are made of nodes that take two inputs and output either the same two, or the two swapped. But whereas in sorting networks the node is a deterministic comparator, in switching networks there is an extra switch bit to select whether you should swap or not. Thanks to this relaxation the networks can be very simple. So this is the type of network that appears in our work.

Sorting isn’t all that there is to it.  One more thing is that any log-space uniform circuit can be made constant-locality uniform, in the sense that given an index to a gate you can compute its children by a map where each output bit depends on a constant number of input bits.  The techniques to achieve this are similar to those used in various equivalences between uniformity conditions established by Ruzzo in the 1979-1981 paper On Uniform Circuit Complexity, which does not seem to be online.  Ruzzo’s goal probably was not constant locality, so that is not established in his paper.  This requires some more work; for one thing, with constant locality you can’t check if your input is a valid index to a gate or a junk string, so you have to deal with that.

Of course, in the 3rd millennium we should not reduce merely to SAT, but to GAP-SAT. In a more recent paper with Ben-Sasson we gave a variant of the BGHSV PCP reduction where each query is just a projection of the input index (and the post-process is a 3CNF). Along the way we also get a reduction to 3SAT that is not constant-locality clause-explicit, but after you fix few bits it becomes locality-1 clause-explicit.  In general, it is still an open problem to determine the minimum amount of locality, and it is not even clear to me how to rule out locality 1.


One thing that this line of works led to is the following. Let the complexity of 3SAT be cn. The current (deterministic) record is

c < 1.34…

We obtain that if

c < 1.10…

then you get some circuit lower bounds that, however modest, we don’t know how to prove otherwise.