Myth creation: The switching lemma

The history of science is littered with anecdotes about misplaced credit. Because it does not matter if it was A or B who did it; it only matters if it was I or not I. In this spirit I am starting a series of posts about such misplaced credit, which I hesitated before calling more colorfully “myth creation.” Before starting, I want to make absolutely clear that I am in no way criticizing the works themselves or their authors. In fact, many are among my favorites. Moreover, at least in the examples I have in mind right now, the authors do place their work in the appropriate context with the use of citations etc. My only point is the credit that the work has received within and without our community (typically due to inertia and snowball effects rather than anything else).

Of course, at some level this doesn’t matter. You can call Chebichev’s polynomials rainbow sprinkles and the math doesn’t change. And yet at some other level maybe it does matter a little, for science isn’t yet a purely robotic activity. With these posts I will advertise unpopular points of views that might be useful, for example to researchers who are junior or from different communities.

The switching lemma

I must admit I had a good run — Johan Hastad (privately to the blogger)

Random restrictions have been used in complexity theory since at least the 60’s [Sub61]. The first dramatic use in the context of AC0 is due to [FSS84Ajt83]. These works proved a switching lemma the amazing fact that a DNF gets simplified by a random restriction to the point that it can be written as a CNF, so you can collapse layers and induct. (An exposition is given below.) Using it, they proved super-polynomial lower bounds for AC0. The proof in [FSS84] is very nice, and if I want to get a quick intuition of why switching is at all possible, I often go back to it. [Ajt83] is also a brilliant paper, and long, unavailable online for free, filled with a logical notation which makes some people twitch. The first symbol of the title says it all, and may be the most obscene ever chosen:

\begin{aligned} \Sigma _{1}^{1}. \end{aligned}

Subsequently, [Yao85] proved exponential lower bounds of the form 2^{n^{c}}, with a refined analysis of the switching lemma. The bounds are tight, except for the constant c which depends on the depth of the circuit. Finally, the star of this post [Has86Has87] obtained c=1/(depth-1).

Yao’s paper doesn’t quite state that a DNF can be written exactly as a CNF, but it states that it can be approximated. Hastad’s work is the first to prove that a DNF can be written as a CNF, and in this sense his statement is cleaner than Yao’s. However, Yao’s paper states explicitly that a small circuit, after being hit by a restriction, can be set to constant by fixing few more bits.

The modern formulation of the switching lemma says that a DNF can be written as a shallow decision tree (and hence a small CNF). This formulation in terms of decision trees is actually not explicit in Hastad’s work. Beame, in his primer [Bea94], credits Cai with this idea and mentions several researchers noted Hastad’s proof works in this way.

Another switching lemma trivia is that the proof in Hastad’s thesis is actually due to Boppana; Hastad’s original argument — of which apparently no written record exists — was closer to Razborov’s later proof.

So, let’s recap. Random restrictions are already in [Sub61]. The idea of switching is already in [FSS84Ajt83]. You already had three analyses of these ideas, two giving superpolynomial lower bounds and one [Yao85] giving exponential. The formulation in terms of decision trees isn’t in [Has87], and the proof that appears in [Has87] is due to Boppana.

Still, I would guess [Has87] is more well known than all the other works above combined. [Yao85] did have a following at the time — I think it appeared in the pop news. But hey — have you ever heard of Yao’s switching lemma?

The current citation counts offer mixed support for my thesis:

FSS: 1351

Y: 732

H – paper “Almost optimal…:” 867

H – thesis: 582

But it is very hard to use citation information. The two H citations overlap, and papers are cited for various reasons. For example FSS got a ton of citations for the connection to oracles (which has nothing to do with switching lemmas).

Instead it’s instructive to note the type of citations that you can find in the literature:

Hastad’s switching lemma is a cornerstone of circuit complexity [No mention of FSS, A, Y]

Hastad‘s Switching Lemma is one of the gems of computational complexity [Notes below in passing it builds on FSS, A, Y]

The wikipedia entry is also telling:

In computational complexity theory, Hastad’s switching lemma is a key tool for proving lower bounds on the size of constant-depth Boolean circuits. Using the switching lemma, Johan Hastad (1987) showed that... [No mention of FSS,A,Y]

I think that 99% of the contribution of this line of research is the amazing idea that random restrictions simplify a DNF so that you can write it as a CNF and collapse. 90% of the rest is analyzing this to get superpolynomial lower bounds. And 90% of whatever is left is analyzing this to get exponential lower bounds.

Going back to something I mentioned at the beginning, I want to emphasize that Hastad during talks makes a point of reminding the audience that the idea of random restrictions is due to Sipser, and of Boppana’s contribution. And I also would like to thank him for his help with this post.

OK — so maybe this is so, but it must then be the case that [Has87] is the final word on this stuff, like the ultimate tightest analysis that kills the problem. Actually, it is not tight in some regimes of interest, and several cool works of past and recent times address that. In the end, I can only think of one reason why [Has87] entered the mythology in ways that other works did not, the reason that I carefully sidestepped while composing this post: å.

Perhaps one reason behind the aura of the switching lemma is that it’s hard to find examples. It would be nice to read: If you have this extreme DNF here’s what happens, on the other hand for this other extreme DNF here’s what happens, and in general this always works and here’s the switching lemma. Examples are forever – Erdos. Instead the switching lemma is typically presented as blam!: an example-free encoding argument which feels deus ex machina, as in this crisp presentation by Thapen. For a little more discussion, I liked Bogdanov’s lecture notes. Next I give a slightly different exposition of the encoding argument.

The simplest case: Or of n bits.

Here the circuit C is simply the Or of n bits x_{1},x_{2},\ldots ,x_{n}. This and the next case can be analyzed in more familiar ways, but the benefit of the encoding argument presented next is that it will extend to the general case more easily… arguably. Anyway, it’s also just fun to learn a different argument.

So, let’s take a random restriction \rho with exactly s stars. Some of the bits may become 0, others 1, and others yet may remain unfixed, i.e., assigned to stars. Those that become 0 you can ignore, while if some become 1 then the whole circuit becomes 1.

We will show that the number of restrictions for which the restricted circuit C|_{\rho } requires decision trees of depth \ge d is small. To accomplish this, we are going to encode/map such restrictions using/to a restriction… with no stars (that is, just a 0/1 assignment to the variables). The gain is clear: just think of a restriction with zero stars versus a restriction with one star. The latter are more by a factor about the number n of variables.

A critical observation is that we only want to encode restrictions for which C|_{\rho } requires large depth. So \rho does not map any variable to 1, for else the Or is 1 which has decision trees of depth 0.

The way we are going to encode \rho is this: Simply replace the stars with ones. To go back, replace the ones with stars. We are using the ones in the encoding to “signal” where the stars are.

Hence, the number of bad restrictions is at most 2^{n}, which is tiny compared to the number \binom {n}{s}2^{n-s} of restrictions with s stars.

The medium case: Or of functions on disjoint inputs.

Instead of working with DNFs, I will consider a circuit C which is the Or of arbitrary functions f_{i} each on w bits. You can immediately get this formulation from the usual one for DNFs, but I still find it a little useful since otherwise you might think there is something special about DNFs. What is special is that you take the Or of the functions, and we will exploit this again shortly.

In this warm-up case, we start with functions on disjoint inputs. So, again, let’s take a random restriction \rho with exactly s stars. Some of the functions may become 0, others 1, and others yet may remain unfixed. Those that become 0 you can ignore, while if some become 1 then the whole circuit becomes 1.

As before, we will show that the number of restrictions for which the restricted circuit C|_{\rho } requires decision trees of depth \ge d is small. To accomplish this, we are going to encode/map such restrictions using/to a restriction with just s-d stars, plus a little more information. As we saw already, the gain in reducing the number of stars is clear. In particular, standard calculations show that saving d stars reduces the number of restrictions by a factor O(s/n)^{d}. The auxiliary information will give us a factor of w^{d}, leading to the familiar bound O(ws/n)^{d}.

As before, recall that we only want to encode restrictions for which C|_{\rho } requires large depth. So no function in C|_{\rho } is 1, for else the circuit is 1 and has decision trees of depth 0. Also, you have d stars among inputs to functions that are unfixed (i.e., not even fixed to 0), for else again you can compute the function reading less than d bits. Because the functions are unfixed, there is a setting for those d stars (and possibly a few more stars – that would only help the argument) that make the corresponding functions 1. We are going to pick precisely that setting in our restriction \rho ' with s-d stars. This allows us to “signal” which functions had inputs with the stars we are saving (namely, those that are the constant 1). To completely recover \rho , we simply add extra information to indicate where the stars were. The saving here is that we only have to say where the stars are among w symbols, not n.

The general case: Or of functions on any subset of w bits.

First, the number of functions does not play a role, so you can think you have functions on any possible subset of w bits, where some functions may be constant. The idea is the same, except we have to be slightly more careful because when we set values for the stars in one function we may also affect other functions. The idea is simply to fix one function at the time. Specifically, starting with \rho , consider the first function f that’s not made constant by \rho . So the inputs to f have some stars. As before, let us replace the stars with constants that make the function f equal to the constant 1, and append the extra information that allows us to recover where these stars were in \rho .

We’d like to repeat the argument. Note however we only have guarantees about C|_{\rho }, not C|_{\rho } with some stars replaced with constants that make f equal to 1. We also can’t just jump to the 2nd function that’s not constant in C|_{\rho }, since the “signal” fixing for that might clash with the fixing for the first – this is where the overlap in inputs makes things slightly more involved. Instead, because C|_{\rho } required decision tree depth at least d, we note there have to be some assignments to the m stars in the input to f so that the resulting, further restricted circuit still requires decision tree depth \ge d-m (else C|_{\rho } has decision trees of depth <d).  We append this assignment to the auxiliary information and we continue the argument using the further restricted circuit.


[Ajt83]    Mikl�s Ajtai. \Sigma \sp {1}\sb {1}-formulae on finite structures. Annals of Pure and Applied Logic, 24(1):1–48, 1983.

[Bea94]   Paul Beame. A switching lemma primer. Technical Report UW-CSE-95-07-01, Department of Computer Science and Engineering, University of Washington, November 1994. Available from

[FSS84]   Merrick L. Furst, James B. Saxe, and Michael Sipser. Parity, circuits, and the polynomial-time hierarchy. Mathematical Systems Theory, 17(1):13–27, 1984.

[Has86]   Johan H�stad. Almost optimal lower bounds for small depth circuits. In Juris Hartmanis, editor, Proceedings of the 18th Annual ACM Symposium on Theory of Computing, May 28-30, 1986, Berkeley, California, USA, pages 6–20. ACM, 1986.

[H�s87]   Johan H�stad. Computational limitations of small-depth circuits. MIT Press, 1987.

[Sub61]   B. A. Subbotovskaya. Realizations of linear functions by formulas using +, *, -. Soviet Mathematics-Doklady, 2:110–112, 1961.

[Yao85]   Andrew Yao. Separating the polynomial-time hierarchy by oracles. In 26th IEEE Symp. on Foundations of Computer Science (FOCS), pages 1–10, 1985.

The ab-normal reach of norm-al proofs in non-abelian Fourier analysis

Fourier analysis over (not necessarily abelian) groups is a cool proof technique that yields many results of interest to theoretical computer science. Often the goal is to show “mixing” or “pseudo/quasi randomness” of appropriate distributions. This post isn’t about the formal statements or applications or proofs or even the credit of these results; for some of this you can see e.g. the references below, or a survey of mine [Vio19], or a survey [Gow17] by Gowers.

Instead this post is about an uncanny development of the proofs of some of these results. Whereas the original proofs were somewhat complicated, in some cases involving heavy mathematical machinery, later there emerged proofs that I propose to call norm-al (or normal for simplicity) because they only involve manipulations that can be cast as norm inequalities, such as Cauchy-Schwarz. Normal proofs I view as just a little up in complexity from proofs that are simply opening up definitions. They can involve the latter, or norm inequalities, and they need not be tight. An example of an ab-norm-al proof would be one that involves induction or iterative arguments, or probabilistic/double-counting/pigeon-hole methods. Making this a little more precise seems to require a lot of discussion, and may not even be possible, so let me stop here and move on with the examples of the proofs which became norm-al. They all involve quasirandom groups [Gow08], but even non-quasirandom groups mix in a certain sense [GV22] and the proofs there are again norm-al (it’s just that I don’t know of earlier proofs in this case).

The first example concerns the quintessential mixing result in this area: If you’ve got independent distributions X and Y, and if each distribution is uniform over say a constant fraction of the group, then the the product XY (a.k.a. convolution, a.k.a. sample from each and output the product) is close to uniform over the entire group. A norm-al proof appears in [Gow17] which also contains pointers to previous proofs.

The second is mixing of three-term progressions. A norml-al proof appears in [BHR22]. From their abstract: “Surprisingly, unlike the proofs of Tao and Peluse, our proof is elementary and only uses basic facts from nonabelian Fourier analysis.”

The third is interleaved mixing, see a recent preprint with Derksen.

Moreover, in the second and third example the proofs are not only simpler, they are also more general in that they apply to any quasirandom group whereas previous proofs only applied to prominent special cases.

Why is all of this happening? One can only speculate that the reach of norml-al proofs in non-abelian Fourier analysis is still just emerging.


[BHR22]   Amey Bhangale, Prahladh Harsha, and Sourya Roy. Mixing of 3-term progressions in quasirandom groups. In Mark Braverman, editor, ACM Innovations in Theoretical Computer Science conf. (ITCS), volume 215 of LIPIcs, pages 20:1–20:9. Schloss Dagstuhl – Leibniz-Zentrum f�r Informatik, 2022.

[Gow08]    W. T. Gowers. Quasirandom groups. Combinatorics, Probability & Computing, 17(3):363–387, 2008.

[Gow17]    W. T. Gowers. Generalizations of Fourier analysis, and how to apply them. Bull. Amer. Math. Soc. (N.S.), 54(1):1–44, 2017.

[GV22]    W. T. Gowers and Emanuele Viola. Mixing in non-quasirandom groups. In ACM Innovations in Theoretical Computer Science conf. (ITCS), 2022.

[Vio19]    Emanuele Viola. Non-abelian combinatorics and communication complexity. SIGACT News, Complexity Theory Column, 50(3), 2019.

Data-structure lower bounds without encoding arguments

I have recently posted the paper [Vio21] (download) which does something that I have been trying to do for a long time, more than ten years, on and off. Consider the basic data-structure problem of storing m bits of data x\in \{0,1\}^{m} into m+r bits so that the prefix-sum queries

\begin{aligned} \mathbb {\text {\textsc {Rank}}}(i):=\sum _{j\le i}x_{j} \end{aligned}

can be computed by probing q cells (or words) of w bits each. (You can think w=\log m throughout this post.) The paper [PV10] with Pǎtraşcu shows that r\ge m/w^{O(q)}, and this was recently shown to be tight by Yu [Yu19] (building on the breakthrough data structure [Pǎt08] which motivated the lower bound and is not far from it).

As is common in data-structure lower bounds, the proof in [PV10] is an encoding argument. In the recently posted paper, an alternative proof is presented which avoids the encoding argument and is perhaps more in line with other proofs in complexity lower bounds. Of course, everything is an encoding argument, and nothing is an encoding argument, and this post won’t draw a line.

The new proof establishes an intrinsic property of efficient data structures, whereas typical proofs including [PV10] are somewhat tailored to the problem at hand. The property is called the separator and is a main technical contribution of the work. At the high level the separator shows that in any efficient data structure you can restrict the input space a little so that many queries are nearly pairwise independent.

Also, the new proof rules out a stronger object: a sampler (see previous post here on sampling lower bounds). Specifically, the distribution Rank(U) where U is the uniform distribution cannot be sampled, not even slightly close, by an efficient cell-probe algorithm. This implies the data-structure result, and it can be informally interpreted as saying that the “reason” why the lower bound holds is not that the data is compressed, but rather that one can’t generate the type of dependencies occurring in Rank via an efficient cell-probe algorithm, regardless of what the input is.

Building on this machinery, one can prove several results about sampling, like showing that cell-probe samplers are strictly weaker than AC0 samplers. While doing this, it occurred to me that one gets a corollary for data structures which I had not seen in the literature. The corollary is a probe hierarchy, showing that some problem can be solved with zero redundancy (r=0) with O(q) probes, while it requires almost linear r for q probes. For example I don’t know of a result yielding this for small q such as q=O(1); I would appreciate a reference. (As mentioned in the paper, the sampling viewpoint is not essential and just like for Rank one can prove the data-structure corollaries directly. Personally, and obviously, I find the sampling viewpoint useful.)

One of my favorite open problems in the area still is: can a uniform distribution over [m] be approximately sampled by an efficient cell-probe algorithm? I can’t even rule out samplers making two probes!


[Pǎt08]   Mihai Pǎtraşcu. Succincter. In 49th IEEE Symp. on Foundations of Computer Science (FOCS). IEEE, 2008.

[PV10]   Mihai Pǎtraşcu and Emanuele Viola. Cell-probe lower bounds for succinct partial sums. In 21th ACM-SIAM Symp. on Discrete Algorithms (SODA), pages 117–122, 2010.

[Vio21]   Emanuele Viola. Lower bounds for samplers and data structures via the cell-probe separator. Available at, 2021.

[Yu19]    Huacheng Yu. Optimal succinct rank data structure via approximate nonnegative tensor decomposition. In Moses Charikar and Edith Cohen, editors, ACM Symp. on the Theory of Computing (STOC), pages 955–966. ACM, 2019.

Non-abelian combinatorics and communication complexity

Below and here in pdf is a survey I am writing for SIGACT, due next week.  Comments would be very helpful.

Finite groups provide an amazing wealth of problems of interest to complexity theory. And complexity theory also provides a useful viewpoint of group-theoretic notions, such as what it means for a group to be “far from abelian.” The general problem that we consider in this survey is that of computing a group product g=x_{1}\cdot x_{2}\cdot \cdots \cdot x_{n} over a finite group G. Several variants of this problem are considered in this survey and in the literature, including in [KMR66Bar89BC92IL95BGKL03PRS97Amb96AL00Raz00MV13Mil14GVa].

Some specific, natural computational problems related to g are, from hardest to easiest:

(1) Computing g,

(2) Deciding if g=1_{G}, where 1_{G} is the identity element of G, and

(3) Deciding if g=1_{G} under the promise that either g=1_{G} or g=h for a fixed h\ne 1_{G}.

Problem (3) is from [MV13]. The focus of this survey is on (2) and (3).

We work in the model of communication complexity [Yao79], with which we assume familiarity. For background see [KN97RY19]. Briefly, the terms x_{i} in a product x_{1}\cdot x_{2}\cdot \cdots \cdot x_{n} will be partitioned among collaborating parties – in several ways – and we shall bound the number of bits that the parties need to exchange to solve the problem.


We begin in Section 2 with two-party communication complexity. In Section 3 we give a streamlined proof, except for a step that is only sketched, of a result of Gowers and the author [GV15GVb] about interleaved group products. In particular we present an alternative proof, communicated to us by Will Sawin, of a lemma from [GVa]. We then consider two models of three-party communication. In Section 4 we consider number-in-hand protocols, and we relate the communication complexity to so-called quasirandom groups [Gow08BNP08]. In Section 6 we consider number-in-hand protocols, and specifically the problem of separating deterministic and randomized communication. In Section 7 we give an exposition of a result by Austin [Aus16], and show that it implies a separation that matches the state-of-the-art [BDPW10] but applies to a different problem.

Some of the sections follow closely a set of lectures by the author [Vio17]; related material can also be found in the blog posts [VioaViob]. One of the goals of this survey is to present this material in a more organized matter, in addition to including new material.

2 Two parties

Let G be a group and let us start by considering the following basic communication task. Alice gets an element x\in G and Bob gets an element y\in G and their goal is to check if x\cdot y=1_{G}. How much communication do they need? Well, x\cdot y=1_{G} is equivalent to x=y^{-1}. Because Bob can compute y^{-1} without communication, this problem is just a rephrasing of the equality problem, which has a randomized protocol with constant communication. This holds for any group.

The same is true if Alice gets two elements x_{1} and x_{2} and they need to check if x_{1}\cdot y\cdot x_{2}=1_{G}. Indeed, it is just checking equality of y and x_{1}^{-1}\cdot x_{2}^{-1}, and again Alice can compute the latter without communication.

Things get more interesting if both Alice and Bob get two elements and they need to check if the interleaved product of the elements of Alice and Bob equals 1_{G}, that is, if

\begin{aligned} x_{1}\cdot y_{1}\cdot x_{2}\cdot y_{2}=1_{G}. \end{aligned}

Now the previous transformations don’t help anymore. In fact, the complexity depends on the group. If it is abelian then the elements can be reordered and the problem is equivalent to checking if (x_{1}\cdot x_{2})\cdot (y_{1}\cdot y_{2})=1_{G}. Again, Alice can compute x_{1}\cdot x_{2} without communication, and Bob can compute y_{1}\cdot y_{2} without communication. So this is the same problem as before and it has a constant communication protocol.

For non-abelian groups this reordering cannot be done, and the problem seems hard. This can be formalized for a class of groups that are “far from abelian” – or we can take this result as a definition of being far from abelian. One of the groups that works best in this sense is the following, first constructed by Galois in the 1830’s.

Definition 1. The special linear group SL(2,q) is the group of 2\times 2 invertible matrices over the field \mathbb{F} _{q} with determinant 1.

The following result was asked in [MV13] and was proved in [GVa].

Theorem 1. Let G=SL(2,q) and let h\ne 1_{G}. Suppose Alice receives x_{1},x_{2}\in G and Bob receives y_{1},y_{2}\in G. They are promised that x_{1}\cdot y_{1}\cdot x_{2}\cdot y_{2} either equals 1_{G} or h. Deciding which case it is requires randomized communication \Omega (\log |G|).

This bound is tight as Alice can send her input, taking O(\log |G|) bits. We present the proof of this theorem in the next section.

Similar results are known for other groups as well, see [GVa] and [Sha16]. For example, one group that is “between” abelian groups and SL(2,q) is the following.

Definition 2. The alternating group A_{n} is the group of even permutations of 1,2,\ldots ,n.

If we work over A_{n} instead of SL(2,q) in Theorem 1 then the communication complexity is \Omega (\log \log |G|) [Sha16]. The latter bound is tight [MV13]: with knowledge of h, the parties can agree on an element a\in {1,2,\ldots ,n} such that h(a)\ne a. Hence they only need to keep track of the image a. This takes communication O(\log n)=O(\log \log |A_{n}|) because |A_{n}|=n!/2. In more detail, the protocol is as follows. First Bob sends y_{2}(a). Then Alice sends x_{2}y_{2}(a). Then Bob sends y_{1}x_{2}y_{2}(a) and finally Alice can check if x_{1}y_{1}x_{2}y_{2}(a)=a.

Interestingly, to decide if g=1_{G} without the promise a stronger lower bound can be proved for many groups, including A_{n}, see Corollary 3 below.

In general, it seems an interesting open problem to try to understand for which groups Theorem 1 applies. For example, is the communication large for every quasirandom group [Gow08]?

Theorem 1 and the corresponding results for other groups also scale with the length of the product: for example deciding if x_{1}\cdot y_{1}\cdot x_{2}\cdot y_{2}\cdots x_{n}\cdot y_{n}=1_{G} over G=SL(2,q) requires communication \Omega (n\log |G|) which is tight.

A strength of the above results is that they hold for any choice of h in the promise. This makes them equivalent to certain mixing results, discussed below in Section 5.0.1. Next we prove two other lower bounds that do not have this property and can be obtained by reduction from disjointness. First we show that for any non-abelian group G there exists an element h such that deciding if g=1_{G} or g=h requires communication linear in the length of the product. Interestingly, the proof works for any non-abelian group. The choice of h is critical, as for some G and h the problem is easy. For example: take any group G and consider H:=G\times \mathbb {Z}_{2} where \mathbb {Z}_{2} is the group of integers with addition modulo 2. Distinguishing between 1_{H}=(1_{G},0) and h=(1_{G},1) amounts to computing the parity of (the \mathbb {Z}_{2} components of) the input, which takes constant communication.

Theorem 2. Let G be a non-abelian group. There exists h\in G such that the following holds. Suppose Alice receives x_{1},x_{2},\ldots ,x_{n} and receives y_{1},y_{2},\ldots ,y_{n}. They are promised that x_{1}\cdot y_{1}\cdot x_{2}\cdot y_{2}\cdot \cdots \cdot x_{n}\cdot y_{n} either equals 1_{G} or h. Deciding which case it is requires randomized communication \Omega (n).

Proof. We reduce from unique set-disjointness, defined below. For the reduction we encode the And of two bits s,t\in \{0,1\} as a group product. This encoding is similar to the famous puzzle that asks to hang a picture on a wall with two nails in such a way that the picture falls if either one of the nails is removed. Since G is non-abelian, there exist a,b\in G such that a\cdot b\neq b\cdot a, and in particular a\cdot b\cdot a^{-1}\cdot b^{-1}=h with h\neq 1. We can use this fact to encode the And of s and t as

\begin{aligned} a^{s}\cdot b^{t}\cdot a^{-s}\cdot b^{-t}=\begin {cases} 1~~\text {if And\ensuremath {(s,t)=0}}\\ h~~\text {otherwise} \end {cases}. \end{aligned}

In the disjointness problem Alice and Bob get inputs x,y\in \{0,1\}^{n} respectively, and they wish to check if there exists an i\in [n] such that x_{i}\land y_{i}=1. If you think of x,y as characteristic vectors of sets, this problem is asking if the sets have a common element or not. The communication of this problem is \Omega (n) [KS92Raz92]. Moreover, in the “unique” variant of this problem where the number of such i’s is 0 or 1, the same lower bound \Omega (n) still applies. This follows from [KS92Raz92] – see also Proposition 3.3 in [AMS99]. For more on disjointness see the surveys [She14CP10].

We will reduce unique disjointness to group products. For x,y\in \{0,1\}^{n} we produce inputs for the group problem as follows:

\begin{aligned} x & \rightarrow (a^{x_{1}},a^{-x_{1}},\ldots ,a^{x_{n}},a^{-x_{n}})\\ y & \rightarrow (b^{y_{1}},b^{-y_{1}},\ldots ,b^{y_{n}},b^{-y_{n}}). \end{aligned}

The group product becomes

\begin{aligned} \underbrace {a^{x_{1}}\cdot b^{y_{1}}\cdot a^{-x_{1}}\cdot b^{-y_{1}}}_{\text {1 bit}}\cdots \cdots a^{x_{n}}\cdot b^{y_{n}}\cdot a^{-x_{n}}\cdot b^{-y_{n}}. \end{aligned}

If there isn’t an i\in [n] such that x_{i}\land y_{i}=1, then for each i the term a^{x_{i}}\cdot b^{y_{i}}\cdot a^{-x_{i}}\cdot b^{-y_{i}} is 1_{G}, and thus the whole product is 1.

Otherwise, there exists a unique i such that x_{i}\land y_{i}=1 and thus the product will be 1\cdots 1\cdot h\cdot 1\cdots 1=h, with h being in the i-th position. If Alice and Bob can check if the above product is equal to 1, they can also solve the unique set disjointness problem, and thus the lower bound applies for the former. \square

We required the uniqueness property, because otherwise we might get a product h^{c} that could be equal to 1 in some groups.

Next we prove a result for products of length just 4; it applies to non-abelian groups of the form G=H^{n} and not with the promise.

Theorem 3. Let H be a non-abelian group and consider G=H^{n}. Suppose Alice receives x_{1},x_{2} and Bob receives y_{1},y_{2}. Deciding if x_{1}\cdot y_{1}\cdot x_{2}\cdot y_{2}=1_{G} requires randomized communication \Omega (n).

Proof. The proof is similar to the proof of Theorem 2. We use coordinate i of G to encode bit i of the disjointness instance. If there is no intersection in the latter, the product will be 1_{G}. Otherwise, at least some coordinate will be \ne 1_{G}. \square

As a corollary we can prove a lower bound for A_{n}.

Corollary 3. Theorem 3 holds for G=A_{n}.

Proof. Note that A_{n} contains (A_{4})^{\lfloor n/4\rfloor } and that A_{4} is not abelian. Apply Theorem 3. \square

Theorem 3 is tight for constant-size G. We do not know if Corollary 3 is tight. The trivial upper bound is O(\log |A_{n}|)=O(n\log n).

3 Proof of Theorem 1

Several related proofs of this theorem exist, see [GV15GVaSha16]. As in [GVa], the proof that we present can be broken down in three steps. First we reduce the problem to a statement about conjugacy classes. Second we reduce this to a statement about trace maps. Third we prove the latter. We present the first step in a way that is similar but slightly different from the presentation in [GVa]. The second step is only sketched, but relies on classical results about SL(2,q) and can be found in [GVa]. For the third we present a proof that was communicated to us by Will Sawin. We thank him for his permission to include it here.

3.1 Step 1

We would like to rule out randomized protocols, but it is hard to reason about them directly. Instead, we are going to rule out deterministic protocols on random inputs. First, for any group element g\in G we define the distribution on quadruples D_{g}:=(x_{1},y_{1},x_{2},(x_{1}\cdot y_{1}\cdot x_{2})^{-1}g), where x,y\in G are uniformly random elements. Note the product of the elements in D_{g} is always g.

Towards a contradiction, suppose we have a randomized protocol P such that

\begin{aligned} \mathbb{P} [P(D_{1})=1]\geq \mathbb{P} [P(D_{h})=1]+\frac {1}{10}. \end{aligned}

This implies a deterministic protocol with the same gap, by fixing the randomness.

We reach a contradiction by showing that for every deterministic protocol P using little communication, we have

\begin{aligned} |\Pr [P(D_{1})=1]-\Pr [P(D_{h})=1]|\leq \frac {1}{100}. \end{aligned}

We start with the following standard lemma, which describes a protocol using product sets.

Lemma 4. (The set of accepted inputs of) A deterministic c-bit protocol for a function f:X\times Y\to Z can be written as a disjoint union of 2^{c} rectangles, where a rectangle is a set of the form A\times B with A\subseteq X and B\subseteq Y and where f is constant.

Proof. (sketch) For every communication transcript t, let S_{t}\subseteq G^{2} be the set of inputs giving transcript t. The sets S_{t} are disjoint since an input gives only one transcript, and their number is 2^{c}: one for each communication transcript of the protocol. The rectangle property can be proven by induction on the protocol tree. \square

Next, we show that any rectangle A\times B cannot distinguish D_{1},D_{h}. The way we achieve this is by showing that for every g the probability that (A\times B)(D_{g})=1 is roughly the same for every g, and is roughly the density of the rectangle. (Here we write A\times B for the characteristic function of the set A\times B.) Without loss of generality we set g=1_{G}. Let A have density \alpha and B have density \beta . We aim to bound above

\begin{aligned} \left |\mathbb{E} _{a_{1},b_{1},a_{2},b_{2}:a_{1}b_{1}a_{2}b_{2}=1}A(a_{1},a_{2})B(b_{1},b_{2})-\alpha \beta \right |, \end{aligned}

where note the distribution of a_{1},b_{1},a_{2},b_{2} is the same as D_{1}.

Because the distribution of (b_{1},b_{2}) is uniform in G^{2}, the above can be rewritten as

\begin{aligned} & \left |\mathbb{E} _{b_{1},b_{2}}B(b_{1},b_{2})\mathbb{E} _{a_{1},a_{2}:a_{1}b_{1}a_{2}b_{2}=1}(A(a_{1},a_{2})-\alpha )\right |\\ & \le \sqrt {\mathbb{E} _{b_{1},b_{2}}B(b_{1},b_{2})^{2}}\sqrt {\mathbb{E} _{b_{1},b_{2}}\mathbb{E} _{a_{1},a_{2}:a_{1}b_{1}a_{2}b_{2}=1}^{2}(A(a_{1},a_{2})-\alpha )}.\\ & =\sqrt {\beta }\sqrt {\mathbb{E} _{b_{1},b_{2},a_{1},a_{2},a_{1}',a_{2}':a_{1}b_{1}a_{2}b_{2}=a_{1}'b_{1}a_{2}'b_{2}=1}A(a_{1},a_{2})A(a_{1}',a_{2}')-\alpha ^{2}}. \end{aligned}

The inequality is Cauchy-Schwarz, and the step after that is obtained by expanding the square and noting that (a_{1},a_{2}) is uniform in G^{2}, so that the expectation of the term A(a_{1},a_{2})\alpha is \alpha ^{2}.

Now we do several transformations to rewrite the distribution in the last expectation in a convenient form. First, right-multiplying by b_{2}^{-1} we can rewrite the distribution as the uniform distribution on tuples such that

\begin{aligned} a_{1}b_{1}a_{2}=a_{1}'b_{1}a_{2}'. \end{aligned}

The last equation is equivalent to b_{1}^{-1}(a_{1}')^{-1}a_{1}b_{1}a_{2}=a_{2}'.

We can now do a transformation setting a_{1}' to be a_{1}x^{-1} to rewrite the distribution of the four-tuple as

\begin{aligned} (a_{1},a_{2},a_{1}x^{-1},C(x)a_{2}) \end{aligned}

where we use C(x) to denote a uniform element from the conjugacy class of x, that is b^{-1}xb for a uniform b\in G.

Hence it is sufficient to bound

\begin{aligned} \left |\mathbb{E} A(a_{1},a_{2})A(a_{1}x^{-1},C(x)a_{2})-\alpha ^{2}\right |, \end{aligned}

where all the variables are uniform and independent.

With a similar derivation as above, this can be rewritten as

\begin{aligned} & \left |\mathbb{E} A(a_{1},a_{2})\mathbb{E} (A(a_{1}x^{-1},C(x)a_{2})-\alpha )\right |\\ & \le \sqrt {\mathbb{E} A(a_{1},a{}_{2})^{2}}\sqrt {\mathbb{E} _{a_{1},a_{2}}\mathbb{E} _{x}^{2}(A(a_{1}x^{-1},C(x)a_{2})-\alpha )}.\\ & =\sqrt {\alpha }\sqrt {\mathbb{E} A(a_{1}x^{-1},C(x)a_{2})A(a_{1}x'^{-1},C(x')a_{2})-\alpha ^{2}}. \end{aligned}

Here each occurrence of C denotes a uniform and independent conjugate. Hence it is sufficient to bound

\begin{aligned} \left |\mathbb{E} A(a_{1}x^{-1},C(x)a_{2})A(a_{1}x'^{-1},C(x')a_{2})-\alpha ^{2}\right |. \end{aligned}

We can now replace a_{2} with C(x)^{-1}a_{2}. Because C(x)^{-1} has the same distribution of C(x^{-1}), it is sufficient to bound

\begin{aligned} \left |\mathbb{E} A(a_{1}x^{-1},a_{2})A(a_{1}x'^{-1},C(x')C(x^{-1})a_{2})-\alpha ^{2}\right |. \end{aligned}

For this, it is enough to show that with high probability 1-1/|G|^{\Omega (1)} over x' and x, the distribution of C(x')C(x^{-1}), over the choice of the two independent conjugates, has statistical distance \le 1/|G|^{\Omega (1)} from uniform.

3.2 Step 2

In this step we use information on the conjugacy classes of the group to reduce the latter task to one about the equidistribution of the trace map. Let Tr be the Trace map:

\begin{aligned} Tr\begin {pmatrix}a_{1} & a_{2}\\ a_{3} & a_{4} \end {pmatrix}=a_{1}+a_{4}. \end{aligned}

We state the lemma that we want to show.

Lemma 5. Let a:=\begin {pmatrix}0 & 1\\ 1 & w \end {pmatrix} and b:=\begin {pmatrix}v & 1\\ 1 & 0 \end {pmatrix}. For all but O(1) values of w\in \mathbb{F} _{q} and v\in \mathbb{F} _{q}, the distribution of

\begin{aligned} Tr\left (au^{-1}bu\right ) \end{aligned}

is O(1/q) close to uniform over \mathbb{F} _{q} in statistical distance.

To give some context, in SL(2,q) the conjugacy class of an element is essentially determined by the trace. Moreover, we can think of a and b as generic elements in G. So the lemma can be interpreted as saying that for typical a,b\in G, taking a uniform element from the conjugacy class of b and multiplying it by a yields an element whose conjugacy class is uniform among the classes of G. Using that essentially all conjugacy classes are equal, and some of the properties of the trace map, one can show that the above lemma implies that for typical x,x' the distribution of C(x')C(x^{-1}) is close to uniform. For more on how this fits we refer the reader to [GVa].

3.3 Step 3

We now present a proof of Lemma 5. The high-level argument of the proof is the same as in [GVa] (Lemma 5.5), but the details may be more accessible and in particular the use of the Lang-Weil theorem [LW54] from algebraic geometry is replaced by a more elementary argument. For simplicity we shall only cover the case where q is prime. We will show that for all but O(1) values of v,w,c\in \mathbb{F} _{q}, the probability over u that Tr(au^{-1}bu)=c is within O(1/q^{2}) of 1/q, and for the others it is at most O(1/q). Summing over c gives the result.

We shall consider elements b whose trace is unique to the conjugacy class of b. (This holds for all but O(1) conjugacy classes – see for example [GVa] for details.) This means that the distribution of u^{-1}bu is that of a uniform element in G conditioned on having trace b. Hence, we can write the probability that Tr(au^{-1}bu)=c as the number of solutions in x to the following three equations (divided by the size of the group, which is q^{3}-q):

\begin{aligned} x_{3}+x_{2}+wx_{4} & =c & \hspace {1cm}(Tr(ax)=c),\\ x_{1}+x_{4} & =v & \hspace {1cm}(Tr(x)=Tr(b)),\\ x_{1}x_{4}-x_{3}x_{3} & =1 & \hspace {1cm}(Det(x)=1). \end{aligned}

We use the second one to remove x_{1} and the first one to remove x_{2} from the last equation. This gives

\begin{aligned} (v-x_{4})x_{4}-(c-x_{3}-wx_{4})x_{3}=1. \end{aligned}

This is an equation in two variables. Write x=x_{3} and y=x_{4} and use distributivity to rewrite the equation as

\begin{aligned} -y^{2}+vy-cx+x^{2}+wxy=1. \end{aligned}

At least since Lagrange it has been known how to reduce this to a Pell equation x^{2}+dy^{2}=e. This is done by applying an invertible affine transformation, which does not change the number of solutions. First set x=x-wy/2. Then the equation becomes

\begin{aligned} -y^{2}+vy-c(x-wy/2)+(x-wy/2)^{2}+w(x-wy/2)y=1. \end{aligned}

Equivalently, the cross-term has disappeared and we have

\begin{aligned} y^{2}(-1-w^{2}/4)+y(v+cw/2)+x^{2}-cx=1. \end{aligned}

Now one can add constants to x and y to remove the linear terms, changing the constant term. Specifically, let h:=(v+cw/2)/2 and set y=y-h and x=x+c/2. The equation becomes

\begin{aligned} (y-h)^{2}(-1-w^{2}/4)+(y-h)2h+(x+c/2)^{2}-c(x+c/2)=1. \end{aligned}

The linear terms disappear, the coefficients of x^{2} and y^{2} do not change and the equation can be rewritten as

\begin{aligned} y^{2}(-1-w^{2}/4)+h^{2}(-1-w^{2}/4)-2h^{2}+x^{2}+(c/2)^{2}-c^{2}/2=1. \end{aligned}

So this is now a Pell equation

\begin{aligned} x^{2}+dy^{2}=e \end{aligned}

where d:=(-1-w^{2}/4) and

\begin{aligned} e:=1+h^{2}(3+w^{2}/4)+(c/2)^{2}=1+(v^{2}+(cw/2)^{2}+cvw)(1/4)(3+w^{2}/4)+(c/2)^{2}. \end{aligned}

For all but O(1) values of w we have that d is non-zero. Moreover, for all but O(1) values of v,w the term e is a non-zero polynomial in c. (Specifically, for any v\ne 0 and any w such that 3+w^{2}/4\ne 0.) So we only consider the values of c that make it non-zero. Those where e=0 give O(q) solutions, which is fine. We conclude with the following lemma.

Lemma 6. For d and e non-zero, and prime q, the number of solutions over \mathbb{F} _{q} to the Pell equation

\begin{aligned} x^{2}+dy^{2}=e \end{aligned}

is within O(1) of q.

This is a basic result from algebraic geometry that can be proved from first principles.

Proof. If d=-f^{2} for some f\in \mathbb{F} _{q}, then we can replace y with fy and we can count instead the solutions to the equation

\begin{aligned} x^{2}-y^{2}=e. \end{aligned}

Because x^{2}-y^{2}=(x-y)(x+y) we can set x':=x-y and y':=x+y, which preserves the number of solutions, and rewrite the equation as

\begin{aligned} x'y'=e. \end{aligned}

Because e\ne 0, this has q-1 solutions: for every non-zero y' we have x'=e/y'.

So now we can assume that d\ne -f^{2} for any f\in \mathbb{F} _{q}. Because the number of squares is (q+1)/2, the range of x^{2} has size (q+1)/2. Similarly, the range of e-dy^{2} also has size (q+1)/2. Hence these two ranges intersect, and there is a solution (a,b).

We take a line passing through (a,b): for parameters s,t\in \mathbb{F} we consider pairs (a+t,b+st). There is a bijection between such pairs with t\ne 0 and the points (x,y) with x\ne a. Because the number of solutions with x=a is O(1), using that d\ne 0, it suffices to count the solutions with t\ne 0.

The intuition is that this line has two intersections with the curve x^{2}+dy^{2}=e. Because one of them, (a,b), lies in \mathbb{F} _{q}, the other has to lie as well there. Algebraically, we can plug the pair in the expression to obtain the equivalent equation

\begin{aligned} a^{2}+t^{2}+2at+d(b^{2}+s^{2}t^{2}+2bst)=e. \end{aligned}

Using that (a,b) is a solution this becomes

\begin{aligned} t^{2}+2at+ds^{2}t^{2}+2dbst=0 \end{aligned}

We can divide by t\ne 0. Obtaining

\begin{aligned} t(1+ds^{2})+2a+2dbs=0. \end{aligned}

We can now divide by 1+ds^{2} which is non-zero by the assumption d\ne -f^{2}. This yields

\begin{aligned} t=(-2a-2dbs)/(1+ds^{2}). \end{aligned}

Hence for every value of s there is a unique t giving a solution. This gives q solutions. \square

4 Three parties, number-in-hand

In this section we consider the following three-party number-in-hand problem: Alice gets x, Bob gets y, Charlie gets z, and they want to know if x\cdot y\cdot z=1_{G}. The communication depends on the group G. We present next two efficient protocols for abelian groups, and then a communication lower bound for other groups.

4.1 A randomized protocol for the hypercube

We begin with the simplest setting. Let G=(\mathbb {Z}_{2})^{n}, that is n-bit strings with bit-wise addition modulo 2. The parties want to check if x+y+z=0^{n}. They can do so as follows. First, they pick a hash function h that is linear: h(x+y)=h(x)+h(y). Specifically, for a uniformly random a\in \{0,1\}^{n} define h_{a}(x):=\sum a_{i}x_{i}\mod 2. Then, the protocol is as follows.

  • Alice sends h_{a}(x),
  • Bob send h_{a}(y),
  • Charlie accepts if and only if h_{a}(x)+h_{a}(y)+h_{a}(z)=0s.

The hash function outputs 1 bit, so the communication is constant. By linearity, the protocol accepts iff h_{a}(x+y+z)=0. If x+y+z=0 this is always the case, otherwise it happens with probability 1/2.

4.2 A randomized protocol for \mathbb {Z}_{N}

This protocol is from [Vio14]. For simplicity we only consider the case N=2^{n} here – the protocol for general N is in [Vio14]. Again, the parties want to check if x+y+z=0\mod N. For this group, there is no 100% linear hash function but there are almost linear hash functions h:\mathbb {Z}_{N}\rightarrow \mathbb {Z}_{2^{\ell }} that satisfy the following properties. Note that the inputs to h are interpreted modulo N and the outputs modulo 2^{\ell }.

  1. for all a,x,y there is c\in \{0,1\} such that h_{a}(x+y)=h_{a}(x)+h_{a}(y)+c,
  2. for all x\neq 0 we have \mathbb{P} _{a}[h_{a}(x)\in \{-2,-1,0,1,2\}]\leq O(1/2^{\ell }),
  3. h_{a}(0)=0.

Assuming some random hash function h that satisfies the above properties the protocol works similarly to the previous one:

  • Alice sends h_{a}(x),
  • Bob sends h_{a}(y),
  • Charlie accepts if and only if h_{a}(x)+h_{a}(y)+h_{a}(z)\in \{-2,-1,0\}.

We can set \ell =O(1) to achieve constant communication and constant error.

To prove correctness of the protocol, first note that h_{a}(x)+h_{a}(y)+h_{a}(z)=h_{a}(x+y+z)-c for some c\in \{0,1,2\}. Then consider the following two cases:

  • if x+y+z=0 then h_{a}(x+y+z)-c=h_{a}(0)-c=-c, and the protocol is always correct.
  • if x+y+z\neq 0 then the probability that h_{a}(x+y+z)-c\in \{-2,-1,0\} for some c\in \{0,1,2\} is at most the probability that h_{a}(x+y+z)\in \{-2,-1,0,1,2\} which is \leq 2^{-\Omega (\ell )}; so the protocol is correct with high probability.

The hash function..

For the hash function we can use a function analyzed in [DHKP97]. Let a be a random odd number modulo 2^{n}. Define

\begin{aligned} h_{a}(x):=(a\cdot x\gg n-\ell )\mod 2^{\ell } \end{aligned}

where the product a\cdot x is integer multiplication, and \gg is bit-shift. In other words we output the bits n-\ell +1,n-\ell +2,\ldots ,n of the integer product a\cdot x.

We now verify that the above hash function family satisfies the three properties we required above.

Property (3) is trivially satisfied.

For property (1) we have the following. Let s=a\cdot x and t=a\cdot y and u=n-\ell . To recap, by definition we have:

  • h_{a}(x+y)=((s+t)\gg u)\mod 2^{\ell },
  • h_{a}(x)=(s\gg u)\mod 2^{\ell },
  • h_{a}(x)=(t\gg u)\mod 2^{\ell }.

Notice that if in the addition s+t the carry into the u+1 bit is 0, then

\begin{aligned} (s\gg u)+(t\gg u)=(s+t)\gg u \end{aligned}


\begin{aligned} (s\gg u)+(t\gg u)+1=(s+t)\gg u \end{aligned}

which concludes the proof for property (1).

Finally, we prove property (2). We start by writing x=s\cdot 2^{c} where s is odd. So the binary representation of x looks like

\begin{aligned} (\cdots \cdots 1\underbrace {0\cdots 0}_{c~\textrm {bits}}). \end{aligned}

The binary representation of the product a\cdot x for a uniformly random a looks like

\begin{aligned} (\textit {uniform}~1\underbrace {0\cdots 0}_{c~\textrm {bits}}). \end{aligned}

We consider the two following cases for the product a\cdot x:

  1. If a\cdot x=(\underbrace {\textit {uniform}~1\overbrace {00}^{2~bits}}_{\ell ~bits}\cdots 0), or equivalently c\geq n-\ell +2, the output never lands in the bad set \{-2,-1,0,1,2\};
  2. Otherwise, the hash function output has \ell -O(1) uniform bits. For any set B, the probability that the output lands in B is at most |B|\cdot 2^{-\ell +O(1)}.

4.3 Quasirandom groups

What happens in other groups? The hash function used in the previous result was fairly non-trivial. Do we have an almost linear hash function for 2\times 2 matrices? The answer is negative. For SL_{2}(q) and A_{n} the problem is hard, even under the promise. For a group G the complexity can be expressed in terms of a parameter d which comes from representation theory. We will not formally define this parameter here, but several qualitatively equivalent formulations can be found in [Gow08]. Instead the following table shows the d’s for the groups we’ve introduced.

G : abelian A_{n} SL_{2}(q)

d : 1 \Omega (\frac {\log |G|}{\log \log |G|}) |G|^{\Omega (1)}


Theorem 1. Let G be a group, and let h\in G. Let d be the minimum dimension of any irreducible representation of G. Suppose Alice, Bob, and Charlie receive x, y, and z respectively. They are promised that x\cdot y\cdot z either equals 1_{G} or h. Deciding which case it is requires randomized communication complexity \Omega (\log d).

This result is tight for the groups we have discussed so far. The arguments are the same as before. Specifically, for SL_{2}(q) the communication is \Omega (\log |G|). This is tight up to constants, because Alice and Bob can send their elements. For A_{n} the communication is \Omega (\log \log |G|). This is tight as well, as the parties can again just communicate the images of an element a such that h(a)\ne a, as discussed in Section 1. This also gives a computational proof that d cannot be too large for A_{n}, i.e., it is at most (\log |G|)^{O(1)}. For abelian groups we get nothing, matching the efficient protocols given above.

5 Proof of Theorem 1

First we discuss several “mixing” lemmas for groups, then we come back to protocols and see how to apply one of them there.

5.0.1 XY mixing

We want to consider “high entropy” distributions over G, and state a fact showing that the multiplication of two such distributions “mixes” or in other words increases the entropy. To define entropy we use the norms \lVert A\rVert _{c}=\left (\sum _{x}A(x)^{c}\right )^{\frac {1}{c}}. Our notion of (non-)entropy will be \lVert A\rVert _{2}. Note that \lVert A\rVert _{2}^{2} is exactly the collision probability \mathbb{P} [A=A'] where A' is independent and identically distributed to A. The smaller this quantity, the higher the entropy of A. For the uniform distribution U we have \lVert U\rVert _{2}^{2}=\frac {1}{|G|} and so we can think of 1/|G| as maximum entropy. If A is uniform over \Omega (|G|) elements, we have \lVert A\rVert _{2}^{2}=O(1/|G|) and we think of A as having “high” entropy.

Because the entropy of U is small, we can think of the distance between A and U in the 2-norm as being essentially the entropy of A:

\begin{aligned} \lVert A-U\rVert _{2}^{2} & =\sum _{x\in G}\left (A(x)-\frac {1}{|G|}\right )^{2}\\ & =\sum _{x\in G}A(x)^{2}-2A(x)\frac {1}{|G|}+\frac {1}{|G|^{2}}\\ & =\lVert A\rVert _{2}^{2}-\frac {1}{|G|}\\ & =\lVert A\rVert _{2}^{2}-\lVert U\rVert _{2}^{2}\\ & \approx \lVert A\rVert _{2}^{2}. \end{aligned}

Lemma 7. [Gow08BNP08] If X,Y are independent over G, then

\begin{aligned} \lVert X\cdot Y-U\rVert _{2}\leq \lVert X\rVert _{2}\lVert Y\rVert _{2}\sqrt {\frac {|G|}{d}}, \end{aligned}

where d is the minimum dimension of an irreducible representation of G.

By this lemma, for high entropy distributions X and Y, we get \lVert X\cdot Y-U\rVert _{2}\leq \frac {O(1)}{\sqrt {|G|d}}. The factor 1/\sqrt {|G|} allows us to pass to statistical distance \lVert .\rVert _{1} using Cauchy-Schwarz:

\begin{aligned} \lVert X\cdot Y-U\rVert _{1}\leq \sqrt {|G|}\lVert X\cdot Y-U\rVert _{2}\leq \frac {O(1)}{\sqrt {d}}.~~~~(1) \end{aligned}

This is the way in which we will use the lemma.

Another useful consequence of this lemma, which however we will not use directly, is this. Suppose now you have three independent, high-entropy variables X,Y,Z. Then for every g\in G we have

\begin{aligned} |\mathbb{P} [X\cdot Y\cdot Z=g]-1/|G||\le \lVert X\rVert _{2}\lVert Y\rVert _{2}\lVert Z\rVert _{2}\sqrt {\frac {|G|}{d}}.~~~~(2) \end{aligned}

To show this, set g=1_{G} without loss of generality and rewrite the left-hand-side as

\begin{aligned} |\sum _{h\in G}\mathbb{P} [X=h](\mathbb{P} [YZ=h^{-1}]-1/|G|)|. \end{aligned}

By Cauchy-Schwarz this is at most

\begin{aligned} \sqrt {\sum _{h}\mathbb{P} ^{2}[X=h]}\sqrt {\sum _{h}(\mathbb{P} [YZ=h^{-1}]-1/|G|)^{2}}=\lVert X\lVert _{2}\lVert YZ-U\lVert _{2} \end{aligned}

and we can conclude by Lemma 7. Hence the product of three high-entropy distributions is close to uniform in a point-wise sense: each group element is obtained with roughly probability 1/|G|.

At least over SL(2,q), there exists an alternative proof of this fact that does not mention representation theory (see [GVa] and [VioaViob]).

With this notation in hand, we conclude by stating a “mixing” version of Theorem 2. For more on this perspective we refer the reader to [GVa].

Theorem 1. Let G=SL(2,q). Let X=(X_{1},X_{2}) and Y=(Y_{1},Y_{2}) be two distributions over G^{2}. Suppose X is independent from Y. Let g\in G. We have

\begin{aligned} |\mathbb{P} [X_{1}Y_{1}X_{2}Y_{2}=g]-1/|G||\le |G|^{1-\Omega (1)}\lVert X\rVert _{2}\lVert Y\rVert _{2}. \end{aligned}

For example, when X and Y have high entropy over G^{2} (that is, are uniform over \Omega (|G|^{2}) pairs), we have \lVert X\rVert _{2}\le \sqrt {O(1)/|G|^{2}}, and so |G|^{1-\Omega (1)}\lVert X\rVert _{2}\lVert Y\rVert _{2}\le 1/|G|^{1+\Omega (1)}. In particular, X_{1}Y_{1}X_{2}Y_{2} is 1/|G|^{\Omega (1)} close to uniform over G in statistical distance.

5.0.2 Back to protocols

As in the beginning of Section 3, for any group element g\in G we define the distribution on triples D_{g}:=(x,y,(x\cdot y)^{-1}g), where x,y\in G are uniform and independent. Note the product of the elements in D_{g} is always g. Again as in Section 3, it suffices to show that for every deterministic protocols P using little communication we have

\begin{aligned} |\Pr [P(D_{1})=1]-\Pr [P(D_{h})=1]|\leq \frac {1}{100}. \end{aligned}

Analogously to Lemma 4, the following lemma describes a protocol using rectangles. The proof is nearly identical and is omitted.

Lemma 8. (The set of accepted inputs of) A deterministic c-bit number-in-hand protocol with three parties can be written as a disjoint union of 2^{c} “rectangles,” that is sets of the form A\times B\times C.

Next we show that these product sets cannot distinguish these two distributions D_{1},D_{h}, via a straightforward application of lemma 7.

Lemma 9. For all A,B,C\subseteq G we have |\mathbb{P} (A\times B\times C)(D_{1})=1]-\mathbb{P} [(A\times B\times C)(D_{h})=1]|\leq 1/d^{\Omega (1)}.

Proof. Pick any h\in G and let x,y,z be the inputs of Alice, Bob, and Charlie respectively. Then

\begin{aligned} \mathbb{P} [(A\times B\times C)(D_{h})=1]=\mathbb{P} [(x,y)\in A\times B]\cdot \mathbb{P} [(x\cdot y)^{-1}\cdot h\in C|(x,y)\in A\times B],~~~~(3) \end{aligned}

where (x,y) is uniform in G^{2}. If either A or B is small, that is \mathbb{P} [x\in A]\leq \epsilon or \mathbb{P} [y\in B]\leq \epsilon , then also \mathbb{P} [(x,y)\in A\times B]\le \epsilon and hence (??) is at most \epsilon as well. This holds for every h, so we also have |\mathbb{P} (A\times B\times C)(D_{1})=1]-\mathbb{P} [(A\times B\times C)(D_{h})=1]|\leq \epsilon . We will choose \epsilon later.

Otherwise, A and B are large: \mathbb{P} [x\in A]>\epsilon and \mathbb{P} [y\in B]>\epsilon . Let (x',y') be the distribution of (x,y) conditioned on (x,y)\in A\times B. We have that x' and y' are independent and each is uniform over at least \epsilon |G| elements. By Lemma 7 this implies \lVert x'\cdot y'-U\rVert _{2}\leq \lVert x'\rVert _{2}\cdot \lVert y'\rVert _{2}\cdot \sqrt {\frac {|G|}{d}}, where U is the uniform distribution. As mentioned after the lemma, by Cauchy–Schwarz we obtain

\begin{aligned} \lVert x'\cdot y'-U\rVert _{1}\leq |G|\cdot \lVert x'\rVert _{2}\cdot \lVert y'\rVert _{2}\cdot \sqrt {\frac {1}{d}}\leq \frac {1}{\epsilon }\cdot \frac {1}{\sqrt {d}}, \end{aligned}

where the last inequality follows from the fact that \lVert x\rVert _{2},\lVert y\rVert _{2}\leq \sqrt {\frac {1}{\epsilon |G|}}.

This implies that \lVert (x'\cdot y')^{-1}-U\rVert _{1}\leq \frac {1}{\epsilon }\cdot \frac {1}{\sqrt {d}} and \lVert (x'\cdot y')^{-1}\cdot h-U\rVert _{1}\leq \frac {1}{\epsilon }\cdot \frac {1}{\sqrt {d}}, because taking inverses and multiplying by h does not change the distance to uniform. These two last inequalities imply that

\begin{aligned} |\mathbb{P} [(x'\cdot y')^{-1}\in C]-\mathbb{P} [(x'\cdot y')^{-1}\cdot h\in C]|\le O(\frac {1}{\epsilon \sqrt {d}}); \end{aligned}

and thus we get that

\begin{aligned} |\mathbb{P} [(A\times B\times C)(D_{1})=1]-\mathbb{P} [(A\times B\times C)(D_{h})=1]|\le O(\frac {1}{\epsilon \sqrt {d}}). \end{aligned}

Picking \epsilon =1/d^{1/4} completes the proof. \square

Returning to arbitrary deterministic protocols P (as opposed to rectangles), write P as a union of 2^{c} disjoint rectangles by Lemma 8. Applying Lemma 9 and summing over all rectangles we get that the distinguishing advantage of P is at most 2^{c}/d^{1/4}. For c\leq (1/100)\log d the advantage is at most 1/100, concluding the proof.

6 Three parties, number-on-forehead

In number-on-forehead (NOH) communication complexity [CFL83] with k parties, the input is a k-tuple (x_{1},\dotsc ,x_{k}) and each party i sees all of it except x_{i}. For background, it is not known how to prove negative results for k\ge \log n parties.

We mention that Theorem 1 can be extended to the multiparty setting, see [GVa]. Several questions arise here, such as whether this problem remains hard for k\ge \log n, and what is the minimum length of an interleaved product that is hard for k=3 parties (the proof in 1 gives a large constant).

However in this survey we shall instead focus on the problem of separating deterministic and randomized communication. For k=2, we know the optimal separation: The equality function requires \Omega (n) communication for deterministic protocols, but can be solved using O(1) communication if we allow the protocols to use public coins. For k=3, the best known separation between deterministic and randomized protocol is \Omega (\log n) vs O(1) [BDPW10]. In the following we give a new proof of this result, for a different function: f(x,y,z)=1_{G} if and only if x\cdot y\cdot z=1 for x,y,z\in SL(2,q). As is true for some functions in [BDPW10], a stronger separation could hold for f. For context, let us state and prove the upper bound for randomized communication.

Claim 10. f has randomized communication complexity O(1).

Proof. In the number-on-forehead model, computing f reduces to two-party equality with no additional communication: Alice computes y\cdot z=:w privately, then Alice and Bob check if x=w^{-1}. \square

To prove the lower bound for deterministic protocols we reduce the communication problem to a combinatorial problem.

Definition 11. A corner in a group G is a set \{(x,y),(xz,y),(x,zy)\}\subseteq G^{2}, where x,y are arbitrary group elements and z\neq 1_{G}.

For intuition, if G is the abelian group of real numbers with addition, a corner becomes \{(x,y),(x+z,y),(x,y+z)\} for z\neq 0, which are the coordinates of an isosceles triangle. We now state the theorem that connects corners and lower bounds.

Lemma 12. Let G be a group and \delta a real number. Suppose that every subset A\subseteq G^{2} with |A|/|G^{2}|\ge \delta contains a corner. Then the deterministic communication complexity of f (defined as f(x,y,z)=1\iff x\cdot y\cdot z=1_{G}) is \Omega (\log (1/\delta )).

It is known that \delta \ge 1/\mathrm {polyloglog}|G| implies a corner for certain abelian groups G, see [LM07] for the best bound and pointers to the history of the problem. For G=SL(2,q) a stronger result is known: \delta \ge 1/\mathrm {polylog}|G| implies a corner [Aus16]. This in turn implies communication \Omega (\log \log |G|)=\Omega (\log n).

Proof. We saw already twice that a number-in-hand c-bit protocol can be written as a disjoint union of 2^{c} rectangles (Lemmas 4, 8). Likewise, a number-on-forehead c-bit protocol P can be written as a disjoint union of 2^{c} cylinder intersections C_{i}:=\{(x,y,z):f_{i}(y,z)g_{i}(x,z)h_{i}(x,y)=1\} for some f_{i},g_{i},h_{i}\colon G^{2}\to \{0,1\}:

\begin{aligned} P(x,y,z)=\sum _{i=1}^{2^{c}}f_{i}(y,z)g_{i}(x,z)h_{i}(x,y). \end{aligned}

The proof idea of the above fact is to consider the 2^{c} transcripts of P, then one can see that the inputs giving a fixed transcript are a cylinder intersection.

Let P be a c-bit protocol. Consider the inputs \{(x,y,(xy)^{-1})\} on which P accepts. Note that at least 2^{-c} fraction of them are accepted by some cylinder intersection C=f\cdot g\cdot h. Let A:=\{(x,y):(x,y,(xy)^{-1})\in C\}\subseteq G^{2}. Since the first two elements in the tuple determine the last, we have |A|/|G^{2}|\ge 2^{-c}.

Now suppose A contains a corner \{(x,y),(xz,y),(x,zy)\}. Then

\begin{aligned} (x,y)\in A & \implies (x,y,(xy)^{-1})\in C & & \implies h(x,y)=1,\\ (xz,y)\in A & \implies (xz,y,(xzy)^{-1})\in C & & \implies f(y,(xyz)^{-1})=1,\\ (x,zy)\in A & \implies (x,zy,(xzy)^{-1})\in C & & \implies g(x,(xyz)^{-1})=1. \end{aligned}

This implies (x,y,(xzy)^{-1})\in C, which is a contradiction because z\neq 1 and so x\cdot y\cdot (xzy)^{-1}\neq 1_{G}. \square

7 The corners theorem for quasirandom groups

In this section we prove the corners theorem for quasirandom groups, following Austin [Aus16]. Our exposition has several minor differences with that in [Aus16], which may make it more computer-science friendly. Possibly a proof can also be obtained via certain local modifications and simplifications of Green’s exposition [Gre05bGre05a] of an earlier proof for the abelian case. We focus on the case G=\textit {SL}(2,q) for simplicity, but the proof immediately extends to other quasirandom groups (with corresponding parameters).

Theorem 1. Let G=\textit {SL}(2,q). Every subset A\subseteq G^{2} of density |A|/|G|^{2}\geq 1/\log ^{a}|G| contains a corner \{(x,y),(xz,y),(x,zy)~|~z\neq 1\}.

7.1 Proof idea

For intuition, suppose A is a product set, i.e., A=B\times C for B,C\subseteq G. Let’s look at the quantity

\begin{aligned} \mathbb {E}_{x,y,z\leftarrow G}[A(x,y)A(xz,y)A(x,zy)] \end{aligned}

where A(x,y)=1 iff (x,y)\in A. Note that the random variable in the expectation is equal to 1 exactly when x,y,z form a corner in A. We’ll show that this quantity is greater than 1/|G|, which implies that A contains a corner (where z\neq 1). Since we are taking A=B\times C, we can rewrite the above quantity as

\begin{aligned} \mathbb {E}_{x,y,z\leftarrow G}[B(x)C(y)B(xz)C(y)B(x)C(zy)] & =\mathbb {E}_{x,y,z\leftarrow G}[B(x)C(y)B(xz)C(zy)]\\ & =\mathbb {E}_{x,y,z\leftarrow G}[B(x)C(y)B(z)C(x^{-1}zy)] \end{aligned}

where the last line follows by replacing z with x^{-1}z in the uniform distribution. If |A|/|G|^{2}\ge \delta , then both |B|/|G|\ge \delta and |B|/|G|\ge \delta . Condition on x\in B, y\in C, z\in B. Then the distribution x^{-1}zy is a product of three independent distributions, each uniform on a set of density \ge \delta . (In fact, two distributions would suffice for this.) By Lemma 7, x^{-1}zy is \delta ^{-1}/|G|^{\Omega (1)} close to uniform in statistical distance. This implies that the above expectation equals

\begin{aligned} \frac {|A|}{|G|^{2}}\cdot \frac {|B|}{|G|}\cdot \left (\frac {|C|}{|G|}\pm \frac {\delta ^{-1}}{|G|^{\Omega (1)}}\right ) & \geq \delta ^{2}\left (\delta -\frac {1}{|G|^{\Omega (1)}}\right )\geq \delta ^{3}/2>1/|G|, \end{aligned}

for \delta >1/|G|^{c} for a small enough constant c. Hence, product sets of density polynomial in 1/|G| contain corners.

Given the above, it is natural to try to decompose an arbitrary set A into product sets. We will make use of a more general result.

7.2 Weak Regularity Lemma

Let U be some universe (we will take U=G^{2}) and let f:U\rightarrow [-1,1] be a function (for us, f=1_{A}). Let D\subseteq \{d:U\rightarrow [-1,1]\} be some set of functions, which can be thought of as “easy functions” or “distinguishers” (these will be rectangles or closely related to them). The next theorem shows how to decompose f into a linear combination g of the d_{i} up to an error which is polynomial in the length of the combination. More specifically, f will be indistinguishable from g by the d_{i}.

Lemma 13. Let f:U\rightarrow [-1,1] be a function and D\subseteq \{d:U\rightarrow [-1,1]\} a set of functions. For all \epsilon >0, there exists a function g:=\sum _{i\le s}c_{i}\cdot d_{i} where d_{i}\in D, c_{i}\in \mathbb {R} and s=1/\epsilon ^{2} such that for all d\in D

\begin{aligned} \left |\mathbb {E}_{x\leftarrow U}[f(x)\cdot d(x)]-\mathbb {E}_{x\leftarrow U}[g(x)\cdot d(x)]\right |\le \epsilon . \end{aligned}

A different way to state the conclusion, which we will use, is to say that we can write f=g+h so that \mathbb{E} [h(x)\cdot d(x)] is small.

The lemma is due to Frieze and Kannan [FK96]. It is called “weak” because it came after Szemerédi’s regularity lemma, which has a stronger distinguishing conclusion. However, the lemma is also “strong” in the sense that Szemerédi’s regularity lemma has s as a tower of 1/\epsilon whereas here we have s polynomial in 1/\epsilon . The weak regularity lemma is also simpler. There also exists a proof [Tao17] of Szemerédi’s theorem (on arithmetic progressions), which uses weak regularity as opposed to the full regularity lemma used initially.

Proof. We will construct the approximation g through an iterative process producing functions g_{0},g_{1},\dots ,g. We will show that ||f-g_{i}||_{2}^{2} decreases by \ge \epsilon ^{2} each iteration.

Start: Define g_{0}=0 (which can be realized setting c_{0}=0).

Iterate: If not done, there exists d\in D such that |\mathbb {E}[(f-g)\cdot d]|>\epsilon . Assume without loss of generality \mathbb {E}[(f-g)\cdot d]>\epsilon .

Update: g':=g+\lambda d where \lambda \in \mathbb {R} shall be picked later.

Let us analyze the progress made by the algorithm.

\begin{aligned} ||f-g'||_{2}^{2} & =\mathbb {E}_{x}[(f-g')^{2}(x)]\\ & =\mathbb {E}_{x}[(f-g-\lambda d)^{2}(x)]\\ & =\mathbb {E}_{x}[(f-g)^{2}]+\mathbb {E}_{x}[\lambda ^{2}d^{2}(x)]-2\mathbb {E}_{x}[(f-g)\cdot \lambda d(x)]\\ & \leq ||f-g||_{2}^{2}+\lambda ^{2}-2\lambda \mathbb {E}_{x}[(f-g)d(x)]\\ & \leq ||f-g||_{2}^{2}+\lambda ^{2}-2\lambda \epsilon \\ & \leq ||f-g||_{2}^{2}-\epsilon ^{2} \end{aligned}

where the last line follows by taking \lambda =\epsilon . Therefore, there can only be 1/\epsilon ^{2} iterations because ||f-g_{0}||_{2}^{2}=||f||_{2}^{2}\leq 1. \square

7.3 Getting more for rectangles

Returning to the main proof, we will use the weak regularity lemma to approximate the indicator function for arbitrary A by rectangles. That is, we take D to be the collection of indicator functions for all sets of the form S\times T for S,T\subseteq G. The weak regularity lemma shows how to decompose A into a linear combination of rectangles. These rectangles may overlap. However, we ideally want A to be a linear combination of non-overlapping rectangles. In other words, we want a partition of rectangles. It is possible to achieve this at the price of exponentiating the number of rectangles. Note that an exponential loss is necessary even if S=G in every S\times T rectangle; or in other words in the uni-dimensional setting. This is one step where the terminology “rectangle” may be misleading – the set T is not necessarily an interval. If it was, a polynomial rather than exponential blow-up would have sufficed to remove overlaps.

Claim 14. Given a decomposition of A into rectangles from the weak regularity lemma with s functions, there exists a decomposition with 2^{O(s)} rectangles which don’t overlap.

Proof. Exercise. \square

In the above decomposition, note that it is natural to take the coefficients of rectangles to be the density of points in A that are in the rectangle. This gives rise to the following claim.

Claim 15. The weights of the rectangles in the above claim can be the average of f in the rectangle, at the cost of doubling the error.

Consequently, we have that f=g+h, where g is the sum of 2^{O(s)} non-overlapping rectangles S\times T with coefficients \mathbb{P} _{(x,y)\in S\times T}[f(x,y)=1].

Proof. Let g be a partition decomposition with arbitrary weights. Let g' be a partition decomposition with weights being the average of f. It is enough to show that for all rectangle distinguishers d\in D

\begin{aligned} |\mathbb {E}[(f-g')d]|\leq |\mathbb {E}[(f-g)d]|. \end{aligned}

By the triangle inequality, we have that

\begin{aligned} |\mathbb {E}[(f-g')d]|\leq |\mathbb {E}[(f-g)d]|+|\mathbb {E}[(g-g')d]|. \end{aligned}

To bound \mathbb {E}[(g-g')d]|, note that the error is maximized for a d that respects the decomposition in non-overlapping rectangles, i.e., d is the union of some non-overlapping rectangles from the decomposition. This can be argued using that, unlike f, the value of g and g' on a rectangle S\times T from the decomposition is fixed. But, from the point of “view” of such d, g'=f! More formally, \mathbb {E}[(g-g')d]=\mathbb {E}[(g-f)d]. This gives

\begin{aligned} |\mathbb {E}[(f-g')d]|\leq 2|\mathbb {E}[(f-g)d]| \end{aligned}

and concludes the proof. \square

We need to get still a little more from this decomposition. In our application of the weak regularity lemma above, we took the set of distinguishers to be characteristic functions of rectangles. That is, distinguishers that can be written as U(x)\cdot V(y) where U and V map G\to \{0,1\}. We will use that the same guarantee holds for U and V with range [-1,1], up to a constant factor loss in the error. Indeed, let U and V have range [-1,1]. Write U=U_{+}-U_{-} where U_{+} and U_{-} have range [0,1], and the same for V. The error for distinguisher U\cdot V is at most the sum of the errors for distinguishers U_{+}\cdot V_{+}, U_{+}\cdot V_{-}, U_{-}\cdot V_{+}, and U_{-}\cdot V_{-}. So we can restrict our attention to distinguishers U(x)\cdot V(y) where U and V have range [0,1]. In turn, a function U(x) with range [0,1] can be written as an expectation \mathbb{E} _{a}U_{a}(x) for functions U_{a} with range \{0,1\}, and the same for V. We conclude by observing that

\begin{aligned} \mathbb{E} _{x,y}[(f-g)(x,y)\mathbb{E} _{a}U_{a}(x)\cdot \mathbb{E} _{b}V_{b}(y)]\le \max _{a,b}\mathbb{E} _{x,y}[(f-g)(x,y)U_{a}(x)\cdot V_{b}(y)]. \end{aligned}

7.4 Proof

Let us now finish the proof by showing a corner exists for sufficiently dense sets A\subseteq G^{2}. We’ll use three types of decompositions for f:G^{2}\rightarrow \{0,1\}, with respect to the following three types of distinguishers, where U_{i} and V_{i} have range \{0,1\}:

  1. U_{1}(x)\cdot V_{1}(y),
  2. U_{2}(xy)\cdot V_{2}(y),
  3. U_{3}(x)\cdot V_{3}(xy).

The first type is just rectangles, what we have been discussing until now. The distinguishers in the last two classes can be visualized over \mathbb {R}^{2} as parallelograms with a 45-degree angle. The same extra properties we discussed for rectangles can be verified hold for them too.

Recall that we want to show

\begin{aligned} \mathbb {E}_{x,y,g}[f(x,y)f(xg,y)f(x,gy)]>\frac {1}{|G|}. \end{aligned}

We’ll decompose the i-th occurrence of f via the i-th decomposition listed above. We’ll write this decomposition as f=g_{i}+h_{i}. We apply this in a certain order to produce sums of products of three functions. The inputs to the functions don’t change, so to avoid clutter we do not write them, and it is understood that in each product of three functions the inputs are, in order (x,y),(xg,y),(x,gy). The decomposition is:

\begin{aligned} & fff\\ = & ffg_{3}+ffh_{3}\\ = & fg_{2}g_{3}+fh_{2}g_{3}+ffh_{3}\\ = & g_{1}g_{2}g_{3}+h_{1}g_{2}g_{3}+fh_{2}g_{3}+ffh_{3}. \end{aligned}

We first show that the expectation of the first term is big. This takes the next two claims. Then we show that the expectations of the other terms are small.

Claim 16. For all g\in G, the expectations \mathbb {E}_{x,y}[g_{1}(x,y)g_{2}(xg,y)g_{3}(x,gy)] are the same up to an error of 2^{O(s)}/|G|^{\Omega (1)}.

Proof. We just need to get error 1/|G|^{\Omega (1)} for any product of three functions for the three decomposition types. We have:

\begin{aligned} & \mathbb {E}_{x,y}[c_{1}U_{1}(x)V_{1}(y)\cdot c_{2}U_{2}(xgy)V_{2}(y)\cdot c_{3}U_{3}(x)V_{3}(xgy)]\\ = & c_{1}c_{2}c_{3}\mathbb {E}_{x,y}[(U_{1}\cdot U_{3})(x)(V_{1}\cdot V_{2})(y)(U_{2}\cdot V_{3})(xgy)]\\ = & c_{1}c_{2}c_{3}\cdot \mathbb {E}_{x}[(U_{1}\cdot U_{3})(x)]\cdot \mathbb {E}_{y}[(V_{1}\cdot V_{2})(y)]\cdot \mathbb {E}_{z}[(U_{2}\cdot V_{3})(z)]\pm \frac {1}{|G|^{\Omega (1)}}. \end{aligned}

This is similar to what we discussed in the overview, and is where we use mixing. Specifically, if \mathbb {E}_{x}[(U_{1}\cdot U_{3})(x)] or \mathbb {E}_{y}[(V_{1}\cdot V_{2})(y)] are at most 1/|G|^{c} for a small enough constant c than we are done. Otherwise, conditioned on (U_{1}\cdot U_{3})(x)=1, the distribution on x is uniform over a set of density 1/|G|^{c}, and the same holds for y, and the result follows by Lemma 7. \square

Recall that we start with a set of density \ge 1/\log ^{a}|G|.

Claim 17. \mathbb {E}_{x,y}[g_{1}(x,y)g_{2}(x,y)g_{3}(x,y)]>1/\log ^{4a}|G|.

Proof. We will relate the expectation over x,y to f using the Hölder inequality: For random variables X_{1},X_{2},\ldots ,X_{k},

\begin{aligned} \mathbb {E}[X_{1}\dots X_{k}]\leq \prod _{i=1}^{k}\mathbb {E}[X_{i}^{c_{i}}]^{1/c_{i}}\text { such that }\sum 1/c_{i}=1. \end{aligned}

To apply this inequality in our setting, write

\begin{aligned} f=(f\cdot g_{1}g_{2}g_{3})^{1/4}\cdot \left (\frac {f}{g_{1}}\right )^{1/4}\cdot \left (\frac {f}{g_{2}}\right )^{1/4}\cdot \left (\frac {f}{g_{3}}\right )^{1/4}. \end{aligned}

By the Hölder inequality the expectation of the right-hand side is

\begin{aligned} \leq \mathbb {E}[f\cdot g_{1}g_{2}g_{3}]^{1/4}\mathbb {E}\left [\frac {f}{g_{1}}\right ]^{1/4}\mathbb {E}\left [\frac {f}{g_{2}}\right ]^{1/4}\mathbb {E}\left [\frac {f}{g_{3}}\right ]^{1/4}. \end{aligned}

The last three terms equal to 1 because

\begin{aligned} \mathbb {E}_{x,y}\frac {f(x,y)}{g_{i}(x,y)} & =\mathbb {E}_{x,y}\frac {f(x,y)}{\mathbb {E}_{x',y'\in \textit {Cell}(x,y)}[f(x',y')]}=\mathbb {E}_{x,y}\frac {\mathbb {E}_{x',y'\in \textit {Cell}(x,y)}[f(x',y')]}{\mathbb {E}_{x',y'\in \textit {Cell}(x,y)}[f(x',y')]}=1. \end{aligned}

where \textit {Cell}(x,y) is the set in the partition that contains (x,y). Putting the above together we obtain

\begin{aligned} \mathbb {E}[f]\leq \mathbb {E}[f\cdot g_{1}g_{2}g_{3}]^{1/4}. \end{aligned}

Finally, because the functions are positive, we have that \mathbb {E}[f\cdot g_{1}g_{2}g_{3}]^{1/4}\leq \mathbb {E}[g_{1}g_{2}g_{3}]^{1/4}. This concludes the proof. \square

It remains to show the other terms are small. Let \epsilon be the error in the weak regularity lemma with respect to distinguishers with range \{0,1\} . Recall that this implies error O(\epsilon ) with respect to distinguishers with range [-1,1]. We give the proof for one of the terms and then we say little about the other two.

Claim 18. |\mathbb {E}[f(x,y)f(xg,y)h_{3}(x,gy)]|\leq O(\epsilon )^{1/4}.

The proof involves changing names of variables and doing Cauchy-Schwarz to remove the terms with f and bound the expectation above by \mathbb {E}[h_{3}(x,g)U(x)V(xg)], which is small by the regularity lemma.

Proof. Replace g with gy^{-1} in the uniform distribution to get

\begin{aligned} & \mathbb {E}_{x,y,g}^{4}[f(x,y)f(xg,y)h_{3}(x,gy)]\\ & =\mathbb {E}_{x,y,g}^{4}[f(x,y)f(xgy^{-1},y)h_{3}(x,g)]\\ & =\mathbb {E}_{x,y}^{4}[f(x,y)\mathbb {E}_{g}[f(xgy^{-1},y)h_{3}(x,g)]]\\ & \leq \mathbb {E}_{x,y}^{2}[f^{2}(x,y)]\mathbb {E}_{x,y}^{2}\mathbb {E}_{g}^{2}[f(xgy^{-1},y)h_{3}(x,g)]\\ & \leq \mathbb {E}_{x,y}^{2}\mathbb {E}_{g}^{2}[f(xgy^{-1},y)h_{3}(x,g)]\\ & =\mathbb {E}_{x,y,g,g'}^{2}[f(xgy^{-1},y)h_{3}(x,g)f(xg'y^{-1},y)h_{3}(x,g')], \end{aligned}

where the first inequality is by Cauchy-Schwarz.

Now replace g\rightarrow x^{-1}g,g'\rightarrow x^{-1}g and reason in the same way:

\begin{aligned} & =\mathbb {E}_{x,y,g,g'}^{2}[f(gy^{-1},y)h_{3}(x,x^{-1}g)f(g'y^{-1},y)h_{3}(x,x^{-1}g')]\\ & =\mathbb {E}_{g,g',y}^{2}[f(gy^{-1},y)\cdot f(g'y^{-1},y)\mathbb {E}_{x}[h_{3}(x,x^{-1}g)\cdot h_{3}(x,x^{-1}g')]]\\ & \leq \mathbb {E}_{x,x',g,g'}[h_{3}(x,x^{-1}g)h_{3}(x,x^{-1}g')h_{3}(x',x'^{-1}g)h_{3}(x',x'^{-1}g')]. \end{aligned}

Replace g\rightarrow xg to rewrite the expectation as

\begin{aligned} \mathbb {E}[h_{3}(x,g)h_{3}(x,x^{-1}g')h_{3}(x',x'^{-1}xg)h_{3}(x',x'^{-1}g')]. \end{aligned}

We want to view the last three terms as a distinguisher U(x)\cdot V(xg). First, note that h_{3} has range [-1,1]. This is because h_{3}(x,y)=f(x,y)-\mathbb{E} _{x',y'\in \textit {Cell}(x,y)}f(x',y') and f has range \{0,1\}, where recall that Cell(x,y) is the set in the partition that contains (x,y). Fix x',g'. The last term in the expectation becomes a constant c\in [-1,1]. The second term only depends on x, and the third only on xg. Hence for appropriate functions U and V with range [-1,1] this expectation can be rewritten as

\begin{aligned} \mathbb {E}[h_{3}(x,g)U(x)V(xg)], \end{aligned}

which concludes the proof. \square

There are similar proofs to show the remaining terms are small. For fh_{2}g_{3}, we can perform simple manipulations and then reduce to the above case. For h_{1}g_{2}g_{3}, we have a slightly easier proof than above.

7.4.1 Parameters

Suppose our set has density \delta \ge 1/\log ^{a}|G|, and the error in the regularity lemma is \epsilon . By the above results we can bound

\begin{aligned} \mathbb {E}_{x,y,g}[f(x,y)f(xg,y)f(x,gy)]\ge 1/\log ^{4a}|G|-2^{O(1/\epsilon ^{2})}/|G|^{\Omega (1)}-\epsilon ^{\Omega (1)}, \end{aligned}

where the terms in the right-hand size come, left-to-right from Claim 17, 16, and 18. Picking \epsilon =1/\log ^{1/3}|G| the proof is completed for sufficiently small a.


[AL00]    Andris Ambainis and Satyanarayana V. Lokam. Imroved upper bounds on the simultaneous messages complexity of the generalized addressing function. In Latin American Symposium on Theoretical Informatics (LATIN), pages 207–216, 2000.

[Amb96]    Andris Ambainis. Upper bounds on multiparty communication complexity of shifts. In Symp. on Theoretical Aspects of Computer Science (STACS), pages 631–642, 1996.

[AMS99]    Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. J. of Computer and System Sciences, 58(1, part 2):137–147, 1999.

[Aus16]    Tim Austin. Ajtai-Szemerédi theorems over quasirandom groups. In Recent trends in combinatorics, volume 159 of IMA Vol. Math. Appl., pages 453–484. Springer, [Cham], 2016.

[Bar89]    David A. Mix Barrington. Bounded-width polynomial-size branching programs recognize exactly those languages in NC^1. J. of Computer and System Sciences, 38(1):150–164, 1989.

[BC92]    Michael Ben-Or and Richard Cleve. Computing algebraic formulas using a constant number of registers. SIAM J. on Computing, 21(1):54–58, 1992.

[BDPW10]   Paul Beame, Matei David, Toniann Pitassi, and Philipp Woelfel. Separating deterministic from randomized multiparty communication complexity. Theory of Computing, 6(1):201–225, 2010.

[BGKL03]    László Babai, Anna Gál, Peter G. Kimmel, and Satyanarayana V. Lokam. Communication complexity of simultaneous messages. SIAM J. on Computing, 33(1):137–166, 2003.

[BNP08]    László Babai, Nikolay Nikolov, and László Pyber. Product growth and mixing in finite groups. In ACM-SIAM Symp. on Discrete Algorithms (SODA), pages 248–257, 2008.

[CFL83]    Ashok K. Chandra, Merrick L. Furst, and Richard J. Lipton. Multi-party protocols. In 15th ACM Symp. on the Theory of Computing (STOC), pages 94–99, 1983.

[CP10]    Arkadev Chattopadhyay and Toniann Pitassi. The story of set disjointness. SIGACT News, 41(3):59–85, 2010.

[DHKP97]    Martin Dietzfelbinger, Torben Hagerup, Jyrki Katajainen, and Martti Penttonen. A reliable randomized algorithm for the closest-pair problem. J. Algorithms, 25(1):19–51, 1997.

[FK96]    Alan M. Frieze and Ravi Kannan. The regularity lemma and approximation schemes for dense problems. In IEEE Symp. on Foundations of Computer Science (FOCS), pages 12–20, 1996.

[Gow08]    W. T. Gowers. Quasirandom groups. Combinatorics, Probability & Computing, 17(3):363–387, 2008.

[Gre05a]    Ben Green. An argument of Shkredov in the finite field setting, 2005. Available at

[Gre05b]    Ben Green. Finite field models in additive combinatorics. Surveys in Combinatorics, London Math. Soc. Lecture Notes 327, 1-27, 2005.

[GVa]    W. T. Gowers and Emanuele Viola. Interleaved group products. SIAM J. on Computing.

[GVb]    W. T. Gowers and Emanuele Viola. The multiparty communication complexity of interleaved group products. SIAM J. on Computing.

[GV15]    W. T. Gowers and Emanuele Viola. The communication complexity of interleaved group products. In ACM Symp. on the Theory of Computing (STOC), 2015.

[IL95]    Neil Immerman and Susan Landau. The complexity of iterated multiplication. Inf. Comput., 116(1):103–116, 1995.

[KMR66]    Kenneth Krohn, W. D. Maurer, and John Rhodes. Realizing complex Boolean functions with simple groups. Information and Control, 9:190–195, 1966.

[KN97]    Eyal Kushilevitz and Noam Nisan. Communication complexity. Cambridge University Press, 1997.

[KS92]    Bala Kalyanasundaram and Georg Schnitger. The probabilistic communication complexity of set intersection. SIAM J. Discrete Math., 5(4):545–557, 1992.

[LM07]    Michael T. Lacey and William McClain. On an argument of Shkredov on two-dimensional corners. Online J. Anal. Comb., (2):Art. 2, 21, 2007.

[LW54]    Serge Lang and André Weil. Number of points of varieties in finite fields. American Journal of Mathematics, 76:819–827, 1954.

[Mil14]    Eric Miles. Iterated group products and leakage resilience against NC^1. In ACM Innovations in Theoretical Computer Science conf. (ITCS), 2014.

[MV13]    Eric Miles and Emanuele Viola. Shielding circuits with groups. In ACM Symp. on the Theory of Computing (STOC), 2013.

[PRS97]    Pavel Pudlák, Vojtěch Rödl, and Jiří Sgall. Boolean circuits, tensor ranks, and communication complexity. SIAM J. on Computing, 26(3):605–633, 1997.

[Raz92]    Alexander A. Razborov. On the distributional complexity of disjointness. Theor. Comput. Sci., 106(2):385–390, 1992.

[Raz00]    Ran Raz. The BNS-Chung criterion for multi-party communication complexity. Computational Complexity, 9(2):113–122, 2000.

[RY19]    Anup Rao and Amir Yehudayoff. Communication complexity. 2019. anuprao/pubs/book.pdf.

[Sha16]    Aner Shalev. Mixing, communication complexity and conjectures of Gowers and Viola. Combinatorics, Probability and Computing, pages 1–13, 6 2016. arXiv:1601.00795.

[She14]    Alexander A. Sherstov. Communication complexity theory: Thirty-five years of set disjointness. In Symp. on Math. Foundations of Computer Science (MFCS), pages 24–43, 2014.

[Tao17]    Terence Tao. Szemerédiâs proof of Szemerédiâs theorem, 2017.

[Vioa]    Emanuele Viola. Thoughts: Mixing in groups.

[Viob]    Emanuele Viola. Thoughts: Mixing in groups ii.

[Vio14]    Emanuele Viola. The communication complexity of addition. Combinatorica, pages 1–45, 2014.

[Vio17]    Emanuele Viola. Special topics in complexity theory. Lecture notes of the class taught at Northeastern University. Available at, 2017.

[Yao79]    Andrew Chi-Chih Yao. Some complexity questions related to distributive computing. In 11th ACM Symp. on the Theory of Computing (STOC), pages 209–213, 1979.

bounded independence plus noise fools space

There are many classes of functions on n bits that we know are fooled by bounded independence, including small-depth circuits, halfspaces, etc. (See this previous post.)

On the other hand the simple parity function is not fooled. It’s easy to see that you require independence at least n-1. However, if you just perturb the bits with a little noise N, then parity will be fooled. You can find other examples of functions that are not fooled by bounded independence alone, but are if you just perturb the bits a little.

In [3] we proved that any distribution with independence about n^{2/3} fools space-bounded algorithms, if you perturb it with noise. We asked, both in the paper and many people, if the independence could be lowered. Forbes and Kelley have recently proved [2] that the independence can be lowered all the way to O(\log n), which is tight [1]. Shockingly, their proof is nearly identical to [3]!

This exciting result has several interesting consequences. First, we now have almost the same generators for space-bounded computation in a fixed order as we do for any order. Moreover, the proof greatly simplifies a number of works in the literature. And finally, an approach in [4] to prove limitations for the sum of small-bias generators won’t work for space (possibly justifying some optimism in the power of the sum of small-bias generators).

My understanding of all this area is inseparable from the collaboration I have had with Chin Ho Lee, with whom I co-authored all the papers I have on this topic.

The proof

Let f:\{0,1\}^{n}\to \{0,1\} be a function. We want to show that it is fooled by D+E, where D has independence k, E is the noise vector of i.i.d. bits coming up 1 with probability say 1/4, and + is bit-wise XOR.

The approach in [3] is to decompose f as the sum of a function L with Fourier degree k, and a sum of t functions H_{i}=h_{i}\cdot g_{i} where h_{i} has no Fourier coefficient of degree less than k, and h_{i} and g_{i} are bounded. The function L is immediately fooled by D, and it is shown in [3] that each H_{i} is fooled as well.

To explain the decomposition it is best to think of f as the product of \ell :=n/k functions f_{i} on k bits, on disjoint inputs. The decomposition in [3] is as follows: repeatedly decompose each f_{i} in low-degree f_{L} and high-degree f_{H}. To illustrate:

\begin{aligned} f_{1}f_{2}f_{3} & =f_{1}f_{2}(f_{3H}+f_{3L})=f_{1}f_{2}f_{3H}+f_{1}(f_{2H}+f_{2L})f_{3L}=\ldots \\ = & f_{1H}f_{2L}f_{3L}+f_{1}f_{2H}f_{3L}+f_{1}f_{2}f_{3H}+f_{1L}f_{2L}f_{3L}\\ = & H_{1}+H_{2}+H_{3}+L. \end{aligned}

This works, but the problem is that even if each time f_{iL} has degree 1, the function L increases the degree by at least 1 per decomposition; and so we can afford at most k decompositions.

The decomposition in [2] is instead: pick L to be the degree k part of f, and H_{i} are all the Fourier coefficients which are non-zero in the inputs to f_{i} and whose degree in the inputs of f_{1},\ldots ,f_{i} is \ge k. The functions H_{i} can be written as h_{i}\cdot g_{i}, where h_{i} is the high-degree part of f_{1}\cdots f_{i} and h_{i} is f_{i+1}\cdots f_{\ell }.

Once you have this decomposition you can apply the same lemmas in [3] to get improved bounds. To handle space-bounded computation they extend this argument to matrix-valued functions.

What’s next

In [3] we asked for tight “bounded independence plus noise” results for any model, and the question remains. In particular, what about high-degree polynomials modulo 2?


[1]   Ravi Boppana, Johan Håstad, Chin Ho Lee, and Emanuele Viola. Bounded independence vs. moduli. In Workshop on Randomization and Computation (RANDOM), 2016.

[2]   Michael A. Forbes and Zander Kelley. Pseudorandom generators for read-once branching programs, in any order. In IEEE Symp. on Foundations of Computer Science (FOCS), 2018.

[3]   Elad Haramaty, Chin Ho Lee, and Emanuele Viola. Bounded independence plus noise fools products. SIAM J. on Computing, 47(2):295–615, 2018.

[4]   Chin Ho Lee and Emanuele Viola. Some limitations of the sum of small-bias distributions. Theory of Computing, 13, 2017.



Nonclassical polynomials and exact computation of Boolean functions

Guest post by Abhishek Bhrushundi.

I would like to thank Emanuele for giving me the opportunity to write a guest post here. I recently stumbled upon an old post on this blog which discussed two papers: Nonclassical polynomials as a barrier to polynomial lower bounds by Bhowmick and Lovett, and Anti-concentration for random polynomials by Nguyen and Vu. Towards the end of the post, Emanuele writes:

“Having discussed these two papers in a sequence, a natural question is whether non-classical polynomials help for exact computation as considered in the second paper. In fact, this question is asked in the paper by Bhowmick and Lovett, who conjecture that the answer is negative: for exact computation, non-classical polynomials should not do better than classical.”

In a joint work with Prahladh Harsha and Srikanth Srinivasan from last year, On polynomial approximations over \mathbb {Z}/2^k\mathbb {Z}, we study exact computation of Boolean functions by nonclassical polynomials. In particular, one of our results disproves the aforementioned conjecture of Bhowmick and Lovett by giving an example of a Boolean function for which low degree nonclassical polynomials end up doing better than classical polynomials of the same degree in the case of exact computation.

The counterexample we propose is the elementary symmetric polynomial of degree 16 in \mathbb {F}_2[x_1, \ldots , x_n]. (Such elementary symmetric polynomials also serve as counterexamples to the inverse conjecture for the Gowers norm [LMS11GT07], and this was indeed the reason why we picked these functions as candidate counterexamples),

\begin{aligned}S_{16}(x_1, \ldots , x_n) = \left (\sum _{S\subseteq [n],|S| = 16} \prod _{i \in S}x_i\right )\textrm { mod 2} = {|x| \choose 16} \textrm { mod 2},\end{aligned}

where |x| = \sum _{i=1}^n x_i is the Hamming weight of x. One can verify (using, for example, Lucas’s theorem) that S_{16}(x_1, \ldots , x_n) = 1 if and only if the 5^{th} least significant bit of |x| is 1.

We use that no polynomial of degree less than or equal to 15 can compute S_{16}(x) correctly on more than half of the points in \{0,1\}^n.

Theorem 1. Let P be a polynomial of degree at most 15 in \mathbb {F}_2[x_1, \ldots , x_n]. Then

\begin{aligned}\Pr _{x \sim \{0,1\}^n}[P(x) = S_{16}(x)] \le \frac {1}{2} + o(1).\end{aligned}

[Emanuele’s note. Let me take advantage of this for a historical remark. Green and Tao first claimed this fact and sent me and several others a complicated proof. Then I pointed out the paper by Alon and Beigel [AB01]. Soon after they and I independently discovered the short proof reported in [GT07].]

The constant functions (degree 0 polynomials) can compute any Boolean function on half of the points in \{0,1\}^n and this result shows that even polynomials of higher degree don’t do any better as far as S_{16}(x_1, \ldots , x_n) is concerned. What we prove is that there is a nonclassical polynomial of degree 14 that computes S_{16}(x_1, \ldots , x_n) on 9/16 \ge 1/2 + \Omega (1) of the points in \{0,1\}^n.

Theorem 2. There is a nonclassical polynomial P of degree 14 such that

\begin{aligned}\Pr _{x \sim \{0,1\}^n}[P(x) = S_{16}(x)] = \frac {9}{16} - o(1).\end{aligned}

A nonclassical polynomial takes values on the torus \mathbb {T} = \mathbb {R}/\mathbb {Z} and in order to compare the output of a Boolean function (i.e., a classical polynomial) to that of a nonclassical polynomial it is convenient to think of the range of Boolean functions to be \{0,1/2\} \subset \mathbb {T}. So, for example, S_{16}(x_1, \ldots , x_n) = \frac {1}{2} if |x|_4 = 1, and S_{16}(x_1, \ldots , x_n) = 0 otherwise. Here |x|_4 denotes the 5^{th} least significant bit of |x|.

We show that the nonclassical polynomial that computes S_{16}(x) on 9/16 of the points in \{0,1\}^n is

\begin{aligned}P(x_1, \ldots , x_n) = \frac {\sum _{S \subseteq [n], |S|=12} \prod _{i \in S}x_i}{8} \textrm { mod 1}= \frac {{|x| \choose 12}}{8} \textrm { mod 1} .\end{aligned}

The degree of this nonclassical polynomial is 14 but I wouldn’t get into much detail as to why this is case (See [BL15] for a primer on the notion of degree in the nonclassical world).

Understanding how P(x) behaves comes down to figuring out the largest power of two that divides |x| \choose 12 for a given x: if the largest power of two that divides |x| \choose 12 is 2 then P(x) = 1/2, otherwise if the largest power is at least 3 then P(x) = 0. Fortunately, there is a generalization of Lucas’s theorem, known as Kummer’s theorem, that helps characterize this:

Theorem 3.[Kummer’s theorem] The largest power of 2 dividing a \choose b for a,b \in \mathbb {N}, a \ge b, is equal to the number of borrows required when subtracting b from a in base 2.
Equipped with Kummer’s theorem, it doesn’t take much work to arrive at the following conclusion.

Lemma 4. P(x) = S_{16}(x) if either |x|_{2} = 0 or (|x|_2, |x|_3, |x|_4, |x|_5) = (1,0,0,0), where |x|_i denotes the (i+1)^{th} least significant bit of |x|.

If x = (x_1, \ldots , x_n) is uniformly distributed in \{0,1\}^n then it’s not hard to verify that the bits |x|_0, \ldots , |x|_5 are almost uniformly and independently distributed in \{0,1\}, and so the above lemma proves that P(x) computes S_{16}(x) on 9/16 of the points in \{0,1\}^n. It turns out that one can easily generalize the above argument to show that S_{2^\ell }(x) is a counterexample to Bhowmick and Lovett’s conjecture for every \ell \ge 4.

We also show in our paper that it is not the case that nonclassical polynomials always do better than classical polynomials in the case of exact computation — for the majority function, nonclassical polynomials do as badly as their classical counterparts (this was also conjectured by Bhowmick and Lovett in the same work), and the Razborov-Smolensky bound for classical polynomials extends to nonclassical polynomials.

We started out trying to prove that S_4(x_1, \ldots , x_n) is a counterexample but couldn’t. It would be interesting to check if it is one.


[AB01]    N. Alon and R. Beigel. Lower bounds for approximations by low degree polynomials over z m. In Proceedings 16th Annual IEEE Conference on Computational Complexity, pages 184–187, 2001.

[BL15]    Abhishek Bhowmick and Shachar Lovett. Nonclassical polynomials as a barrier to polynomial lower bounds. In Proceedings of the 30th Conference on Computational Complexity, pages 72–87, 2015.

[GT07]    B. Green and T. Tao. The distribution of polynomials over finite fields, with applications to the Gowers norms. ArXiv e-prints, November 2007.

[LMS11]   Shachar Lovett, Roy Meshulam, and Alex Samorodnitsky. Inverse conjecture for the gowers norm is false. Theory of Computing, 7(9):131–145, 2011.

Entropy polarization

Sometimes you see quantum popping up everywhere. I just did the opposite and gave a classical talk at a quantum workshop, part of an AMS meeting held at Northeastern University, which poured yet another avalanche of talks onto the Boston area. I spoke about the complexity of distributions, also featured in an earlier post, including a result I posted two weeks ago which gives a boolean function f:\{0,1\}^{n}\to \{0,1\} such that the output distribution of any AC^{0} circuit has statistical distance 1/2-1/n^{\omega (1)} from (Y,f(Y)) for uniform Y\in \{0,1\}^{n}. In particular, no AC^{0} circuit can compute f much better than guessing at random even if the circuit is allowed to sample the input itself. The slides for the talk are here.

The new technique that enables this result I’ve called entropy polarization. Basically, for every AC^{0} circuit mapping any number L of bits into n bits, there exists a small set S of restrictions such that:

(1) the restrictions preserve the output distribution, and

(2) for every restriction r\in S, the output distribution of the circuit restricted to r either has min-entropy 0 or n^{0.9}. Whence polarization: the entropy will become either very small or very large.

Such a result is useless and trivial to prove with |S|=2^{n}; the critical feature is that one can obtain a much smaller S of size 2^{n-n^{\Omega (1)}}.

Entropy polarization can be used in conjunction with a previous technique of mine that works for high min-entropy distributions to obtain the said sampling lower bound.

It would be interesting to see if any of this machinery can yield a separation between quantum and classical sampling for constant-depth circuits, which is probably a reason why I was invited to give this talk.

Hardness amplification proofs require majority… and 15 years

Aryeh Grinberg, Ronen Shaltiel, and myself have just posted a paper which proves conjectures I made 15 years ago (the historians want to consult the last paragraph of [2] and my Ph.D. thesis).

At that time, I was studying hardness amplification, a cool technique to take a function f:\{0,1\}^{k}\to \{0,1\} that is somewhat hard on average, and transform it into another function f':\{0,1\}^{n}\to \{0,1\} that is much harder on average. If you call a function \delta -hard if it cannot be computed on a \delta fraction of the inputs, you can start e.g. with f that is 0.1-hard and obtain f' that is 1/2-1/n^{100} hard, or more. This is very important because functions with the latter hardness imply pseudorandom generators with Nisan’s design technique, and also “additional” lower bounds using the “discriminator lemma.”

The simplest and most famous technique is Yao’s XOR lemma, where

\begin{aligned} f'(x_{1},x_{2},\ldots ,x_{t}):=f(x_{1})\oplus f(x_{2})\oplus \ldots \oplus f(x_{t}) \end{aligned}

and the hardness of f' decays exponentially with t. (So to achieve the parameters above it suffices to take t=O(\log k).)

At the same time I was also interested in circuit lower bounds, so it was natural to try to use this technique for classes for which we do have lower bounds. So I tried, and… oops, it does not work! In all known techniques, the reduction circuit cannot be implemented in a class smaller than TC^{0} – a class for which we don’t have lower bounds and for which we think it will be hard to get them, also because of the Natural proofs barrier.

Eventually, I conjectured that this is inherent, namely that you can take any hardness amplification reduction, or proof, and use it to compute majority. To be clear, this conjecture applied to black-box proofs: decoding arguments which take anything that computes f' too well and turn it into something which computes f too well. There were several partial results, but they all had to restrict the proof further, and did not capture all available techniques.

Should you have had any hope that black-box proofs might do the job, in this paper we prove the full conjecture (improving on a number of incomparable works in the literature, including a 10-year-anniversary work by Shaltiel and myself which proved the conjecture for non-adaptive proofs).


One thing that comes up in the proof is the following basic problem. You have a distribution X on n bits that has large entropy, very close to n. A classic result shows that most bits of X are close to uniform. We needed an adaptive version of this, showing that a decision tree making few queries cannot distinguish X from uniform, as long as the tree does not query a certain small forbidden set of variables. This also follows from recent and independent work of Or Meir and Avi Wigderson.

Turns out this natural extension is not enough for us. In a nutshell, it is difficult to understand what queries an arbitrary reduction is making, and so it is hard to guarantee that the reduction does not query the forbidden set. So we prove a variant, where the variables are not forbidden, but are fixed. Basically, you condition on some fixing X_{B}=v of few variables, and then the resulting distribution X|X_{B}=v is indistinguishable from the distribution U|U_{B}=v where U is uniform. Now the queries are not forbidden but have a fixed answer, and this makes things much easier. (Incidentally, you can’t get this simply by fixing the forbidden set.)

Fine, so what?

One great question remains. Can you think of a counter-example to the XOR lemma for a class such as constant-depth circuits with parity gates?

But there is something more why I am interested in this. Proving 1/2-1/n average-case hardness results for restricted classes “just” beyond AC^{0} is more than a long-standing open question in lower bounds: It is necessary even for worst-case lower bounds, both in circuit and communication complexity, as we discussed earlier. And here’s hardness amplification, which intuitively should provide such hardness results. It was given many different proofs, see e.g. [1]. However, none can be applied as we just saw. I don’t know, someone taking results at face value may even start thinking that such average-case hardness results are actually false.


[1]   Oded Goldreich, Noam Nisan, and Avi Wigderson. On Yao’s XOR lemma. Technical Report TR95–050, Electronic Colloquium on Computational Complexity, March 1995.

[2]   Emanuele Viola. The complexity of constructing pseudorandom generators from hard functions. Computational Complexity, 13(3-4):147–188, 2004.

Matrix rigidity, and all that

The rigidity challenge asks to exhibit an n × n matrix M that cannot be written as M = A + B where A is “sparse” and B is “low-rank.” This challenge was raised by Valiant who showed in [Val77] that if it is met for any A with at most n1+ϵ non-zero entries and any B with rank O(n∕ log log n) then computing the linear transformation M requires either logarithmic depth or superlinear size for linear circuits. This connection relies on the following lemma.

Lemma 1. Let C : {0, 1}n →{0, 1}n be a circuit made of XOR gates. If you can remove e edges and reduce the depth to d then the linear transformation computed by C equals A + B where A has ≤ 2d non-zero entries per row (and so a total of ≤ n2d non-zero entries), and B has rank ≤ e.

Proof: After you remove the edges, each output bit is a linear combination of the removed edges and at most 2d input variables. The former can be done by B, the latter by A. QED

Valiant shows that in a log-depth, linear-size circuit one can remove O(n∕ log log n) edges to reduce the depth to nϵ – a proof can be found in [Vio09] – and this gives the above connection to lower bounds.

However, the best available tradeoff for explicit matrices give sparsity n2∕r log(n∕r) and rank r, for any parameter r; and this is not sufficient for application to lower bounds.

Error-correcting codes

It was asked whether generator matrixes of good linear codes are rigid. (A code is good if it has constant rate and constant relative distance. The dimensions of the corresponding matrixes are off by only a constant factor, and so we can treat them as identical.) Spielman [Spi95] shows that there exist good codes that can be encoded by linear-size logarithmic depth circuits. This immediately rules out the possibility of proving a lower bound, and it gives a non-trivial rigidity upper bound via the above connections.

Still, one can ask if these matrices at least are more rigid than the available tradeoffs. Goldreich reports a negative answer by Dvir, showing that there exist good codes whose generating matrix C equals A + B where A has at most O(n2∕d) non-zero entries and B has rank O(d log n∕d), for any d.

A similar negative answer follows by the paper [GHK+13]. There we show that there exist good linear codes whose generating matrix can be written as the product of few sparse matrixes. The corresponding circuits are very structured, and so perhaps it is not surprising that they give good rigidity upper bounds. More precisely, the paper shows that we can encode an n-bit message by a circuit made of XOR gates and with say n log *n wires and depth O(1) – with unbounded fan-in. Each gate in the circuit computes the XOR of some t gates, which can be written as a binary tree of depth log 2t + O(1). Such trees have poor rigidity:

Lemma 2.[Trees are not rigid] Let C be a binary tree of depth d. You can remove an O(1∕2b) fraction of edges to reduce the depth to b, for any b.

Proof: It suffices to remove all edges at depths d – b, d – 2b, …. The number of such edges is O(2d-b + 2d-2b + …) = O(2d-b). Note this includes the case d ≤ b, where we can remove 0 edges. QED

Applying Lemma 2 to a gate in our circuit, we reduce the depth of the binary tree computed at that gate to b. Applying this to every gate we obtain a circuit of depth O(b). In total we have removed an O(1∕2b) fraction of the n log *n edges.

Writing 2b = n∕d, by Lemma 1 we can write the generating matrixes of our code as C = A + B where A has at most O(n∕d) non-zero entries per row, and B has rank O(d log *n). These parameters are the same as in Dvir’s result, up to lower-order terms. The lower-order terms appear incomparable.

Walsh-Fourier transform

Another matrix that was considered is the n×n Inner Product matrix H, aka the Walsh-Hadamard matrix, where the x,y entry is the inner product of x and y modulo 2. Alman and Williams [AW16] recently give an interesting rigidity upper bound which prevents this machinery to establish a circuit lower bound. Specifically they show that H can be written as H = A + B where A has at most n1+ϵ non-zero entries, and B has rank n1-ϵ′, for any ϵ and an ϵ′ which goes to 0 when ϵ does.

Their upper bound works as follows. Let h = log 2n. Start with the univariate, real polynomial p(z1,z2,…,zh) which computes parity exactly on inputs of Hamming weight between 2ϵn and (1∕2 + ϵ)n. By interpolation such a polynomial exists with degree (1∕2 – ϵ)n. Replacing zi with xiyi you obtain a polynomial of degree n – ϵn which computes IP correctly on inputs x,y whose inner product is between 2ϵn and (1∕2 + ϵ)n.

This polynomial has 2(1-ϵ′)n monomials, where ϵ′ = Ω(ϵ2). The truth-table of a polynomial with m monomials is a matrix with rank m, and this gives a low-rank matrix B′.

The fact that sparse polynomials yield low-rank matrixes also appeared in the paper [SV12], which suggested to study the rigidity challenge for matrixes arising from polynomials.

Returning to the proof in [AW16], it remains to deal with inputs whose inner product does not lie in that range. The number of x whose weight is not between (1∕2 – ϵ)n and (1∕2 + ϵ)n is 2(1-ϵ′)n. For each such input x we modify a row of the matrix B′. Repeating the process for the y we obtain the matrix B, and the rank bound 2(1-ϵ′)n hasn’t changed.

Now a calculation shows that B differs from H in few entries. That is, there are few x and y with Hamming weight between (1∕2 – ϵ)n and (1∕2 + ϵ)n, but with inner product less than 2ϵn.

Boolean complexity

There exists a corresponding framework for boolean circuits (as opposed to circuits with XOR gates only). Rigid matrixes informally correspond to depth-3 Or-And-Or circuits. If this circuit has fan-in fo at the output gate and fan-in fi at each input gate, then the correspondence in parameters is

rank = log fo
sparsity = 2fi .

More precisely, we have the following lemma.

Lemma 3. Let C : {0, 1}n →{0, 1}n be a boolean circuit. If you can remove e edges and reduce the depth to d then you can write C as an Or-And-Or circuit with output fan-in 2e and input fan-in 2d.

Proof: After you remove the edges, each output bit and each removed edge depends on at most 2d input bits or removed edges. The output Or gate of the depth-3 circuit is a big Or over all 2e assignments of values for the removed edges. Then we need to check consistency. Each consistency check just depends on 2d inputs and so can be written as a depth-2 circuit with fan-in 2d. QED

The available bounds are of the form log fo = n∕fi. For example, for input fan-in fi = nα we have lower bounds exponential in n1-α but not more. Again it can be shown that breaking this tradeoff in certain regimes (namely, log 2fo = O(n∕ log log n)) yields lower bounds against linear-size log-depth circuits. (A proof appears in [Vio09].) It was also pointed out in [Vio13] that breaking this tradeoff in any regime yields lower bounds for branching programs. See also the previous post.

One may ask how pairwise independent hash functions relate to this challenge. Ishai, Kushilevitz, Ostrovsky, and Sahai showed [IKOS08] that they can be computed by linear-size log-depth circuits. Again this gives a non-trivial upper bound for depth-3 circuits via these connections, and one can ask for more. In [GHK+13] we give constructions of such circuits which in combination with Lemma 3 can again be used to almost match the available trade-offs.

The bottom line of this post is that we can’t prove lower bounds because they are false, and it is a puzzle to me why some people appear confident that P is different from NP.


[AW16]    Josh Alman and Ryan Williams. Probabilistic rank and matrix rigidity, 2016.

[GHK+13]   Anna Gál, Kristoffer Arnsfelt Hansen, Michal Koucký, Pavel Pudlák, and Emanuele Viola. Tight bounds on computing error-correcting codes by bounded-depth circuits with arbitrary gates. IEEE Transactions on Information Theory, 59(10):6611–6627, 2013.

[IKOS08]    Yuval Ishai, Eyal Kushilevitz, Rafail Ostrovsky, and Amit Sahai. Cryptography with constant computational overhead. In 40th ACM Symp. on the Theory of Computing (STOC), pages 433–442, 2008.

[Spi95]    Daniel Spielman. Computationally Efficient Error-Correcting Codes and Holographic Proofs. PhD thesis, Massachusetts Institute of Technology, 1995.

[SV12]    Rocco A. Servedio and Emanuele Viola. On a special case of rigidity. Available at, 2012.

[Val77]    Leslie G. Valiant. Graph-theoretic arguments in low-level complexity. In 6th Symposium on Mathematical Foundations of Computer Science, volume 53 of Lecture Notes in Computer Science, pages 162–176. Springer, 1977.

[Vio09]    Emanuele Viola. On the power of small-depth computation. Foundations and Trends in Theoretical Computer Science, 5(1):1–72, 2009.

[Vio13]    Emanuele Viola. Challenges in computational lower bounds. Available at, 2013.