I have prepared this talk which is a little unusual and is in part historical and speculative. You can view the slides here. I am scheduled to give it in about three hours at Boston University. And because it’s just another day in the greater Boston area, while I’ll be talking my ex officemate Vitaly Feldman will be speaking at Harvard University. His talk looks quite interesting and attempts to explain why overfitting is actually necessary for good learning. As for mine, well you’ll have to come and see or take a peek at the slides.
lower bounds
Nonabelian combinatorics and communication complexity
Below and here in pdf is a survey I am writing for SIGACT, due next week. Comments would be very helpful.
Finite groups provide an amazing wealth of problems of interest to complexity theory. And complexity theory also provides a useful viewpoint of grouptheoretic notions, such as what it means for a group to be “far from abelian.” The general problem that we consider in this survey is that of computing a group product over a finite group . Several variants of this problem are considered in this survey and in the literature, including in [KMR66, Bar89, BC92, IL95, BGKL03, PRS97, Amb96, AL00, Raz00, MV13, Mil14, GVa].
Some specific, natural computational problems related to are, from hardest to easiest:
(1) Computing ,
(2) Deciding if , where is the identity element of , and
(3) Deciding if under the promise that either or for a fixed .
Problem (3) is from [MV13]. The focus of this survey is on (2) and (3).
We work in the model of communication complexity [Yao79], with which we assume familiarity. For background see [KN97, RY19]. Briefly, the terms in a product will be partitioned among collaborating parties – in several ways – and we shall bound the number of bits that the parties need to exchange to solve the problem.
Organization.
We begin in Section 2 with twoparty communication complexity. In Section 3 we give a streamlined proof, except for a step that is only sketched, of a result of Gowers and the author [GV15, GVb] about interleaved group products. In particular we present an alternative proof, communicated to us by Will Sawin, of a lemma from [GVa]. We then consider two models of threeparty communication. In Section 4 we consider numberinhand protocols, and we relate the communication complexity to socalled quasirandom groups [Gow08, BNP08]. In Section 6 we consider numberinhand protocols, and specifically the problem of separating deterministic and randomized communication. In Section 7 we give an exposition of a result by Austin [Aus16], and show that it implies a separation that matches the stateoftheart [BDPW10] but applies to a different problem.
Some of the sections follow closely a set of lectures by the author [Vio17]; related material can also be found in the blog posts [Vioa, Viob]. One of the goals of this survey is to present this material in a more organized matter, in addition to including new material.
2 Two parties
Let be a group and let us start by considering the following basic communication task. Alice gets an element and Bob gets an element and their goal is to check if . How much communication do they need? Well, is equivalent to . Because Bob can compute without communication, this problem is just a rephrasing of the equality problem, which has a randomized protocol with constant communication. This holds for any group.
The same is true if Alice gets two elements and and they need to check if . Indeed, it is just checking equality of and , and again Alice can compute the latter without communication.
Things get more interesting if both Alice and Bob get two elements and they need to check if the interleaved product of the elements of Alice and Bob equals , that is, if
Now the previous transformations don’t help anymore. In fact, the complexity depends on the group. If it is abelian then the elements can be reordered and the problem is equivalent to checking if . Again, Alice can compute without communication, and Bob can compute without communication. So this is the same problem as before and it has a constant communication protocol.
For nonabelian groups this reordering cannot be done, and the problem seems hard. This can be formalized for a class of groups that are “far from abelian” – or we can take this result as a definition of being far from abelian. One of the groups that works best in this sense is the following, first constructed by Galois in the 1830’s.
Definition 1. The special linear group is the group of invertible matrices over the field with determinant .
The following result was asked in [MV13] and was proved in [GVa].
Theorem 1. Let and let . Suppose Alice receives and Bob receives . They are promised that either equals or . Deciding which case it is requires randomized communication .
This bound is tight as Alice can send her input, taking bits. We present the proof of this theorem in the next section.
Similar results are known for other groups as well, see [GVa] and [Sha16]. For example, one group that is “between” abelian groups and is the following.
If we work over instead of in Theorem 1 then the communication complexity is [Sha16]. The latter bound is tight [MV13]: with knowledge of , the parties can agree on an element such that . Hence they only need to keep track of the image . This takes communication because In more detail, the protocol is as follows. First Bob sends . Then Alice sends . Then Bob sends and finally Alice can check if .
Interestingly, to decide if without the promise a stronger lower bound can be proved for many groups, including , see Corollary 3 below.
In general, it seems an interesting open problem to try to understand for which groups Theorem 1 applies. For example, is the communication large for every quasirandom group [Gow08]?
Theorem 1 and the corresponding results for other groups also scale with the length of the product: for example deciding if over requires communication which is tight.
A strength of the above results is that they hold for any choice of in the promise. This makes them equivalent to certain results, discussed below in Section 5.0.1. Next we prove two other lower bounds that do not have this property and can be obtained by reduction from disjointness. First we show that for any nonabelian group there exists an element such that deciding if or requires communication linear in the length of the product. Interestingly, the proof works for any nonabelian group. The choice of is critical, as for some and the problem is easy. For example: take any group and consider where is the group of integers with addition modulo . Distinguishing between and amounts to computing the parity of (the components of) the input, which takes constant communication.
Theorem 2. Let be a nonabelian group. There exists such that the following holds. Suppose Alice receives and receives . They are promised that either equals or . Deciding which case it is requires randomized communication .
Proof. We reduce from unique setdisjointness, defined below. For the reduction we encode the And of two bits as a group product. This encoding is similar to the famous puzzle that asks to hang a picture on a wall with two nails in such a way that the picture falls if either one of the nails is removed. Since is nonabelian, there exist such that , and in particular with . We can use this fact to encode the And of and as
In the disjointness problem Alice and Bob get inputs respectively, and they wish to check if there exists an such that . If you think of as characteristic vectors of sets, this problem is asking if the sets have a common element or not. The communication of this problem is [KS92, Raz92]. Moreover, in the “unique” variant of this problem where the number of such ’s is 0 or 1, the same lower bound still applies. This follows from [KS92, Raz92] – see also Proposition 3.3 in [AMS99]. For more on disjointness see the surveys [She14, CP10].
We will reduce unique disjointness to group products. For we produce inputs for the group problem as follows:
The group product becomes
If there isn’t an such that , then for each the term is , and thus the whole product is 1.
Otherwise, there exists a unique such that and thus the product will be , with being in the th position. If Alice and Bob can check if the above product is equal to 1, they can also solve the unique set disjointness problem, and thus the lower bound applies for the former.
We required the uniqueness property, because otherwise we might get a product that could be equal to 1 in some groups.
Next we prove a result for products of length just ; it applies to nonabelian groups of the form and not with the promise.
Theorem 3. Let be a nonabelian group and consider . Suppose Alice receives and Bob receives . Deciding if requires randomized communication .
Proof. The proof is similar to the proof of Theorem 2. We use coordinate of to encode bit of the disjointness instance. If there is no intersection in the latter, the product will be . Otherwise, at least some coordinate will be .
As a corollary we can prove a lower bound for .
Corollary 3. Theorem 3 holds for .
Proof. Note that contains and that is not abelian. Apply Theorem 3.
Theorem 3 is tight for constantsize . We do not know if Corollary 3 is tight. The trivial upper bound is .
3 Proof of Theorem 1
Several related proofs of this theorem exist, see [GV15, GVa, Sha16]. As in [GVa], the proof that we present can be broken down in three steps. First we reduce the problem to a statement about conjugacy classes. Second we reduce this to a statement about trace maps. Third we prove the latter. We present the first step in a way that is similar but slightly different from the presentation in [GVa]. The second step is only sketched, but relies on classical results about and can be found in [GVa]. For the third we present a proof that was communicated to us by Will Sawin. We thank him for his permission to include it here.
3.1 Step 1
We would like to rule out randomized protocols, but it is hard to reason about them directly. Instead, we are going to rule out deterministic protocols on random inputs. First, for any group element we define the distribution on quadruples , where are uniformly random elements. Note the product of the elements in is always .
Towards a contradiction, suppose we have a randomized protocol such that
This implies a deterministic protocol with the same gap, by fixing the randomness.
We reach a contradiction by showing that for every deterministic protocol using little communication, we have
We start with the following standard lemma, which describes a protocol using product sets.
Lemma 4. (The set of accepted inputs of) A deterministic bit protocol for a function can be written as a disjoint union of rectangles, where a rectangle is a set of the form with and and where is constant.
Proof. (sketch) For every communication transcript , let be the set of inputs giving transcript . The sets are disjoint since an input gives only one transcript, and their number is : one for each communication transcript of the protocol. The rectangle property can be proven by induction on the protocol tree.
Next, we show that any rectangle cannot distinguish . The way we achieve this is by showing that for every the probability that is roughly the same for every , and is roughly the density of the rectangle. (Here we write for the characteristic function of the set .) Without loss of generality we set . Let have density and have density . We aim to bound above
where note the distribution of is the same as .
Because the distribution of is uniform in , the above can be rewritten as
The inequality is CauchySchwarz, and the step after that is obtained by expanding the square and noting that is uniform in , so that the expectation of the term is .
Now we do several transformations to rewrite the distribution in the last expectation in a convenient form. First, rightmultiplying by we can rewrite the distribution as the uniform distribution on tuples such that
The last equation is equivalent to .
We can now do a transformation setting to be to rewrite the distribution of the fourtuple as
where we use to denote a uniform element from the conjugacy class of , that is for a uniform .
Hence it is sufficient to bound
where all the variables are uniform and independent.
With a similar derivation as above, this can be rewritten as
Here each occurrence of denotes a uniform and independent conjugate. Hence it is sufficient to bound
We can now replace with Because has the same distribution of , it is sufficient to bound
For this, it is enough to show that with high probability over and , the distribution of , over the choice of the two independent conjugates, has statistical distance from uniform.
3.2 Step 2
In this step we use information on the conjugacy classes of the group to reduce the latter task to one about the equidistribution of the trace map. Let be the Trace map:
We state the lemma that we want to show.
Lemma 5. Let and . For all but values of and , the distribution of
is close to uniform over in statistical distance.
To give some context, in the conjugacy class of an element is essentially determined by the trace. Moreover, we can think of and as generic elements in . So the lemma can be interpreted as saying that for typical , taking a uniform element from the conjugacy class of and multiplying it by yields an element whose conjugacy class is uniform among the classes of . Using that essentially all conjugacy classes are equal, and some of the properties of the trace map, one can show that the above lemma implies that for typical the distribution of is close to uniform. For more on how this fits we refer the reader to [GVa].
3.3 Step 3
We now present a proof of Lemma 5. The highlevel argument of the proof is the same as in [GVa] (Lemma 5.5), but the details may be more accessible and in particular the use of the LangWeil theorem [LW54] from algebraic geometry is replaced by a more elementary argument. For simplicity we shall only cover the case where is prime. We will show that for all but values of , the probability over that is within of , and for the others it is at most . Summing over gives the result.
We shall consider elements whose trace is unique to the conjugacy class of . (This holds for all but conjugacy classes – see for example [GVa] for details.) This means that the distribution of is that of a uniform element in conditioned on having trace . Hence, we can write the probability that as the number of solutions in to the following three equations (divided by the size of the group, which is ):
We use the second one to remove and the first one to remove from the last equation. This gives
This is an equation in two variables. Write and and use distributivity to rewrite the equation as
At least since Lagrange it has been known how to reduce this to a Pell equation . This is done by applying an invertible affine transformation, which does not change the number of solutions. First set . Then the equation becomes
Equivalently, the crossterm has disappeared and we have
Now one can add constants to and to remove the linear terms, changing the constant term. Specifically, let and set and . The equation becomes
The linear terms disappear, the coefficients of and do not change and the equation can be rewritten as
So this is now a Pell equation
where and
For all but values of we have that is nonzero. Moreover, for all but values of the term is a nonzero polynomial in . (Specifically, for any and any such that .) So we only consider the values of that make it nonzero. Those where give solutions, which is fine. We conclude with the following lemma.
Lemma 6. For and nonzero, and prime , the number of solutions over to the Pell equation
is within of .
This is a basic result from algebraic geometry that can be proved from first principles.
Proof. If for some , then we can replace with and we can count instead the solutions to the equation
Because we can set and , which preserves the number of solutions, and rewrite the equation as
Because , this has solutions: for every nonzero we have .
So now we can assume that for any . Because the number of squares is , the range of has size . Similarly, the range of also has size . Hence these two ranges intersect, and there is a solution .
We take a line passing through : for parameters we consider pairs . There is a bijection between such pairs with and the points with . Because the number of solutions with is , using that , it suffices to count the solutions with .
The intuition is that this line has two intersections with the curve . Because one of them, , lies in , the other has to lie as well there. Algebraically, we can plug the pair in the expression to obtain the equivalent equation
Using that is a solution this becomes
We can divide by . Obtaining
We can now divide by which is nonzero by the assumption . This yields
Hence for every value of there is a unique giving a solution. This gives solutions.
4 Three parties, numberinhand
In this section we consider the following threeparty numberinhand problem: Alice gets , Bob gets , Charlie gets , and they want to know if . The communication depends on the group . We present next two efficient protocols for abelian groups, and then a communication lower bound for other groups.
4.1 A randomized protocol for the hypercube
We begin with the simplest setting. Let , that is bit strings with bitwise addition modulo 2. The parties want to check if . They can do so as follows. First, they pick a hash function that is linear: . Specifically, for a uniformly random define . Then, the protocol is as follows.
 Alice sends ,
 Bob send ,
 Charlie accepts if and only if .
The hash function outputs 1 bit, so the communication is constant. By linearity, the protocol accepts iff . If this is always the case, otherwise it happens with probability .
4.2 A randomized protocol for
This protocol is from [Vio14]. For simplicity we only consider the case here – the protocol for general is in [Vio14]. Again, the parties want to check if . For this group, there is no 100% linear hash function but there are almost linear hash functions that satisfy the following properties. Note that the inputs to are interpreted modulo and the outputs modulo .
 for all there is such that ,
 for all we have ,
 .
Assuming some random hash function that satisfies the above properties the protocol works similarly to the previous one:
 Alice sends ,
 Bob sends ,
 Charlie accepts if and only if .
We can set to achieve constant communication and constant error.
To prove correctness of the protocol, first note that for some . Then consider the following two cases:
 if then and the protocol is always correct.
 if then the probability that for some is at most the probability that which is ; so the protocol is correct with high probability.
The hash function..
For the hash function we can use a function analyzed in [DHKP97]. Let be a random odd number modulo . Define
where the product is integer multiplication, and is bitshift. In other words we output the bits of the integer product .
We now verify that the above hash function family satisfies the three properties we required above.
Property (3) is trivially satisfied.
For property (1) we have the following. Let and and . To recap, by definition we have:
 ,
 .
Notice that if in the addition the carry into the bit is , then
otherwise
which concludes the proof for property (1).
Finally, we prove property (2). We start by writing where is odd. So the binary representation of looks like
The binary representation of the product for a uniformly random looks like
We consider the two following cases for the product :
 If , or equivalently , the output never lands in the bad set ;
 Otherwise, the hash function output has uniform bits. For any set , the probability that the output lands in is at most .
4.3 Quasirandom groups
What happens in other groups? The hash function used in the previous result was fairly nontrivial. Do we have an almost linear hash function for matrices? The answer is negative. For and the problem is hard, even under the promise. For a group the complexity can be expressed in terms of a parameter which comes from representation theory. We will not formally define this parameter here, but several qualitatively equivalent formulations can be found in [Gow08]. Instead the following table shows the ’s for the groups we’ve introduced.





:  abelian  





:  





.
Theorem 1. Let be a group, and let . Let be the minimum dimension of any irreducible representation of . Suppose Alice, Bob, and Charlie receive , y, and respectively. They are promised that either equals or . Deciding which case it is requires randomized communication complexity .
This result is tight for the groups we have discussed so far. The arguments are the same as before. Specifically, for the communication is . This is tight up to constants, because Alice and Bob can send their elements. For the communication is . This is tight as well, as the parties can again just communicate the images of an element such that , as discussed in Section 1. This also gives a computational proof that cannot be too large for , i.e., it is at most . For abelian groups we get nothing, matching the efficient protocols given above.
5 Proof of Theorem 1
First we discuss several “mixing” lemmas for groups, then we come back to protocols and see how to apply one of them there.
5.0.1 mixing
We want to consider “high entropy” distributions over , and state a fact showing that the multiplication of two such distributions “mixes” or in other words increases the entropy. To define entropy we use the norms . Our notion of (non)entropy will be . Note that is exactly the collision probability where is independent and identically distributed to . The smaller this quantity, the higher the entropy of . For the uniform distribution we have and so we can think of as maximum entropy. If is uniform over elements, we have and we think of as having “high” entropy.
Because the entropy of is small, we can think of the distance between and in the 2norm as being essentially the entropy of :
Lemma 7. [Gow08, BNP08] If are independent over , then
where is the minimum dimension of an irreducible representation of .
By this lemma, for high entropy distributions and , we get . The factor allows us to pass to statistical distance using CauchySchwarz:
This is the way in which we will use the lemma.
Another useful consequence of this lemma, which however we will not use directly, is this. Suppose now you have independent, highentropy variables . Then for every we have
To show this, set without loss of generality and rewrite the lefthandside as
By CauchySchwarz this is at most
and we can conclude by Lemma 7. Hence the product of three highentropy distributions is close to uniform in a pointwise sense: each group element is obtained with roughly probability .
At least over , there exists an alternative proof of this fact that does not mention representation theory (see [GVa] and [Vioa, Viob]).
With this notation in hand, we conclude by stating a “mixing” version of Theorem 2. For more on this perspective we refer the reader to [GVa].
Theorem 1. Let . Let and be two distributions over . Suppose is independent from . Let . We have
For example, when and have high entropy over (that is, are uniform over pairs), we have , and so . In particular, is close to uniform over in statistical distance.
5.0.2 Back to protocols
As in the beginning of Section 3, for any group element we define the distribution on triples , where are uniform and independent. Note the product of the elements in is always . Again as in Section 3, it suffices to show that for every deterministic protocols using little communication we have
Analogously to Lemma 4, the following lemma describes a protocol using rectangles. The proof is nearly identical and is omitted.
Lemma 8. (The set of accepted inputs of) A deterministic bit numberinhand protocol with three parties can be written as a disjoint union of “rectangles,” that is sets of the form .
Next we show that these product sets cannot distinguish these two distributions , via a straightforward application of lemma 7.
Proof. Pick any and let be the inputs of Alice, Bob, and Charlie respectively. Then
where is uniform in . If either or is small, that is or , then also and hence (??) is at most as well. This holds for every , so we also have We will choose later.
Otherwise, and are large: and . Let be the distribution of conditioned on . We have that and are independent and each is uniform over at least elements. By Lemma 7 this implies , where is the uniform distribution. As mentioned after the lemma, by Cauchy–Schwarz we obtain
where the last inequality follows from the fact that .
This implies that and , because taking inverses and multiplying by does not change the distance to uniform. These two last inequalities imply that
and thus we get that
Picking completes the proof.
Returning to arbitrary deterministic protocols (as opposed to rectangles), write as a union of disjoint rectangles by Lemma 8. Applying Lemma 9 and summing over all rectangles we get that the distinguishing advantage of is at most . For the advantage is at most , concluding the proof.
6 Three parties, numberonforehead
In numberonforehead (NOH) communication complexity [CFL83] with parties, the input is a tuple and each party sees all of it except . For background, it is not known how to prove negative results for parties.
We mention that Theorem 1 can be extended to the multiparty setting, see [GVa]. Several questions arise here, such as whether this problem remains hard for , and what is the minimum length of an interleaved product that is hard for parties (the proof in 1 gives a large constant).
However in this survey we shall instead focus on the problem of separating deterministic and randomized communication. For , we know the optimal separation: The equality function requires communication for deterministic protocols, but can be solved using communication if we allow the protocols to use public coins. For , the best known separation between deterministic and randomized protocol is vs [BDPW10]. In the following we give a new proof of this result, for a different function: if and only if for . As is true for some functions in [BDPW10], a stronger separation could hold for . For context, let us state and prove the upper bound for randomized communication.
Proof. In the numberonforehead model, computing reduces to twoparty equality with no additional communication: Alice computes privately, then Alice and Bob check if .
To prove the lower bound for deterministic protocols we reduce the communication problem to a combinatorial problem.
For intuition, if is the abelian group of real numbers with addition, a corner becomes for , which are the coordinates of an isosceles triangle. We now state the theorem that connects corners and lower bounds.
Lemma 12. Let be a group and a real number. Suppose that every subset with contains a corner. Then the deterministic communication complexity of (defined as ) is .
It is known that implies a corner for certain abelian groups , see [LM07] for the best bound and pointers to the history of the problem. For a stronger result is known: implies a corner [Aus16]. This in turn implies communication .
Proof. We saw already twice that a numberinhand bit protocol can be written as a disjoint union of rectangles (Lemmas 4, 8). Likewise, a numberonforehead bit protocol can be written as a disjoint union of cylinder intersections for some :
The proof idea of the above fact is to consider the transcripts of , then one can see that the inputs giving a fixed transcript are a cylinder intersection.
Let be a bit protocol. Consider the inputs on which accepts. Note that at least fraction of them are accepted by some cylinder intersection . Let . Since the first two elements in the tuple determine the last, we have .
Now suppose contains a corner . Then
This implies , which is a contradiction because and so .
7 The corners theorem for quasirandom groups
In this section we prove the corners theorem for quasirandom groups, following Austin [Aus16]. Our exposition has several minor differences with that in [Aus16], which may make it more computerscience friendly. Possibly a proof can also be obtained via certain local modifications and simplifications of Green’s exposition [Gre05b, Gre05a] of an earlier proof for the abelian case. We focus on the case for simplicity, but the proof immediately extends to other quasirandom groups (with corresponding parameters).
Theorem 1. Let . Every subset of density contains a corner .
7.1 Proof idea
For intuition, suppose is a product set, i.e., for . Let’s look at the quantity
where iff . Note that the random variable in the expectation is equal to exactly when form a corner in . We’ll show that this quantity is greater than , which implies that contains a corner (where ). Since we are taking , we can rewrite the above quantity as
where the last line follows by replacing with in the uniform distribution. If , then both B/G and . Condition on , , . Then the distribution is a product of three independent distributions, each uniform on a set of density . (In fact, two distributions would suffice for this.) By Lemma 7, is close to uniform in statistical distance. This implies that the above expectation equals
for for a small enough constant . Hence, product sets of density polynomial in contain corners.
Given the above, it is natural to try to decompose an arbitrary set into product sets. We will make use of a more general result.
7.2 Weak Regularity Lemma
Let be some universe (we will take ) and let be a function (for us, ). Let be some set of functions, which can be thought of as “easy functions” or “distinguishers” (these will be rectangles or closely related to them). The next theorem shows how to decompose into a linear combination of the up to an error which is polynomial in the length of the combination. More specifically, will be indistinguishable from by the .
Lemma 13. Let be a function and a set of functions. For all , there exists a function where , and such that for all
A different way to state the conclusion, which we will use, is to say that we can write so that is small.
The lemma is due to Frieze and Kannan [FK96]. It is called “weak” because it came after Szemerédi’s regularity lemma, which has a stronger distinguishing conclusion. However, the lemma is also “strong” in the sense that Szemerédi’s regularity lemma has as a tower of whereas here we have polynomial in . The weak regularity lemma is also simpler. There also exists a proof [Tao17] of Szemerédi’s theorem (on arithmetic progressions), which uses weak regularity as opposed to the full regularity lemma used initially.
Proof. We will construct the approximation through an iterative process producing functions . We will show that decreases by each iteration.
Start: Define (which can be realized setting ).
Iterate: If not done, there exists such that . Assume without loss of generality .
Update: where shall be picked later.
Let us analyze the progress made by the algorithm.
where the last line follows by taking . Therefore, there can only be iterations because .
7.3 Getting more for rectangles
Returning to the main proof, we will use the weak regularity lemma to approximate the indicator function for arbitrary by rectangles. That is, we take to be the collection of indicator functions for all sets of the form for . The weak regularity lemma shows how to decompose into a linear combination of rectangles. These rectangles may overlap. However, we ideally want to be a linear combination of nonoverlapping rectangles. In other words, we want a partition of rectangles. It is possible to achieve this at the price of exponentiating the number of rectangles. Note that an exponential loss is necessary even if in every rectangle; or in other words in the unidimensional setting. This is one step where the terminology “rectangle” may be misleading – the set is not necessarily an interval. If it was, a polynomial rather than exponential blowup would have sufficed to remove overlaps.
Claim 14. Given a decomposition of into rectangles from the weak regularity lemma with functions, there exists a decomposition with rectangles which don’t overlap.
Proof. Exercise.
In the above decomposition, note that it is natural to take the coefficients of rectangles to be the density of points in that are in the rectangle. This gives rise to the following claim.
Claim 15. The weights of the rectangles in the above claim can be the average of in the rectangle, at the cost of doubling the error.
Consequently, we have that , where is the sum of nonoverlapping rectangles with coefficients .
Proof. Let be a partition decomposition with arbitrary weights. Let be a partition decomposition with weights being the average of . It is enough to show that for all rectangle distinguishers
By the triangle inequality, we have that
To bound , note that the error is maximized for a that respects the decomposition in nonoverlapping rectangles, i.e., is the union of some nonoverlapping rectangles from the decomposition. This can be argued using that, unlike , the value of and on a rectangle from the decomposition is fixed. But, from the point of “view” of such , ! More formally, . This gives
and concludes the proof.
We need to get still a little more from this decomposition. In our application of the weak regularity lemma above, we took the set of distinguishers to be characteristic functions of rectangles. That is, distinguishers that can be written as where and map . We will use that the same guarantee holds for and with range , up to a constant factor loss in the error. Indeed, let and have range . Write where and have range , and the same for . The error for distinguisher is at most the sum of the errors for distinguishers , , , and . So we can restrict our attention to distinguishers where and have range . In turn, a function with range can be written as an expectation for functions with range , and the same for . We conclude by observing that
7.4 Proof
Let us now finish the proof by showing a corner exists for sufficiently dense sets . We’ll use three types of decompositions for , with respect to the following three types of distinguishers, where and have range :
 ,
 ,
 .
The first type is just rectangles, what we have been discussing until now. The distinguishers in the last two classes can be visualized over as parallelograms with a 45degree angle. The same extra properties we discussed for rectangles can be verified hold for them too.
Recall that we want to show
We’ll decompose the th occurrence of via the th decomposition listed above. We’ll write this decomposition as . We apply this in a certain order to produce sums of products of three functions. The inputs to the functions don’t change, so to avoid clutter we do not write them, and it is understood that in each product of three functions the inputs are, in order . The decomposition is:
We first show that the expectation of the first term is big. This takes the next two claims. Then we show that the expectations of the other terms are small.
Proof. We just need to get error for any product of three functions for the three decomposition types. We have:
This is similar to what we discussed in the overview, and is where we use mixing. Specifically, if or are at most for a small enough constant than we are done. Otherwise, conditioned on , the distribution on is uniform over a set of density , and the same holds for , and the result follows by Lemma 7.
Recall that we start with a set of density .
Proof. We will relate the expectation over to using the Hölder inequality: For random variables ,
To apply this inequality in our setting, write
By the Hölder inequality the expectation of the righthand side is
The last three terms equal to because
where is the set in the partition that contains . Putting the above together we obtain
Finally, because the functions are positive, we have that . This concludes the proof.
It remains to show the other terms are small. Let be the error in the weak regularity lemma with respect to distinguishers with range . Recall that this implies error with respect to distinguishers with range . We give the proof for one of the terms and then we say little about the other two.
The proof involves changing names of variables and doing CauchySchwarz to remove the terms with and bound the expectation above by , which is small by the regularity lemma.
Proof. Replace with in the uniform distribution to get
where the first inequality is by CauchySchwarz.
Now replace and reason in the same way:
Replace to rewrite the expectation as
We want to view the last three terms as a distinguisher . First, note that has range . This is because and has range , where recall that is the set in the partition that contains . Fix . The last term in the expectation becomes a constant . The second term only depends on , and the third only on . Hence for appropriate functions and with range this expectation can be rewritten as
which concludes the proof.
There are similar proofs to show the remaining terms are small. For , we can perform simple manipulations and then reduce to the above case. For , we have a slightly easier proof than above.
7.4.1 Parameters
Suppose our set has density , and the error in the regularity lemma is . By the above results we can bound
where the terms in the righthand size come, lefttoright from Claim 17, 16, and 18. Picking the proof is completed for sufficiently small .
References
[AL00] Andris Ambainis and Satyanarayana V. Lokam. Imroved upper bounds on the simultaneous messages complexity of the generalized addressing function. In Latin American Symposium on Theoretical Informatics (LATIN), pages 207–216, 2000.
[Amb96] Andris Ambainis. Upper bounds on multiparty communication complexity of shifts. In Symp. on Theoretical Aspects of Computer Science (STACS), pages 631–642, 1996.
[AMS99] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. J. of Computer and System Sciences, 58(1, part 2):137–147, 1999.
[Aus16] Tim Austin. AjtaiSzemerédi theorems over quasirandom groups. In Recent trends in combinatorics, volume 159 of IMA Vol. Math. Appl., pages 453–484. Springer, [Cham], 2016.
[Bar89] David A. Mix Barrington. Boundedwidth polynomialsize branching programs recognize exactly those languages in NC. J. of Computer and System Sciences, 38(1):150–164, 1989.
[BC92] Michael BenOr and Richard Cleve. Computing algebraic formulas using a constant number of registers. SIAM J. on Computing, 21(1):54–58, 1992.
[BDPW10] Paul Beame, Matei David, Toniann Pitassi, and Philipp Woelfel. Separating deterministic from randomized multiparty communication complexity. Theory of Computing, 6(1):201–225, 2010.
[BGKL03] László Babai, Anna Gál, Peter G. Kimmel, and Satyanarayana V. Lokam. Communication complexity of simultaneous messages. SIAM J. on Computing, 33(1):137–166, 2003.
[BNP08] László Babai, Nikolay Nikolov, and László Pyber. Product growth and mixing in finite groups. In ACMSIAM Symp. on Discrete Algorithms (SODA), pages 248–257, 2008.
[CFL83] Ashok K. Chandra, Merrick L. Furst, and Richard J. Lipton. Multiparty protocols. In 15th ACM Symp. on the Theory of Computing (STOC), pages 94–99, 1983.
[CP10] Arkadev Chattopadhyay and Toniann Pitassi. The story of set disjointness. SIGACT News, 41(3):59–85, 2010.
[DHKP97] Martin Dietzfelbinger, Torben Hagerup, Jyrki Katajainen, and Martti Penttonen. A reliable randomized algorithm for the closestpair problem. J. Algorithms, 25(1):19–51, 1997.
[FK96] Alan M. Frieze and Ravi Kannan. The regularity lemma and approximation schemes for dense problems. In IEEE Symp. on Foundations of Computer Science (FOCS), pages 12–20, 1996.
[Gow08] W. T. Gowers. Quasirandom groups. Combinatorics, Probability & Computing, 17(3):363–387, 2008.
[Gre05a] Ben Green. An argument of Shkredov in the finite field setting, 2005. Available at people.maths.ox.ac.uk/greenbj/papers/corners.pdf.
[Gre05b] Ben Green. Finite field models in additive combinatorics. Surveys in Combinatorics, London Math. Soc. Lecture Notes 327, 127, 2005.
[GVa] W. T. Gowers and Emanuele Viola. Interleaved group products. SIAM J. on Computing.
[GVb] W. T. Gowers and Emanuele Viola. The multiparty communication complexity of interleaved group products. SIAM J. on Computing.
[GV15] W. T. Gowers and Emanuele Viola. The communication complexity of interleaved group products. In ACM Symp. on the Theory of Computing (STOC), 2015.
[IL95] Neil Immerman and Susan Landau. The complexity of iterated multiplication. Inf. Comput., 116(1):103–116, 1995.
[KMR66] Kenneth Krohn, W. D. Maurer, and John Rhodes. Realizing complex Boolean functions with simple groups. Information and Control, 9:190–195, 1966.
[KN97] Eyal Kushilevitz and Noam Nisan. Communication complexity. Cambridge University Press, 1997.
[KS92] Bala Kalyanasundaram and Georg Schnitger. The probabilistic communication complexity of set intersection. SIAM J. Discrete Math., 5(4):545–557, 1992.
[LM07] Michael T. Lacey and William McClain. On an argument of Shkredov on twodimensional corners. Online J. Anal. Comb., (2):Art. 2, 21, 2007.
[LW54] Serge Lang and André Weil. Number of points of varieties in finite fields. American Journal of Mathematics, 76:819–827, 1954.
[Mil14] Eric Miles. Iterated group products and leakage resilience against . In ACM Innovations in Theoretical Computer Science conf. (ITCS), 2014.
[MV13] Eric Miles and Emanuele Viola. Shielding circuits with groups. In ACM Symp. on the Theory of Computing (STOC), 2013.
[PRS97] Pavel Pudlák, Vojtěch Rödl, and Jiří Sgall. Boolean circuits, tensor ranks, and communication complexity. SIAM J. on Computing, 26(3):605–633, 1997.
[Raz92] Alexander A. Razborov. On the distributional complexity of disjointness. Theor. Comput. Sci., 106(2):385–390, 1992.
[Raz00] Ran Raz. The BNSChung criterion for multiparty communication complexity. Computational Complexity, 9(2):113–122, 2000.
[RY19] Anup Rao and Amir Yehudayoff. Communication complexity. 2019. https://homes.cs.washington.edu/ anuprao/pubs/book.pdf.
[Sha16] Aner Shalev. Mixing, communication complexity and conjectures of Gowers and Viola. Combinatorics, Probability and Computing, pages 1–13, 6 2016. arXiv:1601.00795.
[She14] Alexander A. Sherstov. Communication complexity theory: Thirtyfive years of set disjointness. In Symp. on Math. Foundations of Computer Science (MFCS), pages 24–43, 2014.
[Tao17] Terence Tao. Szemerédiâs proof of Szemerédiâs theorem, 2017. https://terrytao.files.wordpress.com/2017/09/szemerediproof1.pdf.
[Vioa] Emanuele Viola. Thoughts: Mixing in groups. https://emanueleviola.wordpress.com/2016/10/21/mixingingroups/.
[Viob] Emanuele Viola. Thoughts: Mixing in groups ii. https://emanueleviola.wordpress.com/2016/11/15/mixingingroupsii/.
[Vio14] Emanuele Viola. The communication complexity of addition. Combinatorica, pages 1–45, 2014.
[Vio17] Emanuele Viola. Special topics in complexity theory. Lecture notes of the class taught at Northeastern University. Available at http://www.ccs.neu.edu/home/viola/classes/spepf17.html, 2017.
[Yao79] Andrew ChiChih Yao. Some complexity questions related to distributive computing. In 11th ACM Symp. on the Theory of Computing (STOC), pages 209–213, 1979.
We knew the best thresholdcircuit lower bounds long ago
For more than 20 years we’ve had lower bounds for threshold circuits of depth [IPS97], for a fixed . There have been several “explanations” for the lack of progress [AK10]. Recently Chen and Tell have given a better explanation showing that you can’t even improve the result to a better without proving “the whole thing.”
Say you have a finite group and you want to compute the iterated product of elements.
Warmup [AK10]..
Suppose you can compute this with circuits of size and depth . Now we show how you can trade size for depth. Put a complete tree with fanin on top of the group product, where each node computes the product of its children (this is correct by associativity, in general this works for a monoid). This tree needs depth . If you stick your circuit of size and depth at each node, the depth of the overall circuit would be obviously and the overall size would be dominated by the input layer which is . If you are aiming for overall depth , you need . This gives size .
Hence we have shown that proving bounds for some depth suffices to prove lower bounds for depth .
Chen and Tell..
The above is not the most efficient way to build a tree! I am writing this post following their paper to understand what they do. As they say, the idea is quite simple. While above the size will be dominated by the input layer, we want to balance things so that every layer has roughly the same contribution.
Let’s say we are aiming for size and let’s see what depth we can get. Let’s say now the size is . Let us denote by the number of nodes at level with being the root. The fanin at level is so that the cost is as desired. We have the recursion .
The solution to this recursion is , see below.
So that’s it. We need to get to nodes. So if you set you get say . Going back to , we have exhibited circuits of size and depth just . So proving stronger bounds than this would rule out circuits of size and depth .
Added later: About the recurrence.
Letting we have the following recurrence for the exponents of .
This gives
If it was obviously would already be . Instead for we need to get to .
My two cents..
I am not sure I need more evidence that making progress on longstanding bounds in complexity theory is hard, but I do find it interesting to prove these links; we have quite a few by now! The fact that we have been stuck forever just short of proving “the whole thing” makes me think that these longsought bounds may in fact be false. Would love to be proved wrong, but it’s 2019, this connection is proved by balancing a tree better, and you feel confident that P NP?
References
Just coincidence?
Proving lower bounds is one of the greatest intellectual challenges of our time. Something that strikes me is when people reach the same bounds from seemingly different angles. Two recent examples:
 Static Data Structure Lower Bounds Imply Rigidity, by Golovnev, Dvir, Weinstein. They show that improving static datastructure lower bounds, for linear data structures, implies new lower bounds for matrix rigidity. My understanding (the paper isn’t out) is that the available weak but nontrivial data structure lower bounds imply the available weak but nontrivial rigidity lower bounds, and there is absolutely no room for improvement on the former without improving the latter.
 Toward the KRW Composition Conjecture: Cubic Formula Lower Bounds via Communication Complexity, by Dinur and Meir. They reprove the bound on formula size via seemingly different techniques.
What does this mean? Again, the only things that matter are those that you can prove. Still, here are some options:
 Lower bounds are true, and provable with the bag of tricks people are using. The above is just coincidence. Given the above examples (and others) I find this possibility quite bizarre. To illustrate the bizarre in a bizarre way, imagine a graph where one edge is a trick from the bag, and each node is a bound. Why should different paths lead to the same sink, over and over again?
 Lower bounds are true, but you need to use a different bag of tricks. My impression is that two types of results are available here. The first is for “infinitary” proof systems, and includes famous results like the ParisHarrington theorem. The second is for “finitary” proof systems, and includes results like Razborov’s proof that superpolynomial lower bounds cannot be proved in Res(k). What I really would like is a survey that explains what these and all other relevant proof systems are and can do, and what would it mean to either strengthen the proof system or make the unprovable statement closer to the stateoftheart. (I don’t even have the excuse of not having a background in logic. I took classes both in Italy and in the USA. In Italy I went to a summer school in logic, and took the logic class in the math department. It was a rather tough class, one of the last offerings before the teacher was forced to water it down. If I remember correctly, it lasted an entire year (though now it seems a lot). As in the European tradition, at least of the time, instruction was mostly oneway: you’d sit there for hours each week and just swallow this avalanche of material. At the very end, there was an oral exam where you sit with the instructor — facetoface — and they mostly ask you to repeat random bits of the lectures. But for the bright student some simple original problems are also asked — to be solved on the spot. So there is substantial focus on memorization, a word which has acquired a negative connotation, some of which I sympathize with. However a 30minute oral exam does have its benefits, and on certain aspects I’d argue it can’t quite be replaced by written exams, let alone takehome. But I digress.)
 Lower bounds are false. That is, all “simple” functions have say formula size. You can prove this using computational checkpoints, a notion which in hindsight isn’t too complicated, but alas has not yet been invented. To me, this remains the more likely option.
What do you think?
Nonclassical polynomials and exact computation of Boolean functions
Guest post by Abhishek Bhrushundi.
I would like to thank Emanuele for giving me the opportunity to write a guest post here. I recently stumbled upon an old post on this blog which discussed two papers: Nonclassical polynomials as a barrier to polynomial lower bounds by Bhowmick and Lovett, and Anticoncentration for random polynomials by Nguyen and Vu. Towards the end of the post, Emanuele writes:
“Having discussed these two papers in a sequence, a natural question is whether nonclassical polynomials help for exact computation as considered in the second paper. In fact, this question is asked in the paper by Bhowmick and Lovett, who conjecture that the answer is negative: for exact computation, nonclassical polynomials should not do better than classical.”
In a joint work with Prahladh Harsha and Srikanth Srinivasan from last year, On polynomial approximations over , we study exact computation of Boolean functions by nonclassical polynomials. In particular, one of our results disproves the aforementioned conjecture of Bhowmick and Lovett by giving an example of a Boolean function for which low degree nonclassical polynomials end up doing better than classical polynomials of the same degree in the case of exact computation.
The counterexample we propose is the elementary symmetric polynomial of degree in . (Such elementary symmetric polynomials also serve as counterexamples to the inverse conjecture for the Gowers norm [LMS11, GT07], and this was indeed the reason why we picked these functions as candidate counterexamples),
where is the Hamming weight of . One can verify (using, for example, Lucas’s theorem) that if and only if the least significant bit of is .
We use that no polynomial of degree less than or equal to can compute correctly on more than half of the points in .
Theorem 1. Let be a polynomial of degree at most in . Then
[Emanuele’s note. Let me take advantage of this for a historical remark. Green and Tao first claimed this fact and sent me and several others a complicated proof. Then I pointed out the paper by Alon and Beigel [AB01]. Soon after they and I independently discovered the short proof reported in [GT07].]
The constant functions (degree polynomials) can compute any Boolean function on half of the points in and this result shows that even polynomials of higher degree don’t do any better as far as is concerned. What we prove is that there is a nonclassical polynomial of degree that computes on of the points in .
Theorem 2. There is a nonclassical polynomial of degree such that
A nonclassical polynomial takes values on the torus and in order to compare the output of a Boolean function (i.e., a classical polynomial) to that of a nonclassical polynomial it is convenient to think of the range of Boolean functions to be . So, for example, if , and otherwise. Here denotes the least significant bit of .
We show that the nonclassical polynomial that computes on of the points in is
The degree of this nonclassical polynomial is but I wouldn’t get into much detail as to why this is case (See [BL15] for a primer on the notion of degree in the nonclassical world).
Understanding how behaves comes down to figuring out the largest power of two that divides for a given : if the largest power of two that divides is then , otherwise if the largest power is at least then . Fortunately, there is a generalization of Lucas’s theorem, known as Kummer’s theorem, that helps characterize this:
Theorem 3.[Kummer’s theorem] The largest power of dividing for , , is equal to the number of borrows required when subtracting from in base .
Equipped with Kummer’s theorem, it doesn’t take much work to arrive at the following conclusion.
Lemma 4. if either or , where denotes the least significant bit of .
If is uniformly distributed in then it’s not hard to verify that the bits are almost uniformly and independently distributed in , and so the above lemma proves that computes on of the points in . It turns out that one can easily generalize the above argument to show that is a counterexample to Bhowmick and Lovett’s conjecture for every .
We also show in our paper that it is not the case that nonclassical polynomials always do better than classical polynomials in the case of exact computation — for the majority function, nonclassical polynomials do as badly as their classical counterparts (this was also conjectured by Bhowmick and Lovett in the same work), and the RazborovSmolensky bound for classical polynomials extends to nonclassical polynomials.
We started out trying to prove that is a counterexample but couldn’t. It would be interesting to check if it is one.
References
[AB01] N. Alon and R. Beigel. Lower bounds for approximations by low degree polynomials over z m. In Proceedings 16th Annual IEEE Conference on Computational Complexity, pages 184–187, 2001.
[BL15] Abhishek Bhowmick and Shachar Lovett. Nonclassical polynomials as a barrier to polynomial lower bounds. In Proceedings of the 30th Conference on Computational Complexity, pages 72–87, 2015.
[GT07] B. Green and T. Tao. The distribution of polynomials over finite fields, with applications to the Gowers norms. ArXiv eprints, November 2007.
[LMS11] Shachar Lovett, Roy Meshulam, and Alex Samorodnitsky. Inverse conjecture for the gowers norm is false. Theory of Computing, 7(9):131–145, 2011.
Entropy polarization
Sometimes you see quantum popping up everywhere. I just did the opposite and gave a classical talk at a quantum workshop, part of an AMS meeting held at Northeastern University, which poured yet another avalanche of talks onto the Boston area. I spoke about the complexity of distributions, also featured in an earlier post, including a result I posted two weeks ago which gives a boolean function such that the output distribution of any AC circuit has statistical distance from for uniform . In particular, no AC circuit can compute much better than guessing at random even if the circuit is allowed to sample the input itself. The slides for the talk are here.
The new technique that enables this result I’ve called entropy polarization. Basically, for every AC circuit mapping any number of bits into bits, there exists a small set of restrictions such that:
(1) the restrictions preserve the output distribution, and
(2) for every restriction , the output distribution of the circuit restricted to either has minentropy or . Whence polarization: the entropy will become either very small or very large.
Such a result is useless and trivial to prove with ; the critical feature is that one can obtain a much smaller of size .
Entropy polarization can be used in conjunction with a previous technique of mine that works for high minentropy distributions to obtain the said sampling lower bound.
It would be interesting to see if any of this machinery can yield a separation between quantum and classical sampling for constantdepth circuits, which is probably a reason why I was invited to give this talk.
Hardness amplification proofs require majority… and 15 years
Aryeh Grinberg, Ronen Shaltiel, and myself have just posted a paper which proves conjectures I made 15 years ago (the historians want to consult the last paragraph of [2] and my Ph.D. thesis).
At that time, I was studying hardness amplification, a cool technique to take a function that is somewhat hard on average, and transform it into another function that is much harder on average. If you call a function hard if it cannot be computed on a fraction of the inputs, you can start e.g. with that is hard and obtain that is hard, or more. This is very important because functions with the latter hardness imply pseudorandom generators with Nisan’s design technique, and also “additional” lower bounds using the “discriminator lemma.”
The simplest and most famous technique is Yao’s XOR lemma, where
and the hardness of decays exponentially with . (So to achieve the parameters above it suffices to take .)
At the same time I was also interested in circuit lower bounds, so it was natural to try to use this technique for classes for which we do have lower bounds. So I tried, and… oops, it does not work! In all known techniques, the reduction circuit cannot be implemented in a class smaller than TC – a class for which we don’t have lower bounds and for which we think it will be hard to get them, also because of the Natural proofs barrier.
Eventually, I conjectured that this is inherent, namely that you can take any hardness amplification reduction, or proof, and use it to compute majority. To be clear, this conjecture applied to blackbox proofs: decoding arguments which take anything that computes too well and turn it into something which computes too well. There were several partial results, but they all had to restrict the proof further, and did not capture all available techniques.
Should you have had any hope that blackbox proofs might do the job, in this paper we prove the full conjecture (improving on a number of incomparable works in the literature, including a 10yearanniversary work by Shaltiel and myself which proved the conjecture for nonadaptive proofs).
Indistinguishability
One thing that comes up in the proof is the following basic problem. You have a distribution on bits that has large entropy, very close to . A classic result shows that most bits of are close to uniform. We needed an adaptive version of this, showing that a decision tree making few queries cannot distinguish from uniform, as long as the tree does not query a certain small forbidden set of variables. This also follows from recent and independent work of Or Meir and Avi Wigderson.
Turns out this natural extension is not enough for us. In a nutshell, it is difficult to understand what queries an arbitrary reduction is making, and so it is hard to guarantee that the reduction does not query the forbidden set. So we prove a variant, where the variables are not forbidden, but are fixed. Basically, you condition on some fixing of few variables, and then the resulting distribution is indistinguishable from the distribution where is uniform. Now the queries are not forbidden but have a fixed answer, and this makes things much easier. (Incidentally, you can’t get this simply by fixing the forbidden set.)
Fine, so what?
One great question remains. Can you think of a counterexample to the XOR lemma for a class such as constantdepth circuits with parity gates?
But there is something more why I am interested in this. Proving averagecase hardness results for restricted classes “just” beyond AC is more than a longstanding open question in lower bounds: It is necessary even for worstcase lower bounds, both in circuit and communication complexity, as we discussed earlier. And here’s hardness amplification, which intuitively should provide such hardness results. It was given many different proofs, see e.g. [1]. However, none can be applied as we just saw. I don’t know, someone taking results at face value may even start thinking that such averagecase hardness results are actually false.
References
[1] Oded Goldreich, Noam Nisan, and Avi Wigderson. On Yao’s XOR lemma. Technical Report TR95–050, Electronic Colloquium on Computational Complexity, March 1995. http://www.eccc.unitrier.de/.
[2] Emanuele Viola. The complexity of constructing pseudorandom generators from hard functions. Computational Complexity, 13(34):147–188, 2004.
Matrix rigidity, and all that
The rigidity challenge asks to exhibit an n × n matrix M that cannot be written as M = A + B where A is “sparse” and B is “lowrank.” This challenge was raised by Valiant who showed in [Val77] that if it is met for any A with at most n^{1+ϵ} nonzero entries and any B with rank O(n∕ log log n) then computing the linear transformation M requires either logarithmic depth or superlinear size for linear circuits. This connection relies on the following lemma.
Lemma 1. Let C : {0, 1}^{n} →{0, 1}^{n} be a circuit made of XOR gates. If you can remove e edges and reduce the depth to d then the linear transformation computed by C equals A + B where A has ≤ 2^{d} nonzero entries per row (and so a total of ≤ n2^{d} nonzero entries), and B has rank ≤ e.
Proof: After you remove the edges, each output bit is a linear combination of the removed edges and at most 2^{d} input variables. The former can be done by B, the latter by A. QED
Valiant shows that in a logdepth, linearsize circuit one can remove O(n∕ log log n) edges to reduce the depth to n^{ϵ} – a proof can be found in [Vio09] – and this gives the above connection to lower bounds.
However, the best available tradeoff for explicit matrices give sparsity n^{2}∕r log(n∕r) and rank r, for any parameter r; and this is not sufficient for application to lower bounds.
Errorcorrecting codes
It was asked whether generator matrixes of good linear codes are rigid. (A code is good if it has constant rate and constant relative distance. The dimensions of the corresponding matrixes are off by only a constant factor, and so we can treat them as identical.) Spielman [Spi95] shows that there exist good codes that can be encoded by linearsize logarithmic depth circuits. This immediately rules out the possibility of proving a lower bound, and it gives a nontrivial rigidity upper bound via the above connections.
Still, one can ask if these matrices at least are more rigid than the available tradeoffs. Goldreich reports a negative answer by Dvir, showing that there exist good codes whose generating matrix C equals A + B where A has at most O(n^{2}∕d) nonzero entries and B has rank O(d log n∕d), for any d.
A similar negative answer follows by the paper [GHK^{+}13]. There we show that there exist good linear codes whose generating matrix can be written as the product of few sparse matrixes. The corresponding circuits are very structured, and so perhaps it is not surprising that they give good rigidity upper bounds. More precisely, the paper shows that we can encode an nbit message by a circuit made of XOR gates and with say n log ^{*}n wires and depth O(1) – with unbounded fanin. Each gate in the circuit computes the XOR of some t gates, which can be written as a binary tree of depth log _{2}t + O(1). Such trees have poor rigidity:
Lemma 2.[Trees are not rigid] Let C be a binary tree of depth d. You can remove an O(1∕2^{b}) fraction of edges to reduce the depth to b, for any b.
Proof: It suffices to remove all edges at depths d – b, d – 2b, …. The number of such edges is O(2^{db} + 2^{d2b} + …) = O(2^{db}). Note this includes the case d ≤ b, where we can remove 0 edges. QED
Applying Lemma 2 to a gate in our circuit, we reduce the depth of the binary tree computed at that gate to b. Applying this to every gate we obtain a circuit of depth O(b). In total we have removed an O(1∕2^{b}) fraction of the n log ^{*}n edges.
Writing 2^{b} = n∕d, by Lemma 1 we can write the generating matrixes of our code as C = A + B where A has at most O(n∕d) nonzero entries per row, and B has rank O(d log ^{*}n). These parameters are the same as in Dvir’s result, up to lowerorder terms. The lowerorder terms appear incomparable.
WalshFourier transform
Another matrix that was considered is the n×n Inner Product matrix H, aka the WalshHadamard matrix, where the x,y entry is the inner product of x and y modulo 2. Alman and Williams [AW16] recently give an interesting rigidity upper bound which prevents this machinery to establish a circuit lower bound. Specifically they show that H can be written as H = A + B where A has at most n^{1+ϵ} nonzero entries, and B has rank n^{1ϵ′}, for any ϵ and an ϵ′ which goes to 0 when ϵ does.
Their upper bound works as follows. Let h = log _{2}n. Start with the univariate, real polynomial p(z_{1},z_{2},…,z_{h}) which computes parity exactly on inputs of Hamming weight between 2ϵn and (1∕2 + ϵ)n. By interpolation such a polynomial exists with degree (1∕2 – ϵ)n. Replacing z_{i} with x_{i}y_{i} you obtain a polynomial of degree n – ϵn which computes IP correctly on inputs x,y whose inner product is between 2ϵn and (1∕2 + ϵ)n.
This polynomial has 2^{(1ϵ′)n} monomials, where ϵ′ = Ω(ϵ^{2}). The truthtable of a polynomial with m monomials is a matrix with rank m, and this gives a lowrank matrix B′.
The fact that sparse polynomials yield lowrank matrixes also appeared in the paper [SV12], which suggested to study the rigidity challenge for matrixes arising from polynomials.
Returning to the proof in [AW16], it remains to deal with inputs whose inner product does not lie in that range. The number of x whose weight is not between (1∕2 – ϵ)n and (1∕2 + ϵ)n is 2^{(1ϵ′)n}. For each such input x we modify a row of the matrix B′. Repeating the process for the y we obtain the matrix B, and the rank bound 2^{(1ϵ′)n} hasn’t changed.
Now a calculation shows that B differs from H in few entries. That is, there are few x and y with Hamming weight between (1∕2 – ϵ)n and (1∕2 + ϵ)n, but with inner product less than 2ϵn.
Boolean complexity
There exists a corresponding framework for boolean circuits (as opposed to circuits with XOR gates only). Rigid matrixes informally correspond to depth3 OrAndOr circuits. If this circuit has fanin f_{o} at the output gate and fanin f_{i} at each input gate, then the correspondence in parameters is
rank  = log f_{o}  
sparsity  = 2^{fi }. 
More precisely, we have the following lemma.
Lemma 3. Let C : {0, 1}^{n} →{0, 1}^{n} be a boolean circuit. If you can remove e edges and reduce the depth to d then you can write C as an OrAndOr circuit with output fanin 2^{e} and input fanin 2^{d}.
Proof: After you remove the edges, each output bit and each removed edge depends on at most 2^{d} input bits or removed edges. The output Or gate of the depth3 circuit is a big Or over all 2^{e} assignments of values for the removed edges. Then we need to check consistency. Each consistency check just depends on 2^{d} inputs and so can be written as a depth2 circuit with fanin 2^{d}. QED
The available bounds are of the form log f_{o} = n∕f_{i}. For example, for input fanin f_{i} = n^{α} we have lower bounds exponential in n^{1α} but not more. Again it can be shown that breaking this tradeoff in certain regimes (namely, log _{2}f_{o} = O(n∕ log log n)) yields lower bounds against linearsize logdepth circuits. (A proof appears in [Vio09].) It was also pointed out in [Vio13] that breaking this tradeoff in any regime yields lower bounds for branching programs. See also the previous post.
One may ask how pairwise independent hash functions relate to this challenge. Ishai, Kushilevitz, Ostrovsky, and Sahai showed [IKOS08] that they can be computed by linearsize logdepth circuits. Again this gives a nontrivial upper bound for depth3 circuits via these connections, and one can ask for more. In [GHK^{+}13] we give constructions of such circuits which in combination with Lemma 3 can again be used to almost match the available tradeoffs.
The bottom line of this post is that we can’t prove lower bounds because they are false, and it is a puzzle to me why some people appear confident that P is different from NP.
References
[AW16] Josh Alman and Ryan Williams. Probabilistic rank and matrix rigidity, 2016. https://arxiv.org/abs/1611.05558.
[GHK^{+}13] Anna Gál, Kristoffer Arnsfelt Hansen, Michal Koucký, Pavel Pudlák, and Emanuele Viola. Tight bounds on computing errorcorrecting codes by boundeddepth circuits with arbitrary gates. IEEE Transactions on Information Theory, 59(10):6611–6627, 2013.
[IKOS08] Yuval Ishai, Eyal Kushilevitz, Rafail Ostrovsky, and Amit Sahai. Cryptography with constant computational overhead. In 40th ACM Symp. on the Theory of Computing (STOC), pages 433–442, 2008.
[Spi95] Daniel Spielman. Computationally Efficient ErrorCorrecting Codes and Holographic Proofs. PhD thesis, Massachusetts Institute of Technology, 1995.
[SV12] Rocco A. Servedio and Emanuele Viola. On a special case of rigidity. Available at http://www.ccs.neu.edu/home/viola/, 2012.
[Val77] Leslie G. Valiant. Graphtheoretic arguments in lowlevel complexity. In 6th Symposium on Mathematical Foundations of Computer Science, volume 53 of Lecture Notes in Computer Science, pages 162–176. Springer, 1977.
[Vio09] Emanuele Viola. On the power of smalldepth computation. Foundations and Trends in Theoretical Computer Science, 5(1):1–72, 2009.
[Vio13] Emanuele Viola. Challenges in computational lower bounds. Available at http://www.ccs.neu.edu/home/viola/, 2013.
Restricted models
Map 1
Map 2
To understand Life, what should you study?
a. People’s dreams.
b. The AMPK gene of the fruit fly.
Studying restricted computational models corresponds to b. Just like microbes constitute a wealth of open problems whose solutions are sometimes farreaching, so restricted computational models present a number of challenges whose study is significant. For one example, Valiant’s study of arithmetic lower bounds boosted the study of superconcentrators, an influential type of graphs closely related to expanders.
The maps above, taken from here, include a number of challenges together with their relationships. Arrows go towards special cases (which are presumably easier). As written in the manuscript, my main aim was to put these challenges in perspective, and to present some connections which do not seem widely known. Indeed, one specific reason why I drew the first map was the realization that an open problem that I spent some time working on can actually be solved immediately by combining known results. The problem was to show that multiparty (numberonforehead) communication lower bounds imply correlation bounds for polynomials over GF(2). The classic work by Hastad and Goldman does show that kparty protocols can simulate polynomials of degree k1, and so obviously that correlation bounds for kparty protocols imply the same bounds for polynomials of degree k1. But what I wanted was a connection with worstcase communication lower bounds, to show that correlation bounds for polynomials (survey) are a prerequisite even for that.
As it turns out, and as the arrows from (1.5) to (1.2) in the first map show, this is indeed true when k is polylogarithmic. So, if you have been trying to prove multiparty lower bounds for polylogarithmic k, you may want to try correlation bounds first. (This connection is for proving correlation bounds under some distribution, not necessarily uniform.)
Another reason why I drew the first map was to highlight a certain type of correlation bound (1.3), discussed in this paper with Razborov. It is a favorite example of mine of a seemingly very basic open problem that is, again, a roadblock for much of what we’d like to know. The problem is to obtain correlation bounds against polynomials that are real valued, with the convention that whenever the polynomial does not output a boolean value we count that as a mistake, thus making the goal of proving a lower bound presumably easier. Amazingly, the following is still open:
Prove that the correlation of the parity function on n bits is at most 1/n with any real polynomial of degree log(n).
To be precise, correlation is defined as the probability that the polynomial correctly computes parity, minus 1/2. For example, the constant polynomial 1 has correlation zero with parity — it gets it right half the times. Whereas the polynomial x1+x2+…+xn does a lot worse, it has negative correlation with parity or in fact any boolean function, just because it is unlikely that its output is in {0,1}.
What we do in the paper, in my opinion, is to begin to formalize the intuition that these polynomials cannot do much. We show that the correlation with parity is zero (not very small, but actually zero) as long as the polynomial has degree 0.001 loglog(n). This is different from the more familiar models of polynomials modulo m or sign polynomials, because those can achieve nonzero correlation even with constant degree.
On the other hand, with a simple construction, we can obtain nonzero correlation with polynomials of degree O(sqrt(n)). Note the huge gap with the 0.001 loglog(n) lower bound.
Question: what is the largest degree for which the correlation is zero?
The second map gives another slice of open problems. It highlights how superlinearlength lower bounds for branching programs are necessary for several notorious circuit lower bounds.
A third map was scheduled to include Valiant’s longstanding rigidity question and algebraic lower bounds. In the end it was dropped because it required a lot of definitions while I knew of very few arrows. But one problem that was meant to be there is a special case of the rigidity question from this work with Servedio. The question is basically a variant of the above question of real polynomials, where instead of considering lowdegree polynomials we consider sparse polynomials. What may not be immediately evident, although in hindsight it is technically immediate, is that this problem is indeed a special case of the rigidity question. The question is to improve on the rigidity bounds in this special case.
In the paper we prove some variant that does not seem to be known in the rigidity world, but what I want to focus on right now is an application that such bounds would have, if established for the Inner Product function modulo 2 (IP). They would imply that IP cannot be computed by polynomialsize AC0Parity circuits, i.e., AC0 circuits which take as input a layer of parity gates that’s connected to the input. It seems ridiculous that IP can be computed by such circuits, of course. It is easy to handle OrAndParity circuits, but circuits of higher depth have resisted attacks.
The question was reasked by Akavia, Bogdanov, Guo, Kamath, and Rosen.
Cheraghchi, Grigorescu, Juba, Wimmer, and Xie have just obtained some lower bounds for this problem. For AndOrAndParity circuits they obtain almost quadratic; the bounds degrade for larger depth but stay polynomial. Their proof of the quadratic lower bound looks nice to me. Their first moves are relatively standard: first they reduce to an approximation question for OrAndParity circuits; then they fix half the variables of IP so that IP becomes a parity that is “far” from the parities that are input to the DNF. The more interesting step of the argument, in my opinion, comes at this point. They consider the random variable N that counts the number of Andparity gates that evaluate to one, and they observe that the distribution of several moments of this variable is the same in the case where the parity that comes from IP is zero or one. From this, they use approximation theory to argue about the probability that N will be zero in the two cases. They get that these probabilities are also quite close, as long as the circuit is not too large, which shows that the circuit is not correctly computing IP.
What is the P vs. NP problem? My two cents
I just had to convert a movie clip into a different format. The conversion took ten minutes. Given that the clip can be loaded into memory in one second, ten minutes is a long time. Could not this be done faster? In fact, why not one second?
Afterwards, I played video games. I have a playstation 3, but I heard that on the playstation 4 the games look better because with the faster processor the 3D scenes have more details. But do I really need a faster processor for those details? Could not my playstation 3 be programmed to run those games? In fact, could there be a way to have playstation 4 games run on a Commodore 64, or even… see the picture. There do exist hardware limitations of course, but these are not the bottleneck. A presentday 3D game engine on a Commodore 64 would be a stunning achievement, even if the resolution were a bit coarser and the cut scenes deleted.
These are two examples of a wideopen question that is central to theoretical computer science, and to a growing number of fields of science: Are there computational tasks that require a long time?
This may sound puzzling at first, since computers that take a long time are everyday experience, from the cases mentioned above to every time the mouse pointer turns into a sand clock for a ridiculous amount of time, including just now when I booted my computer so that I could continue this post. But the point is that nobody knows whether computers could be programmed to run much faster.
To be sure, there does exist one problem that is known to take a long time. I call this the programenumeration problem, and is as follows: consider a computer program that goes through every possible program of length n, runs each for n steps, and does something different from each. For example the program could output the first number that is not output by any program of size n within n steps. By construction, this strange task cannot be performed in time n. On the other hand, it can be done in time exponential in n. (Think n = 10000000000000000, which is roughly how many instructions a modern computer can do in a year.)
A crude, pessimistic summary of our understanding of the limitations of computation is that this is it: all we know is that the programenumeration problem cannot be solved in time n. This single result does have many applications, for example it can be used to show that there is no algorithm for checking the correctness of programs, somewhat justifying the immense industry devoted to that. But the result is very unsatisfactory because most problems that we face, including those mentioned above, have nothing to do with program enumeration. It also feels unsatisfactory because it does not give any information on computation beyond the simple fact that programs can run other programs.
The wideopen question can now be reformulated more precisely as follows.
Grand question: Is there any computational task, besides program enumeration, that requires a long time?
A negative answer, meaning that computers are allpowerful, would have dramatic consequences, well beyond the above examples. What would you give for a computer that executes any task with no perceptible delay? What would you give to play today the game engines of the next fifty years? Certain companies would see a dramatic increase in profits. And, what I find the most interesting application, scientists would be able to solve very complex models, pushing way beyond current knowledge. Let me elaborate on this last point. Interestingly, the situation in several branches of science is somewhat similar to theoretical computer science. Specifically, scientists have identified a number of features which are desirable in a model of whatever it is that they are studying. However, nobody is able to solve models with these feature, except in toy cases, because known programs are too slow.
That computers are allpowerful sounds too good to be true, and in fact the common belief is that there do exist many problems that take a long time. This also has many desirable applications: the security of many everyday electronic systems relies on this belief for specific problems, such as factoring numbers. But we cannot be completely confident of the security until the belief is proved.
The P vs. NP problem is a young, prominent special case of the grand challenge. I think it should be presented as such, which is not always done. The problem asks whether a specific class of problems requires at least a specific amount of time to solve.
P stands for the computational tasks that can be done somewhat efficiently. The letter P is short for “polynomial time:” the time required to solve the problem must scale with a polynomial in the size of the problem. For example, if you have a program that converts movies of size n in time quadratic in n, that would be efficient, and the conversion task would be in P. This is of course a theoretical approximation which does not necessarily guarantee efficiency in practice. But this approximation is convenient, and meaningful when compared to seemingly exponentialtime tasks such as program enumeration.
I don’t like the terminology “polynomial time.” Calling something a polynomial emphasizes that that something is made of many monomials. However, the only monomial that is relevant in the definition of P is the one with the highest power. I thought of “power time.” I find it more to the point, and simpler, while preserving the initial.
NP is what you can solve in power time on a nondeterministic computer. This class captures many problems that people care about, but that are not known to be in P. To keep with the spirit of the post, I’ll mention that among these problems are appropriate generalizations of Tetris, Lemmings, and Super Mario.
The P vs. NP problem can also be presented as the difference between computing a solution (P) and checking it (NP). For example, if someone can solve a tough Tetris level, it is easy for you to check that. By contrast, if someone can solve the programenumeration problem, it is not clear how you would check that. Indeed, that problem is not believed to be in NP.
I have mixed feelings about this presentation. I do find it catchy, but I don’t like that it suggests to me that to check solutions we will use the same variety of algorithms that we use in computing them. Actually, solutions can be encoded so that the verification is extremely simple, as is known since the original formulations. On the other hand, the length of the solution does increase for this encoding, and a variety of algorithms may be recovered from the encoding.
Do I think that P is different from NP? I feel that I have no idea. In the few years that I have been following this research I have already seen several efficient algorithms for tasks that at first sight looked impossible, and in some cases were conjectured to be. Here are two:
1. Suppose your computer has no memory (i.e., it only has a constant number of registers and the program counter). Can you determine in power time if a.b > c, for integers a,b, and c with many digits? This is actually possible, by Barrington’s theorem.
2. Is there a pseudorandom generator where each output bit depends on only five input bits? This is also believed to be possible, by a result of Applebaum, Ishai, and Kushilevitz.
Both relate to restricted computational models, which is not that surprising, since we do not understand general computation. However, by analogy, should there not be similar surprises for P? Or will the only surprises in computational classes be of the type that something in P is shown to be doable even in a more restricted computational model?