# Myth creation: The switching lemma

The history of science is littered with anecdotes about misplaced credit. Because it does not matter if it was A or B who did it; it only matters if it was I or not I. In this spirit I am starting a series of posts about such misplaced credit, which I hesitated before calling more colorfully “myth creation.” Before starting, I want to make absolutely clear that I am in no way criticizing the works themselves or their authors. In fact, many are among my favorites. Moreover, at least in the examples I have in mind right now, the authors do place their work in the appropriate context with the use of citations etc. My only point is the credit that the work has received within and without our community (typically due to inertia and snowball effects rather than anything else).

Of course, at some level this doesn’t matter. You can call Chebichev’s polynomials rainbow sprinkles and the math doesn’t change. And yet at some other level maybe it does matter a little, for science isn’t yet a purely robotic activity. With these posts I will advertise unpopular points of views that might be useful, for example to researchers who are junior or from different communities.

### The switching lemma

Random restrictions have been used in complexity theory since at least the 60’s [Sub61]. The first dramatic use in the context of AC0 is due to [FSS84Ajt83]. These works proved a switching lemma the amazing fact that a DNF gets simplified by a random restriction to the point that it can be written as a CNF, so you can collapse layers and induct. (An exposition is given below.) Using it, they proved super-polynomial lower bounds for AC0. The proof in [FSS84] is very nice, and if I want to get a quick intuition of why switching is at all possible, I often go back to it. [Ajt83] is also a brilliant paper, and long, unavailable online for free, filled with a logical notation which makes some people twitch. The first symbol of the title says it all, and may be the most obscene ever chosen:

\begin{aligned} \Sigma _{1}^{1}. \end{aligned}

Subsequently, [Yao85] proved exponential lower bounds of the form $2^{n^{c}}$, with a refined analysis of the switching lemma. The bounds are tight, except for the constant $c$ which depends on the depth of the circuit. Finally, the star of this post [Has86Has87] obtained $c=1/(depth-1)$.

Yao’s paper doesn’t quite state that a DNF can be written exactly as a CNF, but it states that it can be approximated. Hastad’s work is the first to prove that a DNF can be written as a CNF, and in this sense his statement is cleaner than Yao’s. However, Yao’s paper states explicitly that a small circuit, after being hit by a restriction, can be set to constant by fixing few more bits.

The modern formulation of the switching lemma says that a DNF can be written as a shallow decision tree (and hence a small CNF). This formulation in terms of decision trees is actually not explicit in Hastad’s work. Beame, in his primer [Bea94], credits Cai with this idea and mentions several researchers noted Hastad’s proof works in this way.

Another switching lemma trivia is that the proof in Hastad’s thesis is actually due to Boppana; Hastad’s original argument — of which apparently no written record exists — was closer to Razborov’s later proof.

So, let’s recap. Random restrictions are already in [Sub61]. The idea of switching is already in [FSS84Ajt83]. You already had three analyses of these ideas, two giving superpolynomial lower bounds and one [Yao85] giving exponential. The formulation in terms of decision trees isn’t in [Has87], and the proof that appears in [Has87] is due to Boppana.

Still, I would guess [Has87] is more well known than all the other works above combined. [Yao85] did have a following at the time — I think it appeared in the pop news. But hey — have you ever heard of Yao’s switching lemma?

The current citation counts offer mixed support for my thesis:

FSS: 1351

Y: 732

H – paper “Almost optimal…:” 867

H – thesis: 582

But it is very hard to use citation information. The two H citations overlap, and papers are cited for various reasons. For example FSS got a ton of citations for the connection to oracles (which has nothing to do with switching lemmas).

Instead it’s instructive to note the type of citations that you can find in the literature:

Hastad’s switching lemma is a cornerstone of circuit complexity [No mention of FSS, A, Y]

Hastad‘s Switching Lemma is one of the gems of computational complexity [Notes below in passing it builds on FSS, A, Y]

The wikipedia entry is also telling:

 In computational complexity theory, Hastad’s switching lemma is a key tool for proving lower bounds on the size of constant-depth Boolean circuits. Using the switching lemma, Johan Hastad (1987) showed that... [No mention of FSS,A,Y]

I think that 99% of the contribution of this line of research is the amazing idea that random restrictions simplify a DNF so that you can write it as a CNF and collapse. 90% of the rest is analyzing this to get superpolynomial lower bounds. And 90% of whatever is left is analyzing this to get exponential lower bounds.

Going back to something I mentioned at the beginning, I want to emphasize that Hastad during talks makes a point of reminding the audience that the idea of random restrictions is due to Sipser, and of Boppana’s contribution. And I also would like to thank him for his help with this post.

OK — so maybe this is so, but it must then be the case that [Has87] is the final word on this stuff, like the ultimate tightest analysis that kills the problem. Actually, it is not tight in some regimes of interest, and several cool works of past and recent times address that. In the end, I can only think of one reason why [Has87] entered the mythology in ways that other works did not, the reason that I carefully sidestepped while composing this post: å.

Perhaps one reason behind the aura of the switching lemma is that it’s hard to find examples. It would be nice to read: If you have this extreme DNF here’s what happens, on the other hand for this other extreme DNF here’s what happens, and in general this always works and here’s the switching lemma. Examples are forever – Erdos. Instead the switching lemma is typically presented as blam!: an example-free encoding argument which feels deus ex machina, as in this crisp presentation by Thapen. For a little more discussion, I liked Bogdanov’s lecture notes. Next I give a slightly different exposition of the encoding argument.

The simplest case: Or of $n$ bits.

Here the circuit $C$ is simply the Or of $n$ bits $x_{1},x_{2},\ldots ,x_{n}$. This and the next case can be analyzed in more familiar ways, but the benefit of the encoding argument presented next is that it will extend to the general case more easily… arguably. Anyway, it’s also just fun to learn a different argument.

So, let’s take a random restriction $\rho$ with exactly $s$ stars. Some of the bits may become $0$, others $1$, and others yet may remain unfixed, i.e., assigned to stars. Those that become $0$ you can ignore, while if some become $1$ then the whole circuit becomes $1$.

We will show that the number of restrictions for which the restricted circuit $C|_{\rho }$ requires decision trees of depth $\ge d$ is small. To accomplish this, we are going to encode/map such restrictions using/to a restriction… with no stars (that is, just a 0/1 assignment to the variables). The gain is clear: just think of a restriction with zero stars versus a restriction with one star. The latter are more by a factor about the number $n$ of variables.

A critical observation is that we only want to encode restrictions for which $C|_{\rho }$ requires large depth. So $\rho$ does not map any variable to $1$, for else the Or is $1$ which has decision trees of depth $0$.

The way we are going to encode $\rho$ is this: Simply replace the stars with ones. To go back, replace the ones with stars. We are using the ones in the encoding to “signal” where the stars are.

Hence, the number of bad restrictions is at most $2^{n}$, which is tiny compared to the number $\binom {n}{s}2^{n-s}$ of restrictions with $s$ stars.

The medium case: Or of functions on disjoint inputs.

Instead of working with DNFs, I will consider a circuit $C$ which is the Or of arbitrary functions $f_{i}$ each on $w$ bits. You can immediately get this formulation from the usual one for DNFs, but I still find it a little useful since otherwise you might think there is something special about DNFs. What is special is that you take the Or of the functions, and we will exploit this again shortly.

In this warm-up case, we start with functions on disjoint inputs. So, again, let’s take a random restriction $\rho$ with exactly $s$ stars. Some of the functions may become $0$, others $1$, and others yet may remain unfixed. Those that become $0$ you can ignore, while if some become $1$ then the whole circuit becomes $1$.

As before, we will show that the number of restrictions for which the restricted circuit $C|_{\rho }$ requires decision trees of depth $\ge d$ is small. To accomplish this, we are going to encode/map such restrictions using/to a restriction with just $s-d$ stars, plus a little more information. As we saw already, the gain in reducing the number of stars is clear. In particular, standard calculations show that saving $d$ stars reduces the number of restrictions by a factor $O(s/n)^{d}$. The auxiliary information will give us a factor of $w^{d}$, leading to the familiar bound $O(ws/n)^{d}$.

As before, recall that we only want to encode restrictions for which $C|_{\rho }$ requires large depth. So no function in $C|_{\rho }$ is $1$, for else the circuit is $1$ and has decision trees of depth $0$. Also, you have $d$ stars among inputs to functions that are unfixed (i.e., not even fixed to $0$), for else again you can compute the function reading less than $d$ bits. Because the functions are unfixed, there is a setting for those $d$ stars (and possibly a few more stars – that would only help the argument) that make the corresponding functions $1$. We are going to pick precisely that setting in our restriction $\rho '$ with $s-d$ stars. This allows us to “signal” which functions had inputs with the stars we are saving (namely, those that are the constant $1$). To completely recover $\rho$, we simply add extra information to indicate where the stars were. The saving here is that we only have to say where the stars are among $w$ symbols, not $n$.

The general case: Or of functions on any subset of $w$ bits.

First, the number of functions does not play a role, so you can think you have functions on any possible subset of $w$ bits, where some functions may be constant. The idea is the same, except we have to be slightly more careful because when we set values for the stars in one function we may also affect other functions. The idea is simply to fix one function at the time. Specifically, starting with $\rho$, consider the first function $f$ that’s not made constant by $\rho$. So the inputs to $f$ have some stars. As before, let us replace the stars with constants that make the function $f$ equal to the constant 1, and append the extra information that allows us to recover where these stars were in $\rho$.

We’d like to repeat the argument. Note however we only have guarantees about $C|_{\rho }$, not $C|_{\rho }$ with some stars replaced with constants that make $f$ equal to $1$. We also can’t just jump to the 2nd function that’s not constant in $C|_{\rho }$, since the “signal” fixing for that might clash with the fixing for the first – this is where the overlap in inputs makes things slightly more involved. Instead, because $C|_{\rho }$ required decision tree depth at least $d$, we note there have to be some assignments to the $m$ stars in the input to $f$ so that the resulting, further restricted circuit still requires decision tree depth $\ge d-m$ (else $C|_{\rho }$ has decision trees of depth $).  We append this assignment to the auxiliary information and we continue the argument using the further restricted circuit.

### References

[Ajt83]    Mikl�s Ajtai. $\Sigma \sp {1}\sb {1}$-formulae on finite structures. Annals of Pure and Applied Logic, 24(1):1–48, 1983.

[Bea94]   Paul Beame. A switching lemma primer. Technical Report UW-CSE-95-07-01, Department of Computer Science and Engineering, University of Washington, November 1994. Available from http://www.cs.washington.edu/homes/beame/.

[FSS84]   Merrick L. Furst, James B. Saxe, and Michael Sipser. Parity, circuits, and the polynomial-time hierarchy. Mathematical Systems Theory, 17(1):13–27, 1984.

[Has86]   Johan H�stad. Almost optimal lower bounds for small depth circuits. In Juris Hartmanis, editor, Proceedings of the 18th Annual ACM Symposium on Theory of Computing, May 28-30, 1986, Berkeley, California, USA, pages 6–20. ACM, 1986.

[H�s87]   Johan H�stad. Computational limitations of small-depth circuits. MIT Press, 1987.

[Sub61]   B. A. Subbotovskaya. Realizations of linear functions by formulas using +, *, -. Soviet Mathematics-Doklady, 2:110–112, 1961.

[Yao85]   Andrew Yao. Separating the polynomial-time hierarchy by oracles. In 26th IEEE Symp. on Foundations of Computer Science (FOCS), pages 1–10, 1985.

# Fibonacci and I

The other day I couldn’t remember Fibonacci’s original motivation/presentation of the sequence now famously named after him. This had to be corrected immediately, because of the picture above and my first publication (1994) which includes a simple algorithm to decompress sounds. The compression algorithm works by storing rather than the sound data — think of it as the key — the difference between consecutive keys. The saving comes from not allowing every possible difference, but only those in… the Fibonacci sequence. Why those differences are the right ones is part of the mystique which makes studying the sequence fun. For further technical but not mystical details see the paper; an implementation of the decompressor is given in the Motorola 68000 assembly code.

This is me on my way to Fibonacci from Rome, some years ago:

I actually find some presentations of the sequence a little hard to grasp, so I came up with a trivially different rendering which now will make it impossible for me to forget:

There are two types of trees: Young and old. You start with one young tree. In one period, a young tree produces another young tree and becomes old, and an old tree produces a young tree and dies. How many young trees are there after t periods?

I also couldn’t exactly remember the spiral you can make with these numbers. But you can tile the plane with squares whose sides come from the sequence, if you arrange them in a spiral.

# Talk: Why do lower bounds stop “just before” proving major results?

I have prepared this talk which is a little unusual and is in part historical and speculative. You can view the slides here. I am scheduled to give it in about three hours at Boston University. And because it’s just another day in the greater Boston area, while I’ll be talking my ex office-mate Vitaly Feldman will be speaking at Harvard University.  His talk looks quite interesting and attempts to explain why overfitting is actually necessary for good learning. As for mine, well you’ll have to come and see or take a peek at the slides.

# Non-abelian combinatorics and communication complexity

Below and here in pdf is a survey I am writing for SIGACT, due next week.  Comments would be very helpful.

Finite groups provide an amazing wealth of problems of interest to complexity theory. And complexity theory also provides a useful viewpoint of group-theoretic notions, such as what it means for a group to be “far from abelian.” The general problem that we consider in this survey is that of computing a group product $g=x_{1}\cdot x_{2}\cdot \cdots \cdot x_{n}$ over a finite group $G$. Several variants of this problem are considered in this survey and in the literature, including in .

Some specific, natural computational problems related to $g$ are, from hardest to easiest:

(1) Computing $g$,

(2) Deciding if $g=1_{G}$, where $1_{G}$ is the identity element of $G$, and

(3) Deciding if $g=1_{G}$ under the promise that either $g=1_{G}$ or $g=h$ for a fixed $h\ne 1_{G}$.

Problem (3) is from [MV13]. The focus of this survey is on (2) and (3).

We work in the model of communication complexity [Yao79], with which we assume familiarity. For background see [KN97RY19]. Briefly, the terms $x_{i}$ in a product $x_{1}\cdot x_{2}\cdot \cdots \cdot x_{n}$ will be partitioned among collaborating parties – in several ways – and we shall bound the number of bits that the parties need to exchange to solve the problem.

Organization.

We begin in Section 2 with two-party communication complexity. In Section 3 we give a streamlined proof, except for a step that is only sketched, of a result of Gowers and the author [GV15GVb] about interleaved group products. In particular we present an alternative proof, communicated to us by Will Sawin, of a lemma from [GVa]. We then consider two models of three-party communication. In Section 4 we consider number-in-hand protocols, and we relate the communication complexity to so-called quasirandom groups [Gow08BNP08]. In Section 6 we consider number-in-hand protocols, and specifically the problem of separating deterministic and randomized communication. In Section 7 we give an exposition of a result by Austin [Aus16], and show that it implies a separation that matches the state-of-the-art [BDPW10] but applies to a different problem.

Some of the sections follow closely a set of lectures by the author [Vio17]; related material can also be found in the blog posts [VioaViob]. One of the goals of this survey is to present this material in a more organized matter, in addition to including new material.

### 2 Two parties

Let $G$ be a group and let us start by considering the following basic communication task. Alice gets an element $x\in G$ and Bob gets an element $y\in G$ and their goal is to check if $x\cdot y=1_{G}$. How much communication do they need? Well, $x\cdot y=1_{G}$ is equivalent to $x=y^{-1}$. Because Bob can compute $y^{-1}$ without communication, this problem is just a rephrasing of the equality problem, which has a randomized protocol with constant communication. This holds for any group.

The same is true if Alice gets two elements $x_{1}$ and $x_{2}$ and they need to check if $x_{1}\cdot y\cdot x_{2}=1_{G}$. Indeed, it is just checking equality of $y$ and $x_{1}^{-1}\cdot x_{2}^{-1}$, and again Alice can compute the latter without communication.

Things get more interesting if both Alice and Bob get two elements and they need to check if the interleaved product of the elements of Alice and Bob equals $1_{G}$, that is, if

\begin{aligned} x_{1}\cdot y_{1}\cdot x_{2}\cdot y_{2}=1_{G}. \end{aligned}

Now the previous transformations don’t help anymore. In fact, the complexity depends on the group. If it is abelian then the elements can be reordered and the problem is equivalent to checking if $(x_{1}\cdot x_{2})\cdot (y_{1}\cdot y_{2})=1_{G}$. Again, Alice can compute $x_{1}\cdot x_{2}$ without communication, and Bob can compute $y_{1}\cdot y_{2}$ without communication. So this is the same problem as before and it has a constant communication protocol.

For non-abelian groups this reordering cannot be done, and the problem seems hard. This can be formalized for a class of groups that are “far from abelian” – or we can take this result as a definition of being far from abelian. One of the groups that works best in this sense is the following, first constructed by Galois in the 1830’s.

Definition 1. The special linear group $SL(2,q)$ is the group of $2\times 2$ invertible matrices over the field $\mathbb{F} _{q}$ with determinant $1$.

The following result was asked in [MV13] and was proved in [GVa].

Theorem 1. Let $G=SL(2,q)$ and let $h\ne 1_{G}$. Suppose Alice receives $x_{1},x_{2}\in G$ and Bob receives $y_{1},y_{2}\in G$. They are promised that $x_{1}\cdot y_{1}\cdot x_{2}\cdot y_{2}$ either equals $1_{G}$ or $h$. Deciding which case it is requires randomized communication $\Omega (\log |G|)$.

This bound is tight as Alice can send her input, taking $O(\log |G|)$ bits. We present the proof of this theorem in the next section.

Similar results are known for other groups as well, see [GVa] and [Sha16]. For example, one group that is “between” abelian groups and $SL(2,q)$ is the following.

Definition 2. The alternating group $A_{n}$ is the group of even permutations of $1,2,\ldots ,n$.

If we work over $A_{n}$ instead of $SL(2,q)$ in Theorem 1 then the communication complexity is $\Omega (\log \log |G|)$ [Sha16]. The latter bound is tight [MV13]: with knowledge of $h$, the parties can agree on an element $a\in {1,2,\ldots ,n}$ such that $h(a)\ne a$. Hence they only need to keep track of the image $a$. This takes communication $O(\log n)=O(\log \log |A_{n}|)$ because $|A_{n}|=n!/2.$ In more detail, the protocol is as follows. First Bob sends $y_{2}(a)$. Then Alice sends $x_{2}y_{2}(a)$. Then Bob sends $y_{1}x_{2}y_{2}(a)$ and finally Alice can check if $x_{1}y_{1}x_{2}y_{2}(a)=a$.

Interestingly, to decide if $g=1_{G}$ without the promise a stronger lower bound can be proved for many groups, including $A_{n}$, see Corollary 3 below.

In general, it seems an interesting open problem to try to understand for which groups Theorem 1 applies. For example, is the communication large for every quasirandom group [Gow08]?

Theorem 1 and the corresponding results for other groups also scale with the length of the product: for example deciding if $x_{1}\cdot y_{1}\cdot x_{2}\cdot y_{2}\cdots x_{n}\cdot y_{n}=1_{G}$ over $G=SL(2,q)$ requires communication $\Omega (n\log |G|)$ which is tight.

A strength of the above results is that they hold for any choice of $h$ in the promise. This makes them equivalent to certain $mixing$ results, discussed below in Section 5.0.1. Next we prove two other lower bounds that do not have this property and can be obtained by reduction from disjointness. First we show that for any non-abelian group $G$ there exists an element $h$ such that deciding if $g=1_{G}$ or $g=h$ requires communication linear in the length of the product. Interestingly, the proof works for any non-abelian group. The choice of $h$ is critical, as for some $G$ and $h$ the problem is easy. For example: take any group $G$ and consider $H:=G\times \mathbb {Z}_{2}$ where $\mathbb {Z}_{2}$ is the group of integers with addition modulo $2$. Distinguishing between $1_{H}=(1_{G},0)$ and $h=(1_{G},1)$ amounts to computing the parity of (the $\mathbb {Z}_{2}$ components of) the input, which takes constant communication.

Theorem 2. Let $G$ be a non-abelian group. There exists $h\in G$ such that the following holds. Suppose Alice receives $x_{1},x_{2},\ldots ,x_{n}$ and receives $y_{1},y_{2},\ldots ,y_{n}$. They are promised that $x_{1}\cdot y_{1}\cdot x_{2}\cdot y_{2}\cdot \cdots \cdot x_{n}\cdot y_{n}$ either equals $1_{G}$ or $h$. Deciding which case it is requires randomized communication $\Omega (n)$.

Proof. We reduce from unique set-disjointness, defined below. For the reduction we encode the And of two bits $s,t\in \{0,1\}$ as a group product. This encoding is similar to the famous puzzle that asks to hang a picture on a wall with two nails in such a way that the picture falls if either one of the nails is removed. Since $G$ is non-abelian, there exist $a,b\in G$ such that $a\cdot b\neq b\cdot a$, and in particular $a\cdot b\cdot a^{-1}\cdot b^{-1}=h$ with $h\neq 1$. We can use this fact to encode the And of $s$ and $t$ as

\begin{aligned} a^{s}\cdot b^{t}\cdot a^{-s}\cdot b^{-t}=\begin {cases} 1~~\text {if And\ensuremath {(s,t)=0}}\\ h~~\text {otherwise} \end {cases}. \end{aligned}

In the disjointness problem Alice and Bob get inputs $x,y\in \{0,1\}^{n}$ respectively, and they wish to check if there exists an $i\in [n]$ such that $x_{i}\land y_{i}=1$. If you think of $x,y$ as characteristic vectors of sets, this problem is asking if the sets have a common element or not. The communication of this problem is $\Omega (n)$ [KS92Raz92]. Moreover, in the “unique” variant of this problem where the number of such $i$’s is 0 or 1, the same lower bound $\Omega (n)$ still applies. This follows from [KS92Raz92] – see also Proposition 3.3 in [AMS99]. For more on disjointness see the surveys [She14CP10].

We will reduce unique disjointness to group products. For $x,y\in \{0,1\}^{n}$ we produce inputs for the group problem as follows:

\begin{aligned} x & \rightarrow (a^{x_{1}},a^{-x_{1}},\ldots ,a^{x_{n}},a^{-x_{n}})\\ y & \rightarrow (b^{y_{1}},b^{-y_{1}},\ldots ,b^{y_{n}},b^{-y_{n}}). \end{aligned}

The group product becomes

\begin{aligned} \underbrace {a^{x_{1}}\cdot b^{y_{1}}\cdot a^{-x_{1}}\cdot b^{-y_{1}}}_{\text {1 bit}}\cdots \cdots a^{x_{n}}\cdot b^{y_{n}}\cdot a^{-x_{n}}\cdot b^{-y_{n}}. \end{aligned}

If there isn’t an $i\in [n]$ such that $x_{i}\land y_{i}=1$, then for each $i$ the term $a^{x_{i}}\cdot b^{y_{i}}\cdot a^{-x_{i}}\cdot b^{-y_{i}}$ is $1_{G}$, and thus the whole product is 1.

Otherwise, there exists a unique $i$ such that $x_{i}\land y_{i}=1$ and thus the product will be $1\cdots 1\cdot h\cdot 1\cdots 1=h$, with $h$ being in the $i$-th position. If Alice and Bob can check if the above product is equal to 1, they can also solve the unique set disjointness problem, and thus the lower bound applies for the former. $\square$

We required the uniqueness property, because otherwise we might get a product $h^{c}$ that could be equal to 1 in some groups.

Next we prove a result for products of length just $4$; it applies to non-abelian groups of the form $G=H^{n}$ and not with the promise.

Theorem 3. Let $H$ be a non-abelian group and consider $G=H^{n}$. Suppose Alice receives $x_{1},x_{2}$ and Bob receives $y_{1},y_{2}$. Deciding if $x_{1}\cdot y_{1}\cdot x_{2}\cdot y_{2}=1_{G}$ requires randomized communication $\Omega (n)$.

Proof. The proof is similar to the proof of Theorem 2. We use coordinate $i$ of $G$ to encode bit $i$ of the disjointness instance. If there is no intersection in the latter, the product will be $1_{G}$. Otherwise, at least some coordinate will be $\ne 1_{G}$. $\square$

As a corollary we can prove a lower bound for $A_{n}$.

Corollary 3. Theorem 3 holds for $G=A_{n}$.

Proof. Note that $A_{n}$ contains $(A_{4})^{\lfloor n/4\rfloor }$ and that $A_{4}$ is not abelian. Apply Theorem 3. $\square$

Theorem 3 is tight for constant-size $G$. We do not know if Corollary 3 is tight. The trivial upper bound is $O(\log |A_{n}|)=O(n\log n)$.

### 3 Proof of Theorem 1

Several related proofs of this theorem exist, see [GV15GVaSha16]. As in [GVa], the proof that we present can be broken down in three steps. First we reduce the problem to a statement about conjugacy classes. Second we reduce this to a statement about trace maps. Third we prove the latter. We present the first step in a way that is similar but slightly different from the presentation in [GVa]. The second step is only sketched, but relies on classical results about $SL(2,q)$ and can be found in [GVa]. For the third we present a proof that was communicated to us by Will Sawin. We thank him for his permission to include it here.

#### 3.1 Step 1

We would like to rule out randomized protocols, but it is hard to reason about them directly. Instead, we are going to rule out deterministic protocols on random inputs. First, for any group element $g\in G$ we define the distribution on quadruples $D_{g}:=(x_{1},y_{1},x_{2},(x_{1}\cdot y_{1}\cdot x_{2})^{-1}g)$, where $x,y\in G$ are uniformly random elements. Note the product of the elements in $D_{g}$ is always $g$.

Towards a contradiction, suppose we have a randomized protocol $P$ such that

\begin{aligned} \mathbb{P} [P(D_{1})=1]\geq \mathbb{P} [P(D_{h})=1]+\frac {1}{10}. \end{aligned}

This implies a deterministic protocol with the same gap, by fixing the randomness.

We reach a contradiction by showing that for every deterministic protocol $P$ using little communication, we have

\begin{aligned} |\Pr [P(D_{1})=1]-\Pr [P(D_{h})=1]|\leq \frac {1}{100}. \end{aligned}

We start with the following standard lemma, which describes a protocol using product sets.

Lemma 4. (The set of accepted inputs of) A deterministic $c$-bit protocol for a function $f:X\times Y\to Z$ can be written as a disjoint union of $2^{c}$ rectangles, where a rectangle is a set of the form $A\times B$ with $A\subseteq X$ and $B\subseteq Y$ and where $f$ is constant.

Proof. (sketch) For every communication transcript $t$, let $S_{t}\subseteq G^{2}$ be the set of inputs giving transcript $t$. The sets $S_{t}$ are disjoint since an input gives only one transcript, and their number is $2^{c}$: one for each communication transcript of the protocol. The rectangle property can be proven by induction on the protocol tree. $\square$

Next, we show that any rectangle $A\times B$ cannot distinguish $D_{1},D_{h}$. The way we achieve this is by showing that for every $g$ the probability that $(A\times B)(D_{g})=1$ is roughly the same for every $g$, and is roughly the density of the rectangle. (Here we write $A\times B$ for the characteristic function of the set $A\times B$.) Without loss of generality we set $g=1_{G}$. Let $A$ have density $\alpha$ and $B$ have density $\beta$. We aim to bound above

\begin{aligned} \left |\mathbb{E} _{a_{1},b_{1},a_{2},b_{2}:a_{1}b_{1}a_{2}b_{2}=1}A(a_{1},a_{2})B(b_{1},b_{2})-\alpha \beta \right |, \end{aligned}

where note the distribution of $a_{1},b_{1},a_{2},b_{2}$ is the same as $D_{1}$.

Because the distribution of $(b_{1},b_{2})$ is uniform in $G^{2}$, the above can be rewritten as

\begin{aligned} & \left |\mathbb{E} _{b_{1},b_{2}}B(b_{1},b_{2})\mathbb{E} _{a_{1},a_{2}:a_{1}b_{1}a_{2}b_{2}=1}(A(a_{1},a_{2})-\alpha )\right |\\ & \le \sqrt {\mathbb{E} _{b_{1},b_{2}}B(b_{1},b_{2})^{2}}\sqrt {\mathbb{E} _{b_{1},b_{2}}\mathbb{E} _{a_{1},a_{2}:a_{1}b_{1}a_{2}b_{2}=1}^{2}(A(a_{1},a_{2})-\alpha )}.\\ & =\sqrt {\beta }\sqrt {\mathbb{E} _{b_{1},b_{2},a_{1},a_{2},a_{1}',a_{2}':a_{1}b_{1}a_{2}b_{2}=a_{1}'b_{1}a_{2}'b_{2}=1}A(a_{1},a_{2})A(a_{1}',a_{2}')-\alpha ^{2}}. \end{aligned}

The inequality is Cauchy-Schwarz, and the step after that is obtained by expanding the square and noting that $(a_{1},a_{2})$ is uniform in $G^{2}$, so that the expectation of the term $A(a_{1},a_{2})\alpha$ is $\alpha ^{2}$.

Now we do several transformations to rewrite the distribution in the last expectation in a convenient form. First, right-multiplying by $b_{2}^{-1}$ we can rewrite the distribution as the uniform distribution on tuples such that

\begin{aligned} a_{1}b_{1}a_{2}=a_{1}'b_{1}a_{2}'. \end{aligned}

The last equation is equivalent to $b_{1}^{-1}(a_{1}')^{-1}a_{1}b_{1}a_{2}=a_{2}'$.

We can now do a transformation setting $a_{1}'$ to be $a_{1}x^{-1}$ to rewrite the distribution of the four-tuple as

\begin{aligned} (a_{1},a_{2},a_{1}x^{-1},C(x)a_{2}) \end{aligned}

where we use $C(x)$ to denote a uniform element from the conjugacy class of $x$, that is $b^{-1}xb$ for a uniform $b\in G$.

Hence it is sufficient to bound

\begin{aligned} \left |\mathbb{E} A(a_{1},a_{2})A(a_{1}x^{-1},C(x)a_{2})-\alpha ^{2}\right |, \end{aligned}

where all the variables are uniform and independent.

With a similar derivation as above, this can be rewritten as

\begin{aligned} & \left |\mathbb{E} A(a_{1},a_{2})\mathbb{E} (A(a_{1}x^{-1},C(x)a_{2})-\alpha )\right |\\ & \le \sqrt {\mathbb{E} A(a_{1},a{}_{2})^{2}}\sqrt {\mathbb{E} _{a_{1},a_{2}}\mathbb{E} _{x}^{2}(A(a_{1}x^{-1},C(x)a_{2})-\alpha )}.\\ & =\sqrt {\alpha }\sqrt {\mathbb{E} A(a_{1}x^{-1},C(x)a_{2})A(a_{1}x'^{-1},C(x')a_{2})-\alpha ^{2}}. \end{aligned}

Here each occurrence of $C$ denotes a uniform and independent conjugate. Hence it is sufficient to bound

\begin{aligned} \left |\mathbb{E} A(a_{1}x^{-1},C(x)a_{2})A(a_{1}x'^{-1},C(x')a_{2})-\alpha ^{2}\right |. \end{aligned}

We can now replace $a_{2}$ with $C(x)^{-1}a_{2}.$ Because $C(x)^{-1}$ has the same distribution of $C(x^{-1})$, it is sufficient to bound

\begin{aligned} \left |\mathbb{E} A(a_{1}x^{-1},a_{2})A(a_{1}x'^{-1},C(x')C(x^{-1})a_{2})-\alpha ^{2}\right |. \end{aligned}

For this, it is enough to show that with high probability $1-1/|G|^{\Omega (1)}$ over $x'$ and $x$, the distribution of $C(x')C(x^{-1})$, over the choice of the two independent conjugates, has statistical distance $\le 1/|G|^{\Omega (1)}$ from uniform.

#### 3.2 Step 2

In this step we use information on the conjugacy classes of the group to reduce the latter task to one about the equidistribution of the trace map. Let $Tr$ be the Trace map:

\begin{aligned} Tr\begin {pmatrix}a_{1} & a_{2}\\ a_{3} & a_{4} \end {pmatrix}=a_{1}+a_{4}. \end{aligned}

We state the lemma that we want to show.

Lemma 5. Let $a:=\begin {pmatrix}0 & 1\\ 1 & w \end {pmatrix}$ and $b:=\begin {pmatrix}v & 1\\ 1 & 0 \end {pmatrix}$. For all but $O(1)$ values of $w\in \mathbb{F} _{q}$ and $v\in \mathbb{F} _{q}$, the distribution of

\begin{aligned} Tr\left (au^{-1}bu\right ) \end{aligned}

is $O(1/q)$ close to uniform over $\mathbb{F} _{q}$ in statistical distance.

To give some context, in $SL(2,q)$ the conjugacy class of an element is essentially determined by the trace. Moreover, we can think of $a$ and $b$ as generic elements in $G$. So the lemma can be interpreted as saying that for typical $a,b\in G$, taking a uniform element from the conjugacy class of $b$ and multiplying it by $a$ yields an element whose conjugacy class is uniform among the classes of $G$. Using that essentially all conjugacy classes are equal, and some of the properties of the trace map, one can show that the above lemma implies that for typical $x,x'$ the distribution of $C(x')C(x^{-1})$ is close to uniform. For more on how this fits we refer the reader to [GVa].

#### 3.3 Step 3

We now present a proof of Lemma 5. The high-level argument of the proof is the same as in [GVa] (Lemma 5.5), but the details may be more accessible and in particular the use of the Lang-Weil theorem [LW54] from algebraic geometry is replaced by a more elementary argument. For simplicity we shall only cover the case where $q$ is prime. We will show that for all but $O(1)$ values of $v,w,c\in \mathbb{F} _{q}$, the probability over $u$ that $Tr(au^{-1}bu)=c$ is within $O(1/q^{2})$ of $1/q$, and for the others it is at most $O(1/q)$. Summing over $c$ gives the result.

We shall consider elements $b$ whose trace is unique to the conjugacy class of $b$. (This holds for all but $O(1)$ conjugacy classes – see for example [GVa] for details.) This means that the distribution of $u^{-1}bu$ is that of a uniform element in $G$ conditioned on having trace $b$. Hence, we can write the probability that $Tr(au^{-1}bu)=c$ as the number of solutions in $x$ to the following three equations (divided by the size of the group, which is $q^{3}-q$):

\begin{aligned} x_{3}+x_{2}+wx_{4} & =c & \hspace {1cm}(Tr(ax)=c),\\ x_{1}+x_{4} & =v & \hspace {1cm}(Tr(x)=Tr(b)),\\ x_{1}x_{4}-x_{3}x_{3} & =1 & \hspace {1cm}(Det(x)=1). \end{aligned}

We use the second one to remove $x_{1}$ and the first one to remove $x_{2}$ from the last equation. This gives

\begin{aligned} (v-x_{4})x_{4}-(c-x_{3}-wx_{4})x_{3}=1. \end{aligned}

This is an equation in two variables. Write $x=x_{3}$ and $y=x_{4}$ and use distributivity to rewrite the equation as

\begin{aligned} -y^{2}+vy-cx+x^{2}+wxy=1. \end{aligned}

At least since Lagrange it has been known how to reduce this to a Pell equation $x^{2}+dy^{2}=e$. This is done by applying an invertible affine transformation, which does not change the number of solutions. First set $x=x-wy/2$. Then the equation becomes

\begin{aligned} -y^{2}+vy-c(x-wy/2)+(x-wy/2)^{2}+w(x-wy/2)y=1. \end{aligned}

Equivalently, the cross-term has disappeared and we have

\begin{aligned} y^{2}(-1-w^{2}/4)+y(v+cw/2)+x^{2}-cx=1. \end{aligned}

Now one can add constants to $x$ and $y$ to remove the linear terms, changing the constant term. Specifically, let $h:=(v+cw/2)/2$ and set $y=y-h$ and $x=x+c/2$. The equation becomes

\begin{aligned} (y-h)^{2}(-1-w^{2}/4)+(y-h)2h+(x+c/2)^{2}-c(x+c/2)=1. \end{aligned}

The linear terms disappear, the coefficients of $x^{2}$ and $y^{2}$ do not change and the equation can be rewritten as

\begin{aligned} y^{2}(-1-w^{2}/4)+h^{2}(-1-w^{2}/4)-2h^{2}+x^{2}+(c/2)^{2}-c^{2}/2=1. \end{aligned}

So this is now a Pell equation

\begin{aligned} x^{2}+dy^{2}=e \end{aligned}

where $d:=(-1-w^{2}/4)$ and

\begin{aligned} e:=1+h^{2}(3+w^{2}/4)+(c/2)^{2}=1+(v^{2}+(cw/2)^{2}+cvw)(1/4)(3+w^{2}/4)+(c/2)^{2}. \end{aligned}

For all but $O(1)$ values of $w$ we have that $d$ is non-zero. Moreover, for all but $O(1)$ values of $v,w$ the term $e$ is a non-zero polynomial in $c$. (Specifically, for any $v\ne 0$ and any $w$ such that $3+w^{2}/4\ne 0$.) So we only consider the values of $c$ that make it non-zero. Those where $e=0$ give $O(q)$ solutions, which is fine. We conclude with the following lemma.

Lemma 6. For $d$ and $e$ non-zero, and prime $q$, the number of solutions over $\mathbb{F} _{q}$ to the Pell equation

\begin{aligned} x^{2}+dy^{2}=e \end{aligned}

is within $O(1)$ of $q$.

This is a basic result from algebraic geometry that can be proved from first principles.

Proof. If $d=-f^{2}$ for some $f\in \mathbb{F} _{q}$, then we can replace $y$ with $fy$ and we can count instead the solutions to the equation

\begin{aligned} x^{2}-y^{2}=e. \end{aligned}

Because $x^{2}-y^{2}=(x-y)(x+y)$ we can set $x':=x-y$ and $y':=x+y$, which preserves the number of solutions, and rewrite the equation as

\begin{aligned} x'y'=e. \end{aligned}

Because $e\ne 0$, this has $q-1$ solutions: for every non-zero $y'$ we have $x'=e/y'$.

So now we can assume that $d\ne -f^{2}$ for any $f\in \mathbb{F} _{q}$. Because the number of squares is $(q+1)/2$, the range of $x^{2}$ has size $(q+1)/2$. Similarly, the range of $e-dy^{2}$ also has size $(q+1)/2$. Hence these two ranges intersect, and there is a solution $(a,b)$.

We take a line passing through $(a,b)$: for parameters $s,t\in \mathbb{F}$ we consider pairs $(a+t,b+st)$. There is a bijection between such pairs with $t\ne 0$ and the points $(x,y)$ with $x\ne a$. Because the number of solutions with $x=a$ is $O(1)$, using that $d\ne 0$, it suffices to count the solutions with $t\ne 0$.

The intuition is that this line has two intersections with the curve $x^{2}+dy^{2}=e$. Because one of them, $(a,b)$, lies in $\mathbb{F} _{q}$, the other has to lie as well there. Algebraically, we can plug the pair in the expression to obtain the equivalent equation

\begin{aligned} a^{2}+t^{2}+2at+d(b^{2}+s^{2}t^{2}+2bst)=e. \end{aligned}

Using that $(a,b)$ is a solution this becomes

\begin{aligned} t^{2}+2at+ds^{2}t^{2}+2dbst=0 \end{aligned}

We can divide by $t\ne 0$. Obtaining

\begin{aligned} t(1+ds^{2})+2a+2dbs=0. \end{aligned}

We can now divide by $1+ds^{2}$ which is non-zero by the assumption $d\ne -f^{2}$. This yields

\begin{aligned} t=(-2a-2dbs)/(1+ds^{2}). \end{aligned}

Hence for every value of $s$ there is a unique $t$ giving a solution. This gives $q$ solutions. $\square$

### 4 Three parties, number-in-hand

In this section we consider the following three-party number-in-hand problem: Alice gets $x$, Bob gets $y$, Charlie gets $z$, and they want to know if $x\cdot y\cdot z=1_{G}$. The communication depends on the group $G$. We present next two efficient protocols for abelian groups, and then a communication lower bound for other groups.

#### 4.1 A randomized protocol for the hypercube

We begin with the simplest setting. Let $G=(\mathbb {Z}_{2})^{n}$, that is $n$-bit strings with bit-wise addition modulo 2. The parties want to check if $x+y+z=0^{n}$. They can do so as follows. First, they pick a hash function $h$ that is linear: $h(x+y)=h(x)+h(y)$. Specifically, for a uniformly random $a\in \{0,1\}^{n}$ define $h_{a}(x):=\sum a_{i}x_{i}\mod 2$. Then, the protocol is as follows.

• Alice sends $h_{a}(x)$,
• Bob send $h_{a}(y)$,
• Charlie accepts if and only if $h_{a}(x)+h_{a}(y)+h_{a}(z)=0s$.

The hash function outputs 1 bit, so the communication is constant. By linearity, the protocol accepts iff $h_{a}(x+y+z)=0$. If $x+y+z=0$ this is always the case, otherwise it happens with probability $1/2$.

#### 4.2 A randomized protocol for $\mathbb {Z}_{N}$

This protocol is from [Vio14]. For simplicity we only consider the case $N=2^{n}$ here – the protocol for general $N$ is in [Vio14]. Again, the parties want to check if $x+y+z=0\mod N$. For this group, there is no 100% linear hash function but there are almost linear hash functions $h:\mathbb {Z}_{N}\rightarrow \mathbb {Z}_{2^{\ell }}$ that satisfy the following properties. Note that the inputs to $h$ are interpreted modulo $N$ and the outputs modulo $2^{\ell }$.

1. for all $a,x,y$ there is $c\in \{0,1\}$ such that $h_{a}(x+y)=h_{a}(x)+h_{a}(y)+c$,
2. for all $x\neq 0$ we have $\mathbb{P} _{a}[h_{a}(x)\in \{-2,-1,0,1,2\}]\leq O(1/2^{\ell })$,
3. $h_{a}(0)=0$.

Assuming some random hash function $h$ that satisfies the above properties the protocol works similarly to the previous one:

• Alice sends $h_{a}(x)$,
• Bob sends $h_{a}(y)$,
• Charlie accepts if and only if $h_{a}(x)+h_{a}(y)+h_{a}(z)\in \{-2,-1,0\}$.

We can set $\ell =O(1)$ to achieve constant communication and constant error.

To prove correctness of the protocol, first note that $h_{a}(x)+h_{a}(y)+h_{a}(z)=h_{a}(x+y+z)-c$ for some $c\in \{0,1,2\}$. Then consider the following two cases:

• if $x+y+z=0$ then $h_{a}(x+y+z)-c=h_{a}(0)-c=-c,$ and the protocol is always correct.
• if $x+y+z\neq 0$ then the probability that $h_{a}(x+y+z)-c\in \{-2,-1,0\}$ for some $c\in \{0,1,2\}$ is at most the probability that $h_{a}(x+y+z)\in \{-2,-1,0,1,2\}$ which is $\leq 2^{-\Omega (\ell )}$; so the protocol is correct with high probability.

The hash function..

For the hash function we can use a function analyzed in [DHKP97]. Let $a$ be a random odd number modulo $2^{n}$. Define

\begin{aligned} h_{a}(x):=(a\cdot x\gg n-\ell )\mod 2^{\ell } \end{aligned}

where the product $a\cdot x$ is integer multiplication, and $\gg$ is bit-shift. In other words we output the bits $n-\ell +1,n-\ell +2,\ldots ,n$ of the integer product $a\cdot x$.

We now verify that the above hash function family satisfies the three properties we required above.

Property (3) is trivially satisfied.

For property (1) we have the following. Let $s=a\cdot x$ and $t=a\cdot y$ and $u=n-\ell$. To recap, by definition we have:

• $h_{a}(x+y)=((s+t)\gg u)\mod 2^{\ell },$
• $h_{a}(x)=(s\gg u)\mod 2^{\ell }$,
• $h_{a}(x)=(t\gg u)\mod 2^{\ell }$.

Notice that if in the addition $s+t$ the carry into the $u+1$ bit is $0$, then

\begin{aligned} (s\gg u)+(t\gg u)=(s+t)\gg u \end{aligned}

otherwise

\begin{aligned} (s\gg u)+(t\gg u)+1=(s+t)\gg u \end{aligned}

which concludes the proof for property (1).

Finally, we prove property (2). We start by writing $x=s\cdot 2^{c}$ where $s$ is odd. So the binary representation of $x$ looks like

\begin{aligned} (\cdots \cdots 1\underbrace {0\cdots 0}_{c~\textrm {bits}}). \end{aligned}

The binary representation of the product $a\cdot x$ for a uniformly random $a$ looks like

\begin{aligned} (\textit {uniform}~1\underbrace {0\cdots 0}_{c~\textrm {bits}}). \end{aligned}

We consider the two following cases for the product $a\cdot x$:

1. If $a\cdot x=(\underbrace {\textit {uniform}~1\overbrace {00}^{2~bits}}_{\ell ~bits}\cdots 0)$, or equivalently $c\geq n-\ell +2$, the output never lands in the bad set $\{-2,-1,0,1,2\}$;
2. Otherwise, the hash function output has $\ell -O(1)$ uniform bits. For any set $B$, the probability that the output lands in $B$ is at most $|B|\cdot 2^{-\ell +O(1)}$.

#### 4.3 Quasirandom groups

What happens in other groups? The hash function used in the previous result was fairly non-trivial. Do we have an almost linear hash function for $2\times 2$ matrices? The answer is negative. For $SL_{2}(q)$ and $A_{n}$ the problem is hard, even under the promise. For a group $G$ the complexity can be expressed in terms of a parameter $d$ which comes from representation theory. We will not formally define this parameter here, but several qualitatively equivalent formulations can be found in [Gow08]. Instead the following table shows the $d$’s for the groups we’ve introduced.

 $G$ : abelian $A_{n}$ $SL_{2}(q)$ $d$ : $1$ $\Omega (\frac {\log |G|}{\log \log |G|})$ $|G|^{\Omega (1)}$

.

Theorem 1. Let $G$ be a group, and let $h\in G$. Let $d$ be the minimum dimension of any irreducible representation of $G$. Suppose Alice, Bob, and Charlie receive $x$, y, and $z$ respectively. They are promised that $x\cdot y\cdot z$ either equals $1_{G}$ or $h$. Deciding which case it is requires randomized communication complexity $\Omega (\log d)$.

This result is tight for the groups we have discussed so far. The arguments are the same as before. Specifically, for $SL_{2}(q)$ the communication is $\Omega (\log |G|)$. This is tight up to constants, because Alice and Bob can send their elements. For $A_{n}$ the communication is $\Omega (\log \log |G|)$. This is tight as well, as the parties can again just communicate the images of an element $a$ such that $h(a)\ne a$, as discussed in Section 1. This also gives a computational proof that $d$ cannot be too large for $A_{n}$, i.e., it is at most $(\log |G|)^{O(1)}$. For abelian groups we get nothing, matching the efficient protocols given above.

### 5 Proof of Theorem 1

First we discuss several “mixing” lemmas for groups, then we come back to protocols and see how to apply one of them there.

##### 5.0.1 $XY$ mixing

We want to consider “high entropy” distributions over $G$, and state a fact showing that the multiplication of two such distributions “mixes” or in other words increases the entropy. To define entropy we use the norms $\lVert A\rVert _{c}=\left (\sum _{x}A(x)^{c}\right )^{\frac {1}{c}}$. Our notion of (non-)entropy will be $\lVert A\rVert _{2}$. Note that $\lVert A\rVert _{2}^{2}$ is exactly the collision probability $\mathbb{P} [A=A']$ where $A'$ is independent and identically distributed to $A$. The smaller this quantity, the higher the entropy of $A$. For the uniform distribution $U$ we have $\lVert U\rVert _{2}^{2}=\frac {1}{|G|}$ and so we can think of $1/|G|$ as maximum entropy. If $A$ is uniform over $\Omega (|G|)$ elements, we have $\lVert A\rVert _{2}^{2}=O(1/|G|)$ and we think of $A$ as having “high” entropy.

Because the entropy of $U$ is small, we can think of the distance between $A$ and $U$ in the 2-norm as being essentially the entropy of $A$:

\begin{aligned} \lVert A-U\rVert _{2}^{2} & =\sum _{x\in G}\left (A(x)-\frac {1}{|G|}\right )^{2}\\ & =\sum _{x\in G}A(x)^{2}-2A(x)\frac {1}{|G|}+\frac {1}{|G|^{2}}\\ & =\lVert A\rVert _{2}^{2}-\frac {1}{|G|}\\ & =\lVert A\rVert _{2}^{2}-\lVert U\rVert _{2}^{2}\\ & \approx \lVert A\rVert _{2}^{2}. \end{aligned}

Lemma 7. [Gow08BNP08] If $X,Y$ are independent over $G$, then

\begin{aligned} \lVert X\cdot Y-U\rVert _{2}\leq \lVert X\rVert _{2}\lVert Y\rVert _{2}\sqrt {\frac {|G|}{d}}, \end{aligned}

where $d$ is the minimum dimension of an irreducible representation of $G$.

By this lemma, for high entropy distributions $X$ and $Y$, we get $\lVert X\cdot Y-U\rVert _{2}\leq \frac {O(1)}{\sqrt {|G|d}}$. The factor $1/\sqrt {|G|}$ allows us to pass to statistical distance $\lVert .\rVert _{1}$ using Cauchy-Schwarz:

\begin{aligned} \lVert X\cdot Y-U\rVert _{1}\leq \sqrt {|G|}\lVert X\cdot Y-U\rVert _{2}\leq \frac {O(1)}{\sqrt {d}}.~~~~(1) \end{aligned}

This is the way in which we will use the lemma.

Another useful consequence of this lemma, which however we will not use directly, is this. Suppose now you have $three$ independent, high-entropy variables $X,Y,Z$. Then for every $g\in G$ we have

\begin{aligned} |\mathbb{P} [X\cdot Y\cdot Z=g]-1/|G||\le \lVert X\rVert _{2}\lVert Y\rVert _{2}\lVert Z\rVert _{2}\sqrt {\frac {|G|}{d}}.~~~~(2) \end{aligned}

To show this, set $g=1_{G}$ without loss of generality and rewrite the left-hand-side as

\begin{aligned} |\sum _{h\in G}\mathbb{P} [X=h](\mathbb{P} [YZ=h^{-1}]-1/|G|)|. \end{aligned}

By Cauchy-Schwarz this is at most

\begin{aligned} \sqrt {\sum _{h}\mathbb{P} ^{2}[X=h]}\sqrt {\sum _{h}(\mathbb{P} [YZ=h^{-1}]-1/|G|)^{2}}=\lVert X\lVert _{2}\lVert YZ-U\lVert _{2} \end{aligned}

and we can conclude by Lemma 7. Hence the product of three high-entropy distributions is close to uniform in a point-wise sense: each group element is obtained with roughly probability $1/|G|$.

At least over $SL(2,q)$, there exists an alternative proof of this fact that does not mention representation theory (see [GVa] and [VioaViob]).

With this notation in hand, we conclude by stating a “mixing” version of Theorem 2. For more on this perspective we refer the reader to [GVa].

Theorem 1. Let $G=SL(2,q)$. Let $X=(X_{1},X_{2})$ and $Y=(Y_{1},Y_{2})$ be two distributions over $G^{2}$. Suppose $X$ is independent from $Y$. Let $g\in G$. We have

\begin{aligned} |\mathbb{P} [X_{1}Y_{1}X_{2}Y_{2}=g]-1/|G||\le |G|^{1-\Omega (1)}\lVert X\rVert _{2}\lVert Y\rVert _{2}. \end{aligned}

For example, when $X$ and $Y$ have high entropy over $G^{2}$ (that is, are uniform over $\Omega (|G|^{2})$ pairs), we have $\lVert X\rVert _{2}\le \sqrt {O(1)/|G|^{2}}$, and so $|G|^{1-\Omega (1)}\lVert X\rVert _{2}\lVert Y\rVert _{2}\le 1/|G|^{1+\Omega (1)}$. In particular, $X_{1}Y_{1}X_{2}Y_{2}$ is $1/|G|^{\Omega (1)}$ close to uniform over $G$ in statistical distance.

##### 5.0.2 Back to protocols

As in the beginning of Section 3, for any group element $g\in G$ we define the distribution on triples $D_{g}:=(x,y,(x\cdot y)^{-1}g)$, where $x,y\in G$ are uniform and independent. Note the product of the elements in $D_{g}$ is always $g$. Again as in Section 3, it suffices to show that for every deterministic protocols $P$ using little communication we have

\begin{aligned} |\Pr [P(D_{1})=1]-\Pr [P(D_{h})=1]|\leq \frac {1}{100}. \end{aligned}

Analogously to Lemma 4, the following lemma describes a protocol using rectangles. The proof is nearly identical and is omitted.

Lemma 8. (The set of accepted inputs of) A deterministic $c$-bit number-in-hand protocol with three parties can be written as a disjoint union of $2^{c}$ “rectangles,” that is sets of the form $A\times B\times C$.

Next we show that these product sets cannot distinguish these two distributions $D_{1},D_{h}$, via a straightforward application of lemma 7.

Lemma 9. For all $A,B,C\subseteq G$ we have $|\mathbb{P} (A\times B\times C)(D_{1})=1]-\mathbb{P} [(A\times B\times C)(D_{h})=1]|\leq 1/d^{\Omega (1)}.$

Proof. Pick any $h\in G$ and let $x,y,z$ be the inputs of Alice, Bob, and Charlie respectively. Then

\begin{aligned} \mathbb{P} [(A\times B\times C)(D_{h})=1]=\mathbb{P} [(x,y)\in A\times B]\cdot \mathbb{P} [(x\cdot y)^{-1}\cdot h\in C|(x,y)\in A\times B],~~~~(3) \end{aligned}

where $(x,y)$ is uniform in $G^{2}$. If either $A$ or $B$ is small, that is $\mathbb{P} [x\in A]\leq \epsilon$ or $\mathbb{P} [y\in B]\leq \epsilon$, then also $\mathbb{P} [(x,y)\in A\times B]\le \epsilon$ and hence (??) is at most $\epsilon$ as well. This holds for every $h$, so we also have $|\mathbb{P} (A\times B\times C)(D_{1})=1]-\mathbb{P} [(A\times B\times C)(D_{h})=1]|\leq \epsilon .$ We will choose $\epsilon$ later.

Otherwise, $A$ and $B$ are large: $\mathbb{P} [x\in A]>\epsilon$ and $\mathbb{P} [y\in B]>\epsilon$. Let $(x',y')$ be the distribution of $(x,y)$ conditioned on $(x,y)\in A\times B$. We have that $x'$ and $y'$ are independent and each is uniform over at least $\epsilon |G|$ elements. By Lemma 7 this implies $\lVert x'\cdot y'-U\rVert _{2}\leq \lVert x'\rVert _{2}\cdot \lVert y'\rVert _{2}\cdot \sqrt {\frac {|G|}{d}}$, where $U$ is the uniform distribution. As mentioned after the lemma, by Cauchy–Schwarz we obtain

\begin{aligned} \lVert x'\cdot y'-U\rVert _{1}\leq |G|\cdot \lVert x'\rVert _{2}\cdot \lVert y'\rVert _{2}\cdot \sqrt {\frac {1}{d}}\leq \frac {1}{\epsilon }\cdot \frac {1}{\sqrt {d}}, \end{aligned}

where the last inequality follows from the fact that $\lVert x\rVert _{2},\lVert y\rVert _{2}\leq \sqrt {\frac {1}{\epsilon |G|}}$.

This implies that $\lVert (x'\cdot y')^{-1}-U\rVert _{1}\leq \frac {1}{\epsilon }\cdot \frac {1}{\sqrt {d}}$ and $\lVert (x'\cdot y')^{-1}\cdot h-U\rVert _{1}\leq \frac {1}{\epsilon }\cdot \frac {1}{\sqrt {d}}$, because taking inverses and multiplying by $h$ does not change the distance to uniform. These two last inequalities imply that

\begin{aligned} |\mathbb{P} [(x'\cdot y')^{-1}\in C]-\mathbb{P} [(x'\cdot y')^{-1}\cdot h\in C]|\le O(\frac {1}{\epsilon \sqrt {d}}); \end{aligned}

and thus we get that

\begin{aligned} |\mathbb{P} [(A\times B\times C)(D_{1})=1]-\mathbb{P} [(A\times B\times C)(D_{h})=1]|\le O(\frac {1}{\epsilon \sqrt {d}}). \end{aligned}

Picking $\epsilon =1/d^{1/4}$ completes the proof. $\square$

Returning to arbitrary deterministic protocols $P$ (as opposed to rectangles), write $P$ as a union of $2^{c}$ disjoint rectangles by Lemma 8. Applying Lemma 9 and summing over all rectangles we get that the distinguishing advantage of $P$ is at most $2^{c}/d^{1/4}$. For $c\leq (1/100)\log d$ the advantage is at most $1/100$, concluding the proof.

In number-on-forehead (NOH) communication complexity [CFL83] with $k$ parties, the input is a $k$-tuple $(x_{1},\dotsc ,x_{k})$ and each party $i$ sees all of it except $x_{i}$. For background, it is not known how to prove negative results for $k\ge \log n$ parties.

We mention that Theorem 1 can be extended to the multiparty setting, see [GVa]. Several questions arise here, such as whether this problem remains hard for $k\ge \log n$, and what is the minimum length of an interleaved product that is hard for $k=3$ parties (the proof in 1 gives a large constant).

However in this survey we shall instead focus on the problem of separating deterministic and randomized communication. For $k=2$, we know the optimal separation: The equality function requires $\Omega (n)$ communication for deterministic protocols, but can be solved using $O(1)$ communication if we allow the protocols to use public coins. For $k=3$, the best known separation between deterministic and randomized protocol is $\Omega (\log n)$ vs $O(1)$ [BDPW10]. In the following we give a new proof of this result, for a different function: $f(x,y,z)=1_{G}$ if and only if $x\cdot y\cdot z=1$ for $x,y,z\in SL(2,q)$. As is true for some functions in [BDPW10], a stronger separation could hold for $f$. For context, let us state and prove the upper bound for randomized communication.

Claim 10. $f$ has randomized communication complexity $O(1)$.

Proof. In the number-on-forehead model, computing $f$ reduces to two-party equality with no additional communication: Alice computes $y\cdot z=:w$ privately, then Alice and Bob check if $x=w^{-1}$. $\square$

To prove the lower bound for deterministic protocols we reduce the communication problem to a combinatorial problem.

Definition 11. A corner in a group $G$ is a set $\{(x,y),(xz,y),(x,zy)\}\subseteq G^{2}$, where $x,y$ are arbitrary group elements and $z\neq 1_{G}$.

For intuition, if $G$ is the abelian group of real numbers with addition, a corner becomes $\{(x,y),(x+z,y),(x,y+z)\}$ for $z\neq 0$, which are the coordinates of an isosceles triangle. We now state the theorem that connects corners and lower bounds.

Lemma 12. Let $G$ be a group and $\delta$ a real number. Suppose that every subset $A\subseteq G^{2}$ with $|A|/|G^{2}|\ge \delta$ contains a corner. Then the deterministic communication complexity of $f$ (defined as $f(x,y,z)=1\iff x\cdot y\cdot z=1_{G}$) is $\Omega (\log (1/\delta ))$.

It is known that $\delta \ge 1/\mathrm {polyloglog}|G|$ implies a corner for certain abelian groups $G$, see [LM07] for the best bound and pointers to the history of the problem. For $G=SL(2,q)$ a stronger result is known: $\delta \ge 1/\mathrm {polylog}|G|$ implies a corner [Aus16]. This in turn implies communication $\Omega (\log \log |G|)=\Omega (\log n)$.

Proof. We saw already twice that a number-in-hand $c$-bit protocol can be written as a disjoint union of $2^{c}$ rectangles (Lemmas 4, 8). Likewise, a number-on-forehead $c$-bit protocol $P$ can be written as a disjoint union of $2^{c}$ cylinder intersections $C_{i}:=\{(x,y,z):f_{i}(y,z)g_{i}(x,z)h_{i}(x,y)=1\}$ for some $f_{i},g_{i},h_{i}\colon G^{2}\to \{0,1\}$:

\begin{aligned} P(x,y,z)=\sum _{i=1}^{2^{c}}f_{i}(y,z)g_{i}(x,z)h_{i}(x,y). \end{aligned}

The proof idea of the above fact is to consider the $2^{c}$ transcripts of $P$, then one can see that the inputs giving a fixed transcript are a cylinder intersection.

Let $P$ be a $c$-bit protocol. Consider the inputs $\{(x,y,(xy)^{-1})\}$ on which $P$ accepts. Note that at least $2^{-c}$ fraction of them are accepted by some cylinder intersection $C=f\cdot g\cdot h$. Let $A:=\{(x,y):(x,y,(xy)^{-1})\in C\}\subseteq G^{2}$. Since the first two elements in the tuple determine the last, we have $|A|/|G^{2}|\ge 2^{-c}$.

Now suppose $A$ contains a corner $\{(x,y),(xz,y),(x,zy)\}$. Then

\begin{aligned} (x,y)\in A & \implies (x,y,(xy)^{-1})\in C & & \implies h(x,y)=1,\\ (xz,y)\in A & \implies (xz,y,(xzy)^{-1})\in C & & \implies f(y,(xyz)^{-1})=1,\\ (x,zy)\in A & \implies (x,zy,(xzy)^{-1})\in C & & \implies g(x,(xyz)^{-1})=1. \end{aligned}

This implies $(x,y,(xzy)^{-1})\in C$, which is a contradiction because $z\neq 1$ and so $x\cdot y\cdot (xzy)^{-1}\neq 1_{G}$. $\square$

### 7 The corners theorem for quasirandom groups

In this section we prove the corners theorem for quasirandom groups, following Austin [Aus16]. Our exposition has several minor differences with that in [Aus16], which may make it more computer-science friendly. Possibly a proof can also be obtained via certain local modifications and simplifications of Green’s exposition [Gre05bGre05a] of an earlier proof for the abelian case. We focus on the case $G=\textit {SL}(2,q)$ for simplicity, but the proof immediately extends to other quasirandom groups (with corresponding parameters).

Theorem 1. Let $G=\textit {SL}(2,q)$. Every subset $A\subseteq G^{2}$ of density $|A|/|G|^{2}\geq 1/\log ^{a}|G|$ contains a corner $\{(x,y),(xz,y),(x,zy)~|~z\neq 1\}$.

#### 7.1 Proof idea

For intuition, suppose $A$ is a product set, i.e., $A=B\times C$ for $B,C\subseteq G$. Let’s look at the quantity

\begin{aligned} \mathbb {E}_{x,y,z\leftarrow G}[A(x,y)A(xz,y)A(x,zy)] \end{aligned}

where $A(x,y)=1$ iff $(x,y)\in A$. Note that the random variable in the expectation is equal to $1$ exactly when $x,y,z$ form a corner in $A$. We’ll show that this quantity is greater than $1/|G|$, which implies that $A$ contains a corner (where $z\neq 1$). Since we are taking $A=B\times C$, we can rewrite the above quantity as

\begin{aligned} \mathbb {E}_{x,y,z\leftarrow G}[B(x)C(y)B(xz)C(y)B(x)C(zy)] & =\mathbb {E}_{x,y,z\leftarrow G}[B(x)C(y)B(xz)C(zy)]\\ & =\mathbb {E}_{x,y,z\leftarrow G}[B(x)C(y)B(z)C(x^{-1}zy)] \end{aligned}

where the last line follows by replacing $z$ with $x^{-1}z$ in the uniform distribution. If $|A|/|G|^{2}\ge \delta$, then both |B|/|G|$\ge \delta$ and $|B|/|G|\ge \delta$. Condition on $x\in B$, $y\in C$, $z\in B$. Then the distribution $x^{-1}zy$ is a product of three independent distributions, each uniform on a set of density $\ge \delta$. (In fact, two distributions would suffice for this.) By Lemma 7, $x^{-1}zy$ is $\delta ^{-1}/|G|^{\Omega (1)}$ close to uniform in statistical distance. This implies that the above expectation equals

\begin{aligned} \frac {|A|}{|G|^{2}}\cdot \frac {|B|}{|G|}\cdot \left (\frac {|C|}{|G|}\pm \frac {\delta ^{-1}}{|G|^{\Omega (1)}}\right ) & \geq \delta ^{2}\left (\delta -\frac {1}{|G|^{\Omega (1)}}\right )\geq \delta ^{3}/2>1/|G|, \end{aligned}

for $\delta >1/|G|^{c}$ for a small enough constant $c$. Hence, product sets of density polynomial in $1/|G|$ contain corners.

Given the above, it is natural to try to decompose an arbitrary set $A$ into product sets. We will make use of a more general result.

#### 7.2 Weak Regularity Lemma

Let $U$ be some universe (we will take $U=G^{2}$) and let $f:U\rightarrow [-1,1]$ be a function (for us, $f=1_{A}$). Let $D\subseteq \{d:U\rightarrow [-1,1]\}$ be some set of functions, which can be thought of as “easy functions” or “distinguishers” (these will be rectangles or closely related to them). The next theorem shows how to decompose $f$ into a linear combination $g$ of the $d_{i}$ up to an error which is polynomial in the length of the combination. More specifically, $f$ will be indistinguishable from $g$ by the $d_{i}$.

Lemma 13. Let $f:U\rightarrow [-1,1]$ be a function and $D\subseteq \{d:U\rightarrow [-1,1]\}$ a set of functions. For all $\epsilon >0$, there exists a function $g:=\sum _{i\le s}c_{i}\cdot d_{i}$ where $d_{i}\in D$, $c_{i}\in \mathbb {R}$ and $s=1/\epsilon ^{2}$ such that for all $d\in D$

\begin{aligned} \left |\mathbb {E}_{x\leftarrow U}[f(x)\cdot d(x)]-\mathbb {E}_{x\leftarrow U}[g(x)\cdot d(x)]\right |\le \epsilon . \end{aligned}

A different way to state the conclusion, which we will use, is to say that we can write $f=g+h$ so that $\mathbb{E} [h(x)\cdot d(x)]$ is small.

The lemma is due to Frieze and Kannan [FK96]. It is called “weak” because it came after Szemerédi’s regularity lemma, which has a stronger distinguishing conclusion. However, the lemma is also “strong” in the sense that Szemerédi’s regularity lemma has $s$ as a tower of $1/\epsilon$ whereas here we have $s$ polynomial in $1/\epsilon$. The weak regularity lemma is also simpler. There also exists a proof [Tao17] of Szemerédi’s theorem (on arithmetic progressions), which uses weak regularity as opposed to the full regularity lemma used initially.

Proof. We will construct the approximation $g$ through an iterative process producing functions $g_{0},g_{1},\dots ,g$. We will show that $||f-g_{i}||_{2}^{2}$ decreases by $\ge \epsilon ^{2}$ each iteration.

Start: Define $g_{0}=0$ (which can be realized setting $c_{0}=0$).

Iterate: If not done, there exists $d\in D$ such that $|\mathbb {E}[(f-g)\cdot d]|>\epsilon$. Assume without loss of generality $\mathbb {E}[(f-g)\cdot d]>\epsilon$.

Update: $g':=g+\lambda d$ where $\lambda \in \mathbb {R}$ shall be picked later.

Let us analyze the progress made by the algorithm.

\begin{aligned} ||f-g'||_{2}^{2} & =\mathbb {E}_{x}[(f-g')^{2}(x)]\\ & =\mathbb {E}_{x}[(f-g-\lambda d)^{2}(x)]\\ & =\mathbb {E}_{x}[(f-g)^{2}]+\mathbb {E}_{x}[\lambda ^{2}d^{2}(x)]-2\mathbb {E}_{x}[(f-g)\cdot \lambda d(x)]\\ & \leq ||f-g||_{2}^{2}+\lambda ^{2}-2\lambda \mathbb {E}_{x}[(f-g)d(x)]\\ & \leq ||f-g||_{2}^{2}+\lambda ^{2}-2\lambda \epsilon \\ & \leq ||f-g||_{2}^{2}-\epsilon ^{2} \end{aligned}

where the last line follows by taking $\lambda =\epsilon$. Therefore, there can only be $1/\epsilon ^{2}$ iterations because $||f-g_{0}||_{2}^{2}=||f||_{2}^{2}\leq 1$. $\square$

#### 7.3 Getting more for rectangles

Returning to the main proof, we will use the weak regularity lemma to approximate the indicator function for arbitrary $A$ by rectangles. That is, we take $D$ to be the collection of indicator functions for all sets of the form $S\times T$ for $S,T\subseteq G$. The weak regularity lemma shows how to decompose $A$ into a linear combination of rectangles. These rectangles may overlap. However, we ideally want $A$ to be a linear combination of non-overlapping rectangles. In other words, we want a partition of rectangles. It is possible to achieve this at the price of exponentiating the number of rectangles. Note that an exponential loss is necessary even if $S=G$ in every $S\times T$ rectangle; or in other words in the uni-dimensional setting. This is one step where the terminology “rectangle” may be misleading – the set $T$ is not necessarily an interval. If it was, a polynomial rather than exponential blow-up would have sufficed to remove overlaps.

Claim 14. Given a decomposition of $A$ into rectangles from the weak regularity lemma with $s$ functions, there exists a decomposition with $2^{O(s)}$ rectangles which don’t overlap.

Proof. Exercise. $\square$

In the above decomposition, note that it is natural to take the coefficients of rectangles to be the density of points in $A$ that are in the rectangle. This gives rise to the following claim.

Claim 15. The weights of the rectangles in the above claim can be the average of $f$ in the rectangle, at the cost of doubling the error.

Consequently, we have that $f=g+h$, where $g$ is the sum of $2^{O(s)}$ non-overlapping rectangles $S\times T$ with coefficients $\mathbb{P} _{(x,y)\in S\times T}[f(x,y)=1]$.

Proof. Let $g$ be a partition decomposition with arbitrary weights. Let $g'$ be a partition decomposition with weights being the average of $f$. It is enough to show that for all rectangle distinguishers $d\in D$

\begin{aligned} |\mathbb {E}[(f-g')d]|\leq |\mathbb {E}[(f-g)d]|. \end{aligned}

By the triangle inequality, we have that

\begin{aligned} |\mathbb {E}[(f-g')d]|\leq |\mathbb {E}[(f-g)d]|+|\mathbb {E}[(g-g')d]|. \end{aligned}

To bound $\mathbb {E}[(g-g')d]|$, note that the error is maximized for a $d$ that respects the decomposition in non-overlapping rectangles, i.e., $d$ is the union of some non-overlapping rectangles from the decomposition. This can be argued using that, unlike $f$, the value of $g$ and $g'$ on a rectangle $S\times T$ from the decomposition is fixed. But, from the point of “view” of such $d$, $g'=f$! More formally, $\mathbb {E}[(g-g')d]=\mathbb {E}[(g-f)d]$. This gives

\begin{aligned} |\mathbb {E}[(f-g')d]|\leq 2|\mathbb {E}[(f-g)d]| \end{aligned}

and concludes the proof. $\square$

We need to get still a little more from this decomposition. In our application of the weak regularity lemma above, we took the set of distinguishers to be characteristic functions of rectangles. That is, distinguishers that can be written as $U(x)\cdot V(y)$ where $U$ and $V$ map $G\to \{0,1\}$. We will use that the same guarantee holds for $U$ and $V$ with range $[-1,1]$, up to a constant factor loss in the error. Indeed, let $U$ and $V$ have range $[-1,1]$. Write $U=U_{+}-U_{-}$ where $U_{+}$ and $U_{-}$ have range $[0,1]$, and the same for $V$. The error for distinguisher $U\cdot V$ is at most the sum of the errors for distinguishers $U_{+}\cdot V_{+}$, $U_{+}\cdot V_{-}$, $U_{-}\cdot V_{+}$, and $U_{-}\cdot V_{-}$. So we can restrict our attention to distinguishers $U(x)\cdot V(y)$ where $U$ and $V$ have range $[0,1]$. In turn, a function $U(x)$ with range $[0,1]$ can be written as an expectation $\mathbb{E} _{a}U_{a}(x)$ for functions $U_{a}$ with range $\{0,1\}$, and the same for $V$. We conclude by observing that

\begin{aligned} \mathbb{E} _{x,y}[(f-g)(x,y)\mathbb{E} _{a}U_{a}(x)\cdot \mathbb{E} _{b}V_{b}(y)]\le \max _{a,b}\mathbb{E} _{x,y}[(f-g)(x,y)U_{a}(x)\cdot V_{b}(y)]. \end{aligned}

#### 7.4 Proof

Let us now finish the proof by showing a corner exists for sufficiently dense sets $A\subseteq G^{2}$. We’ll use three types of decompositions for $f:G^{2}\rightarrow \{0,1\}$, with respect to the following three types of distinguishers, where $U_{i}$ and $V_{i}$ have range $\{0,1\}$:

1. $U_{1}(x)\cdot V_{1}(y)$,
2. $U_{2}(xy)\cdot V_{2}(y)$,
3. $U_{3}(x)\cdot V_{3}(xy)$.

The first type is just rectangles, what we have been discussing until now. The distinguishers in the last two classes can be visualized over $\mathbb {R}^{2}$ as parallelograms with a 45-degree angle. The same extra properties we discussed for rectangles can be verified hold for them too.

Recall that we want to show

\begin{aligned} \mathbb {E}_{x,y,g}[f(x,y)f(xg,y)f(x,gy)]>\frac {1}{|G|}. \end{aligned}

We’ll decompose the $i$-th occurrence of $f$ via the $i$-th decomposition listed above. We’ll write this decomposition as $f=g_{i}+h_{i}$. We apply this in a certain order to produce sums of products of three functions. The inputs to the functions don’t change, so to avoid clutter we do not write them, and it is understood that in each product of three functions the inputs are, in order $(x,y),(xg,y),(x,gy)$. The decomposition is:

\begin{aligned} & fff\\ = & ffg_{3}+ffh_{3}\\ = & fg_{2}g_{3}+fh_{2}g_{3}+ffh_{3}\\ = & g_{1}g_{2}g_{3}+h_{1}g_{2}g_{3}+fh_{2}g_{3}+ffh_{3}. \end{aligned}

We first show that the expectation of the first term is big. This takes the next two claims. Then we show that the expectations of the other terms are small.

Claim 16. For all $g\in G$, the expectations $\mathbb {E}_{x,y}[g_{1}(x,y)g_{2}(xg,y)g_{3}(x,gy)]$ are the same up to an error of $2^{O(s)}/|G|^{\Omega (1)}$.

Proof. We just need to get error $1/|G|^{\Omega (1)}$ for any product of three functions for the three decomposition types. We have:

\begin{aligned} & \mathbb {E}_{x,y}[c_{1}U_{1}(x)V_{1}(y)\cdot c_{2}U_{2}(xgy)V_{2}(y)\cdot c_{3}U_{3}(x)V_{3}(xgy)]\\ = & c_{1}c_{2}c_{3}\mathbb {E}_{x,y}[(U_{1}\cdot U_{3})(x)(V_{1}\cdot V_{2})(y)(U_{2}\cdot V_{3})(xgy)]\\ = & c_{1}c_{2}c_{3}\cdot \mathbb {E}_{x}[(U_{1}\cdot U_{3})(x)]\cdot \mathbb {E}_{y}[(V_{1}\cdot V_{2})(y)]\cdot \mathbb {E}_{z}[(U_{2}\cdot V_{3})(z)]\pm \frac {1}{|G|^{\Omega (1)}}. \end{aligned}

This is similar to what we discussed in the overview, and is where we use mixing. Specifically, if $\mathbb {E}_{x}[(U_{1}\cdot U_{3})(x)]$ or $\mathbb {E}_{y}[(V_{1}\cdot V_{2})(y)]$ are at most $1/|G|^{c}$ for a small enough constant $c$ than we are done. Otherwise, conditioned on $(U_{1}\cdot U_{3})(x)=1$, the distribution on $x$ is uniform over a set of density $1/|G|^{c}$, and the same holds for $y$, and the result follows by Lemma 7. $\square$

Recall that we start with a set of density $\ge 1/\log ^{a}|G|$.

Claim 17. $\mathbb {E}_{x,y}[g_{1}(x,y)g_{2}(x,y)g_{3}(x,y)]>1/\log ^{4a}|G|$.

Proof. We will relate the expectation over $x,y$ to $f$ using the Hölder inequality: For random variables $X_{1},X_{2},\ldots ,X_{k}$,

\begin{aligned} \mathbb {E}[X_{1}\dots X_{k}]\leq \prod _{i=1}^{k}\mathbb {E}[X_{i}^{c_{i}}]^{1/c_{i}}\text { such that }\sum 1/c_{i}=1. \end{aligned}

To apply this inequality in our setting, write

\begin{aligned} f=(f\cdot g_{1}g_{2}g_{3})^{1/4}\cdot \left (\frac {f}{g_{1}}\right )^{1/4}\cdot \left (\frac {f}{g_{2}}\right )^{1/4}\cdot \left (\frac {f}{g_{3}}\right )^{1/4}. \end{aligned}

By the Hölder inequality the expectation of the right-hand side is

\begin{aligned} \leq \mathbb {E}[f\cdot g_{1}g_{2}g_{3}]^{1/4}\mathbb {E}\left [\frac {f}{g_{1}}\right ]^{1/4}\mathbb {E}\left [\frac {f}{g_{2}}\right ]^{1/4}\mathbb {E}\left [\frac {f}{g_{3}}\right ]^{1/4}. \end{aligned}

The last three terms equal to $1$ because

\begin{aligned} \mathbb {E}_{x,y}\frac {f(x,y)}{g_{i}(x,y)} & =\mathbb {E}_{x,y}\frac {f(x,y)}{\mathbb {E}_{x',y'\in \textit {Cell}(x,y)}[f(x',y')]}=\mathbb {E}_{x,y}\frac {\mathbb {E}_{x',y'\in \textit {Cell}(x,y)}[f(x',y')]}{\mathbb {E}_{x',y'\in \textit {Cell}(x,y)}[f(x',y')]}=1. \end{aligned}

where $\textit {Cell}(x,y)$ is the set in the partition that contains $(x,y)$. Putting the above together we obtain

\begin{aligned} \mathbb {E}[f]\leq \mathbb {E}[f\cdot g_{1}g_{2}g_{3}]^{1/4}. \end{aligned}

Finally, because the functions are positive, we have that $\mathbb {E}[f\cdot g_{1}g_{2}g_{3}]^{1/4}\leq \mathbb {E}[g_{1}g_{2}g_{3}]^{1/4}$. This concludes the proof. $\square$

It remains to show the other terms are small. Let $\epsilon$ be the error in the weak regularity lemma with respect to distinguishers with range $\{0,1\}$. Recall that this implies error $O(\epsilon )$ with respect to distinguishers with range $[-1,1]$. We give the proof for one of the terms and then we say little about the other two.

Claim 18. $|\mathbb {E}[f(x,y)f(xg,y)h_{3}(x,gy)]|\leq O(\epsilon )^{1/4}$.

The proof involves changing names of variables and doing Cauchy-Schwarz to remove the terms with $f$ and bound the expectation above by $\mathbb {E}[h_{3}(x,g)U(x)V(xg)]$, which is small by the regularity lemma.

Proof. Replace $g$ with $gy^{-1}$ in the uniform distribution to get

\begin{aligned} & \mathbb {E}_{x,y,g}^{4}[f(x,y)f(xg,y)h_{3}(x,gy)]\\ & =\mathbb {E}_{x,y,g}^{4}[f(x,y)f(xgy^{-1},y)h_{3}(x,g)]\\ & =\mathbb {E}_{x,y}^{4}[f(x,y)\mathbb {E}_{g}[f(xgy^{-1},y)h_{3}(x,g)]]\\ & \leq \mathbb {E}_{x,y}^{2}[f^{2}(x,y)]\mathbb {E}_{x,y}^{2}\mathbb {E}_{g}^{2}[f(xgy^{-1},y)h_{3}(x,g)]\\ & \leq \mathbb {E}_{x,y}^{2}\mathbb {E}_{g}^{2}[f(xgy^{-1},y)h_{3}(x,g)]\\ & =\mathbb {E}_{x,y,g,g'}^{2}[f(xgy^{-1},y)h_{3}(x,g)f(xg'y^{-1},y)h_{3}(x,g')], \end{aligned}

where the first inequality is by Cauchy-Schwarz.

Now replace $g\rightarrow x^{-1}g,g'\rightarrow x^{-1}g$ and reason in the same way:

\begin{aligned} & =\mathbb {E}_{x,y,g,g'}^{2}[f(gy^{-1},y)h_{3}(x,x^{-1}g)f(g'y^{-1},y)h_{3}(x,x^{-1}g')]\\ & =\mathbb {E}_{g,g',y}^{2}[f(gy^{-1},y)\cdot f(g'y^{-1},y)\mathbb {E}_{x}[h_{3}(x,x^{-1}g)\cdot h_{3}(x,x^{-1}g')]]\\ & \leq \mathbb {E}_{x,x',g,g'}[h_{3}(x,x^{-1}g)h_{3}(x,x^{-1}g')h_{3}(x',x'^{-1}g)h_{3}(x',x'^{-1}g')]. \end{aligned}

Replace $g\rightarrow xg$ to rewrite the expectation as

\begin{aligned} \mathbb {E}[h_{3}(x,g)h_{3}(x,x^{-1}g')h_{3}(x',x'^{-1}xg)h_{3}(x',x'^{-1}g')]. \end{aligned}

We want to view the last three terms as a distinguisher $U(x)\cdot V(xg)$. First, note that $h_{3}$ has range $[-1,1]$. This is because $h_{3}(x,y)=f(x,y)-\mathbb{E} _{x',y'\in \textit {Cell}(x,y)}f(x',y')$ and $f$ has range $\{0,1\}$, where recall that $Cell(x,y)$ is the set in the partition that contains $(x,y)$. Fix $x',g'$. The last term in the expectation becomes a constant $c\in [-1,1]$. The second term only depends on $x$, and the third only on $xg$. Hence for appropriate functions $U$ and $V$ with range $[-1,1]$ this expectation can be rewritten as

\begin{aligned} \mathbb {E}[h_{3}(x,g)U(x)V(xg)], \end{aligned}

which concludes the proof. $\square$

There are similar proofs to show the remaining terms are small. For $fh_{2}g_{3}$, we can perform simple manipulations and then reduce to the above case. For $h_{1}g_{2}g_{3}$, we have a slightly easier proof than above.

##### 7.4.1 Parameters

Suppose our set has density $\delta \ge 1/\log ^{a}|G|$, and the error in the regularity lemma is $\epsilon$. By the above results we can bound

\begin{aligned} \mathbb {E}_{x,y,g}[f(x,y)f(xg,y)f(x,gy)]\ge 1/\log ^{4a}|G|-2^{O(1/\epsilon ^{2})}/|G|^{\Omega (1)}-\epsilon ^{\Omega (1)}, \end{aligned}

where the terms in the right-hand size come, left-to-right from Claim 17, 16, and 18. Picking $\epsilon =1/\log ^{1/3}|G|$ the proof is completed for sufficiently small $a$.

### References

[AL00]    Andris Ambainis and Satyanarayana V. Lokam. Imroved upper bounds on the simultaneous messages complexity of the generalized addressing function. In Latin American Symposium on Theoretical Informatics (LATIN), pages 207–216, 2000.

[Amb96]    Andris Ambainis. Upper bounds on multiparty communication complexity of shifts. In Symp. on Theoretical Aspects of Computer Science (STACS), pages 631–642, 1996.

[AMS99]    Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. J. of Computer and System Sciences, 58(1, part 2):137–147, 1999.

[Aus16]    Tim Austin. Ajtai-Szemerédi theorems over quasirandom groups. In Recent trends in combinatorics, volume 159 of IMA Vol. Math. Appl., pages 453–484. Springer, [Cham], 2016.

[Bar89]    David A. Mix Barrington. Bounded-width polynomial-size branching programs recognize exactly those languages in NC$^1$. J. of Computer and System Sciences, 38(1):150–164, 1989.

[BC92]    Michael Ben-Or and Richard Cleve. Computing algebraic formulas using a constant number of registers. SIAM J. on Computing, 21(1):54–58, 1992.

[BDPW10]   Paul Beame, Matei David, Toniann Pitassi, and Philipp Woelfel. Separating deterministic from randomized multiparty communication complexity. Theory of Computing, 6(1):201–225, 2010.

[BGKL03]    László Babai, Anna Gál, Peter G. Kimmel, and Satyanarayana V. Lokam. Communication complexity of simultaneous messages. SIAM J. on Computing, 33(1):137–166, 2003.

[BNP08]    László Babai, Nikolay Nikolov, and László Pyber. Product growth and mixing in finite groups. In ACM-SIAM Symp. on Discrete Algorithms (SODA), pages 248–257, 2008.

[CFL83]    Ashok K. Chandra, Merrick L. Furst, and Richard J. Lipton. Multi-party protocols. In 15th ACM Symp. on the Theory of Computing (STOC), pages 94–99, 1983.

[CP10]    Arkadev Chattopadhyay and Toniann Pitassi. The story of set disjointness. SIGACT News, 41(3):59–85, 2010.

[DHKP97]    Martin Dietzfelbinger, Torben Hagerup, Jyrki Katajainen, and Martti Penttonen. A reliable randomized algorithm for the closest-pair problem. J. Algorithms, 25(1):19–51, 1997.

[FK96]    Alan M. Frieze and Ravi Kannan. The regularity lemma and approximation schemes for dense problems. In IEEE Symp. on Foundations of Computer Science (FOCS), pages 12–20, 1996.

[Gow08]    W. T. Gowers. Quasirandom groups. Combinatorics, Probability & Computing, 17(3):363–387, 2008.

[Gre05a]    Ben Green. An argument of Shkredov in the finite field setting, 2005. Available at people.maths.ox.ac.uk/greenbj/papers/corners.pdf.

[Gre05b]    Ben Green. Finite field models in additive combinatorics. Surveys in Combinatorics, London Math. Soc. Lecture Notes 327, 1-27, 2005.

[GVa]    W. T. Gowers and Emanuele Viola. Interleaved group products. SIAM J. on Computing.

[GVb]    W. T. Gowers and Emanuele Viola. The multiparty communication complexity of interleaved group products. SIAM J. on Computing.

[GV15]    W. T. Gowers and Emanuele Viola. The communication complexity of interleaved group products. In ACM Symp. on the Theory of Computing (STOC), 2015.

[IL95]    Neil Immerman and Susan Landau. The complexity of iterated multiplication. Inf. Comput., 116(1):103–116, 1995.

[KMR66]    Kenneth Krohn, W. D. Maurer, and John Rhodes. Realizing complex Boolean functions with simple groups. Information and Control, 9:190–195, 1966.

[KN97]    Eyal Kushilevitz and Noam Nisan. Communication complexity. Cambridge University Press, 1997.

[KS92]    Bala Kalyanasundaram and Georg Schnitger. The probabilistic communication complexity of set intersection. SIAM J. Discrete Math., 5(4):545–557, 1992.

[LM07]    Michael T. Lacey and William McClain. On an argument of Shkredov on two-dimensional corners. Online J. Anal. Comb., (2):Art. 2, 21, 2007.

[LW54]    Serge Lang and André Weil. Number of points of varieties in finite fields. American Journal of Mathematics, 76:819–827, 1954.

[Mil14]    Eric Miles. Iterated group products and leakage resilience against $NC^1$. In ACM Innovations in Theoretical Computer Science conf. (ITCS), 2014.

[MV13]    Eric Miles and Emanuele Viola. Shielding circuits with groups. In ACM Symp. on the Theory of Computing (STOC), 2013.

[PRS97]    Pavel Pudlák, Vojtěch Rödl, and Jiří Sgall. Boolean circuits, tensor ranks, and communication complexity. SIAM J. on Computing, 26(3):605–633, 1997.

[Raz92]    Alexander A. Razborov. On the distributional complexity of disjointness. Theor. Comput. Sci., 106(2):385–390, 1992.

[Raz00]    Ran Raz. The BNS-Chung criterion for multi-party communication complexity. Computational Complexity, 9(2):113–122, 2000.

[RY19]    Anup Rao and Amir Yehudayoff. Communication complexity. 2019. https://homes.cs.washington.edu/ anuprao/pubs/book.pdf.

[Sha16]    Aner Shalev. Mixing, communication complexity and conjectures of Gowers and Viola. Combinatorics, Probability and Computing, pages 1–13, 6 2016. arXiv:1601.00795.

[She14]    Alexander A. Sherstov. Communication complexity theory: Thirty-five years of set disjointness. In Symp. on Math. Foundations of Computer Science (MFCS), pages 24–43, 2014.

[Tao17]    Terence Tao. Szemerédiâs proof of Szemerédiâs theorem, 2017. https://terrytao.files.wordpress.com/2017/09/szemeredi-proof1.pdf.

[Vioa]    Emanuele Viola. Thoughts: Mixing in groups. https://emanueleviola.wordpress.com/2016/10/21/mixing-in-groups/.

[Viob]    Emanuele Viola. Thoughts: Mixing in groups ii. https://emanueleviola.wordpress.com/2016/11/15/mixing-in-groups-ii/.

[Vio14]    Emanuele Viola. The communication complexity of addition. Combinatorica, pages 1–45, 2014.

[Vio17]    Emanuele Viola. Special topics in complexity theory. Lecture notes of the class taught at Northeastern University. Available at http://www.ccs.neu.edu/home/viola/classes/spepf17.html, 2017.

[Yao79]    Andrew Chi-Chih Yao. Some complexity questions related to distributive computing. In 11th ACM Symp. on the Theory of Computing (STOC), pages 209–213, 1979.

# Special Topics in Complexity Theory: class is over :-(

I put together in a single file all the lectures given by me. On the class webpage you can also find the scribes of the two guest lectures, and the students’ presentations. Many thanks to Matthew Dippel, Xuangui Huang, Chin Ho Lee, Biswaroop Maiti, Tanay Mehta, Willy Quach, and Giorgos Zirdelis for doing an excellent job scribing these lectures. (And for giving me perfect teaching evaluations. Though I am not sure if I biased the sample. It went like this. One day I said: “Please fill the student evaluations, we need 100%.” A student said: “100% what?  Participation or score?” I meant participation but couldn’t resist replying jokingly “both.”) Finally, thanks also to all the other students, postdocs, and faculty who attended the class and created a great atmosphere.

# Special Topics in Complexity Theory, Lecture 19

Special Topics in Complexity Theory, Fall 2017. Instructor: Emanuele Viola

### 1 Lecture 19, Guest lecture by Huacheng Yu, Scribe: Matthew Dippel

Guest lecture by Huacheng Yu on dynamic data structure lower bounds, for the 2D range query and 2D range parity problems. Thanks to Huacheng for giving this lecture and for feedback on the write-up.

What is covered.

• Overview of Larsen’s lower bound for 2D range counting.
• Extending these techniques for $\Omega (\log ^{1.5}n / \log \log ^3 n)$ for 2D range parity.

### 2 Problem definitions

Definition 1. 2D range counting

Give a data structure $D$ that maintains a weighted set of 2 dimensional points with integer coordinates, that supports the following operations:

1. UPDATE: Add a (point, weight) tuple to the set.
2. QUERY: Given a query point $(x, y)$, return the sum of weights of points $(x', y')$ in the set satisfying $x' \leq x$ and $y' \leq y$.

Definition 2. 2D range parity

Give a data structure $D$ that maintains an unweighted set of 2 dimensional points with integer coefficients, that supports the following operations:

1. UPDATE: Add a point to the set.
2. QUERY: Given a query point $(x, y)$, return the parity of the number of points $(x', y')$ in the set satisfying $x' \leq x$ and $y' \leq y$.

Both of these definitions extend easily to the $d$-dimensional case, but we state the 2D versions as we will mainly work with those.

#### 2.1 Known bounds

All upper bounds assume the RAM model with word size $\Theta (\log n)$.

Upper bounds: Using range trees, we can create a data structure for 2D range counting, with all update and query operations taking time $O(\log ^d n)$ time. With extra tricks, we can make this work for 2D range parity with operations running in time $O((\log n / \log \log n)^d)$.

Lower bounds. There are a series of works on lower bounds:

• Fredman, Saks ’89 – 1D range parity requires $\Omega (\log n / \log \log n)$.
• Patrascu, Demaine ’04 – 1D range counting requires $\Omega (\log n)$.
• Larsen ’12 – 2D range counting requires $\Omega ((\log n / \log \log n)^2)$.
• Larsen, Weinstein, Yu ’17 – 2D range parity requires $\Omega (\log ^{1.5} n / \log \log ^3 n)$.

This lecture presents the recent result of [Larsen ’12] and [Larsen, Weinstein, Yu ’17]. They both use the same general approach:

1. Show that, for an efficient approach to exist, the problem must demonstrate some property.
2. Show that the problem doesn’t have that property.

### 3 Larsen’s technique

All lower bounds are in the cell probe model with word size $\Theta (\log n)$.

We consider a general data structure problem, where we require a structure $D$ that supports updates and queries of an unspecified nature. We further assume that there exists an efficient solution with update and query times $o((\log n / \log \log n)^2)$. We will restrict our attention to operation sequences of the form $u_1, u_2, \cdots , u_n, q$. That is, a sequence of $n$ updates followed by a single query $q$. We fix a distribution over such sequences, and show that the problem is still hard.

#### 3.1 Chronogram method [FS89]

We divide the updates into $r$ epochs, so that our sequence becomes:

\begin{aligned}U_r, U_{r-1}, \cdots , U_1, q\end{aligned}

where $|U_i| = \beta ^i$ and $\beta = \log ^5 n$. The epochs are multiplicatively shrinking. With this requirement, we have that $r = \Theta (\log n / \log \log n)$.

Let $M$ be the set of all memory cells used by the data structure when run on the sequence of updates. Further, let $A_i$ be the set of memory cells which are accessed by the structure at least once in $U_i$, and never again in a further epoch.

Claim 1. The $A_r, A_{r-1}, \cdots A_1$ are disjoint.

Claim 2. There exists an epoch $i$ such that $D$ probes $o(\log n / \log \log n)$ cells from $A_i$ when answering the query at the end. Note that this is simply our query time divided by the number of epochs. In other words, $D$ can’t afford to read $\Omega (\log n / \log \log n)$ cells from each $A_i$ set without breaking its promise on the query run time.

Claim 2 implies that there is an epoch $i$ which has the smallest effect on the final answer. We will call this the ”easy” epoch.

Idea. : The set $A_i$ contains ”most” information about $U_i$ among all memory cells in $M$. Also, $A_r, A_{r-1}, \cdots , A_{i+1}$ are not updated past epoch $i + 1$, and hence should contain no information relative to the updates in $U_i$. Epochs $A_{i-1}, A_{i-2}, \cdots A_1$ are progressively shrinking, and so the total touched cells in $A_i$ during the query operation should be small.

\begin{aligned}\sum _{j < i}|A_j| \leq O(\beta ^{i - 1}) \cdot \log ^2 n\end{aligned}

#### 3.2 Communication game

Having set up the framework for how to analyze the data structure, we now introduce a communication game where two parties attempt to solve an identical problem. We will show that, an efficient data structure implies an efficient solution to this communication game. If the message is smaller than the entropy of the updates of epoch $i$ (conditioned on preceding epochs), this gives an information theoretic contradiction. The trick is to find a way for the encoder to exploit the small number of probed cells to send a short message.

The game. The game consists of two players, Alice and Bob, who must jointly compute a single query after a series of updates. The model is as follows:

• Alice has all of the update epochs $U_r, U_{r-1}, ... U_1$. She also has an index $i$, which still corresponds to the ”easy” epoch as defined above.
• Bob has all update epochs EXCEPT for $U_i$. He also has a random query $q$. He is aware of the index $i$.
• Communication can only occur in a single direction, from Alice to Bob.
• We assume some fixed input distribution $\mathcal {D}$.
• They win this game if Bob successfully computes the correct answer for the query $q$.

Then we will show the following generic theorem, relating this communication game to data structures for the corresponding problem:

Theorem 3. If there is a data structure with update time $t_u$ and probes $t$ cells from $A_i$ in expectation when answering the final query $q$, then the communication game has an efficient solution, with $O(p|U_i|t_u\log n + \beta ^{i-1}t_u\log n )$ communication cost, and success probability at least $p^t$. This holds for any choice of $0 < p < 1$.

Before we prove the theorem, we consider specific parameters for our problem. If we pick

\begin{aligned} p &= 1 / \log ^5n, \\ t_u &= \log ^2 n, \\ t &= o(\log n / \log \log n), \end{aligned}

then, after plugging in the parameters, the communication cost is $|U_i| / \log ^2 n$. Note that, we could always trivially achieve $|U_i|$ by having Alice send Bob all of $U_i$, so that he can compute the solution of the problem with no uncertainty. The success probability is $(\log ^{-5} n)^{o(\log n / \log \log n)}$, which simplifies to $2^{-o(\log n)} = 1 / n^{o(1)}$. This is significantly better than $1 / n^{O(1)}$, which could be achieved trivially by having Bob output a random answer to the query, independent of the updates.

Proof.

We assume we have a data structure $D$ for the update / query problem. Then Alice and Bob will proceed as follows:

Alice’s steps.

1. Simulate $D$ on $U_r, U_{r - 1}, ... U_1$. While doing so, keep track of memory cell accesses and compute $A_r, A_{r-1}, ... A_1$.
2. Sample a random subset $C \subset A_i$, such that $|C| = p|A_i|$.
3. Send $C \cup A_{i-1} \cup A_{i-2} \cup ... A_1$.

We note that in Alice’s Step 3, to send a cell, she sends a tuple holding the cell ID and the cell state before the query was executed. Also note that, she doesn’t distinguish to Bob which cells are in which sets of the union.

Bob’s steps.

1. Receive $C'$ from Alice.
2. Simulate $D$ on epochs $U_{r}, U_{r-1}, ... U_{i+1}$. Snapshot the current memory state of the data structure as $M$.
3. Simulate the query algorithm. Every time $q$ attempts to probe cell $c$, Bob checks if $c \in C'$. If it is, he lets $D$ probe from $C'$. Otherwise, he lets $D$ probe from $M$.
4. Bob returns the result from the query algorithm as his answer.

If the query algorithm does not query any cell in $A_i - C$, then Bob succeeds, as he can exactly simulate the data structure query. Since the query will check $t$ cells in $A_i$, and Bob has a random subset of them of size $p|A_i|$, then the probability that he got a subset the data structure will not probe is at least $p^t$. The communication cost is the cost of Alice sending the cells to Bob, which is

\begin{aligned} (p|A_i| + \sum _{j < i}|A_i|) \leq (pt_u + |U_i| + \beta ^{i-1}t_u)\log n\end{aligned}

$\square$

### 4 Extension to 2D Range Parity

The extension to 2D range parity proceeds in nearly identical fashion, with a similar theorem relating data structures to communication games.

Theorem 1. Consider an arbitrary data structure problem where queries have 1-bit outputs. If there exists a data structure having:

• update time $t_u$
• query time $t_q$
• Probes $t$ cells from $A_i$ when answering the last query $q$

Then there exists a protocol for the communication game with $O(p|U_i|t_i\log n + t_u\beta ^{i-1}\log n )$ bits of communication and success probability at least $1/2 + 2^{-O(\sqrt {t_q t (\log (1 / p)^3})}$, for any choice of $0 < p < 1$. Again, we plug in the parameters from 2D range parity. If we set

\begin{aligned} t_u = t_q &= o(\log ^{1.5}n / (\log \log n)^2), \\ t = t_q / r &= o(\log ^ (1/2) n / \log \log n), \\ p &= 1 / \log ^5 n, \end{aligned}

then the cost is $|U_i| / \log ^2 n$, and the probability simplifies to $1/2 + 1 / n^{o(1)}$.

We note that, if we had $Q = n^{O(1)}$ different queries, then randomly guessing on all of them, with constant probability we could be correct on as many as $Q/2 \pm O(\sqrt {Q})$. In this case, the probability of being correct on a single one, amortized, is $1/2 + 1/n^{\Theta (1)}$.

Proof. The communication protocol will be slightly adjusted. We assume an a priori distribution on the updates and queries. Bob will then compute the posterior distribution, based on what he knows and what Alice sends him. He then computes the maximum likelihood answer to the query $q$. We thus need to figure out what Alice can send, so that the answer to $q$ is often biased towards either $1$ or $0$.

We assume the existence of some public randomness available to both Alice and Bob. Then we adjust the communication protocol as follows:

Alice’s modified steps.

• Alice samples, using the public randomness, a subset of ALL memory cells $M_2$, such that each cell is sampled with probability $p$. Alice sends $M_2 \cap A_i$ to Bob. Since Bob can mimic the sampling, he gains additional information about which cells are and aren’t in $A_i$.

Bob’s modified steps.

• Denote by $S$ the set of memory cells probed by the data structure when Bob simulates the query algorithm. That is, $S$ is what Bob ”thinks” D will probe during the query, as the actual set of cells may be different if Bob had full knowledge of the updates, and the data structure may use that information to determine what to probe. Bob will use $S$ to compute the posterior distribution.

Define the function $f(z) : [2^w] \rightarrow \mathbb {R}$ to be the ”bias” when $S$ takes on the value $z$. In particular, this function is conditioned on $C'$ that Bob receives from Alice. We can then clarify the definition of $f$ as

\begin{aligned} f_{C'}(z) &:= (\text {Pr}[\text {ans to q } = 1 | C', S \leftarrow z] - 1/2) * \text {Pr}[S \leftarrow z | C'] \end{aligned}

In particular, $f$ has the following two properties:

1. $\sum _z |f(z)| \leq 1$
2. $\mathbb {E}_{C'}[\max _z |f(z)|] \geq 1/2 \cdot p^t$

In these statements, the expectation is over everything that Bob knows, and the probabilities are also conditioned on everything that Bob knows. The randomness comes from what he doesn’t know. We also note that when the query probes no cells in $A_i - C'$, then the bias is always $1/2$, since the a posterior distribution will put all its weight on the correct answer of the query.

Finishing the proof requires the following lemma:

Lemma 2. For any $f$ with the above two properties, there exists a $Y \subseteq S$ such that $|Y| \leq O(\sqrt {|S| \log 1/p^t})$ and

\begin{aligned} \sum _{y \in Y} \left |\sum _{z | y} f(z) \right | &\geq 2^{-O(\sqrt {|S| \log 1 / p^t})}. \end{aligned}

Note that the sum inside the absolute values is the bias when $Y \leftarrow y$. $\square$

### References

[FS89]   Michael L. Fredman and Michael E. Saks. The cell probe complexity of dynamic data structures. In ACM Symp. on the Theory of Computing (STOC), pages 345–354, 1989.

# Special Topics in Complexity Theory, Lecture 18

Special Topics in Complexity Theory, Fall 2017. Instructor: Emanuele Viola

### 1 Lecture 18, Scribe: Giorgos Zirdelis

In this lecture we study lower bounds on data structures. First, we define the setting. We have $n$ bits of data, stored in $s$ bits of memory (the data structure) and want to answer $m$ queries about the data. Each query is answered with $d$ probes. There are two types of probes:

• bit-probe which return one bit from the memory, and
• cell-probe in which the memory is divided into cells of $\log n$ bits, and each probe returns one cell.

The queries can be adaptive or non-adaptive. In the adaptive case, the data structure probes locations which may depend on the answer to previous probes. For bit-probes it means that we answer a query with depth-$d$ decision trees.

Finally, there are two types of data structure problems:

• The static case, in which we map the data to the memory arbitrarily and afterwards the memory remains unchanged.
• The dynamic case, in which we have update queries that change the memory and also run in bounded time.

In this lecture we focus on the non-adaptive, bit-probe, and static setting. Some trivial extremes for this setting are the following. Any problem (i.e., collection of queries) admits data structures with the following parameters:

• $s=m$ and $d=1$, i.e. you write down all the answers, and
• $s=n$ and $d=n$, i.e. you can always answer a query about the data if you read the entire data.

Next, we review the best current lower bound, a bound proved in the 80’s by Siegel [Sie04] and rediscovered later. We state and prove the lower bound in a different way. The lower bound is for the problem of $k$-wise independence.

Problem 1. The data is a seed of size $n=k \log m$ for a $k$-wise independent distribution over $\{0,1\}^m$. A query $i$ is defined to be the $i$-th bit of the sample.

The question is: if we allow a little more space than seed length, can we compute such distributions fast?

Theorem 2. For the above problem with $k=m^{1/3}$ it holds that

\begin{aligned} d \geq \Omega \left ( \frac {\lg m}{\lg (s/n)} \right ). \end{aligned}

It follows, that if $s=O(n)$ then $d$ is $\Omega (\lg m)$. But if $s=n^{1+\Omega (1)}$ then nothing is known.

Proof. Let $p=1/m^{1/4d}$. We have the memory of $s$ bits and we are going to subsample it. Specifically, we will select a bit of $s$ with probability $p$, independently.

The intuition is that we will shrink the memory but still answer a lot of queries, and derive a contradiction because of the seed length required to sample $k$-wise independence.

For the “shrinking” part we have the following. We expect to keep $p\cdot s$ memory bits. By a Chernoff bound, it follows that we keep $O(p\cdot s)$ bits except with probability $2^{-\Omega (p \cdot s)}$.

For the “answer a lot of queries” part, recall that each query probes $d$ bits from the memory. We keep one of the $m$ queries if it so happens that we keep all the $d$ bits that it probed in the memory. For a fixed query, the probability that we keep all its $d$ probes is $p^d = 1/m^{1/4}$.

We claim that with probability at least $1/m^{O(1)}$, we keep $\sqrt {m}$ queries. This follows by Markov’s inequality. We expect to not keep $m - m^{3/4}$ queries on average. We now apply Markov’s inequality to get that the probability that we don’t keep at least $m - \sqrt {m}$ queries is at most $(m - m^{3/4})/(m-\sqrt {m})$.

Thus, if $2^{-\Omega (p\cdot s)} \leq 1/m^{O(1)}$, then there exists a fixed choice of memory bits that we keep, to achieve both the “shrinking” part and the “answer a lot of queries” part as above. This inequality is true because $s \geq n > m^{1/3}$ and so $p \cdot s \ge m^{-1/4 + 1/3} = m^{\Omega (1)}$. But now we have $O(p \cdot s)$ bits of memory while still answering as many as $\sqrt {m}$ queries.

The minimum seed length to answer that many queries while maintaining $k$-wise independence is $k \log \sqrt {m} = \Omega (k \lg m) = \Omega (n)$. Therefore the memory has to be at least as big as the seed. This yields

\begin{aligned} O(ps) \ge \Omega (n) \end{aligned}

from which the result follows. $\square$

This lower bound holds even if the $s$ memory bits are filled arbitrarily (rather than having entropy at most $n$). It can also be extended to adaptive cell probes.

We will now show a conceptually simple data structure which nearly matches the lower bound. Pick a random bipartite graph with $s$ nodes on the left and $m$ nodes on the right. Every node on the right side has degree $d$. We answer each probe with an XOR of its neighbor bits. By the Vazirani XOR lemma, it suffices to show that any subset $S \subseteq [m]$ of at most $k$ memory bits has an XOR which is unbiased. Hence it suffices that every subset $S \subseteq [m]$ with $|S| \leq k$ has a unique neighbor. For that, in turn, it suffices that $S$ has a neighborhood of size greater than $\frac {d |S|}{2}$ (because if every element in the neighborhood of $S$ has two neighbors in $S$ then $S$ has a neighborhood of size $< d|S|/2$). We pick the graph at random and show by standard calculations that it has this property with non-zero probability.

\begin{aligned} & \Pr \left [ \exists S \subseteq [m], |S| \leq k, \textrm { s.t. } |\mathsf {neighborhood}(S)| \leq \frac {d |S|}{2} \right ] \\ & = \Pr \left [ \exists S \subseteq [m], |S| \leq k, \textrm { and } \exists T \subseteq [s], |T| \leq \frac {d|S|}{2} \textrm { s.t. all neighbors of S land in T} \right ] \\ & \leq \sum _{i=1}^k \binom {m}{i} \cdot \binom {s}{d \cdot i/2} \cdot \left (\frac {d \cdot i}{s}\right )^{d \cdot i} \\ & \leq \sum _{i=1}^k \left (\frac {e \cdot m}{i}\right )^i \cdot \left (\frac {e \cdot s} {d \cdot i/2}\right )^{d\cdot i/2} \cdot \left (\frac {d \cdot i}{s}\right )^{d \cdot i} \\ & = \sum _{i=1}^k \left (\frac {e \cdot m}{i}\right )^i \cdot \left (\frac {e \cdot d \cdot i/2}{s}\right )^{d \cdot i/2} \\ & = \sum _{i=1}^k \left [ \underbrace { \frac {e \cdot m}{i} \cdot \left (\frac {e \cdot d \cdot i/2}{s}\right )^{d/2} }_{C} \right ]^{i}. \end{aligned}

It suffices to have $C \leq 1/2$, so that the probability is strictly less than 1, because $\sum _{i=1}^{k} 1/2^i = 1-2^{-k}$. We can match the lower bound in two settings:

• if $s=m^{\epsilon }$ for some constant $\epsilon$, then $d=O(1)$ suffices,
• $s=O(k \cdot \log m)$ and $d=O(\lg m)$ suffices.

Remark 3. It is enough if the memory is $(d\cdot k)$-wise independent as opposed to completely uniform, so one can have $n = d \cdot k \cdot \log s$. An open question is if you can improve the seed length to optimal.

As remarked earlier the lower bound does not give anything when $s$ is much larger than $n$. In particular it is not clear if it rules out $d=2$. Next we show a lower bound which applies to this case.

Problem 4. Take $n$ bits to be a seed for $1/100$-biased distribution over $\{0,1\}^m$. The queries, like before, are the bits of that distribution. Recall that $n=O(\lg m)$.

Theorem 5. You need $s = \Omega (m)$.

Proof. Every query is answered by looking at $d=2$ bits. But $t = \Omega (m)$ queries are answered by the same 2-bit function $f$ of probes (because there is a constant number of functions on 2-bits). There are two cases for $f$:

1. $f$ is linear (or affine). Suppose for the sake of contradiction that $t>s$. Then you have a linear dependence, because the space of linear functions on $s$ bits is $s$. This implies that if you XOR those bits, you always get 0. This in turn contradicts the assumption that the distributions has small bias.
2. $f$ is AND (up to negating the input variables or the output). In this case, we keep collecting queries as long as they probe at least one new memory bit. If $t > s$ when we stop we have a query left such that both their probes query bits that have already been queried. This means that there exist two queries $q_1$ and $q_2$ whose probes cover the probes of a third query $q_3$. This in turn implies that the queries are not close to uniform. That is because there exist answers to $q_1$ and $q_2$ that fix bits probed by them, and so also fix the bits probed by $q_3$. But this contradicts the small bias of the distribution.

$\square$

### References

[Sie04]   Alan Siegel. On universal classes of extremely random constant-time hash functions. SIAM J. on Computing, 33(3):505–543, 2004.

# Special Topics in Complexity Theory, Lectures 16-17

Special Topics in Complexity Theory, Fall 2017. Instructor: Emanuele Viola

### 1 Lectures 16-17, Scribe: Tanay Mehta

In these lectures we prove the corners theorem for pseudorandom groups, following Austin [Aus16]. Our exposition has several non-major differences with that in [Aus16], which may make it more computer-science friendly. The instructor suspects a proof can also be obtained via certain local modifications and simplifications of Green’s exposition [Gre05bGre05a] of an earlier proof for the abelian case. We focus on the case $G = \textit {SL}_2(q)$ for simplicity, but the proof immediately extends to other pseudorandom groups.

Theorem 1. Let $G = \textit {SL}_2(q)$. Every subset $A \subseteq G^2$ of density $\mu (A) \geq 1/\log ^a |G|$ contains a corner, i.e., a set of the form $\{(x, y), (xz, y), (x, zy) ~|~ z \neq 1\}$.

#### 1.1 Proof Overview

For intuition, suppose $A$ is a product set, i.e., $A = B \times C$ for $B, C \subseteq G$. Let’s look at the quantity

\begin{aligned}\mathbb {E}_{x, y, z \leftarrow G}[A(x, y) A(xz, y) A(x, zy)]\end{aligned}

where $A(x, y) = 1$ iff $(x, y) \in A$. Note that the random variable in the expectation is equal to $1$ exactly when $x, y, z$ form a corner in $A$. We’ll show that this quantity is greater than $1/|G|$, which implies that $A$ contains a corner (where $z \neq 1$). Since we are taking $A = B \times C$, we can rewrite the above quantity as

\begin{aligned} & \mathbb {E}_{x, y, z \leftarrow G}[B(x)C(y) B(xz)C(y) B(x)C(zy)] \\ & = \mathbb {E}_{x, y, z \leftarrow G}[B(x)C(y) B(xz)C(zy)] \\ & = \mathbb {E}_{x, y, z \leftarrow G}[B(x)C(y) B(z)C(x^{-1}zy)] \end{aligned}

where the last line follows by replacing $z$ with $x^{-1}z$ in the uniform distribution. If $\mu (A) \ge \delta$, then $\mu (B) \ge \delta$ and $\mu (C) \ge \delta$. Condition on $x \in B$, $y \in C$, $z \in B$. Then the distribution $x^{-1}zy$ is a product of three independent distributions, each uniform on a set of measure greater than $\delta$. By pseudorandomness $x^{-1}zy$ is $1/|G|^{\Omega (1)}$ close to uniform in statistical distance. This implies that the above quantity equals

\begin{aligned} & \mu (A) \cdot \mu (C) \cdot \mu (B) \cdot \left (\mu (C) \pm \frac {1}{|G|^{\Omega (1)}}\right )\\ & \geq \delta ^3 \left ( \delta - \frac {1}{|G|^{\Omega (1)}} \right ) \\ & \geq \delta ^4 /2 \\ & > 1/|G|. \end{aligned}

Given this, it is natural to try to write an arbitrary $A$ as a combination of product sets (with some error). We will make use of a more general result.

#### 1.2 Weak Regularity Lemma

Let $U$ be some universe (we will take $U = G^2$). Let $f:~U \rightarrow [-1,1]$ be a function (for us, $f = 1_A$). Let $D \subseteq \{d: U \rightarrow [-1,1]\}$ be some set of functions, which can be thought of as “easy functions” or “distinguishers.”

Theorem 2.[Weak Regularity Lemma] For all $\epsilon > 0$, there exists a function $g := \sum _{i \le s} c_i \cdot d_i$ where $d_i \in D$, $c_i \in \mathbb {R}$ and $s = 1/\epsilon ^2$ such that for all $d \in D$

\begin{aligned}\mathbb {E}_{x \leftarrow U}[f(x) \cdot d(x)] = \mathbb {E}_{x \leftarrow U}[g(x) \cdot d(x)] \pm \epsilon .\end{aligned}

The lemma is called ‘weak’ because it came after Szemerédi’s regularity lemma, which has a stronger distinguishing conclusion. However, the lemma is also ‘strong’ in the sense that Szemerédi’s regularity lemma has $s$ as a tower of $1/\epsilon$ whereas here we have $s$ polynomial in $1/\epsilon$. The weak regularity lemma is also simpler. There also exists a proof of Szemerédi’s theorem (on arithmetic progressions), which uses weak regularity as opposed to the full regularity lemma used initially.

Proof. We will construct the approximation $g$ through an iterative process producing functions $g_0, g_1, \dots , g$. We will show that $||f - g_i||_2^2$ decreases by $\ge \epsilon ^2$ each iteration.

1. Start: Define $g_0 = 0$ (which can be realized setting $c_0 = 0$).
2. Iterate: If not done, there exists $d \in D$ such that $|\mathbb {E}[(f - g) \cdot d]| > \epsilon$. Assume without loss of generality $\mathbb {E}[(f - g) \cdot d] > \epsilon$.
3. Update: $g' := g + \lambda d$ where $\lambda \in \mathbb {R}$ shall be picked later.

Let us analyze the progress made by the algorithm.

\begin{aligned} ||f - g'||_2^2 &~ = \mathbb {E}_x[(f - g')^2(x)] \\ &~ = \mathbb {E}_x[(f - g - \lambda d)^2(x)] \\ &~ = \mathbb {E}_x[(f - g)^2] + \mathbb {E}_x[\lambda ^2 d^2 (x)] - 2\mathbb {E}_x[(f - g)\cdot \lambda d(x)] \\ &~ \leq ||f - g||_2^2 + \lambda ^2 - 2\lambda \mathbb {E}_x[(f-g)d(x)] \\ &~ \leq ||f - g||_2^2 + \lambda ^2 - 2\lambda \epsilon \\ &~ \leq ||f-g||_2^2 - \epsilon ^2 \end{aligned}

where the last line follows by taking $\lambda = \epsilon$. Therefore, there can only be $1/\epsilon ^2$ iterations because $||f - g_0||_2^2 = ||f||_2^2 \leq 1$. $\square$

#### 1.3 Getting more for rectangles

Returning to the lower bound proof, we will use the weak regularity lemma to approximate the indicator function for arbitrary $A$ by rectangles. That is, we take $D$ to be the collection of indicator functions for all sets of the form $S \times T$ for $S, T \subseteq G$. The weak regularity lemma gives us $A$ as a linear combination of rectangles. These rectangles may overlap. However, we ideally want $A$ to be a linear combination of non-overlapping rectangles.

Claim 3. Given a decomposition of $A$ into rectangles from the weak regularity lemma with $s$ functions, there exists a decomposition with $2^{O(s)}$ rectangles which don’t overlap.

Proof. Exercise. $\square$

In the above decomposition, note that it is natural to take the coefficients of rectangles to be the density of points in $A$ that are in the rectangle. This gives rise to the following claim.

Claim 4. The weights of the rectangles in the above claim can be the average of $f$ in the rectangle, at the cost of doubling the distinguisher error.

Consequently, we have that $f = g + h$, where $g$ is the sum of $2^{O(s)}$ non-overlapping rectangles $S \times T$ with coefficients $\Pr _{(x, y) \in S \times T}[f(x, y) = 1]$.

Proof. Let $g$ be a partition decomposition with arbitrary weights. Let $g'$ be a partition decomposition with weights being the average of $f$. It is enough to show that for all rectangle distinguishers $d \in D$

\begin{aligned}|\mathbb {E}[(f-g')d]| \leq |\mathbb {E}[(f-g)d]|.\end{aligned}

By the triangle inequality, we have that

\begin{aligned}|\mathbb {E}[(f-g')d]| \leq |\mathbb {E}[(f-g)d]| + |\mathbb {E}[(g-g')d]|.\end{aligned}

To bound $\mathbb {E}[(g-g')d]|$, note that the error is maximized for a $d$ that respects the decomposition in non-overlapping rectangles, i.e., $d$ is the union of some non-overlapping rectangles from the decomposition. This can be argues using that, unlike $f$, the value of $g$ and $g'$ on a rectangle $S\times T$ from the decomposition is fixed. But, for such $d$, $g' = f$! More formally, $\mathbb {E}[(g-g')d] = \mathbb {E}[(g-f)d]$. $\square$

We need to get a little more from this decomposition. The conclusion of the regularity lemma holds with respect to distinguishers that can be written as $U(x) \cdot V(y)$ where $U$ and $V$ map $G \to \{0,1\}$. We need the same guarantee for $U$ and $V$ with range $[-1,1]$. This can be accomplished paying only a constant factor in the error, as follows. Let $U$ and $V$ have range $[-1,1]$. Write $U = U_+ - U_-$ where $U_+$ and $U_-$ have range $[0,1]$, and the same for $V$. The error for distinguisher $U \cdot V$ is at most the sum of the errors for distinguishers $U_+ \cdot V_+$, $U_+ \cdot V_-$, $U_- \cdot V_+$, and $U_- \cdot V_-$. So we can restrict our attention to distinguishers $U(x) \cdot V(y)$ where $U$ and $V$ have range $[0,1]$. In turn, a function $U(x)$ with range $[0,1]$ can be written as an expectation $\mathbb{E} _a U_a(x)$ for functions $U_a$ with range $\{0,1\}$, and the same for $V$. We conclude by observing that

\begin{aligned} \mathbb{E} _{x,y}[ (f-g)(x,y) \mathbb{E} _a U_a(x) \cdot \mathbb{E} _b V_b(y)] \le \max _{a,b} \mathbb{E} _{x,y}[ (f-g)(x,y) U_a(x) \cdot V_b(y)].\end{aligned}

#### 1.4 Proof

Let us now finish the proof by showing a corner exists for sufficiently dense sets $A \subseteq G^2$. We’ll use three types of decompositions for $f: G^2 \rightarrow \{0,1\}$, with respect to the following three types of distinguishers, where $U_i$ and $V_i$ have range $\{0,1\}$:

1. $U_1(x) \cdot V_1(y)$,
2. $U_2(xy) \cdot V_2(y)$,
3. $U_3(x) \cdot V_3(xy)$.

The last two distinguishers can be visualized as parallelograms with a 45-degree angle between two segments. The same extra properties we discussed for rectangles hold for them too.

Recall that we want to show

\begin{aligned}\mathbb {E}_{x, y, g}[f(x, y) f(xg, y) f(x, gy)] > \frac {1}{|G|}.\end{aligned}

We’ll decompose the $i$-th occurrence of $f$ via the $i$-th decomposition listed above. We’ll write this decomposition as $f = g_i + h_i$. We do this in the following order:

\begin{aligned} & ~f(x, y) \cdot f(xg, y) \cdot f(x, gy) \\ = & ~f(x, y) f(xg, y) g_3(x, gy) + f(x, y) f(xg, y) h_3(x, gy) \\ &~ \vdots \\ =&~ g_1 g_2 g_3 + h_1 g_2 g_3 + f h_2 g_3 + f f h_3 \end{aligned}

We first show that $\mathbb{E} [g_1 g_2 g_3]$ is big (i.e., inverse polylogarithmic in expectation) in the next two claims. Then we show that the expectations of the other terms are small.

Claim 5. For all $g \in G$, the values $\mathbb {E}_{x, y}[g_1(x, y) g_2(xg, y) g_3(x, gy)]$ are the same (over $g$) up to an error of $2^{O(s)} \cdot 1/|G|^{\Omega (1)}$.

Proof. We just need to get error $1/|G|^{\Omega (1)}$ for any product of three functions for the three decomposition types. By the standard pseudorandomness argument we saw in previous lectures,

\begin{aligned} \mathbb {E}_{x, y}[c_1 U_1(x)V_1(y) \cdot c_2 U_2(xgy)V_2(y) \cdot c_3 U_3(x)V_3(xgy)] \\ = c_1 c_2 c_3 \mathbb {E}_{x, y}[(U_1 \cdot U_3)(x) (V_1 \cdot V_2)(y) (U_2 \cdot V_3)(xgy)] \\ = c_1 c_2 c_3 \cdot \mu (U_1 \cdot U_3) \mu (V_1 \cdot V_2) \mu (U_2 \cdot V_3) \pm \frac {1}{|G|^{\Omega (1)}}. \end{aligned}

$\square$

Recall that we start with a set of density $\ge 1/\log ^{a} |G|$.

Claim 6. $\mathbb {E}_{g, x, y}[g_1 g_2 g_3] > \Omega (1/\log ^{4a} |G|)$.

Proof. By the previous claim, we can fix $g = 1_G$. We will relate the expectation over $x, y$ to $f$ by a trick using the Hölder inequality: For random variables $X_1, X_2, \ldots , X_k$,

\begin{aligned}\mathbb {E}[X_1 \dots X_k] \leq \prod _{i=1}^k \mathbb {E}[X_i^{c_i}]^{1/c_i} \text { such that } \sum 1/c_i = 1.\end{aligned}

To apply this inequality in our setting, write

\begin{aligned}\mathbb {E}[f] = \mathbb {E}\left [(f \cdot g_1 g_2 g_3)^{1/4} \cdot \left (\frac {f}{g_1}\right )^{1/4}\cdot \left (\frac {f}{g_2}\right )^{1/4}\cdot \left (\frac {f}{g_3}\right )^{1/4}\right ].\end{aligned}

By the Hölder inequality, we get that

\begin{aligned}\mathbb {E}[f] \leq \mathbb {E}[f \cdot g_1 g_2 g_3]^{1/4} \mathbb {E}\left [\frac {f}{g_1}\right ]^{1/4} \mathbb {E}\left [\frac {f}{g_2}\right ]^{1/4} \mathbb {E}\left [\frac {f}{g_3}\right ]^{1/4}.\end{aligned}

Note that

\begin{aligned} \mathbb {E}_{x, y} \frac {f(x,y)}{g_1(x, y)} & = \mathbb {E}_{x, y} \frac {f(x, y)}{\mathbb {E}_{x', y' \in \textit {Cell}(x,y)}[f(x', y')] } \\ & = \mathbb {E}_{x, y} \frac {\mathbb {E}_{x', y' \in \textit {Cell}(x, y)}[f(x',y')]}{\mathbb {E}_{x', y' \in \textit {Cell}(x,y)}[f(x', y')] }\\ & = 1 \end{aligned}

where $\textit {Cell}(x, y)$ is the set in the partition that contains $(x, y)$. Finally, by non-negativity of $f$, we have that $\mathbb {E}[f \cdot g_1 g_2 g_3]^{1/4} \leq \mathbb {E}[g_1 g_2 g_3]$. This concludes the proof. $\square$

We’ve shown that the $g_1 g_2 g_3$ term is big. It remains to show the other terms are small. Let $\epsilon$ be the error in the weak regularity lemma with respect to distinguishers with range $[-1,1]$.

Claim 7. $|\mathbb {E}[f f h_3]| \leq \epsilon ^{1/4}$.

Proof. Replace $g$ with $gy^{-1}$ in the uniform distribution to get

\begin{aligned} & \mathbb {E}^4_{x, y, g}[f(x,y) f(xg,y)h_3(x, gy)] \\ & = \mathbb {E}^4_{x, y, g}[f(x,y) f(xgy^{-1},y)h_3(x, g)] \\ & = \mathbb {E}^4_{x, y}[f(x,y) \mathbb {E}_g [f(xgy^{-1},y)h_3(x, g)]] \\ & \leq \mathbb {E}^2_{x, y} [f^2(x, y)] \mathbb {E}^2_{x, y} \mathbb {E}^2_g [f(xgy^{-1},y)h_3(x, g)]\\ & \leq \mathbb {E}^2_{x, y} \mathbb {E}^2_g [f(xgy^{-1},y)h_3(x, g)]\\ & = \mathbb {E}^2_{x, y, g, g'}[f(xgy^{-1}, y) h_3(x, g) f(xg'y^{-1}, y) h_3(x, g')], \end{aligned}

where the first inequality is by Cauchy-Schwarz.

Now replace $g \rightarrow x^{-1}g, g' \rightarrow x^{-1}g$ and reason in the same way:

\begin{aligned} & = \mathbb {E}^2_{x, y, g, g'}[f(gy^{-1}, y) h_3(x, x^{-1}g) f(g'y^{-1}, y) h_3(x, x^{-1}g')] \\ & = \mathbb {E}^2_{g, g', y}[f(gy^{-1}, y) \cdot f(g'y^{-1}, y) \mathbb {E}_x [h_3(x, x^{-1}g) \cdot h_3(x, x^{-1}g')]] \\ & \leq \mathbb {E}_{x,x',g,g'}[h_3(x, x^{-1}g) h_3(x, x^{-1}g') h_3(x', x'^{-1}g) h_3(x', x'^{-1}g')]. \end{aligned}

Replace $g \rightarrow xg$ to rewrite the expectation as

\begin{aligned} \mathbb {E}[h_3(x, g) h_3(x, x^{-1}g') h_3(x', x'^{-1}xg) h_3(x', x'^{-1}g')].\end{aligned}

We want to view the last three terms as a distinguisher $U(x) \cdot V(xg)$. First, note that $h_3$ has range $[-1,1]$. This is because $h_3(x,y) = f(x,y) - \mathbb{E} _{x', y' \in \textit {Cell}(x,y)} f(x',y')$ and $f$ has range $\{0,1\}$.

Fix $x', g'$. The last term in the expectation becomes a constant $c \in [-1,1]$. The second term only depends on $x$, and the third only on $xg$. Hence for appropriate functions $U$ and $V$ with range $[-1,1]$ this expectation can be rewritten as

\begin{aligned} \mathbb {E}[h_3(x, g) U(x) V(xg)], \end{aligned}

which concludes the proof. $\square$

There are similar proofs to show the remaining terms are small. For $fh_2g_3$, we can perform simple manipulations and then reduce to the above case. For $h_1 g_2 g_3$, we have a slightly easier proof than above.

##### 1.4.1 Parameters

Suppose our set has density $\delta \ge 1/\log ^a |G|$. We apply the weak regularity lemma for error $\epsilon = 1/\log ^c |G|$. This yields the number of functions $s = 2^{O(1/\epsilon ^2)} = 2^{O(\log ^{2c} |G|)}$. For say $c = 1/3$, we can bound $\mathbb{E} _{x,y,g}[g_1 g_2 g_3]$ from below by the same expectation with $g$ fixed to $1$, up to an error $1/|G|^{\Omega (1)}$. Then, $\mathbb {E}_{x,y,g=1}[g_1g_2g_3] \geq \mathbb {E}[f]^4 = 1/\log ^{4a}|G|$. The expectation of terms with $h$ is less than $1/\log ^{c/4} |G|$. So the proof can be completed for all sufficiently small $a$.

### References

[Aus16]    Tim Austin. Ajtai-Szemerédi theorems over quasirandom groups. In Recent trends in combinatorics, volume 159 of IMA Vol. Math. Appl., pages 453–484. Springer, [Cham], 2016.

[Gre05a]   Ben Green. An argument of shkredov in the finite field setting, 2005. Available at people.maths.ox.ac.uk/greenbj/papers/corners.pdf.

[Gre05b]   Ben Green. Finite field models in additive combinatorics. Surveys in Combinatorics, London Math. Soc. Lecture Notes 327, 1-27, 2005.

# Special Topics in Complexity Theory, Lecture 15

Special Topics in Complexity Theory, Fall 2017. Instructor: Emanuele Viola

### 1 Lecture 15, Scribe: Chin Ho Lee

In this lecture fragment we discuss multiparty communication complexity, especially the problem of separating deterministic and randomized communication, which we connect to a problem in combinatorics.

In number-on-forehead (NOH) communication complexity each party $i$ sees all of the input $(x_1, \dotsc , x_k)$ except its own input $x_i$. For background, it is not known how to prove negative results for $k \ge \log n$ parties. We shall focus on the problem of separating deterministic and randomizes communication. For $k = 2$, we know the optimal separation: The equality function requires $\Omega (n)$ communication for deterministic protocols, but can be solved using $O(1)$ communication if we allow the protocols to use public coins. For $k = 3$, the best known separation between deterministic and randomized protocol is $\Omega (\log n)$ vs $O(1)$ [BDPW10]. In the following we give a new proof of this result, for a simpler function: $f(x, y, z) = 1$ if and only if $x \cdot y \cdot z = 1$ for $x, y, z \in SL_2(q)$.

For context, let us state and prove the upper bound for randomized communication.

Claim 1. $f$ has randomized communication complexity $O(1)$.

Proof. In the NOH model, computing $f$ reduces to $2$-party equality with no additional communication: Alice computes $y \cdot z =: w$ privately, then Alice and Bob check if $x = w^{-1}$. $\square$

To prove a $\Omega (\log n)$ lower bound for deterministic protocols, where $n = \log |G|$, we reduce the communication problem to a combinatorial problem.

Definition 2. A corner in a group $G$ is $\{ (x,y), (xz, y), (x,zy) \} \subseteq G^2$, where $x, y$ are arbitrary group elements and $z \neq 1_G$.

For intuition, consider the case when $G$ is Abelian, where one can replace multiplication by addition and a corner becomes $\{ (x, y), (x + z, y), (x, y + z)\}$ for $z \neq 0$.

We now state the theorem that gives the lower bound.

Theorem 3. Suppose that every subset $A \subseteq G^2$ with $\mu (A) := |A|/|G^2| \ge \delta$ contains a corner. Then the deterministic communication complexity of $f(x, y, z) = 1 \iff x \cdot y \cdot z = 1_G$ is $\Omega (\log (1/\delta ))$.

It is known that when $G$ is Abelian, then $\delta \ge 1/\mathrm {polyloglog}|G|$ implies a corner. We shall prove that when $G = SL_2(q)$, then $\delta \ge 1/\mathrm {polylog}|G|$ implies a corner. This in turn implies communication $\Omega (\log \log |G|) = \Omega (\log n)$.

Proof. We saw that a number-in-hand (NIH) $c$-bit protocol can be written as a disjoint union of $2^c$ rectangles. Likewise, a number-on-forehead $c$-bit protocol $P$ can be written as a disjoint union of $2^c$ cylinder intersections $C_i := \{ (x, y, z) : f_i(y,z) g_i(x,z) h_i(x,y) = 1\}$ for some $f_i, g_i, h_i\colon G^2 \to \{0, 1\}$:

\begin{aligned} P(x,y,z) = \sum _{i=1}^{2^c} f_i(y,z) g_i(x,z) h_i(x,y). \end{aligned}

The proof idea of the above fact is to consider the $2^c$ transcripts of $P$, then one can see that the inputs giving a fixed transcript are a cylinder intersection.

Let $P$ be a $c$-bit protocol. Consider the inputs $\{(x, y, (xy)^{-1}) \}$ on which $P$ accepts. Note that at least $2^{-c}$ fraction of them are accepted by some cylinder intersection $C$. Let $A := \{ (x,y) : (x, y, (xy)^{-1}) \in C \} \subseteq G^2$. Since the first two elements in the tuple determine the last, we have $\mu (A) \ge 2^{-c}$.

Now suppose $A$ contains a corner $\{ (x, y), (xz, y), (x, zy) \}$. Then

\begin{aligned} (x,y) \in A &\implies (x, y, (xy)^{-1}) \in C &&\implies h(x, y) = 1 , \\ (xz,y) \in A &\implies (xz, y, (xzy)^{-1}) \in C &&\implies f(y,(xyz)^{-1}) = 1 , \\ (x,zy) \in A &\implies (x, zy, (xzy)^{-1}) \in C &&\implies g(x,(xyz)^{-1}) = 1 . \end{aligned}

This implies $(x,y,(xzy)^{-1}) \in C$, which is a contradiction because $z \neq 1$ and so $x \cdot y \cdot (xzy)^{-1} \neq 1_G$. $\square$

### References

[BDPW10]   Paul Beame, Matei David, Toniann Pitassi, and Philipp Woelfel. Separating deterministic from randomized multiparty communication complexity. Theory of Computing, 6(1):201–225, 2010.

# Special Topics in Complexity Theory, Lecture 10

Added Dec 27 2017: An updated version of these notes exists on the class page.

Special Topics in Complexity Theory, Fall 2017. Instructor: Emanuele Viola

### 1 Lecture 10, Guest lecture by Justin Thaler, Scribe: Biswaroop Maiti

This is a guest lecture by Justin Thaler regarding lower bounds on approximate degree [BKT17BT15BT17]. Thanks to Justin for giving this lecture and for his help with the write-up. We will sketch some details of the lower bound on the approximate degree of $\mathsf {AND} \circ \mathsf {OR}$, $\mathsf {SURJ}$ and some intuition about the techniques used. Recall the definition of $\mathsf {SURJ}$ from the previous lecture as below:

Definition 1. The surjectivity function $\mathsf {SURJ}\colon \left (\{-1,1\}^{\log R}\right )^N \to \{-1,1\}$, takes input $x=(x_1, \dots , x_N)$ where each $x_i \in \{-1, 1\}^{\log R}$ is interpreted as an element of $[R]$. $\mathsf {SURJ}(x)$ has value $-1$ if and only if $\forall j \in [R], \exists i\colon x_i = j$.

Recall from the last lecture that $\mathsf {AND}_R \circ \mathsf {OR}_N \colon \{-1,1\}^{R\times N} \rightarrow \{-1,1\}$ is the block-wise composition of the $\mathsf {AND}$ function on $R$ bits and the $\mathsf {OR}$ function on $N$ bits. In general, we will denote the block-wise composition of two functions $f$, and $g$, where $f$ is defined on $R$ bits and $g$ is defined on $N$ bits, by $f_R \circ g_N$. Here, the outputs of $R$ copies of $g$ are fed into $f$ (with the inputs to each copy of $g$ being pairwise disjoint). The total number of inputs to $f_R \circ g_N$ is $R \cdot N$.

#### 1.1 Lower Bound of $d_{1/3}( \mathsf {SURJ} )$ via lower bound of $d_{1/3}($AND-OR$)$

Claim 2. $d_{1/3}( \mathsf {SURJ} ) = \widetilde {\Theta }(n^{3/4})$.

We will look at only the lower bound in the claim. We interpret the input as a list of $N$ numbers from $[R]:= \{1,2, \cdots R\}$. As presented in [BKT17], the proof for the lower bound proceeds in the following steps.

1. Show that to approximate $\mathsf {SURJ}$, it is necessary to approximate the block-composition $\mathsf {AND}_R \circ \mathsf {OR}_N$ on inputs of Hamming weight at most $N$. i.e., show that $d_{1/3}(\mathsf {surj}) \geq d_{1/3}^{\leq N}(\mathsf {AND}_R \circ \mathsf {OR}_N)$.

Step 1 was covered in the previous lecture, but we briefly recall a bit of intuition for why the claim in this step is reasonable. The intuition comes from the fact that the converse of the claim is easy to establish, i.e., it is easy to show that in order to approximate $\mathsf {SURJ}$, it is sufficient to approximate $\mathsf {AND}_R \circ \mathsf {OR}_N$ on inputs of Hamming weight exactly $N$.

This is because $\mathsf {SURJ}$ can be expressed as an $\mathsf {AND}_R$ (over all range items $r \in [R]$) of the $\mathsf {OR}_N$ (over all inputs $i \in [N]$) of “Is input $x_i$ equal to $r$”? Each predicate of the form in quotes is computed exactly by a polynomial of degree $\log R$, since it depends on only $\log R$ of the input bits, and exactly $N$ of the predicates (one for each $i \in [N]$) evaluates to TRUE.

Step 1 of the lower bound proof for $\mathsf {SURJ}$ in [BKT17] shows a converse, namely that the only way to approximate $\mathsf {SURJ}$ is to approximate $\mathsf {AND}_R \circ \mathsf {OR}_N$ on inputs of Hamming weight at most $N$.

2. Show that $d_{1/3}^{\leq N}(\mathsf {AND}_R \circ \mathsf {OR}_N) = \widetilde {\Omega }(n^{3/4})$, i.e., the degree required to approximate $\mathsf {AND} _R \circ \mathsf {OR}_N$ on inputs of Hamming weight at most $N$ is at least $D=\widetilde {\Omega }(n^{3/4})$.

In the previous lecture we also sketched this Step 2. In this lecture we give additional details of this step. As in the papers, we use the concept of a “dual witness.” The latter can be shown to be equivalent to bounded indistinguishability.

Step 2 itself proceeds via two substeps:

1. Give a dual witness $\Phi$ for $\mathsf {AND}_R \cdot \mathsf {OR}_N$ that has places little mass (namely, total mass less then $(R \cdot N \cdot D)^{-D}$) on inputs of hamming weight $\geq N$.
2. By modifying $\Phi$, give a dual witness $\Phi '$ for $\mathsf {AND}_R \cdot \mathsf {OR}_N$ that places zero mass on inputs of Hamming weight $\geq N$.

In [BKT17], both Substeps 2a and 2b proceed entirely in the dual world (i.e., they explicitly manipulate dual witnesses $\Phi$ and $\Phi '$). The main goal of this section of the lecture notes is to explain how to replace Step 2b of the argument of [BKT17] with a wholly “primal” argument.

The intuition of the primal version of Step 2b that we’ll cover is as follows. First, we will show that a polynomial $p \colon \{-1, 1\}^{R \cdot N} \to \{-1, 1\}$ of degree $D$ that is bounded on the low Hamming Weight inputs, cannot be too big on the high Hamming weight inputs. In particular, we will prove the following claim.

Claim 3. If $p \colon \{-1, 1\}^{M} \to \mathbb {R}$ is a degree $D$ polynomial that satisfies $|p(x)| \leq 4/3$ on all inputs of $x$ of Hamming weight at most $D$, then $|p(x)| \leq (4/3) \cdot D \cdot M^D$ for all inputs $x$.

Second, we will explain that the dual witness $\Phi$ constructed in Step 2a has the following “primal” implication:

Claim 4. For $D \approx N^{3/4}$, any polynomial $p$ of degree $D$ satisfying $|p(x) - \left (\mathsf {AND}_R \circ \mathsf {OR}_N\right )(x) | \leq 1/3$ for all inputs $x$ of Hamming weight at most $N$ must satisfy $|p(x)| > (4/3) \cdot D \cdot ( R \cdot N)^D$ for some input $x \in \{-1, 1\}^{R \cdot N}$.

Combining Claims 3 and 4, we conclude that no polynomial $p$ of degree $D \approx N^{3/4}$ can satisfy

\begin{aligned} ~~~~(1) |p(x) - (\mathsf {AND}_R \circ \mathsf {OR}_N)(x) | \leq 1/3 \text { for all inputs } x \text { of Hamming weight at most } N,\end{aligned}

which is exactly the desired conclusion of Step 2. This is because any polynomial $p$ satisfying Equation (1) also satisfies $|p(x)| \leq 4/3$ for all $x$ of Hamming weight of most $N$, and hence Claim 3 implies that

\begin{aligned} ~~~~(2) |p(x)| \leq \frac {4}{3} \cdot D \cdot (R \cdot N)^D \text { for \emph {all} inputs } x \in \{-1, 1\}^{R \cdot N}.\end{aligned}

But Claim 4 states that any polynomial satisfying both Equations (1) and (2) requires degree strictly larger than $D$.

In the remainder of this section, we prove Claims 3 and 4.

#### 1.2 Proof of Claim 3

Proof of Claim 3. For notational simplicity, let us prove this claim for polynomials on domain $\{0, 1\}^{M}$, rather than $\{-1, 1\}^M$.

Proof in the case that $p$ is symmetric. Let us assume first that $p$ is symmetric, i.e., $p$ is only a function of the Hamming weight $|x|$ of its input $x$. Then $p(x) = g(|x|)$ for some degree $D$ univariate polynomial $g$ (this is a direct consequence of Minsky-Papert symmetrization, which we have seen in the lectures before). We can express $g$ as below in the same spirit of Lagrange interpolation.

\begin{aligned}g(t)= \sum _{k=0}^{D-1} g(k) \cdot \prod _{i=0}^{D-1} \frac {t-i}{k-i}. \end{aligned}

Here, the first term, $g(k)$ ,is bounded in magnitude by $|g(k)| \leq 4/3$, and $|\prod _{i=0}^{D-1} \frac {t-i}{k-i}| \leq M^D$. Therefore, we get the final bound:

\begin{aligned}|g(t)| \leq (4/3) \cdot D \cdot M^D.\end{aligned}

Proof for general $p$. Let us now consider the case of general (not necessarily symmetric) polynomials $p$. Fix any input $x \in \{0, 1\}^M$. The goal is to show that $|p(x)| \leq \frac 43 D \cdot M^D$.

Let us consider a polynomial $\hat {p}_x \colon \{0,1\}^{|x|} \rightarrow \{0,1\}$ of degree $D$ obtained from $p$ by restricting each input $i$ such that $x_i=0$ to have the value 0. For example, if $M=4$ and $x=(0, 1, 1, 0)$, then $\hat {p}_x(y_2, y_3)=p(0, y_2, y_3, 0)$. We will exploit three properties of $\hat {p}_x$:

• $\deg (\hat {p}_x) \leq \deg (p) \leq D$.
• Since $|p(x)| \leq 4/3$ for all inputs with $|x| \leq D$, $\hat {p}_x(y)$ satisfies the analogous property: $|\hat {p}_x(y)| \leq 4/3$ for all inputs with $|y| \leq D$.
• If $\mathbf {1}_{|x|}$ denotes the all-1s vector of length $|x|$, then $\hat {p}_x(\mathbf {1}_x) = p(x)$.

Property 3 means that our goal is to show that $|\widehat {p}(\mathbf {1}_x)| \leq \frac 43 \cdot D \cdot M^D$.

Let $p^{\text {symm}}_x \colon \{0, 1\}^{M} \to \mathbb {R}$ denote the symmetrized version of $\hat {p}_x$, i.e., $p^{\text {symm}}_x(y) = \mathbb {E}_{\sigma }[\hat {p}_x(\sigma (y))]$, where the expectation is over a random permutation $\sigma$ of $\{1, \dots , |x|\}$, and $\sigma (y)=(y_{\sigma (1)}, \dots , y_{\sigma (|x|)})$. Since $\sigma (\mathbf {1}_{|x|}) = \mathbf {1}_{|x|}$ for all permutations $\sigma$, $p^{\text {symm}}_x(\mathbf {1}_{|x|}) = \hat {p}_x(\mathbf {1}_{|x|}) = p(x)$. But $p^{\text {symm}}_x$ is symmetric, so Properties 1 and 2 together mean that the analysis from the first part of the proof implies that $|p^{\text {symm}}_x(y)| \leq \frac 43 \cdot D \cdot M^D$ for all inputs $y$. In particular, letting $y = \mathbf {1}_{|x|}$, we conclude that $|p(x)| \leq \frac 43 \cdot D \cdot M^D$ as desired. $\square$

Discussion. One may try to simplify the analysis of the general case in the proof Claim 3 by considering the polynomial $p^{\text {symm}} \colon \{0, 1\}^M \to \mathbb {R}$ defined via $p^{\text {symm}}(x)=\mathbb {E}_{\sigma }[p(\sigma (x))$], where the expectation is over permutations $\sigma$ of $\{1, \dots , M\}$. $p^{\text {symm}}$ is a symmetric polynomial, so the analysis for symmetric polynomials immediately implies that $|p^{\text {symm}}(x)| \leq \frac 43 \cdot D \cdot M^D$. Unfortunately, this does not mean that $|p(x)| \leq \frac 43 \cdot D \cdot M^D$.

This is because the symmetrized polynomial $p^{\mathsf {symm}}$ is averaging the values of $p$ over all those inputs of a given Hamming weight. So, a bound on this averaging polynomial does not preclude the case where $p$ is massively positive on some inputs of a given Hamming weight, and massively negative on other inputs of the same Hamming weight, and these values cancel out to obtain a small average value. That is, it is not enough to conclude that on the average over inputs of any given Hamming weight, the magnitude of $p$ is not too big.

Thus, we needed to make sure that when we symmetrize $\hat {p}_x$ to $p^{\mathsf {sym}}_x$, such large cancellations don’t happen, and a bound of the average value of $\hat {p}$ on a given Hamming weight really gives us a bound on $p$ on the input $x$ itself. We defined $\hat {p}_x$ so that $\hat {p}_x(\mathbf {1}_M) = p(x)$. Since there is only one input in $\{0, 1\}^M$ of Hamming weight $M$, $p^{\text {symm}}_x(\mathbf {1}_M)$ does not average $\hat {p}_x$’s values on many inputs, meaning we don’t need to worry about massive cancellations.

A note on the history of Claim 3. Claim 3 was implicit in [RS10]. They explicitly showed a similar bound for symmetric polynomials using primal view and (implicitly) gave a different (dual) proof of the case for general polynomials.

#### 1.3 Proof of Claim 4

##### 1.3.1 Interlude Part 1: Method of Dual Polynomials [BT17]

A dual polynomial is a dual solution to a certain linear program that captures the approximate degree of any given function $f \colon \{-1, 1\}^n \to \{-1, 1\}$. These polynomials act as certificates of the high approximate degree of $f$. The notion of strong LP duality implies that the technique is lossless, in comparison to symmetrization techniques which we saw before. For any function $f$ and any $\varepsilon$, there is always some dual polynomial $\Psi$ that witnesses a tight $\varepsilon$-approximate degree lower bound for $f$. A dual polynomial that witnesses the fact that $\mathsf {d}_\varepsilon (f) \geq d$ is a function $\Psi \colon \{-1, 1\}^n \rightarrow \{-1, 1\}$ satisfying three properties:

• Correlation analysis:
\begin{aligned}\sum _{x \in \{-1,1\}^n }{\Psi (x) \cdot f(x)} > \varepsilon .\end{aligned}

If $\Psi$ satisfies this condition, it is said to be well-correlated with $f$.

• Pure high degree: For all polynomials $p \colon \{-1, 1\}^n \rightarrow \mathbb {R}$ of degree less than $d$, we have
\begin{aligned}\sum _{x \in \{-1,1\}^n } { p(x) \cdot \Psi (x)} = 0.\end{aligned}

If $\Psi$ satisfies this condition, it is said to have pure high degree at least $d$.

• $\ell _1$ norm:
\begin{aligned}\sum _{x \in \{-1,1\}^n }|\Psi (x)| = 1.\end{aligned}
##### 1.3.2 Interlude Part 2: Applying The Method of Dual Polynomials To Block-Composed Functions

For any function $f \colon \{-1, 1\}^n \to \{-1, 1\}$, we can write an LP capturing the approximate degree of $f$. We can prove lower bounds on the approximate degree of $f$ by proving lower bounds on the value of feasible solution of this LP. One way to do this is by writing down the Dual of the LP, and exhibiting a feasible solution to the dual, thereby giving an upper bound on the value of the Dual. By the principle of LP duality, an upper bound on the Dual LP will be a lower bound of the Primal LP. Therefore, exhibiting such a feasible solution, which we call a dual witness, suffices to prove an approximate degree lower bound for $f$.

However, for any given dual witness, some work will be required to verify that the witness indeed meets the criteria imposed by the Dual constraints.

When the function $f$ is a block-wise composition of two functions, say $h$ and $g$, then we can try to construct a good dual witness for $f$ by looking at dual witnesses for each of $h$ and $g$, and combining them carefully, to get the dual witness for $h \circ g$.

The dual witness $\Phi$ constructed in Step 2a for $\mathsf {AND} \circ \mathsf {OR}$ is expressed below in terms of the dual witness of the inner $\mathsf {OR}$ function viz. $\Psi _{\mathsf {OR}}$ and the dual witness of the outer $\mathsf {AND}$, viz. $\Psi _{ \mathsf {AND} }$.

\begin{aligned} ~~~~(3) \Phi (x_1 \dots x_R) = \Psi _{ \mathsf {AND} }\left ( \cdots , \mathsf {sgn}(\Psi _{\mathsf {OR}}(x_i)), \cdots \right ) \cdot \prod _{i=1}^R| \Psi _{\mathsf {OR}}(x_i)|. \end{aligned}

This method of combining dual witnesses $\Psi _{\mathsf {AND}}$ for the “outer” function $\mathsf {AND}$ and $\Psi _{\mathsf {OR}}$ for the “inner function” $\Psi _{\mathsf {OR}}$ is referred to in [BKT17BT17] as dual block composition.

##### 1.3.3 Interlude Part 3: Hamming Weight Decay Conditions

Step 2a of the proof of the $\mathsf {SURJ}$ lower bound from [BKT17] gave a dual witness $\Phi$ for $\mathsf {AND}_R \circ \mathsf {OR}_N$ (with $R=\Theta (N)$) that had pure high degree $\tilde {\Omega }(N^{3/4})$, and also satisfies Equations (4) and (5) below.

\begin{aligned} ~~~~(4) \sum _{|x|>N} {|\Phi (x)|} \ll (R \cdot N \cdot D)^{-D} \end{aligned}
\begin{aligned} ~~~~(5) \text {For all } t=0, \dots , N, \sum _{|x|=t} {|\Phi (x)|} \leq \frac {1}{15 \cdot (1+t)^2}. \end{aligned}

Equation (4) is a very strong “Hamming weight decay” condition: it shows that the total mass that $\Psi$ places on inputs of high Hamming weight is very small. Hamming weight decay conditions play an essential role in the lower bound analysis for $\mathsf {SURJ}$ from [BKT17]. In addition to Equations (4) and (5) themselves being Hamming weight decay conditions, [BKT17]’s proof that $\Phi$ satisfies Equations (4) and (5) exploits the fact that the dual witness $\Psi _{\mathsf {OR}}$ for $\mathsf {OR}$ can be chosen to simultaneously have pure high degree $N^{1/4}$, and to satisfy the following weaker Hamming weight decay condition:

Claim 5. There exist constants $c_1, c_2$ such that for all $t=0, \cdots N$,

\begin{aligned} ~~~~(6) \sum _{|x|=t} { \Psi _{\mathsf {OR}}(x)} \leq c_1 \cdot \frac {1}{(1+t)^2} \cdot \exp (-c_2 \cdot t/N^{1/4}). \end{aligned}

(We will not prove Claim 5 in these notes, we simply state it to highlight the importance of dual decay to the analysis of $\mathsf {SURJ}$).

Dual witnesses satisfying various notions of Hamming weight decay have a natural primal interpretation: they witness approximate degree lower bounds for the target function ($\mathsf {AND}_R \circ \mathsf {OR}_N$ in the case of Equation (4), and $\mathsf {OR}_N$ in the case of Equation (6)) even when the approximation is allowed to be exponentially large on inputs of high Hamming weight. This primal interpretation of dual decay is formalized in the following claim.

Claim 6. Let $L(t)$ be any function mapping $\{0, 1, \dots , N\}$ to $\mathbb {R}_+$. Suppose $\Psi$ is a dual witness for $f$ satisfying the following properties:

• (Correlation): $\sum _{x \in \{-1,1\}^n }{\Psi (x) \cdot f(x)} > 1/3$.
• (Pure high degree): $\Psi$ has pure high degree $D$.
• (Dual decay): $\sum _{|x|=t} |\Psi (x)| \leq \frac {1}{5 \cdot (1+t)^2 \cdot L(t)}$ for all $t = 0, 1, \dots , N$.

Then there is no degree $D$ polynomial $p$ such that

\begin{aligned} ~~~~(7) |p(x)-f(x)| \leq L(t) \text { for all } t = 0, 1, \dots , N.\end{aligned}

Proof. Let $p$ be any degree $D$ polynomial. Since $\Psi$ has pure high degree $D$, $\sum _{x \in \{-1, 1\}^N} p(x) \cdot \Psi (x)=0$.

We will now show that if $p$ satisfies Equation (7), then the other two properties satisfied by $\Psi$ (correlation and dual decay) together imply that $\sum _{x \in \{-1, 1\}^N} p(x) \cdot \Psi (x) >0$, a contradiction.

\begin{aligned} \sum _{x \in \{-1, 1\}^N} \Psi (x) \cdot p(x) = \sum _{x \in \{-1, 1\}^N} \Psi (x) \cdot f(x) - \sum _{x \in \{-1, 1\}^N} \Psi (x) \cdot (p(x) - f(x))\\ \geq 1/3 - \sum _{x \in \{-1, 1\}^N} |\Psi (x)| \cdot |p(x) - f(x)|\\ \geq 1/3 - \sum _{t=0}^N \sum _{|x|=t} |\Psi (x)| \cdot L(t)\\ \geq 1/3 - \sum _{t=0}^N \frac {1}{5 \cdot (1+t)^2 \cdot L(t)} \cdot L(t)\\ = 1/3 - \sum _{t=0}^N \frac {1}{5 \cdot (1+t)^2} > 0. \end{aligned}

Here, Line 2 exploited that $\Psi$ has correlation at least $1/3$ with $f$, Line 3 exploited the assumption that $p$ satisfies Equation (7), and Line 4 exploited the dual decay condition that $\Psi$ is assumed to satisfy. $\square$

##### 1.3.4 Proof of Claim 4

Proof. Claim 4 follows from Equations (4) and (5), combined with Claim 6. Specifically, apply Claim 6 with $f=\mathsf {AND}_R \circ \mathsf {OR}_N$, and

\begin{aligned}L(t) = \begin {cases} 1/3 \text { if } t \leq N \\ (R \cdot N \cdot D)^{D} \text { if } t > N. \end {cases}\end{aligned}

$\square$

### 2 Generalizing the analysis for $\mathsf {SURJ}$ to prove a nearly linear approximate degree lower bound for $\mathsf {AC}^0$

Now we take a look at how to extend this kind of analysis for $\mathsf {SURJ}$ to obtain even stronger approximate degree lower bounds for other functions in $\mathsf {AC}^0$. Recall that $\mathsf {SURJ}$ can be expressed as an $\mathsf {AND}_R$ (over all range items $r \in [R]$) of the $\mathsf {OR}_N$ (over all inputs $i \in [N]$) of “Is input $x_i$ equal to $r$”? That is, $\mathsf {SURJ}$ simply evaluates $\mathsf {AND}_R \circ \mathsf {OR}_N$ on the inputs $(\dots , y_{j, i}, \dots )$ where $y_{j, i}$ indicates whether or not input $x_i$ is equal to range item $j \in [R]$.

Our analysis for $\mathsf {SURJ}$ can be viewed as follows: It is a way to turn the $\mathsf {AND}$ function on $R$ bits (which has approximate degree $\Theta \left (\sqrt []{R}\right )$) into a function on close to $R$ bits, with polynomially larger approximate degree (i.e. $\mathsf {SURJ}$ is defined on $N \log R$ bits where, say, the value of $N$ is $100R$, i.e., it is a function on $100 R \log R$ bits). So, this function is on not much more than $R$ bits, but has approximate degree $\tilde {\Omega }(R^{3/4})$, polynomially larger than the approximate degree of $\mathsf {AND}_R$.

Hence, the lower bound for $\mathsf {SURJ}$ can be seen as a hardness amplification result. We turn the $\mathsf {AND}$ function on $R$ bits to a function on slightly more bits, but the approximate degree of the new function is significantly larger.

From this perspective, the lower bound proof for $\mathsf {SURJ}$ showed that in order to approximate $\mathsf {SURJ}$, we need to not only approximate the $\mathsf {AND}_R$ function, but, additionally, instead of feeding the inputs directly to $\mathsf {AND}$ gate itself, we are further driving up the degree by feeding the input through $\mathsf {OR}_N$ gates. The intuition is that we cannot do much better than merely approximate the $\mathsf {AND}$ function and then approximating the block composed $\mathsf {OR}_N$ gates. This additional approximation of the $\mathsf {OR}$ gates give us the extra exponent in the approximate degree expression.

We will see two issues that come in the way of naive attempts at generalizing our hardness amplification technique from $\mathsf {AND}_R$ to more general functions.

#### 2.1 Interlude: Grover’s Algorithm

Grover’s algorithm [Gro96] is a quantum algorithm that finds with high probability the unique input to a black box function that produces a given output, using $O({\sqrt {N}})$ queries on the function, where $N$ is the size of the the domain of the function. It is originally devised as a database search algorithm that searches an unsorted database of size $N$ and determines whether or not there is a record in the database that satisfies a given property in $O(\sqrt []{N})$ queries. This is strictly better compared to deterministic and randomized query algorithms because they will take $\Omega (N)$ queries in the worst case and in expectation respectively. Grover’s algorithm is optimal up to a constant factor, for the quantum world.

#### 2.2 Issues: Why a dummy range item is necessary

In general, let us consider the problem of taking any function $f$ that does not have maximal approximate degree (say, with approximate degree $n^{1-\Omega (1)}$), and turning it into a function on roughly the same number of bits, but with polynomially larger approximate degree.

In analogy with how $\mathsf {SURJ}(x_1, \dots , x_N)$ equals $\mathsf {AND}_R \circ \mathsf {OR}_N$ evaluated on inputs $(\dots , y_{ji}, \dots )$, where $y_{ji}$ indicates whether or not $x_i=j$, we can consider the block composition $f_R \circ \mathsf {OR}_N$ evaluated on $(\dots , y_{ji}, \dots )$, and hope that this function has polynomially larger approximate degree than $f_R$ itself.

Unfortunately, this does not work. Consider for example the case