Special Topics in Complexity Theory: class is over :-(

I put together in a single file all the lectures given by me. On the class webpage you can also find the scribes of the two guest lectures, and the students’ presentations. Many thanks to Matthew Dippel, Xuangui Huang, Chin Ho Lee, Biswaroop Maiti, Tanay Mehta, Willy Quach, and Giorgos Zirdelis for doing an excellent job scribing these lectures. (And for giving me perfect teaching evaluations. Though I am not sure if I biased the sample. It went like this. One day I said: “Please fill the student evaluations, we need 100%.” A student said: “100% what?  Participation or score?” I meant participation but couldn’t resist replying jokingly “both.”) Finally, thanks also to all the other students, postdocs, and faculty who attended the class and created a great atmosphere.

Advertisements

Special Topics in Complexity Theory, Lecture 19

Special Topics in Complexity Theory, Fall 2017. Instructor: Emanuele Viola

1 Lecture 19, Guest lecture by Huacheng Yu, Scribe: Matthew Dippel

Guest lecture by Huacheng Yu on dynamic data structure lower bounds, for the 2D range query and 2D range parity problems. Thanks to Huacheng for giving this lecture and for feedback on the write-up.

What is covered.

  • Overview of Larsen’s lower bound for 2D range counting.
  • Extending these techniques for \Omega (\log ^{1.5}n / \log \log ^3 n) for 2D range parity.

2 Problem definitions

Definition 1. 2D range counting

Give a data structure D that maintains a weighted set of 2 dimensional points with integer coordinates, that supports the following operations:

  1. UPDATE: Add a (point, weight) tuple to the set.
  2. QUERY: Given a query point (x, y), return the sum of weights of points (x', y') in the set satisfying x' \leq x and y' \leq y.

Definition 2. 2D range parity

Give a data structure D that maintains an unweighted set of 2 dimensional points with integer coefficients, that supports the following operations:

  1. UPDATE: Add a point to the set.
  2. QUERY: Given a query point (x, y), return the parity of the number of points (x', y') in the set satisfying x' \leq x and y' \leq y.

Both of these definitions extend easily to the d-dimensional case, but we state the 2D versions as we will mainly work with those.

2.1 Known bounds

All upper bounds assume the RAM model with word size \Theta (\log n).

Upper bounds: Using range trees, we can create a data structure for 2D range counting, with all update and query operations taking time O(\log ^d n) time. With extra tricks, we can make this work for 2D range parity with operations running in time O((\log n / \log \log n)^d).

Lower bounds. There are a series of works on lower bounds:

  • Fredman, Saks ’89 – 1D range parity requires \Omega (\log n / \log \log n).
  • Patrascu, Demaine ’04 – 1D range counting requires \Omega (\log n).
  • Larsen ’12 – 2D range counting requires \Omega ((\log n / \log \log n)^2).
  • Larsen, Weinstein, Yu ’17 – 2D range parity requires \Omega (\log ^{1.5} n / \log \log ^3 n).

This lecture presents the recent result of [Larsen ’12] and [Larsen, Weinstein, Yu ’17]. They both use the same general approach:

  1. Show that, for an efficient approach to exist, the problem must demonstrate some property.
  2. Show that the problem doesn’t have that property.

3 Larsen’s technique

All lower bounds are in the cell probe model with word size \Theta (\log n).

We consider a general data structure problem, where we require a structure D that supports updates and queries of an unspecified nature. We further assume that there exists an efficient solution with update and query times o((\log n / \log \log n)^2). We will restrict our attention to operation sequences of the form u_1, u_2, \cdots , u_n, q. That is, a sequence of n updates followed by a single query q. We fix a distribution over such sequences, and show that the problem is still hard.

3.1 Chronogram method [FS89]

We divide the updates into r epochs, so that our sequence becomes:

\begin{aligned}U_r, U_{r-1}, \cdots , U_1, q\end{aligned}

where |U_i| = \beta ^i and \beta = \log ^5 n. The epochs are multiplicatively shrinking. With this requirement, we have that r = \Theta (\log n / \log \log n).

Let M be the set of all memory cells used by the data structure when run on the sequence of updates. Further, let A_i be the set of memory cells which are accessed by the structure at least once in U_i, and never again in a further epoch.

Claim 1. The A_r, A_{r-1}, \cdots A_1 are disjoint.

Claim 2. There exists an epoch i such that D probes o(\log n / \log \log n) cells from A_i when answering the query at the end. Note that this is simply our query time divided by the number of epochs. In other words, D can’t afford to read \Omega (\log n / \log \log n) cells from each A_i set without breaking its promise on the query run time.

Claim 2 implies that there is an epoch i which has the smallest effect on the final answer. We will call this the ”easy” epoch.

Idea. : The set A_i contains ”most” information about U_i among all memory cells in M. Also, A_r, A_{r-1}, \cdots , A_{i+1} are not updated past epoch i + 1, and hence should contain no information relative to the updates in U_i. Epochs A_{i-1}, A_{i-2}, \cdots A_1 are progressively shrinking, and so the total touched cells in A_i during the query operation should be small.

\begin{aligned}\sum _{j < i}|A_j| \leq O(\beta ^{i - 1}) \cdot \log ^2 n\end{aligned}

3.2 Communication game

Having set up the framework for how to analyze the data structure, we now introduce a communication game where two parties attempt to solve an identical problem. We will show that, an efficient data structure implies an efficient solution to this communication game. If the message is smaller than the entropy of the updates of epoch i (conditioned on preceding epochs), this gives an information theoretic contradiction. The trick is to find a way for the encoder to exploit the small number of probed cells to send a short message.

The game. The game consists of two players, Alice and Bob, who must jointly compute a single query after a series of updates. The model is as follows:

  • Alice has all of the update epochs U_r, U_{r-1}, ... U_1. She also has an index i, which still corresponds to the ”easy” epoch as defined above.
  • Bob has all update epochs EXCEPT for U_i. He also has a random query q. He is aware of the index i.
  • Communication can only occur in a single direction, from Alice to Bob.
  • We assume some fixed input distribution \mathcal {D}.
  • They win this game if Bob successfully computes the correct answer for the query q.

Then we will show the following generic theorem, relating this communication game to data structures for the corresponding problem:

Theorem 3. If there is a data structure with update time t_u and probes t cells from A_i in expectation when answering the final query q, then the communication game has an efficient solution, with O(p|U_i|t_u\log n + \beta ^{i-1}t_u\log n ) communication cost, and success probability at least p^t. This holds for any choice of 0 < p < 1.

Before we prove the theorem, we consider specific parameters for our problem. If we pick

\begin{aligned} p &= 1 / \log ^5n, \\ t_u &= \log ^2 n, \\ t &= o(\log n / \log \log n), \end{aligned}

then, after plugging in the parameters, the communication cost is |U_i| / \log ^2 n. Note that, we could always trivially achieve |U_i| by having Alice send Bob all of U_i, so that he can compute the solution of the problem with no uncertainty. The success probability is (\log ^{-5} n)^{o(\log n / \log \log n)}, which simplifies to 2^{-o(\log n)} = 1 / n^{o(1)}. This is significantly better than 1 / n^{O(1)}, which could be achieved trivially by having Bob output a random answer to the query, independent of the updates.

Proof.

We assume we have a data structure D for the update / query problem. Then Alice and Bob will proceed as follows:

Alice’s steps.

  1. Simulate D on U_r, U_{r - 1}, ... U_1. While doing so, keep track of memory cell accesses and compute A_r, A_{r-1}, ... A_1.
  2. Sample a random subset C \subset A_i, such that |C| = p|A_i|.
  3. Send C \cup A_{i-1} \cup A_{i-2} \cup ... A_1.

We note that in Alice’s Step 3, to send a cell, she sends a tuple holding the cell ID and the cell state before the query was executed. Also note that, she doesn’t distinguish to Bob which cells are in which sets of the union.

Bob’s steps.

  1. Receive C' from Alice.
  2. Simulate D on epochs U_{r}, U_{r-1}, ... U_{i+1}. Snapshot the current memory state of the data structure as M.
  3. Simulate the query algorithm. Every time q attempts to probe cell c, Bob checks if c \in C'. If it is, he lets D probe from C'. Otherwise, he lets D probe from M.
  4. Bob returns the result from the query algorithm as his answer.

If the query algorithm does not query any cell in A_i - C, then Bob succeeds, as he can exactly simulate the data structure query. Since the query will check t cells in A_i, and Bob has a random subset of them of size p|A_i|, then the probability that he got a subset the data structure will not probe is at least p^t. The communication cost is the cost of Alice sending the cells to Bob, which is

\begin{aligned} (p|A_i| + \sum _{j < i}|A_i|) \leq (pt_u + |U_i| + \beta ^{i-1}t_u)\log n\end{aligned}

\square

4 Extension to 2D Range Parity

The extension to 2D range parity proceeds in nearly identical fashion, with a similar theorem relating data structures to communication games.

Theorem 1. Consider an arbitrary data structure problem where queries have 1-bit outputs. If there exists a data structure having:

  • update time t_u
  • query time t_q
  • Probes t cells from A_i when answering the last query q

Then there exists a protocol for the communication game with O(p|U_i|t_i\log n + t_u\beta ^{i-1}\log n ) bits of communication and success probability at least 1/2 + 2^{-O(\sqrt {t_q t (\log (1 / p)^3})}, for any choice of 0 < p < 1. Again, we plug in the parameters from 2D range parity. If we set

\begin{aligned} t_u = t_q &= o(\log ^{1.5}n / (\log \log n)^2), \\ t = t_q / r &= o(\log ^ (1/2) n / \log \log n), \\ p &= 1 / \log ^5 n, \end{aligned}

then the cost is |U_i| / \log ^2 n, and the probability simplifies to 1/2 + 1 / n^{o(1)}.

We note that, if we had Q = n^{O(1)} different queries, then randomly guessing on all of them, with constant probability we could be correct on as many as Q/2 \pm O(\sqrt {Q}). In this case, the probability of being correct on a single one, amortized, is 1/2 + 1/n^{\Theta (1)}.

Proof. The communication protocol will be slightly adjusted. We assume an a priori distribution on the updates and queries. Bob will then compute the posterior distribution, based on what he knows and what Alice sends him. He then computes the maximum likelihood answer to the query q. We thus need to figure out what Alice can send, so that the answer to q is often biased towards either 1 or 0.

We assume the existence of some public randomness available to both Alice and Bob. Then we adjust the communication protocol as follows:

Alice’s modified steps.

  • Alice samples, using the public randomness, a subset of ALL memory cells M_2, such that each cell is sampled with probability p. Alice sends M_2 \cap A_i to Bob. Since Bob can mimic the sampling, he gains additional information about which cells are and aren’t in A_i.

Bob’s modified steps.

  • Denote by S the set of memory cells probed by the data structure when Bob simulates the query algorithm. That is, S is what Bob ”thinks” D will probe during the query, as the actual set of cells may be different if Bob had full knowledge of the updates, and the data structure may use that information to determine what to probe. Bob will use S to compute the posterior distribution.

Define the function f(z) : [2^w] \rightarrow \mathbb {R} to be the ”bias” when S takes on the value z. In particular, this function is conditioned on C' that Bob receives from Alice. We can then clarify the definition of f as

\begin{aligned} f_{C'}(z) &:= (\text {Pr}[\text {ans to q } = 1 | C', S \leftarrow z] - 1/2) * \text {Pr}[S \leftarrow z | C'] \end{aligned}

In particular, f has the following two properties:

  1. \sum _z |f(z)| \leq 1
  2. \mathbb {E}_{C'}[\max _z |f(z)|] \geq 1/2 \cdot p^t

In these statements, the expectation is over everything that Bob knows, and the probabilities are also conditioned on everything that Bob knows. The randomness comes from what he doesn’t know. We also note that when the query probes no cells in A_i - C', then the bias is always 1/2, since the a posterior distribution will put all its weight on the correct answer of the query.

Finishing the proof requires the following lemma:

Lemma 2. For any f with the above two properties, there exists a Y \subseteq S such that |Y| \leq O(\sqrt {|S| \log 1/p^t}) and

\begin{aligned} \sum _{y \in Y} \left |\sum _{z | y} f(z) \right | &\geq 2^{-O(\sqrt {|S| \log 1 / p^t})}. \end{aligned}

Note that the sum inside the absolute values is the bias when Y \leftarrow y. \square

References

[FS89]   Michael L. Fredman and Michael E. Saks. The cell probe complexity of dynamic data structures. In ACM Symp. on the Theory of Computing (STOC), pages 345–354, 1989.

Special Topics in Complexity Theory, Lecture 18

Special Topics in Complexity Theory, Fall 2017. Instructor: Emanuele Viola

1 Lecture 18, Scribe: Giorgos Zirdelis

In this lecture we study lower bounds on data structures. First, we define the setting. We have n bits of data, stored in s bits of memory (the data structure) and want to answer m queries about the data. Each query is answered with d probes. There are two types of probes:

  • bit-probe which return one bit from the memory, and
  • cell-probe in which the memory is divided into cells of \log n bits, and each probe returns one cell.

The queries can be adaptive or non-adaptive. In the adaptive case, the data structure probes locations which may depend on the answer to previous probes. For bit-probes it means that we answer a query with depth-d decision trees.

Finally, there are two types of data structure problems:

  • The static case, in which we map the data to the memory arbitrarily and afterwards the memory remains unchanged.
  • The dynamic case, in which we have update queries that change the memory and also run in bounded time.

In this lecture we focus on the non-adaptive, bit-probe, and static setting. Some trivial extremes for this setting are the following. Any problem (i.e., collection of queries) admits data structures with the following parameters:

  • s=m and d=1, i.e. you write down all the answers, and
  • s=n and d=n, i.e. you can always answer a query about the data if you read the entire data.

Next, we review the best current lower bound, a bound proved in the 80’s by Siegel [Sie04] and rediscovered later. We state and prove the lower bound in a different way. The lower bound is for the problem of k-wise independence.

Problem 1. The data is a seed of size n=k \log m for a k-wise independent distribution over \{0,1\}^m. A query i is defined to be the i-th bit of the sample.

The question is: if we allow a little more space than seed length, can we compute such distributions fast?

Theorem 2. For the above problem with k=m^{1/3} it holds that

\begin{aligned} d \geq \Omega \left ( \frac {\lg m}{\lg (s/n)} \right ). \end{aligned}

It follows, that if s=O(n) then d is \Omega (\lg m). But if s=n^{1+\Omega (1)} then nothing is known.

Proof. Let p=1/m^{1/4d}. We have the memory of s bits and we are going to subsample it. Specifically, we will select a bit of s with probability p, independently.

The intuition is that we will shrink the memory but still answer a lot of queries, and derive a contradiction because of the seed length required to sample k-wise independence.

For the “shrinking” part we have the following. We expect to keep p\cdot s memory bits. By a Chernoff bound, it follows that we keep O(p\cdot s) bits except with probability 2^{-\Omega (p \cdot s)}.

For the “answer a lot of queries” part, recall that each query probes d bits from the memory. We keep one of the m queries if it so happens that we keep all the d bits that it probed in the memory. For a fixed query, the probability that we keep all its d probes is p^d = 1/m^{1/4}.

We claim that with probability at least 1/m^{O(1)}, we keep \sqrt {m} queries. This follows by Markov’s inequality. We expect to not keep m - m^{3/4} queries on average. We now apply Markov’s inequality to get that the probability that we don’t keep at least m - \sqrt {m} queries is at most (m - m^{3/4})/(m-\sqrt {m}).

Thus, if 2^{-\Omega (p\cdot s)} \leq 1/m^{O(1)}, then there exists a fixed choice of memory bits that we keep, to achieve both the “shrinking” part and the “answer a lot of queries” part as above. This inequality is true because s \geq n > m^{1/3} and so p \cdot s \ge m^{-1/4 + 1/3} = m^{\Omega (1)}. But now we have O(p \cdot s) bits of memory while still answering as many as \sqrt {m} queries.

The minimum seed length to answer that many queries while maintaining k-wise independence is k \log \sqrt {m} = \Omega (k \lg m) = \Omega (n). Therefore the memory has to be at least as big as the seed. This yields

\begin{aligned} O(ps) \ge \Omega (n) \end{aligned}

from which the result follows. \square

This lower bound holds even if the s memory bits are filled arbitrarily (rather than having entropy at most n). It can also be extended to adaptive cell probes.

We will now show a conceptually simple data structure which nearly matches the lower bound. Pick a random bipartite graph with s nodes on the left and m nodes on the right. Every node on the right side has degree d. We answer each probe with an XOR of its neighbor bits. By the Vazirani XOR lemma, it suffices to show that any subset S \subseteq [m] of at most k memory bits has an XOR which is unbiased. Hence it suffices that every subset S \subseteq [m] with |S| \leq k has a unique neighbor. For that, in turn, it suffices that S has a neighborhood of size greater than \frac {d |S|}{2} (because if every element in the neighborhood of S has two neighbors in S then S has a neighborhood of size < d|S|/2). We pick the graph at random and show by standard calculations that it has this property with non-zero probability.

\begin{aligned} & \Pr \left [ \exists S \subseteq [m], |S| \leq k, \textrm { s.t. } |\mathsf {neighborhood}(S)| \leq \frac {d |S|}{2} \right ] \\ & = \Pr \left [ \exists S \subseteq [m], |S| \leq k, \textrm { and } \exists T \subseteq [s], |T| \leq \frac {d|S|}{2} \textrm { s.t. all neighbors of S land in T} \right ] \\ & \leq \sum _{i=1}^k \binom {m}{i} \cdot \binom {s}{d \cdot i/2} \cdot \left (\frac {d \cdot i}{s}\right )^{d \cdot i} \\ & \leq \sum _{i=1}^k \left (\frac {e \cdot m}{i}\right )^i \cdot \left (\frac {e \cdot s} {d \cdot i/2}\right )^{d\cdot i/2} \cdot \left (\frac {d \cdot i}{s}\right )^{d \cdot i} \\ & = \sum _{i=1}^k \left (\frac {e \cdot m}{i}\right )^i \cdot \left (\frac {e \cdot d \cdot i/2}{s}\right )^{d \cdot i/2} \\ & = \sum _{i=1}^k \left [ \underbrace { \frac {e \cdot m}{i} \cdot \left (\frac {e \cdot d \cdot i/2}{s}\right )^{d/2} }_{C} \right ]^{i}. \end{aligned}

It suffices to have C \leq 1/2, so that the probability is strictly less than 1, because \sum _{i=1}^{k} 1/2^i = 1-2^{-k}. We can match the lower bound in two settings:

  • if s=m^{\epsilon } for some constant \epsilon , then d=O(1) suffices,
  • s=O(k \cdot \log m) and d=O(\lg m) suffices.

Remark 3. It is enough if the memory is (d\cdot k)-wise independent as opposed to completely uniform, so one can have n = d \cdot k \cdot \log s. An open question is if you can improve the seed length to optimal.

As remarked earlier the lower bound does not give anything when s is much larger than n. In particular it is not clear if it rules out d=2. Next we show a lower bound which applies to this case.

Problem 4. Take n bits to be a seed for 1/100-biased distribution over \{0,1\}^m. The queries, like before, are the bits of that distribution. Recall that n=O(\lg m).

Theorem 5. You need s = \Omega (m).

Proof. Every query is answered by looking at d=2 bits. But t = \Omega (m) queries are answered by the same 2-bit function f of probes (because there is a constant number of functions on 2-bits). There are two cases for f:

  1. f is linear (or affine). Suppose for the sake of contradiction that t>s. Then you have a linear dependence, because the space of linear functions on s bits is s. This implies that if you XOR those bits, you always get 0. This in turn contradicts the assumption that the distributions has small bias.
  2. f is AND (up to negating the input variables or the output). In this case, we keep collecting queries as long as they probe at least one new memory bit. If t > s when we stop we have a query left such that both their probes query bits that have already been queried. This means that there exist two queries q_1 and q_2 whose probes cover the probes of a third query q_3. This in turn implies that the queries are not close to uniform. That is because there exist answers to q_1 and q_2 that fix bits probed by them, and so also fix the bits probed by q_3. But this contradicts the small bias of the distribution.

\square

References

[Sie04]   Alan Siegel. On universal classes of extremely random constant-time hash functions. SIAM J. on Computing, 33(3):505–543, 2004.

Special Topics in Complexity Theory, Lectures 16-17

Special Topics in Complexity Theory, Fall 2017. Instructor: Emanuele Viola

1 Lectures 16-17, Scribe: Tanay Mehta

In these lectures we prove the corners theorem for pseudorandom groups, following Austin [Aus16]. Our exposition has several non-major differences with that in [Aus16], which may make it more computer-science friendly. The instructor suspects a proof can also be obtained via certain local modifications and simplifications of Green’s exposition [Gre05bGre05a] of an earlier proof for the abelian case. We focus on the case G = \textit {SL}_2(q) for simplicity, but the proof immediately extends to other pseudorandom groups.

Theorem 1. Let G = \textit {SL}_2(q). Every subset A \subseteq G^2 of density \mu (A) \geq 1/\log ^a |G| contains a corner, i.e., a set of the form \{(x, y), (xz, y), (x, zy) ~|~ z \neq 1\}.

1.1 Proof Overview

For intuition, suppose A is a product set, i.e., A = B \times C for B, C \subseteq G. Let’s look at the quantity

\begin{aligned}\mathbb {E}_{x, y, z \leftarrow G}[A(x, y) A(xz, y) A(x, zy)]\end{aligned}

where A(x, y) = 1 iff (x, y) \in A. Note that the random variable in the expectation is equal to 1 exactly when x, y, z form a corner in A. We’ll show that this quantity is greater than 1/|G|, which implies that A contains a corner (where z \neq 1). Since we are taking A = B \times C, we can rewrite the above quantity as

\begin{aligned} & \mathbb {E}_{x, y, z \leftarrow G}[B(x)C(y) B(xz)C(y) B(x)C(zy)] \\ & = \mathbb {E}_{x, y, z \leftarrow G}[B(x)C(y) B(xz)C(zy)] \\ & = \mathbb {E}_{x, y, z \leftarrow G}[B(x)C(y) B(z)C(x^{-1}zy)] \end{aligned}

where the last line follows by replacing z with x^{-1}z in the uniform distribution. If \mu (A) \ge \delta , then \mu (B) \ge \delta and \mu (C) \ge \delta . Condition on x \in B, y \in C, z \in B. Then the distribution x^{-1}zy is a product of three independent distributions, each uniform on a set of measure greater than \delta . By pseudorandomness x^{-1}zy is 1/|G|^{\Omega (1)} close to uniform in statistical distance. This implies that the above quantity equals

\begin{aligned} & \mu (A) \cdot \mu (C) \cdot \mu (B) \cdot \left (\mu (C) \pm \frac {1}{|G|^{\Omega (1)}}\right )\\ & \geq \delta ^3 \left ( \delta - \frac {1}{|G|^{\Omega (1)}} \right ) \\ & \geq \delta ^4 /2 \\ & > 1/|G|. \end{aligned}

Given this, it is natural to try to write an arbitrary A as a combination of product sets (with some error). We will make use of a more general result.

1.2 Weak Regularity Lemma

Let U be some universe (we will take U = G^2). Let f:~U \rightarrow [-1,1] be a function (for us, f = 1_A). Let D \subseteq \{d: U \rightarrow [-1,1]\} be some set of functions, which can be thought of as “easy functions” or “distinguishers.”

Theorem 2.[Weak Regularity Lemma] For all \epsilon > 0, there exists a function g := \sum _{i \le s} c_i \cdot d_i where d_i \in D, c_i \in \mathbb {R} and s = 1/\epsilon ^2 such that for all d \in D

\begin{aligned}\mathbb {E}_{x \leftarrow U}[f(x) \cdot d(x)] = \mathbb {E}_{x \leftarrow U}[g(x) \cdot d(x)] \pm \epsilon .\end{aligned}

The lemma is called ‘weak’ because it came after Szemerédi’s regularity lemma, which has a stronger distinguishing conclusion. However, the lemma is also ‘strong’ in the sense that Szemerédi’s regularity lemma has s as a tower of 1/\epsilon whereas here we have s polynomial in 1/\epsilon . The weak regularity lemma is also simpler. There also exists a proof of Szemerédi’s theorem (on arithmetic progressions), which uses weak regularity as opposed to the full regularity lemma used initially.

Proof. We will construct the approximation g through an iterative process producing functions g_0, g_1, \dots , g. We will show that ||f - g_i||_2^2 decreases by \ge \epsilon ^2 each iteration.

  1. Start: Define g_0 = 0 (which can be realized setting c_0 = 0).
  2. Iterate: If not done, there exists d \in D such that |\mathbb {E}[(f - g) \cdot d]| > \epsilon . Assume without loss of generality \mathbb {E}[(f - g) \cdot d] > \epsilon .
  3. Update: g' := g + \lambda d where \lambda \in \mathbb {R} shall be picked later.

Let us analyze the progress made by the algorithm.

\begin{aligned} ||f - g'||_2^2 &~ = \mathbb {E}_x[(f - g')^2(x)] \\ &~ = \mathbb {E}_x[(f - g - \lambda d)^2(x)] \\ &~ = \mathbb {E}_x[(f - g)^2] + \mathbb {E}_x[\lambda ^2 d^2 (x)] - 2\mathbb {E}_x[(f - g)\cdot \lambda d(x)] \\ &~ \leq ||f - g||_2^2 + \lambda ^2 - 2\lambda \mathbb {E}_x[(f-g)d(x)] \\ &~ \leq ||f - g||_2^2 + \lambda ^2 - 2\lambda \epsilon \\ &~ \leq ||f-g||_2^2 - \epsilon ^2 \end{aligned}

where the last line follows by taking \lambda = \epsilon . Therefore, there can only be 1/\epsilon ^2 iterations because ||f - g_0||_2^2 = ||f||_2^2 \leq 1. \square

1.3 Getting more for rectangles

Returning to the lower bound proof, we will use the weak regularity lemma to approximate the indicator function for arbitrary A by rectangles. That is, we take D to be the collection of indicator functions for all sets of the form S \times T for S, T \subseteq G. The weak regularity lemma gives us A as a linear combination of rectangles. These rectangles may overlap. However, we ideally want A to be a linear combination of non-overlapping rectangles.

Claim 3. Given a decomposition of A into rectangles from the weak regularity lemma with s functions, there exists a decomposition with 2^{O(s)} rectangles which don’t overlap.

Proof. Exercise. \square

In the above decomposition, note that it is natural to take the coefficients of rectangles to be the density of points in A that are in the rectangle. This gives rise to the following claim.

Claim 4. The weights of the rectangles in the above claim can be the average of f in the rectangle, at the cost of doubling the distinguisher error.

Consequently, we have that f = g + h, where g is the sum of 2^{O(s)} non-overlapping rectangles S \times T with coefficients \Pr _{(x, y) \in S \times T}[f(x, y) = 1].

Proof. Let g be a partition decomposition with arbitrary weights. Let g' be a partition decomposition with weights being the average of f. It is enough to show that for all rectangle distinguishers d \in D

\begin{aligned}|\mathbb {E}[(f-g')d]| \leq |\mathbb {E}[(f-g)d]|.\end{aligned}

By the triangle inequality, we have that

\begin{aligned}|\mathbb {E}[(f-g')d]| \leq |\mathbb {E}[(f-g)d]| + |\mathbb {E}[(g-g')d]|.\end{aligned}

To bound \mathbb {E}[(g-g')d]|, note that the error is maximized for a d that respects the decomposition in non-overlapping rectangles, i.e., d is the union of some non-overlapping rectangles from the decomposition. This can be argues using that, unlike f, the value of g and g' on a rectangle S\times T from the decomposition is fixed. But, for such d, g' = f! More formally, \mathbb {E}[(g-g')d] = \mathbb {E}[(g-f)d]. \square

We need to get a little more from this decomposition. The conclusion of the regularity lemma holds with respect to distinguishers that can be written as U(x) \cdot V(y) where U and V map G \to \{0,1\}. We need the same guarantee for U and V with range [-1,1]. This can be accomplished paying only a constant factor in the error, as follows. Let U and V have range [-1,1]. Write U = U_+ - U_- where U_+ and U_- have range [0,1], and the same for V. The error for distinguisher U \cdot V is at most the sum of the errors for distinguishers U_+ \cdot V_+, U_+ \cdot V_-, U_- \cdot V_+, and U_- \cdot V_-. So we can restrict our attention to distinguishers U(x) \cdot V(y) where U and V have range [0,1]. In turn, a function U(x) with range [0,1] can be written as an expectation \mathbb{E} _a U_a(x) for functions U_a with range \{0,1\}, and the same for V. We conclude by observing that

\begin{aligned} \mathbb{E} _{x,y}[ (f-g)(x,y) \mathbb{E} _a U_a(x) \cdot \mathbb{E} _b V_b(y)] \le \max _{a,b} \mathbb{E} _{x,y}[ (f-g)(x,y) U_a(x) \cdot V_b(y)].\end{aligned}

1.4 Proof

Let us now finish the proof by showing a corner exists for sufficiently dense sets A \subseteq G^2. We’ll use three types of decompositions for f: G^2 \rightarrow \{0,1\}, with respect to the following three types of distinguishers, where U_i and V_i have range \{0,1\}:

  1. U_1(x) \cdot V_1(y),
  2. U_2(xy) \cdot V_2(y),
  3. U_3(x) \cdot V_3(xy).

The last two distinguishers can be visualized as parallelograms with a 45-degree angle between two segments. The same extra properties we discussed for rectangles hold for them too.

Recall that we want to show

\begin{aligned}\mathbb {E}_{x, y, g}[f(x, y) f(xg, y) f(x, gy)] > \frac {1}{|G|}.\end{aligned}

We’ll decompose the i-th occurrence of f via the i-th decomposition listed above. We’ll write this decomposition as f = g_i + h_i. We do this in the following order:

\begin{aligned} & ~f(x, y) \cdot f(xg, y) \cdot f(x, gy) \\ = & ~f(x, y) f(xg, y) g_3(x, gy) + f(x, y) f(xg, y) h_3(x, gy) \\ &~ \vdots \\ =&~ g_1 g_2 g_3 + h_1 g_2 g_3 + f h_2 g_3 + f f h_3 \end{aligned}

We first show that \mathbb{E} [g_1 g_2 g_3] is big (i.e., inverse polylogarithmic in expectation) in the next two claims. Then we show that the expectations of the other terms are small.

Claim 5. For all g \in G, the values \mathbb {E}_{x, y}[g_1(x, y) g_2(xg, y) g_3(x, gy)] are the same (over g) up to an error of 2^{O(s)} \cdot 1/|G|^{\Omega (1)}.

Proof. We just need to get error 1/|G|^{\Omega (1)} for any product of three functions for the three decomposition types. By the standard pseudorandomness argument we saw in previous lectures,

\begin{aligned} \mathbb {E}_{x, y}[c_1 U_1(x)V_1(y) \cdot c_2 U_2(xgy)V_2(y) \cdot c_3 U_3(x)V_3(xgy)] \\ = c_1 c_2 c_3 \mathbb {E}_{x, y}[(U_1 \cdot U_3)(x) (V_1 \cdot V_2)(y) (U_2 \cdot V_3)(xgy)] \\ = c_1 c_2 c_3 \cdot \mu (U_1 \cdot U_3) \mu (V_1 \cdot V_2) \mu (U_2 \cdot V_3) \pm \frac {1}{|G|^{\Omega (1)}}. \end{aligned}

\square

Recall that we start with a set of density \ge 1/\log ^{a} |G|.

Claim 6. \mathbb {E}_{g, x, y}[g_1 g_2 g_3] > \Omega (1/\log ^{4a} |G|).

Proof. By the previous claim, we can fix g = 1_G. We will relate the expectation over x, y to f by a trick using the Hölder inequality: For random variables X_1, X_2, \ldots , X_k,

\begin{aligned}\mathbb {E}[X_1 \dots X_k] \leq \prod _{i=1}^k \mathbb {E}[X_i^{c_i}]^{1/c_i} \text { such that } \sum 1/c_i = 1.\end{aligned}

To apply this inequality in our setting, write

\begin{aligned}\mathbb {E}[f] = \mathbb {E}\left [(f \cdot g_1 g_2 g_3)^{1/4} \cdot \left (\frac {f}{g_1}\right )^{1/4}\cdot \left (\frac {f}{g_2}\right )^{1/4}\cdot \left (\frac {f}{g_3}\right )^{1/4}\right ].\end{aligned}

By the Hölder inequality, we get that

\begin{aligned}\mathbb {E}[f] \leq \mathbb {E}[f \cdot g_1 g_2 g_3]^{1/4} \mathbb {E}\left [\frac {f}{g_1}\right ]^{1/4} \mathbb {E}\left [\frac {f}{g_2}\right ]^{1/4} \mathbb {E}\left [\frac {f}{g_3}\right ]^{1/4}.\end{aligned}

Note that

\begin{aligned} \mathbb {E}_{x, y} \frac {f(x,y)}{g_1(x, y)} & = \mathbb {E}_{x, y} \frac {f(x, y)}{\mathbb {E}_{x', y' \in \textit {Cell}(x,y)}[f(x', y')] } \\ & = \mathbb {E}_{x, y} \frac {\mathbb {E}_{x', y' \in \textit {Cell}(x, y)}[f(x',y')]}{\mathbb {E}_{x', y' \in \textit {Cell}(x,y)}[f(x', y')] }\\ & = 1 \end{aligned}

where \textit {Cell}(x, y) is the set in the partition that contains (x, y). Finally, by non-negativity of f, we have that \mathbb {E}[f \cdot g_1 g_2 g_3]^{1/4} \leq \mathbb {E}[g_1 g_2 g_3]. This concludes the proof. \square

We’ve shown that the g_1 g_2 g_3 term is big. It remains to show the other terms are small. Let \epsilon be the error in the weak regularity lemma with respect to distinguishers with range [-1,1].

Claim 7. |\mathbb {E}[f f h_3]| \leq \epsilon ^{1/4}.

Proof. Replace g with gy^{-1} in the uniform distribution to get

\begin{aligned} & \mathbb {E}^4_{x, y, g}[f(x,y) f(xg,y)h_3(x, gy)] \\ & = \mathbb {E}^4_{x, y, g}[f(x,y) f(xgy^{-1},y)h_3(x, g)] \\ & = \mathbb {E}^4_{x, y}[f(x,y) \mathbb {E}_g [f(xgy^{-1},y)h_3(x, g)]] \\ & \leq \mathbb {E}^2_{x, y} [f^2(x, y)] \mathbb {E}^2_{x, y} \mathbb {E}^2_g [f(xgy^{-1},y)h_3(x, g)]\\ & \leq \mathbb {E}^2_{x, y} \mathbb {E}^2_g [f(xgy^{-1},y)h_3(x, g)]\\ & = \mathbb {E}^2_{x, y, g, g'}[f(xgy^{-1}, y) h_3(x, g) f(xg'y^{-1}, y) h_3(x, g')], \end{aligned}

where the first inequality is by Cauchy-Schwarz.

Now replace g \rightarrow x^{-1}g, g' \rightarrow x^{-1}g and reason in the same way:

\begin{aligned} & = \mathbb {E}^2_{x, y, g, g'}[f(gy^{-1}, y) h_3(x, x^{-1}g) f(g'y^{-1}, y) h_3(x, x^{-1}g')] \\ & = \mathbb {E}^2_{g, g', y}[f(gy^{-1}, y) \cdot f(g'y^{-1}, y) \mathbb {E}_x [h_3(x, x^{-1}g) \cdot h_3(x, x^{-1}g')]] \\ & \leq \mathbb {E}_{x,x',g,g'}[h_3(x, x^{-1}g) h_3(x, x^{-1}g') h_3(x', x'^{-1}g) h_3(x', x'^{-1}g')]. \end{aligned}

Replace g \rightarrow xg to rewrite the expectation as

\begin{aligned} \mathbb {E}[h_3(x, g) h_3(x, x^{-1}g') h_3(x', x'^{-1}xg) h_3(x', x'^{-1}g')].\end{aligned}

We want to view the last three terms as a distinguisher U(x) \cdot V(xg). First, note that h_3 has range [-1,1]. This is because h_3(x,y) = f(x,y) - \mathbb{E} _{x', y' \in \textit {Cell}(x,y)} f(x',y') and f has range \{0,1\}.

Fix x', g'. The last term in the expectation becomes a constant c \in [-1,1]. The second term only depends on x, and the third only on xg. Hence for appropriate functions U and V with range [-1,1] this expectation can be rewritten as

\begin{aligned} \mathbb {E}[h_3(x, g) U(x) V(xg)], \end{aligned}

which concludes the proof. \square

There are similar proofs to show the remaining terms are small. For fh_2g_3, we can perform simple manipulations and then reduce to the above case. For h_1 g_2 g_3, we have a slightly easier proof than above.

1.4.1 Parameters

Suppose our set has density \delta \ge 1/\log ^a |G|. We apply the weak regularity lemma for error \epsilon = 1/\log ^c |G|. This yields the number of functions s = 2^{O(1/\epsilon ^2)} = 2^{O(\log ^{2c} |G|)}. For say c = 1/3, we can bound \mathbb{E} _{x,y,g}[g_1 g_2 g_3] from below by the same expectation with g fixed to 1, up to an error 1/|G|^{\Omega (1)}. Then, \mathbb {E}_{x,y,g=1}[g_1g_2g_3] \geq \mathbb {E}[f]^4 = 1/\log ^{4a}|G|. The expectation of terms with h is less than 1/\log ^{c/4} |G|. So the proof can be completed for all sufficiently small a.

References

[Aus16]    Tim Austin. Ajtai-Szemerédi theorems over quasirandom groups. In Recent trends in combinatorics, volume 159 of IMA Vol. Math. Appl., pages 453–484. Springer, [Cham], 2016.

[Gre05a]   Ben Green. An argument of shkredov in the finite field setting, 2005. Available at people.maths.ox.ac.uk/greenbj/papers/corners.pdf.

[Gre05b]   Ben Green. Finite field models in additive combinatorics. Surveys in Combinatorics, London Math. Soc. Lecture Notes 327, 1-27, 2005.

Special Topics in Complexity Theory, Lecture 15

Special Topics in Complexity Theory, Fall 2017. Instructor: Emanuele Viola

1 Lecture 15, Scribe: Chin Ho Lee

In this lecture fragment we discuss multiparty communication complexity, especially the problem of separating deterministic and randomized communication, which we connect to a problem in combinatorics.

2 Number-on-forehead communication complexity

In number-on-forehead (NOH) communication complexity each party i sees all of the input (x_1, \dotsc , x_k) except its own input x_i. For background, it is not known how to prove negative results for k \ge \log n parties. We shall focus on the problem of separating deterministic and randomizes communication. For k = 2, we know the optimal separation: The equality function requires \Omega (n) communication for deterministic protocols, but can be solved using O(1) communication if we allow the protocols to use public coins. For k = 3, the best known separation between deterministic and randomized protocol is \Omega (\log n) vs O(1) [BDPW10]. In the following we give a new proof of this result, for a simpler function: f(x, y, z) = 1 if and only if x \cdot y \cdot z = 1 for x, y, z \in SL_2(q).

For context, let us state and prove the upper bound for randomized communication.

Claim 1. f has randomized communication complexity O(1).

Proof. In the NOH model, computing f reduces to 2-party equality with no additional communication: Alice computes y \cdot z =: w privately, then Alice and Bob check if x = w^{-1}. \square

To prove a \Omega (\log n) lower bound for deterministic protocols, where n = \log |G|, we reduce the communication problem to a combinatorial problem.

Definition 2. A corner in a group G is \{ (x,y), (xz, y), (x,zy) \} \subseteq G^2, where x, y are arbitrary group elements and z \neq 1_G.

For intuition, consider the case when G is Abelian, where one can replace multiplication by addition and a corner becomes \{ (x, y), (x + z, y), (x, y + z)\} for z \neq 0.

We now state the theorem that gives the lower bound.

Theorem 3. Suppose that every subset A \subseteq G^2 with \mu (A) := |A|/|G^2| \ge \delta contains a corner. Then the deterministic communication complexity of f(x, y, z) = 1 \iff x \cdot y \cdot z = 1_G is \Omega (\log (1/\delta )).

It is known that when G is Abelian, then \delta \ge 1/\mathrm {polyloglog}|G| implies a corner. We shall prove that when G = SL_2(q), then \delta \ge 1/\mathrm {polylog}|G| implies a corner. This in turn implies communication \Omega (\log \log |G|) = \Omega (\log n).

Proof. We saw that a number-in-hand (NIH) c-bit protocol can be written as a disjoint union of 2^c rectangles. Likewise, a number-on-forehead c-bit protocol P can be written as a disjoint union of 2^c cylinder intersections C_i := \{ (x, y, z) : f_i(y,z) g_i(x,z) h_i(x,y) = 1\} for some f_i, g_i, h_i\colon G^2 \to \{0, 1\}:

\begin{aligned} P(x,y,z) = \sum _{i=1}^{2^c} f_i(y,z) g_i(x,z) h_i(x,y). \end{aligned}

The proof idea of the above fact is to consider the 2^c transcripts of P, then one can see that the inputs giving a fixed transcript are a cylinder intersection.

Let P be a c-bit protocol. Consider the inputs \{(x, y, (xy)^{-1}) \} on which P accepts. Note that at least 2^{-c} fraction of them are accepted by some cylinder intersection C. Let A := \{ (x,y) : (x, y, (xy)^{-1}) \in C \} \subseteq G^2. Since the first two elements in the tuple determine the last, we have \mu (A) \ge 2^{-c}.

Now suppose A contains a corner \{ (x, y), (xz, y), (x, zy) \}. Then

\begin{aligned} (x,y) \in A &\implies (x, y, (xy)^{-1}) \in C &&\implies h(x, y) = 1 , \\ (xz,y) \in A &\implies (xz, y, (xzy)^{-1}) \in C &&\implies f(y,(xyz)^{-1}) = 1 , \\ (x,zy) \in A &\implies (x, zy, (xzy)^{-1}) \in C &&\implies g(x,(xyz)^{-1}) = 1 . \end{aligned}

This implies (x,y,(xzy)^{-1}) \in C, which is a contradiction because z \neq 1 and so x \cdot y \cdot (xzy)^{-1} \neq 1_G. \square

References

[BDPW10]   Paul Beame, Matei David, Toniann Pitassi, and Philipp Woelfel. Separating deterministic from randomized multiparty communication complexity. Theory of Computing, 6(1):201–225, 2010.

Special Topics in Complexity Theory, Lecture 10

Added Dec 27 2017: An updated version of these notes exists on the class page.

 

Special Topics in Complexity Theory, Fall 2017. Instructor: Emanuele Viola

1 Lecture 10, Guest lecture by Justin Thaler, Scribe: Biswaroop Maiti

This is a guest lecture by Justin Thaler regarding lower bounds on approximate degree [BKT17BT15BT17]. Thanks to Justin for giving this lecture and for his help with the write-up. We will sketch some details of the lower bound on the approximate degree of \mathsf {AND} \circ \mathsf {OR}, \mathsf {SURJ} and some intuition about the techniques used. Recall the definition of \mathsf {SURJ} from the previous lecture as below:

Definition 1. The surjectivity function \mathsf {SURJ}\colon \left (\{-1,1\}^{\log R}\right )^N \to \{-1,1\}, takes input x=(x_1, \dots , x_N) where each x_i \in \{-1, 1\}^{\log R} is interpreted as an element of [R]. \mathsf {SURJ}(x) has value -1 if and only if \forall j \in [R], \exists i\colon x_i = j.

Recall from the last lecture that \mathsf {AND}_R \circ \mathsf {OR}_N \colon \{-1,1\}^{R\times N} \rightarrow \{-1,1\} is the block-wise composition of the \mathsf {AND} function on R bits and the \mathsf {OR} function on N bits. In general, we will denote the block-wise composition of two functions f, and g, where f is defined on R bits and g is defined on N bits, by f_R \circ g_N. Here, the outputs of R copies of g are fed into f (with the inputs to each copy of g being pairwise disjoint). The total number of inputs to f_R \circ g_N is R \cdot N.

1.1 Lower Bound of d_{1/3}( \mathsf {SURJ} ) via lower bound of d_{1/3}(AND-OR)

 

Claim 2. d_{1/3}( \mathsf {SURJ} ) = \widetilde {\Theta }(n^{3/4}) .

We will look at only the lower bound in the claim. We interpret the input as a list of N numbers from [R]:= \{1,2, \cdots R\}. As presented in [BKT17], the proof for the lower bound proceeds in the following steps.

  1. Show that to approximate \mathsf {SURJ}, it is necessary to approximate the block-composition \mathsf {AND}_R \circ \mathsf {OR}_N on inputs of Hamming weight at most N. i.e., show that d_{1/3}(\mathsf {surj}) \geq d_{1/3}^{\leq N}(\mathsf {AND}_R \circ \mathsf {OR}_N).

    Step 1 was covered in the previous lecture, but we briefly recall a bit of intuition for why the claim in this step is reasonable. The intuition comes from the fact that the converse of the claim is easy to establish, i.e., it is easy to show that in order to approximate \mathsf {SURJ}, it is sufficient to approximate \mathsf {AND}_R \circ \mathsf {OR}_N on inputs of Hamming weight exactly N.

    This is because \mathsf {SURJ} can be expressed as an \mathsf {AND}_R (over all range items r \in [R]) of the \mathsf {OR}_N (over all inputs i \in [N]) of “Is input x_i equal to r”? Each predicate of the form in quotes is computed exactly by a polynomial of degree \log R, since it depends on only \log R of the input bits, and exactly N of the predicates (one for each i \in [N]) evaluates to TRUE.

    Step 1 of the lower bound proof for \mathsf {SURJ} in [BKT17] shows a converse, namely that the only way to approximate \mathsf {SURJ} is to approximate \mathsf {AND}_R \circ \mathsf {OR}_N on inputs of Hamming weight at most N.

  2. Show that d_{1/3}^{\leq N}(\mathsf {AND}_R \circ \mathsf {OR}_N) = \widetilde {\Omega }(n^{3/4}), i.e., the degree required to approximate \mathsf {AND} _R \circ \mathsf {OR}_N on inputs of Hamming weight at most N is at least D=\widetilde {\Omega }(n^{3/4}).

    In the previous lecture we also sketched this Step 2. In this lecture we give additional details of this step. As in the papers, we use the concept of a “dual witness.” The latter can be shown to be equivalent to bounded indistinguishability.

    Step 2 itself proceeds via two substeps:

    1. Give a dual witness \Phi for \mathsf {AND}_R \cdot \mathsf {OR}_N that has places little mass (namely, total mass less then (R \cdot N \cdot D)^{-D}) on inputs of hamming weight \geq N.
    2. By modifying \Phi , give a dual witness \Phi ' for \mathsf {AND}_R \cdot \mathsf {OR}_N that places zero mass on inputs of Hamming weight \geq N.

In [BKT17], both Substeps 2a and 2b proceed entirely in the dual world (i.e., they explicitly manipulate dual witnesses \Phi and \Phi '). The main goal of this section of the lecture notes is to explain how to replace Step 2b of the argument of [BKT17] with a wholly “primal” argument.

The intuition of the primal version of Step 2b that we’ll cover is as follows. First, we will show that a polynomial p \colon \{-1, 1\}^{R \cdot N} \to \{-1, 1\} of degree D that is bounded on the low Hamming Weight inputs, cannot be too big on the high Hamming weight inputs. In particular, we will prove the following claim.

Claim 3. If p \colon \{-1, 1\}^{M} \to \mathbb {R} is a degree D polynomial that satisfies |p(x)| \leq 4/3 on all inputs of x of Hamming weight at most D, then |p(x)| \leq (4/3) \cdot D \cdot M^D for all inputs x.

Second, we will explain that the dual witness \Phi constructed in Step 2a has the following “primal” implication:

Claim 4. For D \approx N^{3/4}, any polynomial p of degree D satisfying |p(x) - \left (\mathsf {AND}_R \circ \mathsf {OR}_N\right )(x) | \leq 1/3 for all inputs x of Hamming weight at most N must satisfy |p(x)| > (4/3) \cdot D \cdot ( R \cdot N)^D for some input x \in \{-1, 1\}^{R \cdot N}.

Combining Claims 3 and 4, we conclude that no polynomial p of degree D \approx N^{3/4} can satisfy

\begin{aligned} ~~~~(1) |p(x) - (\mathsf {AND}_R \circ \mathsf {OR}_N)(x) | \leq 1/3 \text { for all inputs } x \text { of Hamming weight at most } N,\end{aligned}

which is exactly the desired conclusion of Step 2. This is because any polynomial p satisfying Equation (1) also satisfies |p(x)| \leq 4/3 for all x of Hamming weight of most N, and hence Claim 3 implies that

\begin{aligned} ~~~~(2) |p(x)| \leq \frac {4}{3} \cdot D \cdot (R \cdot N)^D \text { for \emph {all} inputs } x \in \{-1, 1\}^{R \cdot N}.\end{aligned}

But Claim 4 states that any polynomial satisfying both Equations (1) and (2) requires degree strictly larger than D.

In the remainder of this section, we prove Claims 3 and 4.

1.2 Proof of Claim 3

Proof of Claim 3. For notational simplicity, let us prove this claim for polynomials on domain \{0, 1\}^{M}, rather than \{-1, 1\}^M.

Proof in the case that p is symmetric. Let us assume first that p is symmetric, i.e., p is only a function of the Hamming weight |x| of its input x. Then p(x) = g(|x|) for some degree D univariate polynomial g (this is a direct consequence of Minsky-Papert symmetrization, which we have seen in the lectures before). We can express g as below in the same spirit of Lagrange interpolation.

\begin{aligned}g(t)= \sum _{k=0}^{D-1} g(k) \cdot \prod _{i=0}^{D-1} \frac {t-i}{k-i}. \end{aligned}

Here, the first term, g(k) ,is bounded in magnitude by |g(k)| \leq 4/3, and |\prod _{i=0}^{D-1} \frac {t-i}{k-i}| \leq M^D. Therefore, we get the final bound:

\begin{aligned}|g(t)| \leq (4/3) \cdot D \cdot M^D.\end{aligned}

Proof for general p. Let us now consider the case of general (not necessarily symmetric) polynomials p. Fix any input x \in \{0, 1\}^M. The goal is to show that |p(x)| \leq \frac 43 D \cdot M^D.

Let us consider a polynomial \hat {p}_x \colon \{0,1\}^{|x|} \rightarrow \{0,1\} of degree D obtained from p by restricting each input i such that x_i=0 to have the value 0. For example, if M=4 and x=(0, 1, 1, 0), then \hat {p}_x(y_2, y_3)=p(0, y_2, y_3, 0). We will exploit three properties of \hat {p}_x:

  • \deg (\hat {p}_x) \leq \deg (p) \leq D.
  • Since |p(x)| \leq 4/3 for all inputs with |x| \leq D, \hat {p}_x(y) satisfies the analogous property: |\hat {p}_x(y)| \leq 4/3 for all inputs with |y| \leq D.
  • If \mathbf {1}_{|x|} denotes the all-1s vector of length |x|, then \hat {p}_x(\mathbf {1}_x) = p(x).

Property 3 means that our goal is to show that |\widehat {p}(\mathbf {1}_x)| \leq \frac 43 \cdot D \cdot M^D.

Let p^{\text {symm}}_x \colon \{0, 1\}^{M} \to \mathbb {R} denote the symmetrized version of \hat {p}_x, i.e., p^{\text {symm}}_x(y) = \mathbb {E}_{\sigma }[\hat {p}_x(\sigma (y))], where the expectation is over a random permutation \sigma of \{1, \dots , |x|\}, and \sigma (y)=(y_{\sigma (1)}, \dots , y_{\sigma (|x|)}). Since \sigma (\mathbf {1}_{|x|}) = \mathbf {1}_{|x|} for all permutations \sigma , p^{\text {symm}}_x(\mathbf {1}_{|x|}) = \hat {p}_x(\mathbf {1}_{|x|}) = p(x). But p^{\text {symm}}_x is symmetric, so Properties 1 and 2 together mean that the analysis from the first part of the proof implies that |p^{\text {symm}}_x(y)| \leq \frac 43 \cdot D \cdot M^D for all inputs y. In particular, letting y = \mathbf {1}_{|x|}, we conclude that |p(x)| \leq \frac 43 \cdot D \cdot M^D as desired. \square

Discussion. One may try to simplify the analysis of the general case in the proof Claim 3 by considering the polynomial p^{\text {symm}} \colon \{0, 1\}^M \to \mathbb {R} defined via p^{\text {symm}}(x)=\mathbb {E}_{\sigma }[p(\sigma (x))], where the expectation is over permutations \sigma of \{1, \dots , M\}. p^{\text {symm}} is a symmetric polynomial, so the analysis for symmetric polynomials immediately implies that |p^{\text {symm}}(x)| \leq \frac 43 \cdot D \cdot M^D. Unfortunately, this does not mean that |p(x)| \leq \frac 43 \cdot D \cdot M^D.

This is because the symmetrized polynomial p^{\mathsf {symm}} is averaging the values of p over all those inputs of a given Hamming weight. So, a bound on this averaging polynomial does not preclude the case where p is massively positive on some inputs of a given Hamming weight, and massively negative on other inputs of the same Hamming weight, and these values cancel out to obtain a small average value. That is, it is not enough to conclude that on the average over inputs of any given Hamming weight, the magnitude of p is not too big.

Thus, we needed to make sure that when we symmetrize \hat {p}_x to p^{\mathsf {sym}}_x, such large cancellations don’t happen, and a bound of the average value of \hat {p} on a given Hamming weight really gives us a bound on p on the input x itself. We defined \hat {p}_x so that \hat {p}_x(\mathbf {1}_M) = p(x). Since there is only one input in \{0, 1\}^M of Hamming weight M, p^{\text {symm}}_x(\mathbf {1}_M) does not average \hat {p}_x’s values on many inputs, meaning we don’t need to worry about massive cancellations.

A note on the history of Claim 3. Claim 3 was implicit in [RS10]. They explicitly showed a similar bound for symmetric polynomials using primal view and (implicitly) gave a different (dual) proof of the case for general polynomials.

1.3 Proof of Claim 4

1.3.1 Interlude Part 1: Method of Dual Polynomials [BT17]

A dual polynomial is a dual solution to a certain linear program that captures the approximate degree of any given function f \colon \{-1, 1\}^n \to \{-1, 1\}. These polynomials act as certificates of the high approximate degree of f. The notion of strong LP duality implies that the technique is lossless, in comparison to symmetrization techniques which we saw before. For any function f and any \varepsilon , there is always some dual polynomial \Psi that witnesses a tight \varepsilon -approximate degree lower bound for f. A dual polynomial that witnesses the fact that \mathsf {d}_\varepsilon (f) \geq d is a function \Psi \colon \{-1, 1\}^n \rightarrow \{-1, 1\} satisfying three properties:

  • Correlation analysis:
    \begin{aligned}\sum _{x \in \{-1,1\}^n }{\Psi (x) \cdot f(x)} > \varepsilon .\end{aligned}

    If \Psi satisfies this condition, it is said to be well-correlated with f.

  • Pure high degree: For all polynomials p \colon \{-1, 1\}^n \rightarrow \mathbb {R} of degree less than d, we have
    \begin{aligned}\sum _{x \in \{-1,1\}^n } { p(x) \cdot \Psi (x)} = 0.\end{aligned}

    If \Psi satisfies this condition, it is said to have pure high degree at least d.

  • \ell _1 norm:
    \begin{aligned}\sum _{x \in \{-1,1\}^n }|\Psi (x)| = 1.\end{aligned}
1.3.2 Interlude Part 2: Applying The Method of Dual Polynomials To Block-Composed Functions

For any function f \colon \{-1, 1\}^n \to \{-1, 1\}, we can write an LP capturing the approximate degree of f. We can prove lower bounds on the approximate degree of f by proving lower bounds on the value of feasible solution of this LP. One way to do this is by writing down the Dual of the LP, and exhibiting a feasible solution to the dual, thereby giving an upper bound on the value of the Dual. By the principle of LP duality, an upper bound on the Dual LP will be a lower bound of the Primal LP. Therefore, exhibiting such a feasible solution, which we call a dual witness, suffices to prove an approximate degree lower bound for f.

However, for any given dual witness, some work will be required to verify that the witness indeed meets the criteria imposed by the Dual constraints.

When the function f is a block-wise composition of two functions, say h and g, then we can try to construct a good dual witness for f by looking at dual witnesses for each of h and g, and combining them carefully, to get the dual witness for h \circ g.

The dual witness \Phi constructed in Step 2a for \mathsf {AND} \circ \mathsf {OR} is expressed below in terms of the dual witness of the inner \mathsf {OR} function viz. \Psi _{\mathsf {OR}} and the dual witness of the outer \mathsf {AND}, viz. \Psi _{ \mathsf {AND} }.

\begin{aligned} ~~~~(3) \Phi (x_1 \dots x_R) = \Psi _{ \mathsf {AND} }\left ( \cdots , \mathsf {sgn}(\Psi _{\mathsf {OR}}(x_i)), \cdots \right ) \cdot \prod _{i=1}^R| \Psi _{\mathsf {OR}}(x_i)|. \end{aligned}

This method of combining dual witnesses \Psi _{\mathsf {AND}} for the “outer” function \mathsf {AND} and \Psi _{\mathsf {OR}} for the “inner function” \Psi _{\mathsf {OR}} is referred to in [BKT17BT17] as dual block composition.

1.3.3 Interlude Part 3: Hamming Weight Decay Conditions

Step 2a of the proof of the \mathsf {SURJ} lower bound from [BKT17] gave a dual witness \Phi for \mathsf {AND}_R \circ \mathsf {OR}_N (with R=\Theta (N)) that had pure high degree \tilde {\Omega }(N^{3/4}), and also satisfies Equations (4) and (5) below.

\begin{aligned} ~~~~(4) \sum _{|x|>N} {|\Phi (x)|} \ll (R \cdot N \cdot D)^{-D} \end{aligned}
\begin{aligned} ~~~~(5) \text {For all } t=0, \dots , N, \sum _{|x|=t} {|\Phi (x)|} \leq \frac {1}{15 \cdot (1+t)^2}. \end{aligned}

Equation (4) is a very strong “Hamming weight decay” condition: it shows that the total mass that \Psi places on inputs of high Hamming weight is very small. Hamming weight decay conditions play an essential role in the lower bound analysis for \mathsf {SURJ} from [BKT17]. In addition to Equations (4) and (5) themselves being Hamming weight decay conditions, [BKT17]’s proof that \Phi satisfies Equations (4) and (5) exploits the fact that the dual witness \Psi _{\mathsf {OR}} for \mathsf {OR} can be chosen to simultaneously have pure high degree N^{1/4}, and to satisfy the following weaker Hamming weight decay condition:

Claim 5. There exist constants c_1, c_2 such that for all t=0, \cdots N,

\begin{aligned} ~~~~(6) \sum _{|x|=t} { \Psi _{\mathsf {OR}}(x)} \leq c_1 \cdot \frac {1}{(1+t)^2} \cdot \exp (-c_2 \cdot t/N^{1/4}). \end{aligned}

(We will not prove Claim 5 in these notes, we simply state it to highlight the importance of dual decay to the analysis of \mathsf {SURJ}).

Dual witnesses satisfying various notions of Hamming weight decay have a natural primal interpretation: they witness approximate degree lower bounds for the target function (\mathsf {AND}_R \circ \mathsf {OR}_N in the case of Equation (4), and \mathsf {OR}_N in the case of Equation (6)) even when the approximation is allowed to be exponentially large on inputs of high Hamming weight. This primal interpretation of dual decay is formalized in the following claim.

Claim 6. Let L(t) be any function mapping \{0, 1, \dots , N\} to \mathbb {R}_+. Suppose \Psi is a dual witness for f satisfying the following properties:

  • (Correlation): \sum _{x \in \{-1,1\}^n }{\Psi (x) \cdot f(x)} > 1/3.
  • (Pure high degree): \Psi has pure high degree D.
  • (Dual decay): \sum _{|x|=t} |\Psi (x)| \leq \frac {1}{5 \cdot (1+t)^2 \cdot L(t)} for all t = 0, 1, \dots , N.

Then there is no degree D polynomial p such that

\begin{aligned} ~~~~(7) |p(x)-f(x)| \leq L(t) \text { for all } t = 0, 1, \dots , N.\end{aligned}

Proof. Let p be any degree D polynomial. Since \Psi has pure high degree D, \sum _{x \in \{-1, 1\}^N} p(x) \cdot \Psi (x)=0.

We will now show that if p satisfies Equation (7), then the other two properties satisfied by \Psi (correlation and dual decay) together imply that \sum _{x \in \{-1, 1\}^N} p(x) \cdot \Psi (x) >0, a contradiction.

\begin{aligned} \sum _{x \in \{-1, 1\}^N} \Psi (x) \cdot p(x) = \sum _{x \in \{-1, 1\}^N} \Psi (x) \cdot f(x) - \sum _{x \in \{-1, 1\}^N} \Psi (x) \cdot (p(x) - f(x))\\ \geq 1/3 - \sum _{x \in \{-1, 1\}^N} |\Psi (x)| \cdot |p(x) - f(x)|\\ \geq 1/3 - \sum _{t=0}^N \sum _{|x|=t} |\Psi (x)| \cdot L(t)\\ \geq 1/3 - \sum _{t=0}^N \frac {1}{5 \cdot (1+t)^2 \cdot L(t)} \cdot L(t)\\ = 1/3 - \sum _{t=0}^N \frac {1}{5 \cdot (1+t)^2} > 0. \end{aligned}

Here, Line 2 exploited that \Psi has correlation at least 1/3 with f, Line 3 exploited the assumption that p satisfies Equation (7), and Line 4 exploited the dual decay condition that \Psi is assumed to satisfy. \square

1.3.4 Proof of Claim 4

Proof. Claim 4 follows from Equations (4) and (5), combined with Claim 6. Specifically, apply Claim 6 with f=\mathsf {AND}_R \circ \mathsf {OR}_N, and

\begin{aligned}L(t) = \begin {cases} 1/3 \text { if } t \leq N \\ (R \cdot N \cdot D)^{D} \text { if } t > N. \end {cases}\end{aligned}

\square

2 Generalizing the analysis for \mathsf {SURJ} to prove a nearly linear approximate degree lower bound for \mathsf {AC}^0

Now we take a look at how to extend this kind of analysis for \mathsf {SURJ} to obtain even stronger approximate degree lower bounds for other functions in \mathsf {AC}^0. Recall that \mathsf {SURJ} can be expressed as an \mathsf {AND}_R (over all range items r \in [R]) of the \mathsf {OR}_N (over all inputs i \in [N]) of “Is input x_i equal to r”? That is, \mathsf {SURJ} simply evaluates \mathsf {AND}_R \circ \mathsf {OR}_N on the inputs (\dots , y_{j, i}, \dots ) where y_{j, i} indicates whether or not input x_i is equal to range item j \in [R].

Our analysis for \mathsf {SURJ} can be viewed as follows: It is a way to turn the \mathsf {AND} function on R bits (which has approximate degree \Theta \left (\sqrt []{R}\right )) into a function on close to R bits, with polynomially larger approximate degree (i.e. \mathsf {SURJ} is defined on N \log R bits where, say, the value of N is 100R, i.e., it is a function on 100 R \log R bits). So, this function is on not much more than R bits, but has approximate degree \tilde {\Omega }(R^{3/4}), polynomially larger than the approximate degree of \mathsf {AND}_R.

Hence, the lower bound for \mathsf {SURJ} can be seen as a hardness amplification result. We turn the \mathsf {AND} function on R bits to a function on slightly more bits, but the approximate degree of the new function is significantly larger.

From this perspective, the lower bound proof for \mathsf {SURJ} showed that in order to approximate \mathsf {SURJ}, we need to not only approximate the \mathsf {AND}_R function, but, additionally, instead of feeding the inputs directly to \mathsf {AND} gate itself, we are further driving up the degree by feeding the input through \mathsf {OR}_N gates. The intuition is that we cannot do much better than merely approximate the \mathsf {AND} function and then approximating the block composed \mathsf {OR}_N gates. This additional approximation of the \mathsf {OR} gates give us the extra exponent in the approximate degree expression.

We will see two issues that come in the way of naive attempts at generalizing our hardness amplification technique from \mathsf {AND}_R to more general functions.

2.1 Interlude: Grover’s Algorithm

Grover’s algorithm [Gro96] is a quantum algorithm that finds with high probability the unique input to a black box function that produces a given output, using O({\sqrt {N}}) queries on the function, where N is the size of the the domain of the function. It is originally devised as a database search algorithm that searches an unsorted database of size N and determines whether or not there is a record in the database that satisfies a given property in O(\sqrt []{N}) queries. This is strictly better compared to deterministic and randomized query algorithms because they will take \Omega (N) queries in the worst case and in expectation respectively. Grover’s algorithm is optimal up to a constant factor, for the quantum world.

2.2 Issues: Why a dummy range item is necessary

In general, let us consider the problem of taking any function f that does not have maximal approximate degree (say, with approximate degree n^{1-\Omega (1)}), and turning it into a function on roughly the same number of bits, but with polynomially larger approximate degree.

In analogy with how \mathsf {SURJ}(x_1, \dots , x_N) equals \mathsf {AND}_R \circ \mathsf {OR}_N evaluated on inputs (\dots , y_{ji}, \dots ), where y_{ji} indicates whether or not x_i=j, we can consider the block composition f_R \circ \mathsf {OR}_N evaluated on (\dots , y_{ji}, \dots ), and hope that this function has polynomially larger approximate degree than f_R itself.

Unfortunately, this does not work. Consider for example the case f_R = \mathsf {OR}_R. The function \mathsf {OR}_R \circ \mathsf {OR}_N = \mathsf {OR}_{R \cdot N} evaluates to 1 on all possible vectors (\dots , y_{ji}, \dots , ), since all such vectors of Hamming weight exactly N > 0.

One way to try to address this is to introduce a dummy range item, all occurrences of which are simply ignored by the function. That is, we can consider the (hopefully harder) function G to interpret its input as a list of N numbers from the range [R]_0 := \{0, 1, \dots , R\}, rather than range [R], and define G=f_R \circ \mathsf {OR}_N(y_{1, 1}, \dots , y_{R, N}) (note that variables y_{0, 1}, \dots , y_{0, N}, which indicate whether or not each input x_i equals range item 0, are simply ignored).

In fact, in the previous lecture we already used this technique of introducing a “dummy” range item, to ease the lower bound analysis for \mathsf {SURJ} itself. Last lecture we covered Step 1 of the lower bound proof for \mathsf {SURJ}, and we let z_0= \sum _{i = 1}^N y_{0, i} denote the frequency of the dummy range item, 0. The introduction of this dummy range item let us replace the condition \sum _{j=0}^R z_j = N (i.e., the sum of the frequencies of all the range items is exactly N) by the condition \sum _{j=1}^R z_j \leq N (i.e., the sum of the frequencies of all the range items is at most N).

2.3 A dummy range item is not sufficient on its own

Unfortunately, introducing a dummy range item is not sufficient on its own. That is, even when the range is is [R]_0 rather than [R], the function G=f_R \circ \mathsf {OR}_N(y_{1, 1}, \dots , y_{R, N}) may have approximate degree that is not polynomially larger than that of f_R itself. An example of this is (once again) f_R = \mathsf {OR}_R. With a dummy range item, \mathsf {OR}_R \circ \mathsf {OR}_N(y_{1, 1}, \dots , y_{R, N}) evaluates to TRUE if and only if at least one of the N inputs is not equal to the dummy range item 0. This problem has approximate degree O(N^{1/2}) (it can be solved using Grover search).

Therefore, the most naive approach at general hardness amplification, even with a dummy range item, does not work.

2.4 The approach that works

The approach that succeeds is to consider the block composition f \circ \mathsf {AND}_{\log R} \circ \mathsf {OR}_N (i.e., apply the naive approach with a dummy range item not to f_R itself, but to f_R \circ \mathsf {AND}_{\log R}). As pointed out in Section 2.3, the \mathsf {AND}_{\log R} gates are crucial here for the analysis to go through.

It is instructive to look at where exactly the lower bound proof for \mathsf {SURJ} breaks down if we try to adapt it to the function \mathsf {OR}_R \circ \mathsf {OR}_N = \mathsf {OR}_{R \cdot N} (rather than the function \mathsf {AND}_R \circ \mathsf {OR}_N which we analyzed to prove the lower bound for \mathsf {SURJ}). Then we can see why the introduction of the \mathsf {AND}_{\log R} gates fixes the issue.

When analyzing the more naively defined function G= \left (\mathsf {OR}_R \circ \mathsf {OR}_N\right )(y_{1, 1}, \dots , y_{R, N}) (with a dummy range item), Step 1 of the lower bound analysis for \mathsf {SURJ} does work unmodified to imply that in order to approximate G, it is necessary to approximate block composition of \mathsf {OR}_R \circ \mathsf {OR}_N on inputs of Hamming weight at most N. But Step 2 of the analysis breaks down: one can approximate \mathsf {OR}_R \circ \mathsf {OR}_N on inputs of Hamming weight at most N using degree just O(N^{1/2}).

Why does the Step 2 analysis break down for \mathsf {OR}_R \circ \mathsf {OR}_N? If one tries to construct a dual witness \Phi for \mathsf {OR}_R \circ \mathsf {OR}_N by applying dual block composition (cf. Equation (3), but with the dual witness \Psi _{\mathsf {AND}} for \mathsf {AND}_R replaced by a dual witness for \mathsf {OR}_R), \Phi will not be well-correlated with \mathsf {OR}_R \circ \mathsf {OR}_N.

Roughly speaking, the correlation analysis thinks of each copy of the inner dual witness \Psi _{\mathsf {OR}}(x_i) as consisting of a sign, \mathsf {sgn}(\Psi _{\mathsf {OR}})(x_i), and a magnitude |\Psi _{\mathsf {OR}}(x_i)|, and the inner dual witness “makes an error” on x_i if it outputs the wrong sign, i.e., if \mathsf {sgn}(\Psi _{\mathsf {OR}})(x_i) \neq \mathsf {OR}(x_i). The correlation analysis winds up performing a union bound over the probability (under the product distribution \prod _{i=1}^{R}|\Psi _{\mathsf {OR}}(x_i)|) that any of the R copies of the inner dual witness makes an error. Unfortunately, each copy of the inner dual witness makes an error with constant probability under the distribution |\Psi _{\mathsf {OR}}|. So at least one of them makes an error under the product distribution with probability very close to 1. This means that the correlation of the dual-block-composed dual witness \Phi with \mathsf {OR}_R \circ \mathsf {OR}_N is poor.

But if we look at \mathsf {OR}_R \circ \left (\mathsf {AND}_{\log R} \circ \mathsf {OR}_N\right ), the correlation analysis does go through. That is, we can give a dual witness \Psi _{\mathsf {in}} for \mathsf {AND}_{\log R} \circ \mathsf {OR}_N and a dual witness \Psi _{\mathsf {out}} for \mathsf {OR}_R such that the the dual-block-composition of \Psi _{\mathsf {out}} and \Psi _{\mathsf {in}} is well-correlated with \mathsf {OR}_R \circ \left (\mathsf {AND}_{\log R} \circ \mathsf {OR}_N\right ).

This is because [BT15] showed that for \epsilon =1-1/(3R), d_{\epsilon }\left (\mathsf {AND}_{\log R} \circ \mathsf {OR}_N\right ) = \Omega (N^{1/2}). This means that \left (\mathsf {AND}_{\log R} \circ \mathsf {OR}_N\right ) has a dual witness \Psi _{\mathsf {in}} that “makes an error” with probability just 1/(3R). This probability of making an error is so low that a union bound over all R copies of \Psi _{\mathsf {in}} appearing in the dual-block-composition of \Psi _{\mathsf {out}} and \Psi _{\mathsf {in}} implies that with probability at least 1/3, none of the copies of \Psi _{\mathsf {in}} make an error.

In summary, the key difference between \mathsf {OR}_N and \mathsf {AND}_{\log R} \circ \mathsf {OR}_N that allows the lower bound analysis to go through for the latter but not the former is that the latter has \epsilon -approximate degree \Omega (N^{1/2}) for \epsilon = 1-1/(3R), while the former only has \epsilon -approximate degree \Omega (N^{1/2}) if \epsilon is a constant bounded away from 1.

To summarize, the \mathsf {SURJ} lower bound can be seen as a way to turn the function f_R = \mathsf {AND}_R into a harder function G=\mathsf {SURJ}, meaning that G has polynomially larger approximate degree than f_R. The right approach to generalize the technique for arbitrary f_R is to (a) introduce a dummy range item, all occurrences of which are effectively ignored by the harder function G, and (b) rather than considering the “inner” function \mathsf {OR}_N, consider the inner function \mathsf {AND}_{\log R} \circ \mathsf {OR}_N, i.e., let G=f_R \circ \mathsf {AND}_{\log R} \circ \mathsf {OR}_N(y_{1, 1} \dots , y_{R \log R, N}). The \mathsf {AND}_{\log R} gates are essential to make sure that the error in the correlation of the inner dual witness is very small, and hence the correlation analysis for the dual-block-composed dual witness goes through. Note that G can be interpreted as follows: it breaks the range [R \log R]_0 up into R blocks, each of length \log R, (the dummy range item is excluded from all of the blocks), and for each block it computes a bit indicating whether or not every range item in the block has frequency at least 1. It then feeds these bits into f_R.

By recursively applying this construction, starting with f_R = \mathsf {AND}_R, we get a function in AC^0 with approximate degree \Omega (n^{1-\delta }) for any desired constant \delta > 0.

2.5 k-distinctness

The above mentioned very same issue also arises in [BKT17]’s proof of a lower bound on the approximate degree of the k-distinctness function. Step 1 of the lower bound analysis for \mathsf {SURJ} reduced analyzing k-distinctness to analyzing \mathsf {OR} \circ \mathsf {TH}^k_N (restricted to inputs of Hamming weight at most N), where \mathsf {TH}^k_N is the function that evaluates to TRUE if and only if its input has Hamming weight at least k. The lower bound proved in [BKT17] for k-distinctness is \Omega (n^{3/4-1/(2k)}). \mathsf {OR} is the \mathsf {TH}^1 function. So, \mathsf {OR}_R \circ \mathsf {TH}^k is “close” to \mathsf {OR}_R \circ \mathsf {OR}_N. And we’ve seen that the correlation analysis of the dual witness obtained via dual-block-composition breaks down for \mathsf {OR}_R \circ \mathsf {OR}_N.

To overcome this issue, we have to show that \mathsf {TH}^k_N is harder to approximate than \mathsf {OR}_N itself, but we have to give up some small factor in the process. We will lose some quantity compared to the \Omega (n^{3/4}) lower bound for \mathsf {SURJ}. It may seem that this loss factor is just a technical issue and not intrinsic, but this is not so. In fact, this bound is almost tight. There is an upper bound from a complicated quantum algorithm [BL11Bel12] for k-distinctness that makes O(n^{3/4-1/(2^{k+2}-4)})= n^{3/4-\Omega (1)} that we won’t elaborate on here.

References

[Bel12]    Aleksandrs Belovs. Learning-graph-based quantum algorithm for k-distinctness. In Foundations of Computer Science (FOCS), 2012 IEEE 53rd Annual Symposium on, pages 207–216. IEEE, 2012.

[BKT17]   Mark Bun, Robin Kothari, and Justin Thaler. The polynomial method strikes back: Tight quantum query bounds via dual polynomials. arXiv preprint arXiv:1710.09079, 2017.

[BL11]    Aleksandrs Belovs and Troy Lee. Quantum algorithm for k-distinctness with prior knowledge on the input. arXiv preprint arXiv:1108.3022, 2011.

[BT15]    Mark Bun and Justin Thaler. Hardness amplification and the approximate degree of constant-depth circuits. In International Colloquium on Automata, Languages, and Programming, pages 268–280. Springer, 2015.

[BT17]    Mark Bun and Justin Thaler. A nearly optimal lower bound on the approximate degree of \mathsf {AC}^0. arXiv preprint arXiv:1703.05784, 2017.

[Gro96]    Lov K Grover. A fast quantum mechanical algorithm for database search. In Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, pages 212–219. ACM, 1996.

[RS10]    Alexander A Razborov and Alexander A Sherstov. The sign-rank of \mathsf {AC}^{0}. SIAM Journal on Computing, 39(5):1833–1855, 2010.

Special Topics in Complexity Theory, Lectures 12-13

Special Topics in Complexity Theory, Fall 2017. Instructor: Emanuele Viola

1 Lectures 12-13, Scribe: Giorgos Zirdelis

In these lectures we study the communication complexity of some problems on groups. We give the definition of a protocol when two parties are involved and generalize later to more parties.

Definition 1. A 2-party c-bit deterministic communication protocol is a depth-c binary tree such that:

  • the leaves are the output of the protocol
  • each internal node is labeled with a party and a function from that party’s input space to \{0,1\}

Computation is done by following a path on edges, corresponding to outputs of functions at the nodes.

A public-coin randomized protocol is a distribution on deterministic protocols.

2 2-party communication protocols

We start with a simple protocol for the following problem.

Let G be a group. Alice gets x \in G and Bob gets y \in G and their goal is to check if x \cdot y = 1_G, or equivalently if x = y^{-1}.

There is a simple deterministic protocol in which Alice simply sends her input to Bob who checks if x \cdot y = 1_G. This requires O(\log |G|) communication complexity.

We give a randomized protocol that does better in terms on communication complexity. Alice picks a random hash function h: G \rightarrow \{0,1\}^{\ell }. We can think that both Alice and Bob share some common randomness and thus they can agree on a common hash function to use in the protocol. Next, Alice sends h(x) to Bob, who then checks if h(x)=h(y^{-1}).

For \ell = O(1) we get constant error and constant communication.

3 3-party communication protocols

There are two ways to extend 2-party communication protocols to more parties. We first focus on the Number-in-hand (NIH), where Alice gets x, Bob gets y, Charlie gets z, and they want to check if x \cdot y \cdot z = 1_G. In the NIH setting the communication depends on the group G.

3.1 A randomized protocol for the hypercube

Let G=\left ( \{0,1\}^n, + \right ) with addition modulo 2. We want to test if x+y+z=0^n. First, we pick a linear hash function h, i.e. satisfying h(x+y) = h(x) + h(y). For a uniformly random a \in \{0,1\}^n set h_a(x) = \sum a_i x_i \pmod 2. Then,

  • Alice sends h_a(x)
  • Bob send h_a(y)
  • Charlie accepts if and only if \underbrace {h_a(x) + h_a(y)}_{h_a(x+y)} = h_a(z)

The hash function outputs 1 bit. The error probability is 1/2 and the communication is O(1). For a better error, we can repeat.

3.2 A randomized protocol for \mathbb {Z}_m

Let G=\left (\mathbb {Z}_m, + \right ) where m=2^n. Again, we want to test if x+y+z=0 \pmod m. For this group, there is no 100% linear hash function but there are almost linear hash function families h: \mathbb {Z}_m \rightarrow \mathbb {Z}_{\ell } that satisfy the following properties:

  1. \forall a,x,y we have h_a(x) + h_a(y) = h_a(x+y) \pm 1
  2. \forall x \neq 0 we have \Pr _{a} [h_a(x) \in \{\pm 2, \pm 1, 0\}] \leq 2^{-\Omega (\ell )}
  3. h_a(0)=0

Assuming some random hash function h (from a family) that satisfies the above properties the protocol works similar to the previous one.

  • Alice sends h_a(x)
  • Bob sends h_a(y)
  • Charlie accepts if and only if h_a(x) + h_a(y) + h_a(z) \in \{\pm 2, \pm 1, 0\}

We can set \ell = O(1) to achieve constant communication and constant error.

Analysis

To prove correctness of the protocol, first note that h_a(x) + h_a(y) + h_a(z) = h_a(x+y+z) \pm 2, then consider the following two cases:

  • if x+y+z=0 then h_a(x+y+z) \pm 2 = h_a(0) \pm 2 = 0 \pm 2
  • if x+y+z \neq 0 then \Pr _{a} [h_a(x+y+z) \in \{\pm 2, \pm 1, 0\}] \leq 2^{-\Omega (\ell )}

It now remains to show that such hash function families exist.

Let a be a random odd number modulo 2^n. Define

\begin{aligned} h_a(x) := (a \cdot x \gg n-\ell ) \pmod {2^{\ell }} \end{aligned}

where the product a \cdot x is integer multiplication. In other words we output the bits n-\ell +1, n-\ell +2, \ldots , n of the integer product a\cdot x.

We now verify that the above hash function family satisfies the three properties we required above.

Property (3) is trivially satisfied.

For property (1) we have the following. Let s = a\cdot x and t = a \cdot y and u=n-\ell . The bottom line is how (s \gg u) + (t \gg u) compares with (s+t) \gg u. In more detail we have that,

  • h_a(x+y) = ((s+t) \gg u) \pmod {2^{\ell }}
  • h_a(x) = (s \gg u) \pmod {2^{\ell }}
  • h_a(x) = (t \gg u) \pmod {2^{\ell }}

Notice, that if in the addition s+t the carry into the u+1 bit is 0, then

\begin{aligned} (s \gg u) + (t \gg u) = (s+t) \gg u \end{aligned}

otherwise

\begin{aligned} (s \gg u) + (t \gg u) + 1 = (s+t) \gg u \end{aligned}

which concludes the proof for property (1).

Finally, we prove property (2). We start by writing x=s \cdot 2^c where s is odd. Bitwise, this looks like (\cdots \cdots 1 \underbrace {0 \cdots 0}_{c~ \textrm {bits}}).

The product a \cdot x for a uniformly random a, bitwise looks like ( \textit {uniform} ~ 1 \underbrace {0 \cdots 0}_{c~\textrm {bits}}). We consider the two following cases for the product a \cdot x:

  1. If a \cdot x = (\underbrace {\textit {uniform} ~ 1 \overbrace {00}^{2~bits}}_{\ell ~bits} \cdots 0), or equivalently c \geq n-\ell + 2, the output never lands in the bad set \{\pm 2, \pm 1, 0\} (some thought should be given to the representation of negative numbers – we ignore that for simplicity).
  2. Otherwise, the hash function output has \ell - O(1) uniform bits. Again for simplicity, let B = \{0,1,2\}. Thus,
    \begin{aligned} \Pr [\textrm {output} \in B] \leq |B| \cdot 2^{-\ell + O(1)} \end{aligned}

    In other words, the probability of landing in any small set is small.

4 Other groups

What happens in other groups? Do we have an almost linear hash function for 2 \times 2 matrices? The answer is negative. For SL_2(q) and A_n the problem of testing equality with 1_G is hard.

We would like to rule out randomized protocols, but it is hard to reason about them directly. Instead, we are going to rule out deterministic protocols on random inputs. For concreteness our main focus will be SL_2(q).

First, for any group element g \in G we define the distribution on triples, D_g := (x,y, (x \cdot y)^{-1} g), where x,y \in G are uniformly random elements. Note the product of the elements in D_g is always g.

Towards a contradiction, suppose we have a randomized protocol P for the xyz=^? 1_G problem. In particular, we have

\begin{aligned} \Pr [P(D_1)=1] \geq \Pr [P(D_h)=1] + \frac {1}{10}. \end{aligned}

This implies a deterministic protocol with the same gap, by fixing the randomness.

We reach a contradiction by showing that for every deterministic protocols P using little communication (will quantify later), we have

\begin{aligned} | \Pr [P(D_1)=1] - \Pr [P(D_h)=1] | \leq \frac {1}{100}. \end{aligned}

We start with the following lemma, which describes a protocol using product sets.

Lemma 1. (The set of accepted inputs of) A deterministic c-bit protocol can be written as a disjoint union of 2^c “rectangles,” that is sets of the form A \times B \times C.

Proof. (sketch) For every communication transcript t, let S_t \subseteq G^3 be the set of inputs giving transcript t. The sets S_t are disjoint since an input gives only one transcript, and their number is 2^c, i.e. one for each communication transcript of the protocol. The rectangle property can be proven by induction on the protocol tree. \square

Next, we show that these product sets cannot distinguish these two distributions D_1,D_h, and for that we will use the pseudorandom properties of the group G.

Lemma 2. For all A,B,C \subseteq G and we have

\begin{aligned} |\Pr [A \times B \times C(D_1)=1] - \Pr [A \times B \times C(D_h)=1]| \leq \frac {1}{d^{\Omega (1)}} .\end{aligned}

Recall the parameter d from the previous lectures and that when the group G is SL_2(q) then d=|G|^{\Omega (1)}.

Proof. Pick any h \in G and let x,y,z be the inputs of Alice, Bob, and Charlie respectively. Then

\begin{aligned} \Pr [A \times B \times C(D_h)=1] = \Pr [ (x,y) \in A \times B ] \cdot \Pr [(x \cdot y)^{-1} \cdot h \in C | (x,y) \in A \times B] \end{aligned}

If either A or B is small, that is \Pr [x \in A] \leq \epsilon or \Pr [y \in B] \leq \epsilon , then also \Pr [P(D_h)=1] \leq \epsilon because the term \Pr [ (x,y) \in A \times B ] will be small. We will choose \epsilon later.

Otherwise, A and B are large, which implies that x and y are uniform over at least \epsilon |G| elements. Recall from Lecture 9 that this implies \lVert x \cdot y - U \rVert _2 \leq \lVert x \rVert _2 \cdot \lVert y \rVert _2 \cdot \sqrt {\frac {|G|}{d}}, where U is the uniform distribution.

By Cauchy–Schwarz we obtain,

\begin{aligned} \lVert x \cdot y - U \rVert _1 \leq |G| \cdot \lVert x \rVert _2 \cdot \lVert y \rVert _2 \cdot \sqrt {\frac {1}{d}} \leq \frac {1}{\epsilon } \cdot \frac {1}{\sqrt {d}}. \end{aligned}

The last inequality follows from the fact that \lVert x \rVert _2, \lVert y \rVert _2 \leq \sqrt {\frac {1}{\epsilon |G|}}.

This implies that \lVert (x \cdot y)^{-1} - U \rVert _1 \leq \frac {1}{\epsilon } \cdot \frac {1}{\sqrt {d}} and \lVert (x \cdot y)^{-1} \cdot h - U \rVert _1 \leq \frac {1}{\epsilon } \cdot \frac {1}{\sqrt {d}}, because taking inverses and multiplying by h does not change anything. These two last inequalities imply that,

\begin{aligned} \Pr [(x \cdot y)^{-1} \in C | (x,y) \in A \times B] = \Pr [(x \cdot y)^{-1} \cdot h \in C | (x,y) \in A \times B] \pm \frac {2}{\epsilon } \frac {1}{\sqrt {d}} \end{aligned}

and thus we get that,

\begin{aligned} \Pr [P(D_1)=1] = \Pr [P(D_h)=1] \pm \frac {2}{\epsilon } \frac {1}{\sqrt {d}}. \end{aligned}

To conclude, based on all the above we have that for all \epsilon and independent of the choice of h, it is either the case that

\begin{aligned} | \Pr [P(D_1)=1] - \Pr [P(D_h)=1] | \leq 2 \epsilon \end{aligned}

or

\begin{aligned} | \Pr [P(D_1)=1] - \Pr [P(D_h)=1] | \leq \frac {2}{\epsilon } \frac {1}{\sqrt {d}} \end{aligned}

and we will now choose the \epsilon to balance these two cases and finish the proof:

\begin{aligned} \frac {2}{\epsilon } \frac {1}{\sqrt {d}} = 2 \epsilon \Leftrightarrow \frac {1}{\sqrt {d}} = \epsilon ^2 \Leftrightarrow \epsilon = \frac {1}{d^{1/4}}. \end{aligned}

\square

The above proves that the distribution D_h behaves like the uniform distribution for product sets, for all h \in G.

Returning to arbitrary deterministic protocols P, write P as a union of 2^c disjoint rectangles by the first lemma. Applying the second lemma and summing over all rectangles we get that the distinguishing advantage of P is at most 2^c/d^{1/4}. For c \leq (1/100) \log d the advantage is at most 1/100 and thus we get a contradiction on the existence of such a correct protocol. We have concluded the proof of this theorem.

Theorem 3. Let G be a group, and d be the minimum dimension of an irreducible representation of G. Consider the 3-party, number-in-hand communication protocol f : G^3 \to \{0,1\} where f(x,y,z) = 1 \Leftrightarrow x \cdot y \cdot z = 1_G. Its randomized communication complexity is \Omega (\log d).

For SL_2(q) the communication is \Omega (\log |G|). This is tight up to constants, because Alice can send her entire group element.

For the group A_n the known bounds on d yield communication \Omega ( \log \log |G|). This bound is tight for the problem of distinguishing D_1 from D_h for h\neq 1, as we show next. The identity element 1_G for the group A_n is the identity permutation. If h \neq 1_G then h is a permutation that maps some element a \in G to h(a)=b \neq a. The idea is that the parties just need to “follow” a, which is logarithmically smaller than G. Specifically, let x,y,z be the permutations that Alice, Bob and Charlie get. Alice sends x(a) \in [n]. Bob gets x(a) and sends y(x(a)) \in [n] to Charlie who checks if z(y(x(a))) = 1. The communication is O(\log n). Because the size of the group is |G|=\Theta (n!) = \Theta \left ( \left ( \frac {n}{e} \right )^n \right ), the communication is O(\log \log |G|).

This is also a proof that d cannot be too large for A_n, i.e. is at most (\log |G|)^{O(1)}.

5 More on 2-party protocols

We move to another setting where a clean answer can be given. Here we only have two parties. Alice gets x_1,x_2,\ldots ,x_n, Bob gets y_1,y_2,\ldots ,y_n, and they want to know if x_1 \cdot y_1 \cdot x_2 \cdot y_2 \cdots x_n \cdot y_n = 1_G.

When G is abelian, the elements can be reordered as to check whether (x_1 \cdot x_2 \cdots x_n) \cdot (y_1 \cdot y_2 \cdots y_n) = 1_G. This requires constant communication (using randomness) as we saw in Lecture 12, since it is equivalent to the check x \cdot y = 1_G where x=x_1 \cdot x_2 \cdots x_n and y=y_1 \cdot y_2 \cdots y_n.

We will prove the next theorem for non-abelian groups.

Theorem 1. For every non-abelian group G the communication of deciding if x_1 \cdot y_1 \cdot x_2 \cdot y_2 \cdots x_n \cdot y_n = 1_G is \Omega (n).

Proof. We reduce from unique disjointness, defined below. For the reduction we will need to encode the And of two bits x,y \in \{0,1\} as a group product. (This question is similar to a puzzle that asks how to hang a picture on the wall with two nails, such that if either one of the nails is removed, the picture will fall. This is like computing the And function on two bits, where both bits (nails) have to be 1 in order for the function to be 1.) Since G is non-abelian, there exist a,b \in G such that a \cdot b \neq b\cdot a, and in particular a \cdot b \cdot a^{-1} \cdot b^{-1} = h with h \neq 1. We can use this fact to encode And as

\begin{aligned} a^x \cdot b^y \cdot a^{-x} \cdot b^{-y}= \begin {cases} 1,~~\text {if And(x,y)=0}\\ h,~~\text {otherwise} \end {cases}. \end{aligned}

In the disjointness problem Alice and Bob get inputs x,y \in \{0,1\}^n respectively, and they wish to check if there exists an i \in [n] such that x_i \land y_i =1. If you think of them as characteristic vectors of sets, this problem is asking if the sets have a common element or not. The communication of this problem is \Omega (n). Moreover, in the variant of this problem where the number of such i’s is 0 or 1 (i.e. unique), the same lower bound \Omega (n) still applies. This is like giving Alice and Bob two sets that either are disjoint or intersect in exactly one element, and they need to distinguish these two cases.

Next, we will reduce the above variant of the set disjointness to group products. For x,y \in \{0,1\}^n we product inputs for the group problem as follows:

\begin{aligned} x & \rightarrow (a^{x_1} , a^{-x_1} , \ldots , a^{x_n}, a^{-x_n} ) \\ y & \rightarrow (b^{y_1} , b^{-y_1}, \ldots , b^{y_n}, b^{-y_n}). \end{aligned}

Now, the product x_1 \cdot y_1 \cdot x_2 \cdot y_2 \cdots x_n \cdot y_n we originally wanted to compute becomes

\begin{aligned} \underbrace {a^{x_1} \cdot b^{y_1} \cdot a^{-x_1} \cdot b^{-y_1}}_{\text {1 bit}} \cdots \cdots a^{x_n} \cdot b^{y_n} \cdot a^{-x_n} \cdot b^{-y_n}. \end{aligned}

If there isn’t an i \in [n] such that x_i \land y_i=1, then each product term a^{x_i} \cdot b^{y_i} \cdot a^{-x_i} \cdot b^{-y_i} is 1 for all i, and thus the whole product is 1.

Otherwise, there exists a unique i such that x_i \land y_i=1 and thus the product will be 1 \cdots 1 \cdot h \cdot 1 \cdots 1=h, with h being in the i-th position. If Alice and Bob can test if the above product is equal to 1, they can also solve the unique set disjointness problem, and thus the lower bound applies for the former. \square

We required the uniqueness property, because otherwise we might get a product h^c that could be equal to 1 in some groups.

Special Topics in Complexity Theory, Lectures 8-9

Special Topics in Complexity Theory, Fall 2017. Instructor: Emanuele Viola

1 Lecture 8-9, Scribe: Xuangui Huang

In these lectures, we finish the proof of the approximate degree lower bound for AND-OR function, then we move to the surjectivity function SURJ. Finally we discuss quasirandom groups.

1.1 Lower Bound of d_{1/3}(AND-OR)

Recall from the last lecture that AND-OR:\{0,1\}^{R\times N} \rightarrow \{0,1\} is the composition of the AND function on R bits and the OR function on N bits. We also proved the following lemma.

Lemma 1. Suppose that distributions A^0, A^1 over \{0,1\}^{n_A} are k_A-wise indistinguishable distributions; and distributions B^0, B^1 over \{0,1\}^{n_B} are k_B-wise indistinguishable distributions. Define C^0, C^1 over \{0,1\}^{n_A \cdot n_B} as follows:

C^b: draw a sample x \in \{0,1\}^{n_A} from A^b, and replace each bit x_i by a sample of B^{x_i} (independently).

Then C^0 and C^1 are k_A \cdot k_B-wise indistinguishable.

To finish the proof of the lower bound on the approximate degree of the AND-OR function, it remains to see that AND-OR can distinguish well the distributions C^0 and C^1. For this, we begin with observing that we can assume without loss of generality that the distributions have disjoint supports.

Claim 2. For any function f, and for any k-wise indistinguishable distributions A^0 and A^1, if f can distinguish A^0 and A^1 with probability \epsilon then there are distributions B^0 and B^1 with the same properties (k-wise indistinguishability yet distinguishable by f) and also with disjoint supports. (By disjoint support we mean for any x either \Pr [B^0 = x] = 0 or \Pr [B^1 = x] = 0.)

Proof. Let distribution C be the “common part” of A^0 and A^1. That is to say, we define C such that \Pr [C = x] := \min \{\Pr [A^0 = x], \Pr [A^1 = x]\} multiplied by some constant that normalize C into a distribution.

Then we can write A^0 and A^1 as

\begin{aligned} A^0 &= pC + (1-p) B^0 \,,\\ A^1 &= pC + (1-p) B^1 \,, \end{aligned}

where p \in [0,1], B^0 and B^1 are two distributions. Clearly B^0 and B^1 have disjoint supports.

Then we have

\begin{aligned} \mathbb{E} [f(A^0)] - \mathbb{E} [f(A^1)] =&~p \mathbb{E} [f(C)] + (1-p) \mathbb{E} [f(B^0)] \notag \\ &- p \mathbb{E} [f(C)] - (1-p) \mathbb{E} [f(B^1)] \\ =&~(1-p) \big ( \mathbb{E} [f(B^0)] - \mathbb{E} [f(B^1)] \big ) \\ \leq &~\mathbb{E} [f(B^0)] - \mathbb{E} [f(B^1)] \,. \end{aligned}

Therefore if f can distinguish A^0 and A^1 with probability \epsilon then it can also distinguish B^0 and B^1 with such probability.

Similarly, for all S \neq \varnothing such that |S| \leq k, we have

\begin{aligned} 0 = \mathbb{E} [\chi _S(A^0)] - \mathbb{E} [\chi _S(A^1)] = (1-p) \big ( \mathbb{E} [\chi _S(B^0)] - \mathbb{E} [\chi _S(B^1)] \big ) = 0 \,. \end{aligned}

Hence, B^0 and B^1 are k-wise indistinguishable. \square

Equipped with the above lemma and claim, we can finally prove the following lower bound on the approximate degree of AND-OR.

Theorem 3. d_{1/3}(AND-OR) = \Omega (\sqrt {RN}).

Proof. Let A^0, A^1 be \Omega (\sqrt {R})-wise indistinguishable distributions for AND with advantage 0.99, i.e. \Pr [\mathrm {AND}(A^1) = 1] > \Pr [\mathrm {AND}(A^0) = 1] + 0.99. Let B^0, B^1 be \Omega (\sqrt {N})-wise indistinguishable distributions for OR with advantage 0.99. By the above claim, we can assume that A^0, A^1 have disjoint supports, and the same for B^0, B^1. Compose them by the lemma, getting \Omega (\sqrt {RN})-wise indistinguishable distributions C^0,C^1. We now show that AND-OR can distinguish C^0, C^1:

  • C_0: First sample A^0. As there exists a unique x = 1^R such that \mathrm {AND}(x)= 1, \Pr [A^1 = 1^R] >0. Thus by disjointness of support \Pr [A^0 = 1^R] = 0. Therefore when sampling A^0 we always get a string with at least one “0”. But then “0” is replaced with sample from B^0. We have \Pr [B^0 = 0^N] \geq 0.99, and when B^0 = 0^N, AND-OR=0.
  • C_1: First sample A^1, and we know that A^1 = 1^R with probability at least 0.99. Each bit “1” is replaced by a sample from B^1, and we know that \Pr [B^1 = 0^N] = 0 by disjointness of support. Then AND-OR=1.

Therefore we have d_{1/3}(AND-OR)= \Omega (\sqrt {RN}). \square

1.2 Lower Bound of d_{1/3}(SURJ)

In this subsection we discuss the approximate degree of the surjectivity function. This function is defined as follows.

Definition 4. The surjectivity function SURJ\colon \left (\{0,1\}^{\log R}\right )^N \to \{0,1\}, which takes input (x_1, \dots , x_N) where x_i \in [R] for all i, has value 1 if and only if \forall j \in [R], \exists i\colon x_i = j.

First, some history. Aaronson first proved that the approximate degree of SURJ and other functions on n bits including “the collision problem” is n^{\Omega (1)}. This was motivated by an application in quantum computing. Before this result, even a lower bound of \omega (1) had not been known. Later Shi improved the lower bound to n^{2/3}, see [AS04]. The instructor believes that the quantum framework may have blocked some people from studying this problem, though it may have very well attracted others. Recently Bun and Thaler [BT17] reproved the n^{2/3} lower bound, but in a quantum-free paper, and introducing some different intuition. Soon after, together with Kothari, they proved [BKT17] that the approximate degree of SURJ is \Theta (n^{3/4}).

We shall now prove the \Omega (n^{3/4}) lower bound, though one piece is only sketched. Again we present some things in a different way from the papers.

For the proof, we consider the AND-OR function under the promise that the Hamming weight of the RN input bits is at most N. Call the approximate degree of AND-OR under this promise d_{1/3}^{\leq N}(AND-OR). Then we can prove the following theorems.

Theorem 5. d_{1/3}(SURJ) \geq d_{1/3}^{\leq N}(AND-OR).

Theorem 6. d_{1/3}^{\leq N}(AND-OR) \geq \Omega (N^{3/4}) for some suitable R = \Theta (N).

In our settings, we consider R = \Theta (N). Theorem 5 shows surprisingly that we can somehow “shrink” \Theta (N^2) bits of input into N\log N bits while maintaining the approximate degree of the function, under some promise. Without this promise, we just showed in the last subsection that the approximate degree of AND-OR is \Omega (N) instead of \Omega (N^{3/4}) as in Theorem 6.

Proof of Theorem 5. Define an N \times R matrix Y s.t. the 0/1 variable y_{ij} is the entry in the i-th row j-th column, and y_{ij} = 1 iff x_i = j. We can prove this theorem in following steps:

  1. d_{1/3}(SURJ(\overline {x})) \geq d_{1/3}(AND-OR(\overline {y})) under the promise that each row has weight 1;
  2. let z_j be the sum of the j-th column, then d_{1/3}(AND-OR(\overline {y})) under the promise that each row has weight 1, is at least d_{1/3}(AND-OR(\overline {z})) under the promise that \sum _j z_j = N;
  3. d_{1/3}(AND-OR(\overline {z})) under the promise that \sum _j z_j = N, is at least d_{1/3}^{=N}(AND-OR(\overline {y}));
  4. we can change “=N” into “\leq N”.

Now we prove this theorem step by step.

  1. Let P(x_1, \dots , x_N) be a polynomial for SURJ, where x_i = (x_i)_1, \dots , (x_i)_{\log R}. Then we have
    \begin{aligned} (x_i)_k = \sum _{j: k\text {-th bit of }j \text { is } 1} y_{ij}. \end{aligned}

    Then the polynomial P'(\overline {y}) for AND-OR(\overline {y}) is the polynomial P(\overline {x}) with (x_i)_k replaced as above, thus the degree won’t increase. Correctness follows by the promise.

  2. This is the most extraordinary step, due to Ambainis [Amb05]. In this notation, AND-OR becomes the indicator function of \forall j, z_j \neq 0. Define
    \begin{aligned} Q(z_1, \dots , z_R) := \mathop {\mathbb{E} }_{\substack {\overline {y}: \text { his rows have weight } 1\\ \text {and is consistent with }\overline {z}}} P(\overline {y}). \end{aligned}

    Clearly it is a good approximation of AND-OR(\overline {z}). It remains to show that it’s a polynomial of degree k in z’s if P is a polynomial of degree k in y’s.

    Let’s look at one monomial of degree k in P: y_{i_1j_1}y_{i_2j_2}\cdots y_{i_kj_k}. Observe that all i_\ell ’s are distinct by the promise, and by u^2 = u over \{0,1\}. By chain rule we have

    \begin{aligned} \mathbb{E} [y_{i_1j_1}\cdots y_{i_kj_k}] = \mathbb{E} [y_{i_1j_1}]\mathbb{E} [y_{i_2j_2}|y_{i_1j_1} = 1] \cdots \mathbb{E} [y_{i_kj_k}|y_{i_1j_1}=\cdots =y_{i_{k-1}j_{k-1}} = 1]. \end{aligned}

    By symmetry we have \mathbb{E} [y_{i_1j_1}] = \frac {z_{j_1}}{N}, which is linear in z’s. To get \mathbb{E} [y_{i_2j_2}|y_{i_1j_1} = 1], we know that every other entry in row i_1 is 0, so we give away row i_1, average over y’s such that \left \{\begin {array}{ll} y_{i_1j_1} = 1 &\\ y_{ij} = 0 & j\neq j_1 \end {array}\right . under the promise and consistent with z’s. Therefore

    \begin{aligned} \mathbb{E} [y_{i_2j_2}|y_{i_1j_1} = 1] = \left \{ \begin {array}{ll} \frac {z_{j_2}}{N-1} & j_1 \neq j_2,\\ \frac {z_{j_2}-1}{N-1} & j_1 = j_2. \end {array}\right . \end{aligned}

    In general we have

    \begin{aligned} \mathbb{E} [y_{i_kj_k}|y_{i_1j_1}=\cdots =y_{i_{k-1}j_{k-1}} = 1] = \frac {z_{j_k} - \#\ell < k \colon j_\ell = j_k}{N-k + 1}, \end{aligned}

    which has degree 1 in z’s. Therefore the degree of Q is not larger than that of P.

  3. Note that \forall j, z_j = \sum _i y_{ij}. Hence by replacing z’s by y’s, the degree won’t increase.
  4. We can add a “slack” variable z_0, or equivalently y_{01}, \dots , y_{0N}; then the condition \sum _{j=0}^R z_j = N actually means \sum _{j=1}^R z_j \leq N.

\square

Proof idea for Theorem 6. First, by the duality argument we can verify that d_{1/3}^{\leq N}(f) \geq d if and only if there exists d-wise indistinguishable distributions A, B such that:

  • f can distinguish A, B;
  • A and B are supported on strings of weight \leq N.

Claim 7. d_{1/3}^{\leq \sqrt {N}}(OR_N) = \Omega (N^{1/4}).

The proof needs a little more information about the weight distribution of the indistinguishable distributions corresponding to this claim. Basically, their expected weight is very small.

Now we combine these distributions with the usual ones for And using the lemma mentioned at the beginning.

What remains to show is that the final distribution is supported on Hamming weight \le N. Because by construction the R copies of the distributions for Or are sampled independently, we can use concentration of measure to prove a tail bound. This gives that all but an exponentially small measure of the distribution is supported on strings of weight \le N. The final step of the proof consists of slightly tweaking the distributions to make that measure 0. \square

1.3 Groups

Groups have many applications in theoretical computer science. Barrington [Bar89] used the permutation group S_5 to prove a very surprising result, which states that the majority function can be computed efficiently using only constant bits of memory (something which was conjectured to be false). More recently, catalytic computation [BCK^{+}14] shows that if we have a lot of memory, but it’s full with junk that cannot be erased, we can still compute more than if we had little memory. We will see some interesting properties of groups in the following.

Some famous groups used in computer science are:

  • \{0,1\}^n with bit-wise addition;
  • \mathbb {Z}_m with addition mod m ;
  • S_n, which are permutations of n elements;
  • Wreath product G:= (\mathbb {Z}_m \times \mathbb {Z}_m) \wr \mathbb {Z}_2\,, whose elements are of the form (a,b)z where z is a “flip bit”, with the following multiplication rules:
    • (a, b) 1 = 1 (b, a) ;
    • z\cdot z' := z+z' in \mathbb {Z}_2 ;
    • (a,b) \cdot (a',b') := (a+a', b+b') is the \mathbb {Z}_m\times \mathbb {Z}_m operation;

    An example is (5,7)1 \cdot (2,1) 1 = (5,7) 1 \cdot 1 (1, 2) = (6,9)0 . Generally we have

    \begin{aligned} (a, b) z \cdot (a', b') z' = \left \{ \begin {array}{ll} (a + a', b+b') z+z' & z = 1\,,\\ (a+b', b + a') z+z' & z = 0\,; \end {array}\right . \end{aligned}

  • SL_2(q) := \{2\times 2 matrices over \mathbb {F}_q with determinant 1\}, in other words, group of matrices \begin {pmatrix} a & b\\ c & d \end {pmatrix} such that ad - bc = 1.

The group SL_2(q) was invented by Galois. (If you haven’t, read his biography on wikipedia.)

Quiz. Among these groups, which is the “least abelian”? The latter can be defined in several ways. We focus on this: If we have two high-entropy distributions X, Y over G, does X \cdot Y has more entropy? For example, if X and Y are uniform over some \Omega (|G|) elements, is X\cdot Y close to uniform over G? By “close to” we mean that the statistical distance is less that a small constant from the uniform distribution. For G=(\{0,1\}^n, +), if Y=X uniform over \{0\}\times \{0,1\}^{n-1}, then X\cdot Y is the same, so there is not entropy increase even though X and Y are uniform on half the elements.

Definition 8.[Measure of Entropy] For \lVert A\rVert _2 = \left (\sum _xA(x)^2\right )^{\frac {1}{2}}, we think of \lVert A\rVert ^2_2 = 100 \frac {1}{|G|} for “high entropy”.

Note that \lVert A\rVert ^2_2 is exactly the “collision probability”, i.e. \Pr [A = A']. We will consider the entropy of the uniform distribution U as very small, i.e. \lVert U\rVert ^2_2 = \frac {1}{|G|} \approx \lVert \overline {0}\rVert ^2_2. Then we have

\begin{aligned} \lVert A - U \rVert ^2_2 &= \sum _x \left (A(x) - \frac {1}{|G|}\right )^2\\ &= \sum _x A(x)^2 - 2A(x) \frac {1}{|G|} + \frac {1}{|G|^2} \\ &= \lVert A \rVert ^2_2 - \frac {1}{|G|} \\ &= \lVert A \rVert ^2_2 - \lVert U \rVert ^2_2\\ &\approx \lVert A \rVert ^2_2\,. \end{aligned}

Theorem 9.[[Gow08], [BNP08]] If X, Y are independent over G, then

\begin{aligned} \lVert X\cdot Y - U \rVert _2 \leq \lVert X \rVert _2 \lVert Y \rVert _2 \sqrt {\frac {|G|}{d}}, \end{aligned}

where d is the minimum dimension of irreducible representation of G.

By this theorem, for high entropy distributions X and Y, we get \lVert X\cdot Y - U \rVert _2 \leq \frac {O(1)}{\sqrt {|G|d}}, thus we have

\begin{aligned} ~~~~(1) \lVert X\cdot Y - U \rVert _1 \leq \sqrt {|G|} \lVert X\cdot Y - U \rVert _2 \leq \frac {O(1)}{\sqrt {d}}. \end{aligned}

If d is large, then X \cdot Y is very close to uniform. The following table shows the d’s for the groups we’ve introduced.








G \{0,1\}^n \mathbb {Z}_m (\mathbb {Z}_m \times \mathbb {Z}_m) \wr \mathbb {Z}_2 A_n SL_2(q)






d 1 1 should be very small \frac {\log |G|}{\log \log |G|} |G|^{1/3}







Here A_n is the alternating group of even permutations. We can see that for the first groups, Equation ((1)) doesn’t give non-trivial bounds.

But for A_n we get a non-trivial bound, and for SL_2(q) we get a strong bound: we have \lVert X\cdot Y - U \rVert _2 \leq \frac {1}{|G|^{\Omega (1)}}.

References

[Amb05]    Andris Ambainis. Polynomial degree and lower bounds in quantum complexity: Collision and element distinctness with small range. Theory of Computing, 1(1):37–46, 2005.

[AS04]    Scott Aaronson and Yaoyun Shi. Quantum lower bounds for the collision and the element distinctness problems. J. of the ACM, 51(4):595–605, 2004.

[Bar89]    David A. Mix Barrington. Bounded-width polynomial-size branching programs recognize exactly those languages in NC^1. J. of Computer and System Sciences, 38(1):150–164, 1989.

[BCK^{+}14]    Harry Buhrman, Richard Cleve, Michal Koucký, Bruno Loff, and Florian Speelman. Computing with a full memory: catalytic space. In ACM Symp. on the Theory of Computing (STOC), pages 857–866, 2014.

[BKT17]    Mark Bun, Robin Kothari, and Justin Thaler. The polynomial method strikes back: Tight quantum query bounds via dual polynomials. CoRR, arXiv:1710.09079, 2017.

[BNP08]    László Babai, Nikolay Nikolov, and László Pyber. Product growth and mixing in finite groups. In ACM-SIAM Symp. on Discrete Algorithms (SODA), pages 248–257, 2008.

[BT17]    Mark Bun and Justin Thaler. A nearly optimal lower bound on the approximate degree of AC0. CoRR, abs/1703.05784, 2017.

[Gow08]    W. T. Gowers. Quasirandom groups. Combinatorics, Probability & Computing, 17(3):363–387, 2008.

Special Topics in Complexity Theory, Lectures 6-7

Special Topics in Complexity Theory, Fall 2017. Instructor: Emanuele Viola

1 Lecture 6-7, Scribe: Willy Quach

In these lectures, we introduce k-wise indistinguishability and link this notion to the approximate degree of a function. Then, we study the approximate degree of some functions, namely, the AND function and the AND-OR function. For the latter function we begin to see a proof that is different (either in substance or language) from the proofs in the literature. We begin with some LATEXtips.

1.1 Some LATEX tips.

  • Mind the punctuation. Treat equations as part of the phrases; add commas, colons, etc accordingly.
  • In math mode, it is usually better to use \ell (\ell ) rather than regular l. The latter can be confused with 1.
  • Align equations with \begin{align} \cdots \end{align} with the alignment character &.
  • For set inclusion, use \subset (\subset ) only if you mean proper inclusion (which is uncommon). Otherwise use \subseteq (\subseteq ). (Not everybody agrees with this, but this seems the most natural convention by analogy with < and \le .)

1.2 Introducing k-wise indistinguishability.

We studied previously the following questions:

  • What is the minimum k such that any k-wise independent distribution P over \{0,1\}^n fools \mathrm {AC}^0 (i.e. \mathbb {E}C(P) \approx \mathbb {E}C(U) for all poly(n)-size circuits C with constant depth)?

    We saw that k = \log ^{\mathcal {O}(d)}(s/\epsilon ) is enough.

  • What is the minimum k such that P fools the AND function?

    Taking k=\mathcal {O}(1) for \epsilon =\mathcal {O}(1) suffices (more precisely we saw that k-wise independence fools the AND function with \epsilon = 2^{-\Omega (k)}).

Consider now P and Q two distributions over \{0,1\}^n that are k-wise indistinguishable, that is, any projection over k bits of P and Q have the same distribution. We can ask similar questions:

  • What is the minimum k such that \mathrm {AC}^0 cannot distinguish P and Q (i.e. \mathbb {E}C(P) \approx \mathbb {E}C(Q) for all poly(n)-size circuits C with constant depth)?

    It turns out this requires k \geq n^{1-o(1)}: there are some distributions that are almost always distinguishable in this regime. (Whether k=\Omega (n) is necessary or not is an open question.)

    Also, k = n\left (1- \frac {1}{polylog(n)}\right ) suffices to fool \mathrm {AC}^0 (in which case \epsilon is essentially exponentially small).

  • What is the minimum k such that the AND function (on n bits) cannot distinguish P and Q?

    It turns out that k=\Theta (\sqrt {n}) is necessary and sufficient. More precisely:

    • There exists some P,Q over \{0,1\}^n that are c\sqrt {n}-wise indistinguishable for some constant c, but such that:
      \begin{aligned} \left | \Pr _P [AND(P)=1] - \Pr _Q [AND(Q)=1] \right | \geq 0.99 \,;\end{aligned}
    • For all P, Q that are c'\sqrt {n}-wise indistinguishable for some bigger constant c', we have:
      \begin{aligned} \left | \Pr _P [AND(P)=1] - \Pr _Q [AND(Q)=1] \right | \leq 0.01 \,.\end{aligned}

1.3 Duality.

Those question are actually equivalent to ones related about approximation by real-valued polynomials:

Theorem 1. Let f:\{0,1\}^n \rightarrow \{0,1\} be a function, and k an integer. Then:

\begin{aligned} \max _{P,Q \, k\text {-wise indist.}} \left | \mathbb {E}f(P)-\mathbb {E}f(Q) \right | = \min \{ \, \epsilon \, | \, \exists g\in \mathbb {R}_k[X]: \forall x, \left |f(x)-g(x) \right | \leq \epsilon \}. \end{aligned}

Here \mathbb {R}_k[X] denotes degree-k real polynomials. We will denote the right-hand side \epsilon _k(f).

Some examples:

  • f=1: then \mathbb {E}f(P)=1 for all distribution P, so that both sides of the equality are 0.
  • f(x) = \sum _i x_i \bmod 2 the parity function on n bits.

    Then for k = n-1, the left-hand side is at least 1/2: take P to be uniform; and Q to be uniform on n-1 bits, defining the nth bit to be Q_n = \sum _{i<n} Q_i \bmod 2 to be the parity of the first n-1 bits. Then \mathbb {E}f(P)=1/2 but \mathbb {E}f(Q)=0.

    Furthermore, we have:

    Claim 2. \epsilon _{n-1}(\mathrm {Parity}) \geq 1/2.

    Proof. Suppose by contradiction that some polynomial g has degree k and approximates Parity by \epsilon < 1/2.

    The key ingredient is to symmetrize a polynomial p, by letting

    \begin{aligned} p^{sym}(x) := \frac {1}{n!} \sum _{\pi \in \mathfrak {S}_n} f(\pi x),\end{aligned}

    where \pi ranges over permutations. Note that p^{sym}(x) only depends on \|x\| = \sum _i x_i.

    Now we claim that there is a univariate polynomial p' also of degree k such that

    \begin{aligned} p'(\sum x_i) = p^{sym}(x_1, x_2, \ldots , x_n)\end{aligned}

    for every x.

    To illustrate, let M be a monomial of p. For instance if M = X_1, then p'(i) = i/n, where i is the Hamming weight of the input. (For this we think of the input as being \in \{0,1\}. Similar calculations can be done for \in \{-1,-1\}.)

    If M = X_1 X_2, then p'(i) = \frac {i}{n}\cdot \frac {i-1}{n} which is quadratic in i.

    And so on.

    More generally p^{sym}(X_1,\dots ,X_n) is a symmetric polynomial. As \{(\sum _j X_j)^\ell \}_{\ell \leq k} form a basis of symmetric polynomials of degree k, p^{sym} can be written as a linear combination in this basis. Now note that \{(\sum _j X_j)^{\ell } (x)\}_{\ell \leq k} only depends on \|x\|; substituting i = \sum _j X_j gives that p' is of degree \leq k in i.

    (Note that the degree of p' can be strictly less than the degree of p (e.g. for p(X_1,X_2) = X_1-X_2: we have p^{sym} = p' = 0).)

    Then, applying symmetrization on g, if g is a real polynomial \epsilon -close to Parity (in \ell _\infty norm), then g' is also \epsilon -close to Parity’ (as a convex combination of close values).

    Finally, remark that for every integer k \in \{0,\dots ,\lfloor n/2 \rfloor \}, we have: Parity'(2k)=0 and Parity'(2k+1)=1. In particular, as \epsilon < 1/2, g'-1/2 must have at least n zeroes, and must therefore be zero, which is a contradiction. \square

We will now focus on proving the theorem.

Note that one direction is easy: if a function fis closely approximated by a polynomial g of degree k, it cannot distinguish two k-wise indistinguishable distributions P and Q:

\begin{aligned} \mathbb {E}[f(P)]&= \mathbb {E}[g(P)] \pm \epsilon \\ &\stackrel {(*)}{=} \mathbb {E}[g(Q)] \pm \epsilon \\ &= \mathbb {E}[f(Q)] \pm 2\epsilon \, , \end{aligned}

where (*) comes from the fact that P and Q are k-wise indistinguishable.

The general proof goes by a Linear Programming Duality (aka finite-dimensional Hahn-Banach theorem, min-max, etc.). This states that:

If A \in \mathbb {R}^{n\times m}, x\in \mathbb {R}^m, b\in \mathbb {R}^n and c\in \mathbb {R}^m, then:

\left . \begin {array}{rrcl} &\min \langle c,x \rangle &=& \sum _{i \leq m} c_i x_i\\ &&\\ \text { subject to:} &Ax &=& b\\ &x &\geq & 0\\ \end {array} \right | \, = \, \left | \begin {array}{cc} &\max \langle b,y \rangle \\ &\\ \text { subject to:} &A^T y \leq c\\ &\\ \end {array} \right .

We can now prove the theorem:

Proof.

The proof will consist in rewriting the sides of the equality in the theorem as outputs of a Linear Program. Let us focus on the left side of the equality: \max _{P,Q \, k\text {-wise indist.}} \left | \mathbb {E}f(P)-\mathbb {E}f(Q) \right |.

We will introduce 2^{n+1} variables, namely P_x and Q_x for every x\in \{0,1\}^n, which will represent \Pr [D=x] for D=P,Q.

We will also use the following, which can be proved similarly to the Vazirani XOR Lemma:

Claim 3. Two distributions P and Q are k-wise indistinguishable if and only if: \forall S\subseteq \{1,\dots ,n\} with |S|\leq k, \sum _x P_x \chi _S(x) - \sum _x Q_x \chi _S(x)=0, where \chi _S(X) = \prod _S X_i is the Fourier basis of boolean functions.

The quantity \max _{P,Q \, k\text {-wise indist.}} \left | \mathbb {E}f(P)-\mathbb {E}f(Q) \right | can then be rewritten:

\begin {array}{rrl} &-\min \sum _x P_xf(x) - \sum _x Q_xf(x)\\ &&\\ \text { subject to:} &\sum _x P_x &= 1\\ &\sum _x Q_x &= 1\\ &\forall S \subseteq \{1,\dots ,n\} \text { s.t. } |S|\leq k,\sum _x (P_x - Q_x) \chi _S(x) &= 0 \\ \end {array}

Following the syntax of LP Duality stated above, we have:

c^T = \overbrace {\cdots f(x) \cdots }^{2^n}\overbrace {\cdots -f(x) \cdots }^{2^n} \in \mathbb {R}^{2n}, (where x goes over \{0,1\}^n),

x^T = \overbrace {\cdots P_x \cdots }^{2^n}\overbrace {\cdots Q_x \cdots }^{2^n} \in \mathbb {R}^{2n},

b^T = 1 1 \overbrace {0\cdots 0}^{\# S},

A = \left ( \begin {array}{cc} \overbrace {1\cdots \cdots 1}^{2^n} & \overbrace {0\cdots \cdots 0}^{2^n} \\ 0 \cdots \cdots 0 & 1 \cdots \cdots 1 \\ \cdots \cdots & \cdots \cdots \\ \vdots \cdots \cdots \vdots & \vdots \cdots \cdots \vdots \\ \cdots \chi _S(x) \cdots & \cdots -\chi _S(x) \cdots \\ \vdots \cdots \cdots \vdots & \vdots \cdots \cdots \vdots \\ \cdots \cdots & \cdots \cdots \\ \end {array} \right ) ,

where the rows of A except the first two correspond to some S \subseteq \{1,\dots ,n\} such that |S|\leq k.

We apply LP duality. We shall denote the new set of variables by

y^T = d \, d'\, \overbrace {\cdots d_S\cdots }^{\#S}.

We have the following program:

\begin {array}{rrl} &-\max d+d'\\ &&\\ \text { subject to:} &\forall x, d + \sum _x d_S \chi _S(x) &\leq f(x)\\ &\forall x, d' - \sum _x d_S \chi _S(x) &\leq -f(x)\\ \end {array}

Writing d' = -d-\epsilon , the objective becomes to minimize \epsilon , while the second set of constraints can be rewritten:

\begin{aligned} \forall x, d+\epsilon + \sum _S d_S\chi _S(x) \geq f(x) \, . \end{aligned}

The expression d + \sum _S d_S \chi _S(X) is an arbitrary degree-k polynomial which we denote by g(X). So our constrains become

\begin{aligned} g(x) &\leq f(x)\\ g(x) + \epsilon &\geq f(x). \end{aligned}

Where g ranges over all degree-k polynomials, and we are trying to minimize \epsilon . Because g is always below f, but when you add \epsilon it becomes bigger, g is always within \epsilon of f. \square

1.4 Approximate Degree of AND.

Let us now study the AND function on n bits. Let us denote d_{\epsilon }(f) the minimal degree of a polynomial approximating f with error \epsilon .

We will show that d_{1/3}(AND) = \Theta (\sqrt {n}).

Let us first show the upper bound:

Claim 4. We have:

\begin{aligned}d_{1/3}(\text {AND}) = \mathcal {O}(\sqrt {n}).\end{aligned}

To prove this claim, we will consider a special family of polynomials:

Definition 5. (Chebychev polynomials of the first kind.)

The Chebychev polynomials (of the first kind) are a family \{T_k\}_{k\in \mathbb {N}} of polynomials defined inductively as:

  • T_0(X) := 1,
  • T_1(X) := X,
  • \forall k \geq 1, T_{k+1}(X) := 2X T_k - T_{k-1}.

Those polynomials satisfy some useful properties:

  1. \forall x \in [-1,1], T_k(x) = \cos (k \arccos (x))\, ,
  2. \forall x \in [-1,1], \forall k, |T_k(x)| \leq 1 \, ,
  3. \forall x such that |x| \geq 1, |T'_k(x)| \geq k^2 \, ,
  4. \forall k, T_k(1)=1 \, .

Property 2 follows from 1, and property 4 follows from a direct induction. For a nice picture of these polynomials you should have come to class (or I guess you can check wikipedia). We can now prove our upper bound:

Proof. Proof of Claim:

We construct a univariate polynomial p:\{0,1,\dots ,n\} \rightarrow \mathbb {R} such that:

  • \deg p = \mathcal {O}(\sqrt {n});
  • \forall i<n, |P(i)| \leq 1/3;
  • |P(n)-1| \leq 1/3.

In other words, p will be close to 0 on [0,n-1], and close to 1 on n. Then, we can naturally define the polynomial for the AND function on n bits to be q(X_1,\dots ,X_n) := p(\sum _i X_i), which also has degree \mathcal {O}(\sqrt {n}). Indeed, we want q to be close to 0 if X has Hamming weight less than n, while being close to 1 on X of Hamming weight n (by definition of AND). This will conclude the proof.

Let us define p as follows:

\begin{aligned} \forall i\leq n, \quad p(i):= \frac {T_k\left ( \frac {i}{n-1}\right )}{T_k\left ( \frac {n}{n-1}\right )}. \end{aligned}

Intuitively, this uses the fact that Chebychev polynomials are bounded in [-1,1] (Property 2.) and then increase very fast (Property 3.).

More precisely, we have:

  • p(n)=1 by construction;
  • for i<n, we have:

    T_k\left ( \frac {i}{n-1}\right ) \leq 1 by Property 2.;

    T_k\left ( \frac {n}{n-1}\right ) = T_k\left (1 + \frac {1}{n-1}\right ) \geq 1 + \frac {k^2}{n-1} by Property 3. and 4., and therefore for some k = \mathcal {O}(\sqrt {n}), we have: T_k\left ( \frac {n}{n-1}\right ) \geq 3.

\square

Let us now prove the corresponding lower bound:

Claim 6. We have:

\begin{aligned}d_{1/3}(\text {AND}) = \Omega (\sqrt {n}).\end{aligned}

Proof. Let p be a polynomial that approximates the AND function with error 1/3. Consider the univariate symmetrization p' of p.

We have the following result from approximation theory:

Theorem 7. Let q be a real univariate polynomial such that:

  1. \forall i \in \{0,\dots ,n\}, |q(i)| \leq \mathcal {O}(1);
  2. q'(x) \geq \Omega (1) for some x \in [0,n].

    Then \deg q = \Omega (\sqrt {n}).

To prove our claim, it is therefore sufficient to check that p' satisfies conditions 1. and 2., as we saw that \deg p \geq \deg p':

  1. We have: \forall i \in \{0,\dots ,n\}, |p'(i)| \leq 1 + 1/3 by assumption on p;
  2. We have p'(n-1) \leq 1/3 and p'(n) \geq 2/3 (by assumption), so that the mean value theorem gives some x such that p'(x) \geq \Omega (1).

This concludes the proof. \square

1.5 Approximate Degree of AND-OR.

Consider the AND function on R bits and the OR function on N bits. Let AND-OR:\{0,1\}^{R\times N} \rightarrow \{0,1\} be their composition (which outputs the AND of the R outputs of the OR function on N-bits (disjoint) blocks).

It is known that d_{1/3}(AND-OR) = \Theta (\sqrt {RN}). To prove the upper bound, we will need a technique to compose approximating polynomials which we will discuss later.

Now we focus on the lower bound. This lower bound was recently proved independently by Sherstov and by Bun and Thaler. We present a proof that is different (either in substance or in language) and which we find more intuitive. Our proof replaces the “dual block method” with the following lemma.

Lemma 8. Suppose that

distributions A^0, A^1 over \{0,1\}^{n_A} are k_A-wise indistinguishable distributions; and

distributions B^0, B^1 over \{0,1\}^{n_B} are k_B-wise indistinguishable distributions.

Define C^0, C^1 over \{0,1\}^{n_A \cdot n_B} as follows: C^b: draw a sample x \in \{0,1\}^{n_A} from A^b, and replace each bit x_i by a sample of B^{x_i} (independently).

Then C^0 and C^1 are k_A \cdot k_B-wise indistinguishable.

Proof. Consider any set S \subseteq \{1,\dots , n_A\cdot n_B \} of k_A \cdot k_B bit positions; let us show that they have the same distribution in C^0 and C^1.

View the n_A \cdot n_B as n_A blocks of n_B bits. Call a block K of n_B bits heavy if |S\cap K| \geq k_B; call the other blocks light.

There are at most k_A heavy blocks by assumption, so that the distribution of the (entire) heavy blocks are the same in C^0 and C^1 by k_A-wise indistinguishability of A^0 and A^1.

Furthermore, conditioned on any outcome for the A^b samples in C^b, the light blocks have the same distribution in both C^0 and C^1 by k_B-wise indistinguishability of B^0 and B^1.

Therefore C^0 and C^1 are k_A \cdot k_B-wise indistinguishable. \square

Special Topics in Complexity Theory, Lectures 4-5

Special Topics in Complexity Theory, Fall 2017. Instructor: Emanuele Viola

1 Lectures 4-5, Scribe: Matthew Dippel

These lectures cover some basics of small-bias distributions, and then a more recent pseudorandom generator for read-once CNF [GMR^{+}12].

2 Small bias distributions

Definition 1.[Small bias distributions] A distribution D over \{0, 1\}^n has bias \epsilon if no parity function can distinguish it from uniformly random strings with probability greater than \epsilon . More formally, we have:

\begin{aligned}\forall S \subseteq [n], S \neq \emptyset , \left \vert \mathbb {P}_{x \in D}\left [\bigoplus _{i \in S} x_i = 1 \right ] - 1/2\right \vert \leq \epsilon . \end{aligned}

In this definition, the 1/2 is simply the probability of a parity test being 1 or 0 over the uniform distribution. We also note that whether we change the definition to have the probability of the parity test being 0 or 1 doesn’t matter. If a test has probability 1/2 + \epsilon of being equal to 1, then it has probability 1 - (1/2 + \epsilon ) = 1/2 - \epsilon of being 0, so the bias is independent of this choice.

This can be viewed as a distribution which fools tests T that are restricted to computing parity functions on a subset of bits.

Before we answer the important question of how to construct and efficiently sample from such a distribution, we will provide one interesting application of small bias sets to expander graphs.

Theorem 2.[Expander construction from a small bias set] Let D be a distribution over \{0, 1\}^n with bias \epsilon . Define G = (V, E) as the following graph:

\begin{aligned}V = \{0, 1\}^n, E = \{(x,y) \vert x \oplus y \in \text {support}(D)\}.\end{aligned}

Then, when we take the eigenvalues of the random walk matrix of G in descending order \lambda _1, \lambda _2, ... \lambda _{2^n}, we have that:

\begin{aligned}\max \{|\lambda _2|, |\lambda _{2^n}|\} \leq \epsilon .\end{aligned}

Thus, small-bias sets yields expander graphs. Small-bias sets also turn out to be equivalent to constructing good linear codes. Although all these questions have been studied much before the definition of small-bias sets [NN90], the computational perspective has been quite useful, even in answering old questions. For example Ta-Shma used this perspective to construct better codes [Ta-17].

3 Constructions of small bias distributions

Just like our construction of bounded-wise independent distributions from the previous lecture, we will construct small-bias distributions using polynomials over finite fields.

Theorem 1.[Small bias construction] Let \mathcal {F} be a finite field of size 2^\ell , with elements represented as bit strings of length \ell . We define the generator G : \mathcal {F}^2 \rightarrow \{0, 1\}^n as the following:

\begin{aligned}G(a, b)_i = \left \langle a^i, b \right \rangle = \sum _{j \leq \ell } (a^i)_j b_j \mod 2.\end{aligned}

In this notation, a subscript of j indicates taking the jth bit of the representation. Then the output of G(a, b) over uniform a and b has bias n / 2^{\ell }.

Proof. Consider some parity test induced by a subset S \subset [n]. Then when applied to the output of G, it simplifies as:

\begin{aligned}\sum _{i \in S}G(a, b)_i = \sum _{i \in S}\left \langle a^i, b \right \rangle = \left \langle \sum _{i \in S} a^i, b \right \rangle .\end{aligned}

Note that \sum _{i \in S} a^i is the evaluation of the polynomial P_S (x) := \sum _{i \in S} x^i at the point a. We note that if P_S(a) \neq 0, then the value of \left \langle P_S(a), b \right \rangle is equally likely to be 0 or 1 over the probability of a uniformly random b. This follows from the fact that the inner product of any non-zero bit string with a uniformly random bit string is equally likely to be 0 or 1. Hence in this case, our generator has no bias.

In the case where P_S(a) = 0, then the inner product will always be 0, independent of the value of b. In these situations, the bias is 1/2, but this is conditioned on the event that P_S(a) = 0.

We claim that this event has probability \leq n / 2^\ell . Indeed, for non empty S, P_S(a) is a polynomial of degree \leq n. Hence it has at most n roots. But we are selecting a from a field of size 2^\ell . Hence the probability of picking one root is \le n / 2^\ell .

Hence overall the bias is at most n/2^\ell . \square

To make use of the generator, we need to pick a specific \ell . Note that the seed length will be |a| + |b| = 2\ell . If we want to achieve bias \epsilon , then we must have \ell = \log \left (\frac {n}{\epsilon } \right ). Al the logarithms in this lecture are in base 2. This gives us a seed length of 2\log \left (\frac {n}{\epsilon } \right ).

Small-bias are so important that a lot of attention has been devote to optimizing the constant “2” above. A lower bound of \log n + (2 - o(1))\log (1 / \epsilon ) on the seed length was known. Ta-Shma recently [Ta-17] gave a nearly matching construction with seed length \log n + (2 + o(1))\log (1 / \epsilon ).

We next give a sense of how to obtain different tradeoffs between n and \epsilon in the seed length. We specifically focus on getting a nearly optimal dependence on n, because the construction is a simple, interesting “derandomization” of the above one.

3.1 An improved small bias distribution via bootstrapping

We will show another construction of small bias distributions that achieves seed length (1 + o(1))\log n + O(\log (1/\epsilon )). It will make use of the previous construction and proof.

The intuition is the following: the only time we used that b was uniform was in asserting that if P_S(a) \neq 0, then \left \langle P_S(a), b \right \rangle is uniform. But we don’t need b to be uniform for that. What do we need from b? We need that it has small-bias!

Our new generator is G(a, G'(a', b')) where G and G' are as before but with different parameters. For G, we pick a of length \ell = \log n/\epsilon , whereas G' just needs to be an \epsilon -biased generator on \ell bits, which can be done as we just saw with O(\log \ell /\epsilon ) bits. This gives a seed length of \log n + \log \log n + O(\log 1/\epsilon ), as promised.

We can of course repeat the argument but the returns diminish.

4 Connecting small bias to k-wise independence

We will show that using our small bias generators, we can create distributions which are almost k-wise independent. That is, they are very close to a k-wise independent distribution in statistical distance, while having a substantially shorter seed length than what is required for k-wise independence. In particular, we will show two results:

  • Small bias distributions are themselves close to k-wise independent.
  • We can improve the parameters of the above by feeding a small bias distribution to the generator for k-wise independence from the previous lectures. This will improve the seed length of simply using a small bias distribution.

Before we can show these, we’ll have to take a quick aside into some fundamental theorems of Fourier analysis of boolean functions.

4.1 Fourier analysis of boolean functions 101

Let f : \{-1, 1\}^n \rightarrow \{-1, 1\}. Here the switch between \{0, 1\} and \{-1, 1\} is common, but you can think of them as being isomorphic. One way to think of f is as being a vector in \{-1 , 1\}^{2^n}. The xth entry of f indicates the value of f(x). If we let \bf {1}_S be the indicator function returning 1 iff x = S, but once again written as a vector like f is, then any function f can be written over the basis of the \bf {1}_S vectors, as:

\begin{aligned}f = \sum _S f(s) \bf {1}_S.\end{aligned}

This is the “standard” basis.

Fourier analysis simply is a different basis in which to write functions, which is sometimes more useful. The basis functions are \chi _S(x) : \{-1, 1\}^n \rightarrow \{-1, 1\} = \prod _{i \in S} x_i. Then any boolean function f can be expressed as:

\begin{aligned}f(x) = \sum _{S \subseteq [n]}\hat {f}(S)\chi _S(x),\end{aligned}

where the \hat {f}(S), called the “Fourier coefficients,” can be derived as:

\begin{aligned}\hat {f}(S) = \mathbb{E} _{x ~ U_n} \left [f(x)\chi _S(x) \right ],\end{aligned}

where the expectation is over uniformly random x.

Claim 1. For any function f with range \{-1,1\}, its Fourier coefficients satisfy:

\begin{aligned}\sum _{S \subseteq [n]}\hat {f}(S)^2 = 1.\end{aligned}

Proof. We know that \mathbb{E} [f(x)^2] = 1, as squaring the function makes it 1. We can re-express this expectation as:

\begin{aligned}\mathbb{E} [f(x)f(x)] = \mathbb{E} \left [\sum _S \hat {f}(s)\chi _S(x) \cdot \sum _T \hat {f}(T)\chi _T(x)\right ] = \mathbb{E} \left [\sum _{S, T} \hat {f}(s)\chi _S(x) \hat {f}(T)\chi _T(x)\right ].\end{aligned}

We make use of the following fact: if S \neq T, then \mathbb{E} [\chi _S(x)\chi _T(x)] = \mathbb{E} [\chi _{S \oplus T}(x)] = 0. If they equal each other, then their difference is the empty set and this function is 1.

Overall, this implies that the above expectation can be simply rewritten as:

\begin{aligned}\sum _{S = T}\hat {f}(S)\hat {f}(T) = \sum _S \hat {f}(S)^2.\end{aligned}

Since we already decided that the expectation is 1, the claim follows. \square

5 Small bias distributions are close to k-wise independent

Before we can prove our claim, we formally introduce what we mean for two distributions to be close. We use the most common definition of statistical difference, which we repeat here:

Definition 1. Let D_1, D_2 be two distributions over the same domain H. Then we denote their statistical distance \text {SD}(D_1, D_2), and sometimes written as \Delta (D_1, D_2), as

\begin{aligned}\Delta (D_1, D_2) = \max _{T \subseteq H} \left | \mathcal {P}[D_1 \in T] - \mathcal {P}[D_2 \in T]\right |.\end{aligned}

Note that the probabilities are with respect to the individual distributions D_1 and D_2. We may also say that D_1 is \epsilon -close to D_2 if \Delta (D_1, D_2) \leq \epsilon .

We can now show our result, which is known as Vazirani’s XOR Lemma:

Theorem 2. If a distribution D over \{0, 1\}^n has bias \epsilon , then D is \epsilon 2^{n / 2} close to the uniform distribution.

Proof. Let T be a test. To fit the above notation, we can think of T as being defined as the set of inputs for which T(x) = 1. Then we want to bound:

\begin{aligned}|\mathbb{E} [T(D)] - \mathbb{E} [T(U)]|.\end{aligned}

Expanding T in Fourier basis we rewrite this as

\begin{aligned}|\mathbb{E} [\sum _S \hat {T_S}\chi _S(D)] - \mathbb{E} [\sum _S \hat {T_S}\chi _S(U)]|= |\sum _S \hat {T_S}\left (\mathbb{E} [\chi _S(D)] - \mathbb{E} [\chi _S(U)]\right )|.\end{aligned}

We know that \mathbb{E} _U[\chi _S(x)] = 0 for all non empty S, and 1 when S is the empty set. We also know that \mathbb{E} _D[\chi _S(x)] \leq \epsilon for all non empty S, and is 1 when S is the empty set. So the above can be bounded as:

\begin{aligned}\leq \sum _{S \ne \emptyset } |\hat {T_S}| |\mathbb{E} _D[\chi _S(x)] - \mathbb{E} _U[\chi _S(x)]| \leq \sum _S |\hat {T_S}| \epsilon = \epsilon \sum _S |\hat {T_S}|.\end{aligned}

Lemma 3. \sum _S |\hat {T_S}| \leq 2^{n / 2}

Proof. By Cauchy Schwartz:

\begin{aligned}\sum |\hat {T_S}| \leq 2^{n/2} \sqrt {\sum \hat {T_S} ^2} \leq 2^{n/2}\end{aligned}

Where the last simplification follows from Claim 1. \square

Using the above lemma completes the upper bound and the proof of the theorem. \square

Corollary 4. Any k bits of an \epsilon biased distribution are \epsilon 2^{k / 2} close to uniform.

Using the corollary above, we see that we can get \epsilon close to a k-wise independent distribution (in the sense of the corollary) by taking a small bias distribution with \epsilon ' = \epsilon / 2^{k / 2}. This requires seed length \ell = O(\log (n / \epsilon ') = O(\log (2^{k/2}n / \epsilon ) = O(\log (n) + k + \log (1 / \epsilon )). Recall that for exact k-wise we required seed length k \log n.

5.1 An improved construction

Theorem 5. Let G : \{0, 1\}^{k\log n} \rightarrow \{0, 1\}^n be the generator previously described that samples a k-wise independent distribution (or any linear G). If we replace the input to G with a small bias distribution of \epsilon ' = \epsilon / 2^k, then the output of G is \epsilon -close to being k-wise independent.

Proof. Consider any parity test S on k bits on the output of G. It can be shown that G is a linear map, that is, G simply takes its seed and it multiplies it by a matrix over the field GF(2) with two elements. Hence, S corresponds to a test S' on the input of G, on possibly many bits. The test S' is not empty because G is k-wise independent. Since we fool S' with error \epsilon ', we also fool S with error \epsilon , and the theorem follows by Vazirani’s XOR lemma. \square

Using the seed lengths we saw we get the following.

Corollary 6. There is a generator for almost k-wise independent distributions with seed length O(\log \log n + \log (1 / \epsilon ) + k).

6 Tribes Functions and the GMRTV Generator

We now move to a more recent result. Consider the Tribes function, which is a read-once CNF on k \cdot w bits, given by the And of k terms, each on w bits. You should think of n = k \cdot w where w \approx \log n and k \approx n/\log n.

We’d like a generator for this class with seed length O(\log n/\epsilon ). This is still open! (This is just a single function, for which a generator is trivial, but one can make this challenge precise for example by asking to fool the Tribes function for any possible negation of the input variables. These are 2^n tests and a generator with seed length O(\log n/\epsilon ) is unknown.)

The result we saw earlier about fooling And gives a generator with seed length O(\log n), however the dependence on \epsilon is poor. Achieving a good dependence on \epsilon has proved to be a challenge. We now describe a recent generator [GMR^{+}12] which gives seed length O(\log n/\epsilon ) (\log \log n)^{O(1)}. This is incomparable with the previous O(\log n), and in particular the dependence on n is always suboptimal. However, when \epsilon = 1/n the generator [GMR^{+}12] gives seed length O(\log n) \log \log n which is better than previously available constructions.

The high-level technique for doing this is based on iteratively restricting variables, and goes back about 30 years [AW89]. This technique seems to have been abandoned for a while, possibly due to the spectacular successes of Nisan [Nis91Nis92]. It was revived in [GMR^{+}12] (see also [GLS12]) with an emphasis on a good dependence on \epsilon .

A main tool is this claim, showing that small-bias distributions fool products of functions with small variance. Critically, we work with non-boolean functions (which later will be certain averages of boolean functions).

Claim 1. Let f_1, f_2, ..., f_k : \{0, 1\}^w \rightarrow [0,1] be a series of boolean functions. Further, let D = (v_1, v_2, ..., v_k) be an \epsilon -biased distribution over wk bits, where each v_i is w bits long. Then

\begin{aligned}\mathbb{E} _D[\prod _i f_i(v_i)] - \prod _i \mathbb{E} _U[f_i(U)] \leq \left (\sum _i \text {var}(f_i) \right )^d + (k2^w)^d\epsilon ,\end{aligned}

where \text {var}(f) := \mathbb{E} [f^2] - \mathbb{E} ^2[f] is variance of f with respect to the uniform distribution.

This claim has emerged from a series of works, and this statement is from a work in progress with Chin Ho Lee. For intuition, note that constant functions have variance 0, in which case the claim gives good bounds (and indeed any distribution fools constant functions). By contrast, for balanced functions the variance is constant, and the sum of the variances is about k, and the claim gives nothing. Indeed, you can write Inner Product as a product of nearly balanced functions, and it is known that small-bias does not fool it. For this claim to kick in, we need each variance to be at most 1/k.

In the tribes function, the And fucntions have variance 2^{-w}, and the sum of the variances is about 1 and the claim gives nothing. However, if you perturb the Ands with a little noise, the variance drops polynomially, and the claim is useful.

Claim 2. Let f be the AND function on w bits. Rewrite it as f(x, y), where |x| = |y| = w / 2. That is, we partition the input into two sets. Define g(x) as:

\begin{aligned}g(x) = \mathbb{E} _y[f(x, y)],\end{aligned}

where y is uniform. Then \text {var}(g) = \Theta (2^{-3w/2}).

Proof.

\begin{aligned}\text {var}(g) = \mathbb{E} [g(x)^2] - \left (\mathbb{E} [g(x)]\right )^2 = \mathbb{E} _x[\mathbb{E} _y[f(x,y)]^2] - \left (\mathbb{E} _x[\mathbb{E} _y[f(x,y)]] \right )^2.\end{aligned}

We know that \left (\mathbb{E} _x[\mathbb{E} _y[f(x,y)]] \right ) is simply the expected value of f, and since f is the AND function, this is 2^{-w}, so the right term is 2^{-2w}.

We reexpress the left term as \mathbb{E} _{x,y, y'}[f(x,y)f(x, y')]. But we note that this product is 1 iff x = y = y' = \bf {1}. The probability of this happening is (2^{-w/2})^3 = 2^{-3w/2}.

Thus the final difference is 2^{-3w/2}(1 - 2^{-w/2}) = \Theta (2^{-3w/2}). \square

We’ll actually apply this claim to the Or function, which has the same variance as And by De Morgan’s laws.

We now present the main inductive step to fool tribes.

Claim 3. Let f be the tribes function, where the first t \leq w bits of each of the terms are fixed. Let w' = w - t be the free bits per term, and k' \leq k the number of terms that are non-constant (some term may have become 0 after fixing the bits).

Reexpress f as f(x, y) = \bigwedge _{k'} \left (\bigvee (x_i, y_i) \right ), where each term’s input bits are split in half, so |x_i| = |y_i| = w' / 2.

Let D be a small bias distribution with bias \epsilon ^c (for a big enough c to be set later). Then

\begin{aligned}\left \vert \mathbb{E} _{(x, y) \in U^2}[f(x,y)] - \mathbb{E} _{(x, y) \in (D,U)}[f(x,y)] \right \vert \leq \epsilon .\end{aligned}

That is, if we replace half of the free bits with a small bias distribution, then the resulting expectation of the function only changes by a small amount.

To get the generator from this claim, we repeatedly apply Claim 3, replacing half of the bits of the input with another small bias distribution. We repeat this until we have a small enough remaining amount of free bits that replacing all of them with a small bias distribution causes an insignificant change in the expectation of the output.

At each step, w is cut in half, so the required number of repetitions to reduce w' to constant is R = \log (w) = \log \log (n). Actually, as explained below, we’ll stop when w = c' \log \log 1/\epsilon for a suitable constant c' (this arises from the error bound in the claim above, and we).

After each replacement, we incur an error of \epsilon , and then we incur the final error from replacing all bits with a small bias distribution. This final error is negligible by a result which we haven’t seen, but which is close in spirit to the proof we saw that bounded independence fools AND.

The total accumulated error is then \epsilon ' = \epsilon \log \log (n). If we wish to achieve a specific error \epsilon , we can run each small bias generator with \epsilon / \log \log (n).

At each iteration, our small bias distribution requires O(\log (n / \epsilon )) bits, so our final seed length is O(\log (n / \epsilon )) \text {poly}\log \log (n).

Proof of Claim 3. Define g_i(x) = \mathbb{E} _y[\bigvee _i(x_i, y_i)], and rewrite our target expression as:

\begin{aligned}\mathbb{E} _{x \in U}\left [\prod g_i(x_i)\right ] - \mathbb{E} _{x \in D}\left [\prod g_i(x_i)\right ].\end{aligned}

This is in the form of Claim 1. We also note that from Claim 2 that \text {var}(g_i) = 2^{-3w'/2}.

We further assume that k' \leq 2^{w'} \log (1 / \epsilon ). For if this is not true, then the expectation over the first 2^{w'} \log (1 / \epsilon ) terms is \leq \epsilon , because of the calculation

\begin{aligned}(1 - 2^{-w'})^{2^{w'} \log (1 / \epsilon )} \leq \epsilon .\end{aligned}

Then we can reason as in the proof that bounded independence fools AND (i.e., we can run the argument just on the first 2^{w'} \log (1 / \epsilon ) terms to show that the products are close, and then use the fact that it is small under uniform, and the fact that adding terms only decreases the probability under any distribution).

Under the assumption, we can bound the sum of the variances of g as:

\begin{aligned}\sum \text {var}(g_i) \leq k' 2^{-3w' / 2} \leq 2^{-\Omega (w')}\log (1 / \epsilon ).\end{aligned}

If we assume that w' \ge c \log \log (1 / \epsilon ) then this sum is \leq 2^{-\Omega (w')}.

We can then plug this into the bound from Claim 1 to get

\begin{aligned}(2^{-\Omega (w')})^d + (k2^{w'})^d \epsilon ^c = 2^{-\Omega (dw')} + 2^{O(dw')}\epsilon ^c.\end{aligned}

Now we set d so that \Omega (dw') = \log (1 / \epsilon )+1, and the bound becomes:

\begin{aligned}\epsilon / 2 + (1 / \epsilon )^{O(1)}\epsilon ^{c} \leq \epsilon .\end{aligned}

By making c large enough the claim is proved. \square

In the original paper, they apply these ideas to read-once CNF formulas. Interestingly, this extension is more complicated and uses additional ideas. Roughly, the progress measure is going to be number of terms in the CNF (as opposed to the width). A CNF is broken up into a small number of Tribes functions, the above argument is applied to each Tribe, and then they are put together using a general fact that they prove, that if f and g are fooled by small-bias then also f \wedge g on disjoint inputs is fooled by small-bias.

References

[AW89]    Miklos Ajtai and Avi Wigderson. Deterministic simulation of probabilistic constant-depth circuits. Advances in Computing Research – Randomness and Computation, 5:199–223, 1989.

[GLS12]    Dmitry Gavinsky, Shachar Lovett, and Srikanth Srinivasan. Pseudorandom generators for read-once accˆ0. In Proceedings of the 27th Conference on Computational Complexity, CCC 2012, Porto, Portugal, June 26-29, 2012, pages 287–297, 2012.

[GMR^{+}12]    Parikshit Gopalan, Raghu Meka, Omer Reingold, Luca Trevisan, and Salil Vadhan. Better pseudorandom generators from milder pseudorandom restrictions. In IEEE Symp. on Foundations of Computer Science (FOCS), 2012.

[Nis91]    Noam Nisan. Pseudorandom bits for constant depth circuits. Combinatorica. An Journal on Combinatorics and the Theory of Computing, 11(1):63–70, 1991.

[Nis92]    Noam Nisan. Pseudorandom generators for space-bounded computation. Combinatorica, 12(4):449–461, 1992.

[NN90]    J. Naor and M. Naor. Small-bias probability spaces: efficient constructions and applications. In 22nd ACM Symp. on the Theory of Computing (STOC), pages 213–223. ACM, 1990.

[Ta-17]    Amnon Ta-Shma. Explicit, almost optimal, epsilon-balanced codes. In ACM Symp. on the Theory of Computing (STOC), pages 238–251, 2017.