# Special Topics in Complexity Theory, Lectures 4-5

Special Topics in Complexity Theory, Fall 2017. Instructor: Emanuele Viola

### 1 Lectures 4-5, Scribe: Matthew Dippel

These lectures cover some basics of small-bias distributions, and then a more recent pseudorandom generator for read-once CNF .

### 2 Small bias distributions

Definition 1.[Small bias distributions] A distribution $D$ over $\{0, 1\}^n$ has bias $\epsilon$ if no parity function can distinguish it from uniformly random strings with probability greater than $\epsilon$. More formally, we have:

\begin{aligned}\forall S \subseteq [n], S \neq \emptyset , \left \vert \mathbb {P}_{x \in D}\left [\bigoplus _{i \in S} x_i = 1 \right ] - 1/2\right \vert \leq \epsilon . \end{aligned}

In this definition, the $1/2$ is simply the probability of a parity test being $1$ or $0$ over the uniform distribution. We also note that whether we change the definition to have the probability of the parity test being $0$ or $1$ doesn’t matter. If a test has probability $1/2 + \epsilon$ of being equal to $1$, then it has probability $1 - (1/2 + \epsilon ) = 1/2 - \epsilon$ of being $0$, so the bias is independent of this choice.

This can be viewed as a distribution which fools tests $T$ that are restricted to computing parity functions on a subset of bits.

Before we answer the important question of how to construct and efficiently sample from such a distribution, we will provide one interesting application of small bias sets to expander graphs.

Theorem 2.[Expander construction from a small bias set] Let $D$ be a distribution over $\{0, 1\}^n$ with bias $\epsilon$. Define $G = (V, E)$ as the following graph:

\begin{aligned}V = \{0, 1\}^n, E = \{(x,y) \vert x \oplus y \in \text {support}(D)\}.\end{aligned}

Then, when we take the eigenvalues of the random walk matrix of $G$ in descending order $\lambda _1, \lambda _2, ... \lambda _{2^n}$, we have that:

\begin{aligned}\max \{|\lambda _2|, |\lambda _{2^n}|\} \leq \epsilon .\end{aligned}

Thus, small-bias sets yields expander graphs. Small-bias sets also turn out to be equivalent to constructing good linear codes. Although all these questions have been studied much before the definition of small-bias sets [NN90], the computational perspective has been quite useful, even in answering old questions. For example Ta-Shma used this perspective to construct better codes [Ta-17].

### 3 Constructions of small bias distributions

Just like our construction of bounded-wise independent distributions from the previous lecture, we will construct small-bias distributions using polynomials over finite fields.

Theorem 1.[Small bias construction] Let $\mathcal {F}$ be a finite field of size $2^\ell$, with elements represented as bit strings of length $\ell$. We define the generator $G : \mathcal {F}^2 \rightarrow \{0, 1\}^n$ as the following:

\begin{aligned}G(a, b)_i = \left \langle a^i, b \right \rangle = \sum _{j \leq \ell } (a^i)_j b_j \mod 2.\end{aligned}

In this notation, a subscript of $j$ indicates taking the $j$th bit of the representation. Then the output of $G(a, b)$ over uniform $a$ and $b$ has bias $n / 2^{\ell }$.

Proof. Consider some parity test induced by a subset $S \subset [n]$. Then when applied to the output of $G$, it simplifies as:

\begin{aligned}\sum _{i \in S}G(a, b)_i = \sum _{i \in S}\left \langle a^i, b \right \rangle = \left \langle \sum _{i \in S} a^i, b \right \rangle .\end{aligned}

Note that $\sum _{i \in S} a^i$ is the evaluation of the polynomial $P_S (x) := \sum _{i \in S} x^i$ at the point $a$. We note that if $P_S(a) \neq 0$, then the value of $\left \langle P_S(a), b \right \rangle$ is equally likely to be $0$ or $1$ over the probability of a uniformly random $b$. This follows from the fact that the inner product of any non-zero bit string with a uniformly random bit string is equally likely to be $0$ or $1$. Hence in this case, our generator has no bias.

In the case where $P_S(a) = 0$, then the inner product will always be $0$, independent of the value of $b$. In these situations, the bias is $1/2$, but this is conditioned on the event that $P_S(a) = 0$.

We claim that this event has probability $\leq n / 2^\ell$. Indeed, for non empty $S$, $P_S(a)$ is a polynomial of degree $\leq n$. Hence it has at most $n$ roots. But we are selecting $a$ from a field of size $2^\ell$. Hence the probability of picking one root is $\le n / 2^\ell$.

Hence overall the bias is at most $n/2^\ell$. $\square$

To make use of the generator, we need to pick a specific $\ell$. Note that the seed length will be $|a| + |b| = 2\ell$. If we want to achieve bias $\epsilon$, then we must have $\ell = \log \left (\frac {n}{\epsilon } \right )$. Al the logarithms in this lecture are in base $2$. This gives us a seed length of $2\log \left (\frac {n}{\epsilon } \right )$.

Small-bias are so important that a lot of attention has been devote to optimizing the constant “2” above. A lower bound of $\log n + (2 - o(1))\log (1 / \epsilon )$ on the seed length was known. Ta-Shma recently [Ta-17] gave a nearly matching construction with seed length $\log n + (2 + o(1))\log (1 / \epsilon )$.

We next give a sense of how to obtain different tradeoffs between $n$ and $\epsilon$ in the seed length. We specifically focus on getting a nearly optimal dependence on $n$, because the construction is a simple, interesting “derandomization” of the above one.

#### 3.1 An improved small bias distribution via bootstrapping

We will show another construction of small bias distributions that achieves seed length $(1 + o(1))\log n + O(\log (1/\epsilon ))$. It will make use of the previous construction and proof.

The intuition is the following: the only time we used that $b$ was uniform was in asserting that if $P_S(a) \neq 0$, then $\left \langle P_S(a), b \right \rangle$ is uniform. But we don’t need $b$ to be uniform for that. What do we need from $b$? We need that it has small-bias!

Our new generator is $G(a, G'(a', b'))$ where $G$ and $G'$ are as before but with different parameters. For $G$, we pick $a$ of length $\ell = \log n/\epsilon$, whereas $G'$ just needs to be an $\epsilon$-biased generator on $\ell$ bits, which can be done as we just saw with $O(\log \ell /\epsilon )$ bits. This gives a seed length of $\log n + \log \log n + O(\log 1/\epsilon )$, as promised.

We can of course repeat the argument but the returns diminish.

### 4 Connecting small bias to k-wise independence

We will show that using our small bias generators, we can create distributions which are almost $k$-wise independent. That is, they are very close to a $k$-wise independent distribution in statistical distance, while having a substantially shorter seed length than what is required for $k$-wise independence. In particular, we will show two results:

• Small bias distributions are themselves close to $k$-wise independent.
• We can improve the parameters of the above by feeding a small bias distribution to the generator for $k$-wise independence from the previous lectures. This will improve the seed length of simply using a small bias distribution.

Before we can show these, we’ll have to take a quick aside into some fundamental theorems of Fourier analysis of boolean functions.

#### 4.1 Fourier analysis of boolean functions 101

Let $f : \{-1, 1\}^n \rightarrow \{-1, 1\}$. Here the switch between $\{0, 1\}$ and $\{-1, 1\}$ is common, but you can think of them as being isomorphic. One way to think of $f$ is as being a vector in $\{-1 , 1\}^{2^n}$. The $x$th entry of $f$ indicates the value of $f(x)$. If we let $\bf {1}_S$ be the indicator function returning $1$ iff $x = S$, but once again written as a vector like $f$ is, then any function $f$ can be written over the basis of the $\bf {1}_S$ vectors, as:

\begin{aligned}f = \sum _S f(s) \bf {1}_S.\end{aligned}

This is the “standard” basis.

Fourier analysis simply is a different basis in which to write functions, which is sometimes more useful. The basis functions are $\chi _S(x) : \{-1, 1\}^n \rightarrow \{-1, 1\} = \prod _{i \in S} x_i$. Then any boolean function $f$ can be expressed as:

\begin{aligned}f(x) = \sum _{S \subseteq [n]}\hat {f}(S)\chi _S(x),\end{aligned}

where the $\hat {f}(S)$, called the “Fourier coefficients,” can be derived as:

\begin{aligned}\hat {f}(S) = \mathbb{E} _{x ~ U_n} \left [f(x)\chi _S(x) \right ],\end{aligned}

where the expectation is over uniformly random $x$.

Claim 1. For any function $f$ with range $\{-1,1\}$, its Fourier coefficients satisfy:

\begin{aligned}\sum _{S \subseteq [n]}\hat {f}(S)^2 = 1.\end{aligned}

Proof. We know that $\mathbb{E} [f(x)^2] = 1$, as squaring the function makes it $1$. We can re-express this expectation as:

\begin{aligned}\mathbb{E} [f(x)f(x)] = \mathbb{E} \left [\sum _S \hat {f}(s)\chi _S(x) \cdot \sum _T \hat {f}(T)\chi _T(x)\right ] = \mathbb{E} \left [\sum _{S, T} \hat {f}(s)\chi _S(x) \hat {f}(T)\chi _T(x)\right ].\end{aligned}

We make use of the following fact: if $S \neq T$, then $\mathbb{E} [\chi _S(x)\chi _T(x)] = \mathbb{E} [\chi _{S \oplus T}(x)] = 0$. If they equal each other, then their difference is the empty set and this function is $1$.

Overall, this implies that the above expectation can be simply rewritten as:

\begin{aligned}\sum _{S = T}\hat {f}(S)\hat {f}(T) = \sum _S \hat {f}(S)^2.\end{aligned}

Since we already decided that the expectation is $1$, the claim follows. $\square$

### 5 Small bias distributions are close to $k$-wise independent

Before we can prove our claim, we formally introduce what we mean for two distributions to be close. We use the most common definition of statistical difference, which we repeat here:

Definition 1. Let $D_1, D_2$ be two distributions over the same domain $H$. Then we denote their statistical distance $\text {SD}(D_1, D_2)$, and sometimes written as $\Delta (D_1, D_2)$, as

\begin{aligned}\Delta (D_1, D_2) = \max _{T \subseteq H} \left | \mathcal {P}[D_1 \in T] - \mathcal {P}[D_2 \in T]\right |.\end{aligned}

Note that the probabilities are with respect to the individual distributions $D_1$ and $D_2$. We may also say that $D_1$ is $\epsilon$-close to $D_2$ if $\Delta (D_1, D_2) \leq \epsilon$.

We can now show our result, which is known as Vazirani’s XOR Lemma:

Theorem 2. If a distribution $D$ over $\{0, 1\}^n$ has bias $\epsilon$, then $D$ is $\epsilon 2^{n / 2}$ close to the uniform distribution.

Proof. Let $T$ be a test. To fit the above notation, we can think of $T$ as being defined as the set of inputs for which $T(x) = 1$. Then we want to bound:

\begin{aligned}|\mathbb{E} [T(D)] - \mathbb{E} [T(U)]|.\end{aligned}

Expanding $T$ in Fourier basis we rewrite this as

\begin{aligned}|\mathbb{E} [\sum _S \hat {T_S}\chi _S(D)] - \mathbb{E} [\sum _S \hat {T_S}\chi _S(U)]|= |\sum _S \hat {T_S}\left (\mathbb{E} [\chi _S(D)] - \mathbb{E} [\chi _S(U)]\right )|.\end{aligned}

We know that $\mathbb{E} _U[\chi _S(x)] = 0$ for all non empty $S$, and $1$ when $S$ is the empty set. We also know that $\mathbb{E} _D[\chi _S(x)] \leq \epsilon$ for all non empty $S$, and is $1$ when $S$ is the empty set. So the above can be bounded as:

\begin{aligned}\leq \sum _{S \ne \emptyset } |\hat {T_S}| |\mathbb{E} _D[\chi _S(x)] - \mathbb{E} _U[\chi _S(x)]| \leq \sum _S |\hat {T_S}| \epsilon = \epsilon \sum _S |\hat {T_S}|.\end{aligned}

Lemma 3. $\sum _S |\hat {T_S}| \leq 2^{n / 2}$

Proof. By Cauchy Schwartz:

\begin{aligned}\sum |\hat {T_S}| \leq 2^{n/2} \sqrt {\sum \hat {T_S} ^2} \leq 2^{n/2}\end{aligned}

Where the last simplification follows from Claim 1. $\square$

Using the above lemma completes the upper bound and the proof of the theorem. $\square$

Corollary 4. Any $k$ bits of an $\epsilon$ biased distribution are $\epsilon 2^{k / 2}$ close to uniform.

Using the corollary above, we see that we can get $\epsilon$ close to a $k$-wise independent distribution (in the sense of the corollary) by taking a small bias distribution with $\epsilon ' = \epsilon / 2^{k / 2}$. This requires seed length $\ell = O(\log (n / \epsilon ') = O(\log (2^{k/2}n / \epsilon ) = O(\log (n) + k + \log (1 / \epsilon )).$ Recall that for exact $k$-wise we required seed length $k \log n$.

#### 5.1 An improved construction

Theorem 5. Let $G : \{0, 1\}^{k\log n} \rightarrow \{0, 1\}^n$ be the generator previously described that samples a $k$-wise independent distribution (or any linear $G$). If we replace the input to $G$ with a small bias distribution of $\epsilon ' = \epsilon / 2^k$, then the output of $G$ is $\epsilon$-close to being $k$-wise independent.

Proof. Consider any parity test $S$ on $k$ bits on the output of $G$. It can be shown that $G$ is a linear map, that is, $G$ simply takes its seed and it multiplies it by a matrix over the field GF(2) with two elements. Hence, $S$ corresponds to a test $S'$ on the input of $G$, on possibly many bits. The test $S'$ is not empty because $G$ is $k$-wise independent. Since we fool $S'$ with error $\epsilon '$, we also fool $S$ with error $\epsilon$, and the theorem follows by Vazirani’s XOR lemma. $\square$

Using the seed lengths we saw we get the following.

Corollary 6. There is a generator for almost $k$-wise independent distributions with seed length $O(\log \log n + \log (1 / \epsilon ) + k)$.

### 6 Tribes Functions and the GMRTV Generator

We now move to a more recent result. Consider the Tribes function, which is a read-once CNF on $k \cdot w$ bits, given by the And of $k$ terms, each on $w$ bits. You should think of $n = k \cdot w$ where $w \approx \log n$ and $k \approx n/\log n$.

We’d like a generator for this class with seed length $O(\log n/\epsilon )$. This is still open! (This is just a single function, for which a generator is trivial, but one can make this challenge precise for example by asking to fool the Tribes function for any possible negation of the input variables. These are $2^n$ tests and a generator with seed length $O(\log n/\epsilon )$ is unknown.)

The result we saw earlier about fooling And gives a generator with seed length $O(\log n)$, however the dependence on $\epsilon$ is poor. Achieving a good dependence on $\epsilon$ has proved to be a challenge. We now describe a recent generator which gives seed length $O(\log n/\epsilon ) (\log \log n)^{O(1)}$. This is incomparable with the previous $O(\log n)$, and in particular the dependence on $n$ is always suboptimal. However, when $\epsilon = 1/n$ the generator gives seed length $O(\log n) \log \log n$ which is better than previously available constructions.

The high-level technique for doing this is based on iteratively restricting variables, and goes back about 30 years [AW89]. This technique seems to have been abandoned for a while, possibly due to the spectacular successes of Nisan [Nis91Nis92]. It was revived in (see also [GLS12]) with an emphasis on a good dependence on $\epsilon$.

A main tool is this claim, showing that small-bias distributions fool products of functions with small variance. Critically, we work with non-boolean functions (which later will be certain averages of boolean functions).

Claim 1. Let $f_1, f_2, ..., f_k : \{0, 1\}^w \rightarrow [0,1]$ be a series of boolean functions. Further, let $D = (v_1, v_2, ..., v_k)$ be an $\epsilon$-biased distribution over $wk$ bits, where each $v_i$ is $w$ bits long. Then

\begin{aligned}\mathbb{E} _D[\prod _i f_i(v_i)] - \prod _i \mathbb{E} _U[f_i(U)] \leq \left (\sum _i \text {var}(f_i) \right )^d + (k2^w)^d\epsilon ,\end{aligned}

where $\text {var}(f) := \mathbb{E} [f^2] - \mathbb{E} ^2[f]$ is variance of $f$ with respect to the uniform distribution.

This claim has emerged from a series of works, and this statement is from a work in progress with Chin Ho Lee. For intuition, note that constant functions have variance $0$, in which case the claim gives good bounds (and indeed any distribution fools constant functions). By contrast, for balanced functions the variance is constant, and the sum of the variances is about $k$, and the claim gives nothing. Indeed, you can write Inner Product as a product of nearly balanced functions, and it is known that small-bias does not fool it. For this claim to kick in, we need each variance to be at most $1/k$.

In the tribes function, the And fucntions have variance $2^{-w}$, and the sum of the variances is about $1$ and the claim gives nothing. However, if you perturb the Ands with a little noise, the variance drops polynomially, and the claim is useful.

Claim 2. Let $f$ be the AND function on $w$ bits. Rewrite it as $f(x, y)$, where $|x| = |y| = w / 2$. That is, we partition the input into two sets. Define $g(x)$ as:

\begin{aligned}g(x) = \mathbb{E} _y[f(x, y)],\end{aligned}

where $y$ is uniform. Then $\text {var}(g) = \Theta (2^{-3w/2}).$

Proof.

\begin{aligned}\text {var}(g) = \mathbb{E} [g(x)^2] - \left (\mathbb{E} [g(x)]\right )^2 = \mathbb{E} _x[\mathbb{E} _y[f(x,y)]^2] - \left (\mathbb{E} _x[\mathbb{E} _y[f(x,y)]] \right )^2.\end{aligned}

We know that $\left (\mathbb{E} _x[\mathbb{E} _y[f(x,y)]] \right )$ is simply the expected value of $f$, and since $f$ is the AND function, this is $2^{-w}$, so the right term is $2^{-2w}$.

We reexpress the left term as $\mathbb{E} _{x,y, y'}[f(x,y)f(x, y')]$. But we note that this product is $1$ iff $x = y = y' = \bf {1}$. The probability of this happening is $(2^{-w/2})^3 = 2^{-3w/2}$.

Thus the final difference is $2^{-3w/2}(1 - 2^{-w/2}) = \Theta (2^{-3w/2})$. $\square$

We’ll actually apply this claim to the Or function, which has the same variance as And by De Morgan’s laws.

We now present the main inductive step to fool tribes.

Claim 3. Let $f$ be the tribes function, where the first $t \leq w$ bits of each of the terms are fixed. Let $w' = w - t$ be the free bits per term, and $k' \leq k$ the number of terms that are non-constant (some term may have become $0$ after fixing the bits).

Reexpress $f$ as $f(x, y) = \bigwedge _{k'} \left (\bigvee (x_i, y_i) \right )$, where each term’s input bits are split in half, so $|x_i| = |y_i| = w' / 2$.

Let $D$ be a small bias distribution with bias $\epsilon ^c$ (for a big enough $c$ to be set later). Then

\begin{aligned}\left \vert \mathbb{E} _{(x, y) \in U^2}[f(x,y)] - \mathbb{E} _{(x, y) \in (D,U)}[f(x,y)] \right \vert \leq \epsilon .\end{aligned}

That is, if we replace half of the free bits with a small bias distribution, then the resulting expectation of the function only changes by a small amount.

To get the generator from this claim, we repeatedly apply Claim 3, replacing half of the bits of the input with another small bias distribution. We repeat this until we have a small enough remaining amount of free bits that replacing all of them with a small bias distribution causes an insignificant change in the expectation of the output.

At each step, $w$ is cut in half, so the required number of repetitions to reduce $w'$ to constant is $R = \log (w) = \log \log (n)$. Actually, as explained below, we’ll stop when $w = c' \log \log 1/\epsilon$ for a suitable constant $c'$ (this arises from the error bound in the claim above, and we).

After each replacement, we incur an error of $\epsilon$, and then we incur the final error from replacing all bits with a small bias distribution. This final error is negligible by a result which we haven’t seen, but which is close in spirit to the proof we saw that bounded independence fools AND.

The total accumulated error is then $\epsilon ' = \epsilon \log \log (n)$. If we wish to achieve a specific error $\epsilon$, we can run each small bias generator with $\epsilon / \log \log (n)$.

At each iteration, our small bias distribution requires $O(\log (n / \epsilon ))$ bits, so our final seed length is $O(\log (n / \epsilon )) \text {poly}\log \log (n)$.

Proof of Claim 3. Define $g_i(x) = \mathbb{E} _y[\bigvee _i(x_i, y_i)]$, and rewrite our target expression as:

\begin{aligned}\mathbb{E} _{x \in U}\left [\prod g_i(x_i)\right ] - \mathbb{E} _{x \in D}\left [\prod g_i(x_i)\right ].\end{aligned}

This is in the form of Claim 1. We also note that from Claim 2 that $\text {var}(g_i) = 2^{-3w'/2}$.

We further assume that $k' \leq 2^{w'} \log (1 / \epsilon )$. For if this is not true, then the expectation over the first $2^{w'} \log (1 / \epsilon )$ terms is $\leq \epsilon$, because of the calculation

\begin{aligned}(1 - 2^{-w'})^{2^{w'} \log (1 / \epsilon )} \leq \epsilon .\end{aligned}

Then we can reason as in the proof that bounded independence fools AND (i.e., we can run the argument just on the first $2^{w'} \log (1 / \epsilon )$ terms to show that the products are close, and then use the fact that it is small under uniform, and the fact that adding terms only decreases the probability under any distribution).

Under the assumption, we can bound the sum of the variances of $g$ as:

\begin{aligned}\sum \text {var}(g_i) \leq k' 2^{-3w' / 2} \leq 2^{-\Omega (w')}\log (1 / \epsilon ).\end{aligned}

If we assume that $w' \ge c \log \log (1 / \epsilon )$ then this sum is $\leq 2^{-\Omega (w')}$.

We can then plug this into the bound from Claim 1 to get

\begin{aligned}(2^{-\Omega (w')})^d + (k2^{w'})^d \epsilon ^c = 2^{-\Omega (dw')} + 2^{O(dw')}\epsilon ^c.\end{aligned}

Now we set $d$ so that $\Omega (dw') = \log (1 / \epsilon )+1$, and the bound becomes:

\begin{aligned}\epsilon / 2 + (1 / \epsilon )^{O(1)}\epsilon ^{c} \leq \epsilon .\end{aligned}

By making $c$ large enough the claim is proved. $\square$

In the original paper, they apply these ideas to read-once CNF formulas. Interestingly, this extension is more complicated and uses additional ideas. Roughly, the progress measure is going to be number of terms in the CNF (as opposed to the width). A CNF is broken up into a small number of Tribes functions, the above argument is applied to each Tribe, and then they are put together using a general fact that they prove, that if $f$ and $g$ are fooled by small-bias then also $f \wedge g$ on disjoint inputs is fooled by small-bias.

### References

[AW89]    Miklos Ajtai and Avi Wigderson. Deterministic simulation of probabilistic constant-depth circuits. Advances in Computing Research – Randomness and Computation, 5:199–223, 1989.

[GLS12]    Dmitry Gavinsky, Shachar Lovett, and Srikanth Srinivasan. Pseudorandom generators for read-once accˆ0. In Proceedings of the 27th Conference on Computational Complexity, CCC 2012, Porto, Portugal, June 26-29, 2012, pages 287–297, 2012.

[GMR$^{+}$12]    Parikshit Gopalan, Raghu Meka, Omer Reingold, Luca Trevisan, and Salil Vadhan. Better pseudorandom generators from milder pseudorandom restrictions. In IEEE Symp. on Foundations of Computer Science (FOCS), 2012.

[Nis91]    Noam Nisan. Pseudorandom bits for constant depth circuits. Combinatorica. An Journal on Combinatorics and the Theory of Computing, 11(1):63–70, 1991.

[Nis92]    Noam Nisan. Pseudorandom generators for space-bounded computation. Combinatorica, 12(4):449–461, 1992.

[NN90]    J. Naor and M. Naor. Small-bias probability spaces: efficient constructions and applications. In 22nd ACM Symp. on the Theory of Computing (STOC), pages 213–223. ACM, 1990.

[Ta-17]    Amnon Ta-Shma. Explicit, almost optimal, epsilon-balanced codes. In ACM Symp. on the Theory of Computing (STOC), pages 238–251, 2017.

# Special Topics in Complexity Theory, Lectures 2-3

Special Topics in Complexity Theory, Fall 2017. Instructor: Emanuele Viola

### 1 Lectures 2-3, Scribe: Tanay Mehta

In these lectures we conclude the proof that bounded independence fools the AND function, and look at the more recent result that bounded independence fools the circuit class $\mathrm {AC}^0$.

#### 1.1 Bounded Independence Fools AND

We state again the theorem from last time.

Theorem 1. Let $(X_1, \dots , X_n)$ be a distribution over $\{0,1\}^n$ such that any $k$ $X_i$ are independent (but not necessarily uniform). Then, we have that

\begin{aligned} \left |\Pr \left [\bigwedge _{i=1}^n X_i = 1\right ] - \prod _{i=1}^n \Pr [X_i = 1]\right | \leq 2^{- \Omega (k)}\end{aligned}

Proof. Let $D$ be the distribution of $(X_1, \dots , X_n)$. Let $B$ be the $n$-wise independent distribution $(Y_1, \dots , Y_n)$ such that $\Pr [Y_i = 1] = \Pr [X_i = 1]$ for all $i \in [n]$ and the $Y_i$ are independent. The theorem is equivalent to the following statement.

\begin{aligned}|\Pr _{X \leftarrow D}\left [\bigwedge _{i=1}^n X_i = 1 \right ] - \Pr _{X \leftarrow B}\left [\bigwedge _{i=1}^n X_i = 1 \right ]| \leq 2^{-\Omega (k)}\end{aligned}

We will prove the above statement by the following version of the Inclusion-Exclusion principle.

##### 1.1.1 Inclusion-Exclusion Principle

Let $V$ be any distribution over $\{0,1\}^n$. Note that by De Morgan’s laws, we have

\begin{aligned}\Pr \left [\bigwedge V_i = 1 \right ] = 1 - \Pr \left [\bigvee V_i = 0 \right ]\end{aligned}

Let $E_i$ be the event that $V_i = 0$. We want to bound the quantity $\Pr \left [\bigcup E_i\right ]$. By looking at the Venn diagram of the events $E_i$, we can see that

\begin{aligned} \Pr \left [\bigcup E_i\right ] & \leq \Pr [E_1] + \dots + \Pr [E_n] = \sum _i \Pr [E_i] \nonumber \\ \Pr \left [\bigcup E_i\right ] & \geq \sum _i \Pr [E_i] - \sum _{i, j} \Pr [E_i \cap E_j] \nonumber \\ \Pr \left [\bigcup E_i\right ] & \leq \sum _i \Pr [E_i] ~~ - \sum _{S \subseteq [n], |S| = 2} \Pr \left [\bigcap _{i \in S} E_i \right ]~~+ \sum _{S \subseteq [n], |S| = 3} \Pr \left [\bigcap _{i \in S} E_i\right ], \nonumber \end{aligned}

and so on. In general, we have the following. Define

\begin{aligned}T_j := \sum _{S \subseteq [n], |S| = j} \Pr \left [ \bigcap _{i \in S} E_i \right ]\end{aligned}
\begin{aligned}S_h := \sum _{i=1}^h (-1)^{i+1} T_i \end{aligned}

Then, we have the bounds $\Pr \left [\bigcup E_i\right ] \leq S_j$ for odd $j$, and $\Pr \left [\bigcup E_i\right ] \geq S_j$ for even $j$. This fact holds for any distribution.

Let us return to the proof. Note that the $S_h$ are the same for $D$ and $B$ up to $h = k$ because they only involve sums of ANDs of at most $k$ events. Hence, we have that

\begin{aligned}\left |\Pr _D \left [\bigwedge X_i = 1\right ] - \Pr _B\left [ \bigwedge X_i = 1\right ]\right | \leq |S_k - S_{k-1}| = |T_k|\end{aligned}

where the last equality comes from the definition of $S_k$. Therefore, we are done if $|T_k| \leq 2^{-\Omega (k)}$. We have that

\begin{aligned}T_k = \sum _{S \subseteq [n], |S| = k} \Pr \left [\bigcap _{i \in S} E_i \right ] ={ n \choose k} ~~ \mathbb {E}_{S \subseteq [n], |S| = k} \left [ \prod _{i \in S} P_i \right ] \end{aligned}

where $P_i := \Pr [E_i] = 1 - \Pr [X_i = 1]$. To bound the expectation we recall a useful inequality.

##### 1.1.2 A Useful Inequality

Let $Q_1, \dots , Q_n$ be non-negative real numbers. Then, by the AM-GM inequality, we have that

\begin{aligned}\frac {\sum _i Q_i}{n} \geq \left (\prod _i Q_i \right )^{1/n}.\end{aligned}

Consider the following more general statement,

\begin{aligned} \mathbb {E}_{S \subseteq [n], |S| = 1} \left [ \prod _{i \in S} Q_i \right ] \geq \mathbb {E}_{S \subseteq [n], |S| = 2} \left [ \prod _{i \in S} Q_i \right ]^{1/2} \geq \dots \hspace {5cm} \nonumber \\ \dots \geq \mathbb {E}_{S \subseteq [n], |S| = k} \left [ \prod _{i \in S}Q_i \right ]^{1/k} \geq \dots \geq \mathbb {E}_{S \subseteq [n], |S| = n} \left [ \prod _{i \in S} Q_i \right ]^{1/n} \nonumber \end{aligned}

and note that the left most term is equal to $\frac {\sum _i Q_i}{n}$, while the right most term is equal to $\left (\prod _i Q_i \right )^{1/n}$

Applying the above inequality to $T_k$ and a common approximation for the binomial coefficient, we have that

\begin{aligned}T_k = { n \choose k} ~~ \mathbb {E}_{S \subseteq [n], |S| = k} \left [ \prod _{i \in S} P_i \right ] \leq {n \choose k} \sum _{i = 1}^n \left (\frac {P_i}{n}\right )^k \leq \left (\frac {e n}{k}\right )^k \left (\frac {\sum P_i}{n}\right )^k = \left (\frac {e \sum P_i}{k}\right )^k.\end{aligned}

Therefore, we are done if $\sum P_i \leq \frac {k}{2 e}$. Recall that $P_i = \Pr [E_i] = 1 - \Pr [X_i = 1]$. So if $P_i$ is small then $\Pr [X_i = 1]$ is close to $1$.

It remains to handle the case that $\sum P_i \geq \frac {k}{2 e}$. Pick $n'$ such that

\begin{aligned}\sum _{i=1}^{n'} P_i = \frac {k}{2e} \pm 1.\end{aligned}

By the previous argument, the AND of the first $n'$ is the same up to $2^{-\Omega (k)}$ for $D$ and $B$. Also, for every distribution the probability of that the And of $n$ bits is 1 is at most the probability that the And of $n'$ bits is $1$. And also, for the $n$-wise independent distribution $B$ we have

\begin{aligned} \Pr _B\left [\bigwedge _{i=1}^{n'} X_i = 1\right ] & = \prod _{i=1}^{n'} \Pr [X_i = 1] \\ & = \prod _{i=1}^{n'} (1 - P_i)\\ & \leq \left (\frac {\sum _{i=1}^{n'}(1-P_i)}{n'}\right )^{n'} \textrm { by the AM-GM inequality} \\ & \leq \left (\frac {n' - k/2e}{n'}\right )^{n'} \leq (1 - k/2en')^{n'} \leq e^{-\Omega (k)}. \end{aligned}

The combination of these facts concludes this case. To summarize, in this case we showed that

\begin{aligned}\Pr _D[\bigwedge _{i=1}^n X_i = 1] \le \Pr _D[\bigwedge _{i=1}^{n'} X_i = 1].\end{aligned}

as well as

\begin{aligned}\Pr _B[\bigwedge _{i=1}^n X_i = 1] \le \Pr _B[\bigwedge _{i=1}^{n'} X_i = 1] \leq 2^{-\Omega (k)}.\end{aligned}

By the choice of $n'$ and the previous argument, we also know that $|\Pr _D[\bigwedge _{i=1}^{n'} X_i = 1]- \Pr _B[\bigwedge _{i=1}^{n'} X_i = 1]| \le 2^{-\Omega (k)}$ and so we are done, as all quantities above are at most $2^{-\Omega (k)}$ (and at least $0$). $\square$

Remark 2. The bound is tight up to $\Omega ( .)$

Proof. Let $D$ be the distribution over $\{0,1\}^{k+1}$ as follows: $D_{1, \dots , k} = U_k$ and $D_{k+1} = D_1 + \dots + D_k \mod 2$. Then, $D$ is $k$-wise independent. However, if $k$ is even, then

\begin{aligned}\Pr [\bigwedge _{i=1}^{k+1} D_i = 1] = 0.\end{aligned}

Yet, we have that

\begin{aligned}\Pr [\bigwedge _{i=1}^{k+1} U_i = 1] = 2^{-(k+1)}.\end{aligned}

$\square$

#### 1.2 Bounded Independence Fools $\textrm {AC}^0$

Acknowledgement. This section is based on Amnon Ta-Shma’s notes for the class 0368.4159 Expanders, Pseudorandomness and Derandomization CS dept, Tel-Aviv University, Fall 2016.

Note that a DNF on $n$ bits can be modeled as a depth two circuit where the top layer is an OR-gate whose inputs are AND-gates, which take inputs $X_1, \dots , X_n$ and their negations. The circuit class $\textrm {AC}^0$ can be viewed as a generalization of this to higher (but constant) depth circuits. That is, $\textrm {AC}^0$ consists of circuits using AND-gates, OR-gates, NOT-gates, and input registers. Each of the gates have unbounded fan-in (i.e. the number of input wires). The size of the circuit is defined to be the number of gates.

$\textrm {AC}^0$ is one of the most studied classes in complexity theory. $\textrm {AC}^0$ circuits of polynomial size can do many things, including adding and subtracting $n$-bit integers.

Conjecture 3.[Linial-Nisan[LN90]] $\log ^{O(d)} s$-wise independence fools $\textrm {AC}^0$ circuits of depth $d$ and size $s$.

The conjecture was open for a long time, even for in the special case $d=2$. In 2007 a breakthrough work by Bazzi [Baz09] proved it for $d=2$. Shortly afterwards, Razborov presented a simpler proof of Bazzi’s result [Raz09], and Braverman proved the conjecture for any $d$ with $\log ^{d^2} s$-wise independence [Bra10]. Tal improved the result to $\log ^{O(d)} s$ [Tal17].

Interestingly, the progress on the conjecture does not use ideas that were not around since the time of its formulation. Bottom line: if a problem is open for a long time, you should immediately attack it with existing tools.

The high-level intuition why such a result should be true is the following:

1. $\textrm {AC}^0$ is approximated by polylog degree polynomials.
2. $k$-wise independence fools degree-$k$ polynomials.

Proof of (2). Let $x = (x_1, \dots , x_n) \in \{0,1\}^n$. Let $p(x_1, \dots , x_n)$ be a degree $k$ polynomial over $\mathbb {R}$. Write $p$ as

\begin{aligned}p(x_1, \dots , x_n) = \sum _{M \subseteq [n], |M| \leq k} c_M \cdot x_M.\end{aligned}

If $D$ is a $k$-wise independent distribution on $\{0,1\}^n$, then by linearity of expectation

\begin{aligned} \mathbb {E}_D[P] = \sum _{M \subseteq [n], |M| \leq k} c_M \mathbb {E}_D[x_M] = \sum _{M \subseteq [n], |M| \leq k} c_M \mathbb {E}_U[x_M] = \mathbb {E}_U[P]. \end{aligned}

$\square$

There are several notions of approximating AC$^0$ by low-degree polynomials. We now review two of them, explaining why neither of them is sufficient. Braverman showed how to cleverly combine the two methods to prove a version of (1) that’s strong enough.

##### 1.2.1 Approximation 1

Theorem 4. For all $\textrm {AC}^0$ circuits $C(x_1, \dots , x_n)$ of size $s$ and depth $d$, for all distributions $D$ over $\{0,1\}^n$, for all $\epsilon$, there exists a polynomial $p(x_1, \dots , x_n)$ of degree $\log ^{O(d)} s/\epsilon$ such that

\begin{aligned}\Pr _{x \leftarrow D}[p(x) = C(x)] \geq 1 - \epsilon .\end{aligned}

The important features of this approximation are that it works under any distribution, and when the polynomial is correct it outputs a boolean value.

Similar approximations appear in many papers, going back to Razborov’s paper [Raz87] (who considers polynomials modulo 2) which uses ideas from earlier still work.

Note that the polynomial $p$ depends on the circuit $C$ chosen, and on the distribution. This theorem is not a good enough approximation because on the $\epsilon$ fraction of inputs where the polynomial and circuit are unequal, the value of the polynomial can (and does) explode to be much greater than $1/\epsilon$. This prevents us from bounding the average of the polynomial.

Nevertheless, let us prove the above theorem.

Proof. Consider one OR-gate of fan-in $s$. We construct a distribution of polynomials that compute any input with high probability. This implies that there is a fixed polynomial that computes the circuit on a large fraction of the inputs by an averaging argument.

For $i = 1, 2, \dots , \log s$. let $S_i$ be a random subset of $[s]$ where every element is included with probability $1/2^i$, independently.

Suppose $x$ has Hamming weight $2^j$. Then, $\mathbb {E}[\sum _{n \in S_j} x_n] = 1$. And the sum can be shown to equal $1$ with constant probability.

Define the approximation polynomial $p$ to be

\begin{aligned}p(x) := 1 - \prod _{i=1}^{\log s} (1 - \sum _{h \in S_i} x_h)\end{aligned}

Note that if $x$ has weight $w > 0$, then $p(x) = 0$ with constant probability. If $w =0$, then $p(x) = 1$ with probability $1$. We can adjust the error probability to $\epsilon$ by repeating each term in the product $\log (1/\epsilon )$ times.

Thus, we can approximate one gate with the above polynomial of degree $O(\log (s) \cdot \log (1/\epsilon ))$. Construct polynomials as $p$ above for each gate, with error parameter $\epsilon /s$. The probability that any of them is wrong is at most $\epsilon$ by a union bound. To obtain the approximating polynomial for the whole circuit compose all the polynomials together. Since the circuit is of depth $d$, the final degree of the approximating polynomial is $(\log (s) \cdot \log (s/\epsilon ))^{d}$, as desired.

As mentioned at the beginning, this is a distribution on polynomials that computes correctly any input with probability at least $1-\epsilon$. By averaging, there exists a fixed polynomial that computes correctly a $1-\epsilon$ fraction of inputs. $\square$

It can be verified that the value of the polynomial can be larger than $1/\epsilon$. The polynomial for the gates closest to the input can be as large as $s$. Then at the next level it can be as large as $s^{\log s/\epsilon }$, which is already much larger than $1/\epsilon$.

#### 1.3 Approximation 2

Theorem 5. For all circuits $C$ of size $s$ and depth $d$, for all error values $\epsilon$, there exists a polynomial $p(x_1, \dots , x_n)$ of degree $O(\log (s)^{d-1} \log (1/\epsilon ))$ such that

\begin{aligned} \mathbb {E}_{x \leftarrow U_n}[(C(x) - p(x))^2] \leq \epsilon . \end{aligned}

The important feature of this approximation is that it bounds the average, but only under the uniform distribution. Because it does not provide any guarantee on other distributions, including $k$-wise independent distributions, it cannot be used directly for our aims.

Remark 6. Approximation 2 is proved via the switching lemma, an influential lemma first proved in the early 80’s by Ajtai [Ajt83] and by Furst, Saxe, and Sipser [FSS84]. The idea is to randomly set a subset of the variables to simplify the circuit. You can do this repeatedly to simplify the circuit even further, but it only works on the uniform distribution. Hastad [Hås87] gave a much tighter analysis of the switching lemma, and the paper [LMN93] used it to prove a version of Approximation 2 with a slightly worse dependence on the error. Recently, a refinement of the switching lemma was proved in [Hås14IMP12]. Based on that, Tal [Tal17] obtained the corresponding refinement of Approximation 2 where the parameters are as stated above. (The polynomial is simply obtained from the Fourier expansion of the function computed by the circuit by removing all Fourier coefficients larger than a certain threshold. The bound on the Fourier decay in [Tal17] implies the desired approximation.)

#### 1.4 Bounded Independence Fools $\textrm {AC}^0$

Theorem 7. For all circuits $C$ with unbounded fan-in of size $s$ and depth $d$, for all error values $\epsilon$, for all $k$-wise independent distributions $D$ on $\{0,1\}^n$, we have that

\begin{aligned} |\mathbb {E}[C(D)] - \mathbb {E}[C(U_n)]| \leq \epsilon \end{aligned}

for $k = \log (s/\epsilon )^{O(d)}$.

Corollary 8. In particular, if $s = \textrm {poly}(n)$, $d = O(1)$, $s = 1/\textrm {poly}(n)$, then $k = \log ^{O(1)}(n)$ suffices.

The next claim is the ultimate polynomial approximation used to prove the theorem.

Claim 9. For all circuits $C$ with unbounded fan-in of size $s$ and depth $d$, for all error values $\epsilon$, for all $k$-wise independent distributions $D$ on $\{0,1\}^n$, there is a set $E$ of inputs, and a degree-$k$ polynomial $p$ such that:

1. $E$ is ’rare’ under both $D$ and $U_n$:
$\Pr _{x \leftarrow U_n}[E(x) = 1] \leq \epsilon$, and $\Pr _{x \leftarrow D}[E(x) =1] \leq \epsilon$. Here we write $E(x)$ for the indicator function of the event $x \in E$.
2. For all $x$, $p(x) \leq C(x) \vee E(x)$. Here $\vee$ is the logical Or.
3. $\mathbb {E}[p(U_n)] = \mathbb {E}[C(U_n)] \pm \epsilon$.

We only need (1) under $D$, but (1) under $U$ is used to prove (3).

Proof of Theorem 7 from Claim 9.

\begin{aligned} \mathbb {E}[C(D)] & = \mathbb {E}[C(D) \vee E(D)] \pm \epsilon \text {, by Claim.(1)} \nonumber \\ & \geq \mathbb {E}[p(D)] \pm \epsilon \text {, by Claim.(2)} \nonumber \\ & = \mathbb {E}[p(U_n)] \pm \epsilon \text {, because } p \text { has degree } k \text { and } D \text { is } k \text {-wise independent} \nonumber \\ & = \mathbb {E}[C(U_n)] \pm \epsilon \text {, by Claim.(3)} \nonumber \end{aligned}

For the other direction, repeat the argument for ‘not $C$’. $\square$

We can construct the polynomial approximation from Claim 9 by using a combination of Approximation 1 and 2. First we need a little more information about Approximation 1.

Claim 10. Two properties of approximation 1:

1. For all $x$, $p(x) \leq 2^{\log (s/\epsilon )^{O(d)}}$.
2. The ’bad’ set $E$ is computable by a circuit of size $\textrm {poly}(s)$, and depth $d + O(1)$.

Proof of Claim 10 part 2. Consider a single OR gate with input gates $g_1, \dots , g_s$. This is represented in the approximating polynomial by the term

\begin{aligned}1 - \prod _{i=1}^{\textrm {polylog}(s/\epsilon )}(1 - \sum _{j \in S_i} g_j).\end{aligned}

Note that the term is incorrect exactly when the input $g_1, \dots , g_s$ has weight $> 0$ but all the sets $S_i$ intersect $0$ or $\geq 2$ ones. This can be checked in AC$^0$, in parallel for all gates in the circuit. $\square$

Proof of Claim 9. Run approximation 1 for the distribution $\frac {D+U}{2}$, yielding the polynomial $p_c$ and the set $E$. This already proves the first part of the claim for both $D$ and $U$, because if $E$ has probability $\epsilon$ under $D$ it has probability $\ge \epsilon /2$ under $(D+U)/2$, and the same for $U$ . Use Claim 10 part 2, and run approximation 2 on $E$. Call the resulting polynomial $p_E$, which has degree $\log (s/\delta )^{O(d)}$ with error bound $\delta$.

The idea in the ultimate approximating polynomial is to “check if there is a mistake, and if so, output $0$. Otherwise, output $C$”. Formally:

\begin{aligned}p(x) := 1 - (1 - p_c(1 - p_E))^2\end{aligned}

Claim 9 part 2 can be shown as follows. $p(x) \leq 1$ by definition. So, if $C(x) \vee E(x) = 1$, then we are done. Otherwise, $C(x) \vee E(x) = 0$. So there is no mistake, and $C=0$. Hence, by the properties of Approximation 1, $p_c(x) = 0$. This implies $p(x)=0$.

It only remains to show Claim 9 part 3:

\begin{aligned} \mathbb {E}_U[p(x)] = \mathbb {E}_U[C(x)] \pm \epsilon . \end{aligned}

By part 1 of Claim 9,

\begin{aligned} \mathbb {E}_U[C(x) - p(x)] = \mathbb {E}_U[C(x) \vee E(x) - p(x)] \pm \epsilon . \end{aligned}

We can show that this equals

\begin{aligned} \mathbb {E}_U\left [\left (C(x) \vee E(x) - p_c(x)(1-p_E(x))\right )^2\right ] \pm \epsilon \end{aligned}

by the following argument: If $C(x) \vee E(x) =1$ then $1-p(x) = (1-p_c(x)(1-p_E(x)) )^2$ by definition. If $C(x) \vee E(x) =0$, then there is no mistake, and $C(x) = 0$. This implies that $p_c(x)(1 - p_E(x)) = p(x) = 0$.

Let us rewrite the above expression in terms of the expectation $\ell _2$ norm.

\begin{aligned} || C \vee E - p_c(1 - p_E)||_2^2. \end{aligned}

Recall the triangle inequality, which states: $||u - v||_2 \leq ||u - w||_2 + ||w - v||_2$. Therefore, letting $w = p_c(1 - E)$ we have that the above quantity is

\begin{aligned} \leq (|| p_c(1 - E) - p_c(1 - p_E)||_2 ~~ + ~~ ||p_c(1 - E) - C \vee E ||_2)^2 \end{aligned}
\begin{aligned} \leq O(||p_c(1-E) - p_c(1 - p_E)||_2^2 + ||p_c (1 - E) - C \vee E ||_2^2). \end{aligned}

To conclude, we will show that each of the above terms are $\leq \epsilon$. Note that

\begin{aligned} ||p_c(1-E) - p_c(1 - p_E)||_2^2 \leq \max _x |p_c(x)|^2 ||(1 - E) - (1 - p_E)||_2^2. \end{aligned}

By Claim 10 part 1 and Approximation 2, this is at most

\begin{aligned} 2^{\log (s/\epsilon )^{O(d)}} \cdot ||E - p_E||_2^2 \leq 2^{\log (s/\epsilon )^{O(d)}} \cdot \delta . \end{aligned}

For this quantity to be at most $\epsilon$ we set $\delta = \epsilon \cdot 2^{-\log (s/\epsilon )^{O(d)}}$. Here we critically set the error in Approximation 2 much lower, to cancel the large values arising from Approximation 1. By Theorem 5, the polynomial arising from approximation 2 has degree $O(\log (s)^{d-1} \log (1/\delta )) = \log (s/\epsilon )^{O(d)}$.

Finally, let us bound the other term, $||p_c (1 - E) - C \vee E ||_2^2$. If $E(x) = 0$, then the distance is $0$. If $E(x) =1$, then the distance is $\leq 1$. Therefore, this term is at most $\Pr _U[E(x) =1]$, which we know to be at most $\epsilon$. $\square$

### References

[Ajt83]    Miklós Ajtai. $\Sigma \sp {1}\sb {1}$-formulae on finite structures. Annals of Pure and Applied Logic, 24(1):1–48, 1983.

[Baz09]    Louay M. J. Bazzi. Polylogarithmic independence can fool DNF formulas. SIAM J. Comput., 38(6):2220–2272, 2009.

[Bra10]    Mark Braverman. Polylogarithmic independence fools AC$^{\mbox {0}}$ circuits. J. of the ACM, 57(5), 2010.

[FSS84]    Merrick L. Furst, James B. Saxe, and Michael Sipser. Parity, circuits, and the polynomial-time hierarchy. Mathematical Systems Theory, 17(1):13–27, 1984.

[Hås87]    Johan Håstad. Computational limitations of small-depth circuits. MIT Press, 1987.

[Hås14]    Johan Håstad. On the correlation of parity and small-depth circuits. SIAM J. on Computing, 43(5):1699–1708, 2014.

[IMP12]    Russell Impagliazzo, William Matthews, and Ramamohan Paturi. A satisfiability algorithm for AC$^{\mbox {0}}$. In ACM-SIAM Symp. on Discrete Algorithms (SODA), pages 961–972, 2012.

[LMN93]   Nathan Linial, Yishay Mansour, and Noam Nisan. Constant depth circuits, Fourier transform, and learnability. J. of the ACM, 40(3):607–620, 1993.

[LN90]     Nathan Linial and Noam Nisan. Approximate inclusion-exclusion. Combinatorica, 10(4):349–365, 1990.

[Raz87]    Alexander Razborov. Lower bounds on the dimension of schemes of bounded depth in a complete basis containing the logical addition function. Akademiya Nauk SSSR. Matematicheskie Zametki, 41(4):598–607, 1987. English translation in Mathematical Notes of the Academy of Sci. of the USSR, 41(4):333-338, 1987.

[Raz09]    Alexander A. Razborov. A simple proof of Bazzi’s theorem. ACM Transactions on Computation Theory (TOCT), 1(1), 2009.

[Tal17]    Avishay Tal. Tight bounds on the fourier spectrum of AC0. In Conf. on Computational Complexity (CCC), pages 15:1–15:31, 2017.

# Special Topics in Complexity Theory: Lecture 1

Special Topics in Complexity Theory, Fall 2017. Instructor: Emanuele Viola

### 1 Lecture 1, Scribe: Chin Ho Lee

In this first lecture we begin with some background on pseudorandomness and then we move on to the study of bounded independence, presenting in particular constructions and lower bounds.

#### 1.1 Background

Let us first give some background on randomness. There are 3 different theories:

(1) Classical probability theory. For example, if we toss a coin $12$ times then the probability of each outcome is the same, i.e., $\Pr [010101010101] = \Pr [011011100011]$. However, intuitively we feel that the first outcome is less random than the second.

(2) Kolmogorov complexity. Here the randomness is measured by the length of the shortest program outputting a string. In the previous example, the program for the second outcome could be “print 011011100011”, whereas the program for the first outcome can be “print 01 six times”, which is shorter than the first program.

(3) Pseudorandomness. This is similar to resource-bounded Kolmogorov complexity. Here random means the distribution “looks random” to “efficient observers.”

Let us now make the above intuition precise.

Definition 1.[Pseudorandom generator (PRG)] A function $f\colon \{0,1\}^s \to \{0,1\}^n$ is a pseudorandom generator (PRG) against a class of tests $T \subseteq \{ t\colon \{0,1\}^n \to \{0,1\}\}$ with error $\epsilon$, if it satisfies the following 3 conditions:

(1) the output of the generator must be longer than its input, i.e., $n > s$;

(2) it should fool $T$, that is, for every test $t \in T$, we have $\Pr [t(U_n) = 1] = \Pr [t(f(U_s)) = 1] \pm \epsilon$;

(3) the generator must be efficient.

To get a sense of the definition, note that a PRG is easy to obtain if we drop any one of the above 3 conditions. Dropping condition (1), then we can define our PRG as $f(x) := x$. Dropping condition (2), then we can define our PRG as $f(x) := 0$. Dropping condition (3), then the PRG is not as obvious to obtain as the previous two cases. We have the following claim.

Claim 2. For every class of tests $T$, there exists an inefficient PRG with error $\epsilon$ and seed length $s = \lg _2\lg _2(|T|) + 2 \lg _2(1/\epsilon ) + O(1)$.

Before proving the claim, consider the example where $T$ is the class of circuits of size $n^{100}$ over $n$-bit input, it is known that $|T| = 2^{n^{O(1)}}$. Hence, applying our claim above we see that there is an inefficient PRG that fools $T$ with error $\epsilon$ and seed length $s = O(\lg _2(n/\epsilon ))$.

We now prove the claim using the probabilistic method.

Proof. Consider picking $f$ at random. Then by the Chernoff bound, we have for every test $t \in T$,

\begin{aligned} \Pr _f[ | \Pr _{U_s} [ t(f(U_s)) = 1] - \Pr _{U_n}[ t(U_n) = 1] | \ge \epsilon ] \le 2^{-\Omega (\epsilon ^2 2^s)} < 1 / |T|, \end{aligned}

if $s = \lg _2\lg _2(|T|) + 2 \lg _2(1/\epsilon ) + O(1)$. Therefore, by a union bound over $t \in T$, there exists a fixed $f$ such that for every $t \in T$, the probabilities are within $\epsilon$. $\square$

#### 1.2 $k$-wise independent distribution

A major goal in research in pseudorandomness is to construct PRGs for (1) richer and richer class $T$, (2) smaller and smaller seed length $s$, and making the PRG explicit. For starters, let us consider a simple class of tests.

Definition 3.[$d$-local tests] The $d$-local tests are tests that depend only on $d$ bits.

We will show that for this class of tests we can actually achieve error $\epsilon = 0$. To warm up, consider what happens when $d = 1$, then we can have a PRG with seed length $s = 1$ by defining $f(0) := 0^n$ and $f(1) := 1^n$.

For $d = 2$, we have the following construction. Define

\begin{aligned} f(x)_y := \langle x, y \rangle = \sum _i x_i y_i \bmod 2 . \end{aligned}

Here the length of $x$ and $y$ is $|x| = |y| = \lg _2 n$, and we exclude $y = 0^{\lg _2 n}$. Note that the output has $n - 1$ bits, but we can append one uniform bit to the output of $f$. So the seed length would be $\lg _2 n + 1$.

Now we prove the correctness of this PRG.

Claim 4. The $f$ defined above is a PRG against $2$-local tests with error $\epsilon = 0$.

Proof. We need to show that for every $y \neq z$, the random variable $(f(x)_y, f(x)_z)$ over the choice of $x$ is identical to $U_2$, the uniform $2$-bit string. Since $y \neq z$, suppose without loss of generality that there exists an $i$ such that $y_i = 1$ and $z_i = 0$. Now $f(x)_z$ is uniform, and conditioned on $z$, $f(x)_y$ is also uniform, thanks to the index $y_i$. $\square$

The case for $d = 3$ becomes much more complicated and involves the use of finite fields. One can think of a finite field as a finite domain that behaves like $\mathbb {Q}$ in the sense that it allows you to perform arithmetic operations, including division, on the elements. We will use the following fact about finite fields.

Lemma 5. There exist finite fields of size $p^k$, for every prime $p$ and integer $k$. Moreover, they can be constructed and operated with in time $\mathrm {poly}(k, p)$.

Remark 6. Ideally one would like the dependence on $p$ to be $\lg _2 p$. However, such construction remains an open question and there have been many attempts to constructing finite fields in time $\mathrm {poly}(k, \lg _2 p)$. Here we only work with finite fields with $p = 2$, and there are a lot of explicit constructions for that.

One simple example of finite fields are integers modulo $p$.

Theorem 7. Let $D = \{0, 1\}^{\lg _2 n}$. For every $k$, there exists an explicit construction over $D^n$ such that

(1) elements in $D^n$ can be sampled with $s = k \lg _2 n$ bits, and

(2) every $k$ symbols are uniform in $D^k$.

For $d = 3$, we can use the above theorem with $k = 3$, and the PRG can output the first bit of every symbol.

Remark 8. There exist other constructions that are similar to the inner product construction for the case $d = 2$, with $y$ carefully chosen, but the way to choose $y$ involves the use of finite fields as well.

Note that we can also apply the theorem for larger $d$ to fool $d$-local tests with seed length $s = d \lg _2 n$.

We now prove the theorem.

Proof. Pick a finite field $\mathbb {F}$ of size $2^{\lg _2 n}$. Let $a_0, \ldots , a_{n-1} \in \mathbb {F}$ be uniform random elements in $\mathbb {F}$ which we think of as a polynomial $a(x)$ of degree $k-1$. We define the generator $f$ to be

\begin{aligned} f(a_0, \ldots , a_{n-1})_x = a(x) = \sum _{i=0}^{n-1} a_i x^i . \end{aligned}

(One should think of the outputs of $f$ as lines and curves in the real plane.)

The analysis of the PRG follows from the following useful fact: For every $k$ points $(x_0, y_0), (x_1, y_1), \ldots , (x_{k-1}, y_{k-1})$, there exists exactly one degree $k-1$ polynomial going through them. $\square$

Let us now introduce a terminology for PRGs that fool $d$-local tests.

Definition 9. We call distributions that look uniform (with error $0$) to $k$-local tests $k$-wise independent (also known as $k$-wise uniform). The latter terminology is more precise, but the former is more widespread.

We will soon see an example of a distribution where every $k$ elements are independent but not necessarily uniform.

#### 1.3 Lower bounds

We have just seen a construction of $k$-wise independent distributions with seed length $s = d \lg _2 n$. It is natural to ask, what is the minimum seed length of generating $k$-wise independent distributions?

Claim 10. For every $k \ge 2$, every PRG for $k$-local tests over $\{0, 1\}^n$ has seed length $s \ge \Omega (k \lg _2 (n/k))$.

Proof. We use the linear-algebraic method. See the book by Babai–Frankl [1] for more applications of this method.

To begin, we will switch from $\{0, 1\}$ to $\{-1, 1\}$, and write the PRG as a $2^s \times n$ matrix $M$, where the rows are all the possible outputs of the PRG. Since the PRG fools $k$-local tests and $k \ge 2$, one can verify that every $2$ columns of $M$ are orthogonal, i.e., $\langle M_i, M_j \rangle = 0$ for $i \neq j$. As shown below, this implies that the vectors are independent. And by linear algebra this gives a lower bound on $s$.

However so far we have not used $k$. Here’s how to use it. Consider all the column vectors $v$ obtained by taking the entry-wise products of any of the $k/2$ vectors in $M$. Because of $k$-wise independence, these $v$’s are again orthogonal, and this also implies that they are linearly independent.

Claim 11. If $v_1, v_2, \ldots , v_t$ are orthogonal, then they are linearly independent.

Proof. Suppose they are not and we can write $v_i = \sum _{j \in S, i \not \in S} v_j$ for some $S$. Taking inner product with $v_i$ on both sides, we have that the L.H.S. is nonzero, whereas the R.H.S. is zero because the vectors are orthogonal, a contradiction. $\square$

Therefore, the rank of $M$ must be at least the number of $v$’s, and so

\begin{aligned} 2^s \ge \text {number of v's} \ge \binom {n}{k/2} \ge (2n / k)^{k/2} . \end{aligned}

Rearranging gives $s \ge (k/2) \lg _2 (2n / k)$. $\square$

#### 1.4 Who is fooled by $k$-wise independence?

In the coming lectures we will see that $k$-wise independence fools $\mathrm {AC}^0$, the class of constant-depth circuits with unbounded fan-in. Today, let us see what else is fooled by $k$-independence in addition to $k$-local tests.

(1) Suppose we have $n$ independent variables $x_1, \ldots , X_n \in [0,1]$ and we want to understand the behavior of their sum $\sum _i X_i$. Then we can apply tools such as the Chernoff bound, tail bounds, Central Limit Theorem, and the Berry–Esseen theorem. The first two give bounds on large deviation from the mean. The latter two are somewhat more precise facts that show that the sum will approach a normal distribution (i.e., the probability of being larger than $t$ for any $t$ is about the same). One can show that similar results hold when the $X_i$’s are $k$-wise independent. The upshot is that the Chernoff bound gives error $2^{-\text {samples}}$, while under $k$-wise independence we can only get an error $(\text {samples})^{-k/2}$.

(2) We will see next time that $k$-wise independence fools $\mathrm {DNF}$ and $\mathrm {AC}^0$.

(3) $k$-wise independence is also used as hashing in load-balancing.

##### 1.4.1 $k$-wise independence fools AND

We now show that $k$-wise independent distributions fool the AND function.

Claim 12. Every $k$-wise uniform distribution fools the AND functions on bits with error $\epsilon = 2^{-\Omega (k)}$.

Proof. If the AND function is on at most $k$ bits, then by definition the error is $\epsilon = 0$. Otherwise the AND is over more than $k$ bits. Without loss of generality we can assume the AND is on the first $t > k$ bits. Observe that for any distribution $D$, we have

\begin{aligned} \Pr _D[\text {AND on t bits is 1}] \le \Pr _D[\text {AND on k bits is 1}] . \end{aligned}

The right-hand-side is the same under uniform and $k$-wise uniformity, and is $2^{-k}$. Hence,

\begin{aligned} | \Pr _{\text {uniform}}[AND = 1] - \Pr _{\text {k-wise ind.}}[\mathrm {AND} = 1] | \le 2^{-k} . \end{aligned}

$\square$

Instead of working over bits, let us now consider what happens over a general domain $D$. Given $n$ functions $f_1, \ldots , f_n \colon D \to \{0, 1\}$. Suppose $x_1, \ldots , x_n$ are $k$-wise uniform over $D^n$. What can you say about the AND of the outputs of the $f_i$’s, $f_1(x_1), f_2(x_2), \ldots , f_n(x_n)$?

This is similar to the previous example, except now that the variables are independent but not necessarily uniform. Nevertheless, we can show that a similar bound of $2^{-\Omega (k)}$ still holds.

Theorem 13.[[2]] Let $X_1, X_2, \ldots , X_n$ be random variables over $\{0, 1\}$, which are $k$-wise independent, but not necessarily uniform. Then

\begin{aligned} \Pr [ \prod _{i = 1}^n X_i = 1 ] = \prod _{i = 1}^n \Pr [X_i = 1] \pm 2^{-\Omega (k)} . \end{aligned}

This fundamental theorem appeared in the conference version of [2], but was removed in the journal version. One of a few cases where the journal version contains less results than the conference version.

Proof. Since each $X_i$ is in $\{0, 1\}$, by De Morgan’s law, we can write

\begin{aligned} \Pr [\prod _{i=1}^n X_i = 1] = \mathbb{E} [ \prod _{i=1}^n X_i ] = \mathbb{E} [ \mathrm {AND}_{i=1}^n X_i ] = \mathbb{E} [ 1 - \mathrm {OR}_{i=1}^n (1 - X_i)]. \end{aligned}

If we define the event $E_i$ to be $(1 - X_i)$, then $\mathrm {OR}_{i=1}^n (1 - X_i)$ is the same as $\Pr [\bigcup _{I=1}^n E_i]$. Now we apply the inclusion-exclusion principle, which says

\begin{aligned} \Pr [\bigcup _{i=1}^n E_i] = \sum _{i=1}^n \Pr [E_i] - \sum _{i \neq j} \Pr [E_i \cap E_j] + \cdots + (-1)^{J + 1} \sum _{S \subseteq [n], |S| = J} \Pr [\bigcap _{i \in S} E_i] + \cdots . \end{aligned}

we will finish the proof in the next lecture. $\square$

### References

[1]   László Babai and Peter Frankl. Linear algebra methods in combinatorics. 1992.

[2]   Guy Even, Oded Goldreich, Michael Luby, Noam Nisan, and Boban Velickovic. Approximations of general independent distributions. In ACM Symp. on the Theory of Computing (STOC), pages 10–16, 1992.

# Double dragon and labor supply

Another interesting report came out at the NBER. Basically, it argues that better and cheaper video games are a major factor in the decline of hours of work among young men.  You can read the report here and a popular exposition here.

I can definitely relate to this report: I spent countless hours playing videogames, starting very young until quite recently (now I work hard 100% of my time, as you can see from this post). When I was younger I maintained that playing video games is actually good for the brain.  My basic idea was that to play a game you have to learn a new set of rules, and that’s good for you.

What better opportunity to remember one of my favorite games.  A game that I wasted an enormous amount of time not even playing, but waiting for my turn to play at a bar next to my (landlord’s) house in Rome, with the pocket full of jingling, picturesque 500 lire (roughly half a euro). A game that I still think is the best side-scrolling beat’em up ever made. Back then, who could have imagined that one day you will be able to play this first with the MAME and then in a browser?

# taking a course online, instead of in-person, reduces student success and progress in college

The title is taken from the abstract of a recent AER paper:

Online college courses are a rapidly expanding feature of higher education, yet little research identifies their effects relative to traditional in-person classes. Using an instrumental variables approach, we find that taking a course online, instead of in-person, reduces student success and progress in college. Grades are lower both for the course taken online and in future courses. Students are less likely to remain enrolled at the university. These estimates are local average treatment effects for students with access to both online and in-person options; for other students, online classes may be the only option for accessing college-level courses.

This is truly shocking! Who could have possibly guessed? (I am not saying that the paper is worthless.)