Special Topics in Complexity Theory, Lectures 6-7

Special Topics in Complexity Theory, Fall 2017. Instructor: Emanuele Viola

1 Lecture 6-7, Scribe: Willy Quach

In these lectures, we introduce k-wise indistinguishability and link this notion to the approximate degree of a function. Then, we study the approximate degree of some functions, namely, the AND function and the AND-OR function. For the latter function we begin to see a proof that is different (either in substance or language) from the proofs in the literature. We begin with some LATEXtips.

1.1 Some LATEX tips.

  • Mind the punctuation. Treat equations as part of the phrases; add commas, colons, etc accordingly.
  • In math mode, it is usually better to use \ell (\ell ) rather than regular l. The latter can be confused with 1.
  • Align equations with \begin{align} \cdots \end{align} with the alignment character &.
  • For set inclusion, use \subset (\subset ) only if you mean proper inclusion (which is uncommon). Otherwise use \subseteq (\subseteq ). (Not everybody agrees with this, but this seems the most natural convention by analogy with < and \le .)

1.2 Introducing k-wise indistinguishability.

We studied previously the following questions:

  • What is the minimum k such that any k-wise independent distribution P over \{0,1\}^n fools \mathrm {AC}^0 (i.e. \mathbb {E}C(P) \approx \mathbb {E}C(U) for all poly(n)-size circuits C with constant depth)?

    We saw that k = \log ^{\mathcal {O}(d)}(s/\epsilon ) is enough.

  • What is the minimum k such that P fools the AND function?

    Taking k=\mathcal {O}(1) for \epsilon =\mathcal {O}(1) suffices (more precisely we saw that k-wise independence fools the AND function with \epsilon = 2^{-\Omega (k)}).

Consider now P and Q two distributions over \{0,1\}^n that are k-wise indistinguishable, that is, any projection over k bits of P and Q have the same distribution. We can ask similar questions:

  • What is the minimum k such that \mathrm {AC}^0 cannot distinguish P and Q (i.e. \mathbb {E}C(P) \approx \mathbb {E}C(Q) for all poly(n)-size circuits C with constant depth)?

    It turns out this requires k \geq n^{1-o(1)}: there are some distributions that are almost always distinguishable in this regime. (Whether k=\Omega (n) is necessary or not is an open question.)

    Also, k = n\left (1- \frac {1}{polylog(n)}\right ) suffices to fool \mathrm {AC}^0 (in which case \epsilon is essentially exponentially small).

  • What is the minimum k such that the AND function (on n bits) cannot distinguish P and Q?

    It turns out that k=\Theta (\sqrt {n}) is necessary and sufficient. More precisely:

    • There exists some P,Q over \{0,1\}^n that are c\sqrt {n}-wise indistinguishable for some constant c, but such that:
      \begin{aligned} \left | \Pr _P [AND(P)=1] - \Pr _Q [AND(Q)=1] \right | \geq 0.99 \,;\end{aligned}
    • For all P, Q that are c'\sqrt {n}-wise indistinguishable for some bigger constant c', we have:
      \begin{aligned} \left | \Pr _P [AND(P)=1] - \Pr _Q [AND(Q)=1] \right | \leq 0.01 \,.\end{aligned}

1.3 Duality.

Those question are actually equivalent to ones related about approximation by real-valued polynomials:

Theorem 1. Let f:\{0,1\}^n \rightarrow \{0,1\} be a function, and k an integer. Then:

\begin{aligned} \max _{P,Q \, k\text {-wise indist.}} \left | \mathbb {E}f(P)-\mathbb {E}f(Q) \right | = \min \{ \, \epsilon \, | \, \exists g\in \mathbb {R}_k[X]: \forall x, \left |f(x)-g(x) \right | \leq \epsilon \}. \end{aligned}

Here \mathbb {R}_k[X] denotes degree-k real polynomials. We will denote the right-hand side \epsilon _k(f).

Some examples:

  • f=1: then \mathbb {E}f(P)=1 for all distribution P, so that both sides of the equality are 0.
  • f(x) = \sum _i x_i \bmod 2 the parity function on n bits.

    Then for k = n-1, the left-hand side is at least 1/2: take P to be uniform; and Q to be uniform on n-1 bits, defining the nth bit to be Q_n = \sum _{i<n} Q_i \bmod 2 to be the parity of the first n-1 bits. Then \mathbb {E}f(P)=1/2 but \mathbb {E}f(Q)=0.

    Furthermore, we have:

    Claim 2. \epsilon _{n-1}(\mathrm {Parity}) \geq 1/2.

    Proof. Suppose by contradiction that some polynomial g has degree k and approximates Parity by \epsilon < 1/2.

    The key ingredient is to symmetrize a polynomial p, by letting

    \begin{aligned} p^{sym}(x) := \frac {1}{n!} \sum _{\pi \in \mathfrak {S}_n} f(\pi x),\end{aligned}

    where \pi ranges over permutations. Note that p^{sym}(x) only depends on \|x\| = \sum _i x_i.

    Now we claim that there is a univariate polynomial p' also of degree k such that

    \begin{aligned} p'(\sum x_i) = p^{sym}(x_1, x_2, \ldots , x_n)\end{aligned}

    for every x.

    To illustrate, let M be a monomial of p. For instance if M = X_1, then p'(i) = i/n, where i is the Hamming weight of the input. (For this we think of the input as being \in \{0,1\}. Similar calculations can be done for \in \{-1,-1\}.)

    If M = X_1 X_2, then p'(i) = \frac {i}{n}\cdot \frac {i-1}{n} which is quadratic in i.

    And so on.

    More generally p^{sym}(X_1,\dots ,X_n) is a symmetric polynomial. As \{(\sum _j X_j)^\ell \}_{\ell \leq k} form a basis of symmetric polynomials of degree k, p^{sym} can be written as a linear combination in this basis. Now note that \{(\sum _j X_j)^{\ell } (x)\}_{\ell \leq k} only depends on \|x\|; substituting i = \sum _j X_j gives that p' is of degree \leq k in i.

    (Note that the degree of p' can be strictly less than the degree of p (e.g. for p(X_1,X_2) = X_1-X_2: we have p^{sym} = p' = 0).)

    Then, applying symmetrization on g, if g is a real polynomial \epsilon -close to Parity (in \ell _\infty norm), then g' is also \epsilon -close to Parity’ (as a convex combination of close values).

    Finally, remark that for every integer k \in \{0,\dots ,\lfloor n/2 \rfloor \}, we have: Parity'(2k)=0 and Parity'(2k+1)=1. In particular, as \epsilon < 1/2, g'-1/2 must have at least n zeroes, and must therefore be zero, which is a contradiction. \square

We will now focus on proving the theorem.

Note that one direction is easy: if a function fis closely approximated by a polynomial g of degree k, it cannot distinguish two k-wise indistinguishable distributions P and Q:

\begin{aligned} \mathbb {E}[f(P)]&= \mathbb {E}[g(P)] \pm \epsilon \\ &\stackrel {(*)}{=} \mathbb {E}[g(Q)] \pm \epsilon \\ &= \mathbb {E}[f(Q)] \pm 2\epsilon \, , \end{aligned}

where (*) comes from the fact that P and Q are k-wise indistinguishable.

The general proof goes by a Linear Programming Duality (aka finite-dimensional Hahn-Banach theorem, min-max, etc.). This states that:

If A \in \mathbb {R}^{n\times m}, x\in \mathbb {R}^m, b\in \mathbb {R}^n and c\in \mathbb {R}^m, then:

\left . \begin {array}{rrcl} &\min \langle c,x \rangle &=& \sum _{i \leq m} c_i x_i\\ &&\\ \text { subject to:} &Ax &=& b\\ &x &\geq & 0\\ \end {array} \right | \, = \, \left | \begin {array}{cc} &\max \langle b,y \rangle \\ &\\ \text { subject to:} &A^T y \leq c\\ &\\ \end {array} \right .

We can now prove the theorem:

Proof.

The proof will consist in rewriting the sides of the equality in the theorem as outputs of a Linear Program. Let us focus on the left side of the equality: \max _{P,Q \, k\text {-wise indist.}} \left | \mathbb {E}f(P)-\mathbb {E}f(Q) \right |.

We will introduce 2^{n+1} variables, namely P_x and Q_x for every x\in \{0,1\}^n, which will represent \Pr [D=x] for D=P,Q.

We will also use the following, which can be proved similarly to the Vazirani XOR Lemma:

Claim 3. Two distributions P and Q are k-wise indistinguishable if and only if: \forall S\subseteq \{1,\dots ,n\} with |S|\leq k, \sum _x P_x \chi _S(x) - \sum _x Q_x \chi _S(x)=0, where \chi _S(X) = \prod _S X_i is the Fourier basis of boolean functions.

The quantity \max _{P,Q \, k\text {-wise indist.}} \left | \mathbb {E}f(P)-\mathbb {E}f(Q) \right | can then be rewritten:

\begin {array}{rrl} &-\min \sum _x P_xf(x) - \sum _x Q_xf(x)\\ &&\\ \text { subject to:} &\sum _x P_x &= 1\\ &\sum _x Q_x &= 1\\ &\forall S \subseteq \{1,\dots ,n\} \text { s.t. } |S|\leq k,\sum _x (P_x - Q_x) \chi _S(x) &= 0 \\ \end {array}

Following the syntax of LP Duality stated above, we have:

c^T = \overbrace {\cdots f(x) \cdots }^{2^n}\overbrace {\cdots -f(x) \cdots }^{2^n} \in \mathbb {R}^{2n}, (where x goes over \{0,1\}^n),

x^T = \overbrace {\cdots P_x \cdots }^{2^n}\overbrace {\cdots Q_x \cdots }^{2^n} \in \mathbb {R}^{2n},

b^T = 1 1 \overbrace {0\cdots 0}^{\# S},

A = \left ( \begin {array}{cc} \overbrace {1\cdots \cdots 1}^{2^n} & \overbrace {0\cdots \cdots 0}^{2^n} \\ 0 \cdots \cdots 0 & 1 \cdots \cdots 1 \\ \cdots \cdots & \cdots \cdots \\ \vdots \cdots \cdots \vdots & \vdots \cdots \cdots \vdots \\ \cdots \chi _S(x) \cdots & \cdots -\chi _S(x) \cdots \\ \vdots \cdots \cdots \vdots & \vdots \cdots \cdots \vdots \\ \cdots \cdots & \cdots \cdots \\ \end {array} \right ) ,

where the rows of A except the first two correspond to some S \subseteq \{1,\dots ,n\} such that |S|\leq k.

We apply LP duality. We shall denote the new set of variables by

y^T = d \, d'\, \overbrace {\cdots d_S\cdots }^{\#S}.

We have the following program:

\begin {array}{rrl} &-\max d+d'\\ &&\\ \text { subject to:} &\forall x, d + \sum _x d_S \chi _S(x) &\leq f(x)\\ &\forall x, d' - \sum _x d_S \chi _S(x) &\leq -f(x)\\ \end {array}

Writing d' = -d-\epsilon , the objective becomes to minimize \epsilon , while the second set of constraints can be rewritten:

\begin{aligned} \forall x, d+\epsilon + \sum _S d_S\chi _S(x) \geq f(x) \, . \end{aligned}

The expression d + \sum _S d_S \chi _S(X) is an arbitrary degree-k polynomial which we denote by g(X). So our constrains become

\begin{aligned} g(x) &\leq f(x)\\ g(x) + \epsilon &\geq f(x). \end{aligned}

Where g ranges over all degree-k polynomials, and we are trying to minimize \epsilon . Because g is always below f, but when you add \epsilon it becomes bigger, g is always within \epsilon of f. \square

1.4 Approximate Degree of AND.

Let us now study the AND function on n bits. Let us denote d_{\epsilon }(f) the minimal degree of a polynomial approximating f with error \epsilon .

We will show that d_{1/3}(AND) = \Theta (\sqrt {n}).

Let us first show the upper bound:

Claim 4. We have:

\begin{aligned}d_{1/3}(\text {AND}) = \mathcal {O}(\sqrt {n}).\end{aligned}

To prove this claim, we will consider a special family of polynomials:

Definition 5. (Chebychev polynomials of the first kind.)

The Chebychev polynomials (of the first kind) are a family \{T_k\}_{k\in \mathbb {N}} of polynomials defined inductively as:

  • T_0(X) := 1,
  • T_1(X) := X,
  • \forall k \geq 1, T_{k+1}(X) := 2X T_k - T_{k-1}.

Those polynomials satisfy some useful properties:

  1. \forall x \in [-1,1], T_k(x) = \cos (k \arccos (x))\, ,
  2. \forall x \in [-1,1], \forall k, |T_k(x)| \leq 1 \, ,
  3. \forall x such that |x| \geq 1, |T'_k(x)| \geq k^2 \, ,
  4. \forall k, T_k(1)=1 \, .

Property 2 follows from 1, and property 4 follows from a direct induction. For a nice picture of these polynomials you should have come to class (or I guess you can check wikipedia). We can now prove our upper bound:

Proof. Proof of Claim:

We construct a univariate polynomial p:\{0,1,\dots ,n\} \rightarrow \mathbb {R} such that:

  • \deg p = \mathcal {O}(\sqrt {n});
  • \forall i<n, |P(i)| \leq 1/3;
  • |P(n)-1| \leq 1/3.

In other words, p will be close to 0 on [0,n-1], and close to 1 on n. Then, we can naturally define the polynomial for the AND function on n bits to be q(X_1,\dots ,X_n) := p(\sum _i X_i), which also has degree \mathcal {O}(\sqrt {n}). Indeed, we want q to be close to 0 if X has Hamming weight less than n, while being close to 1 on X of Hamming weight n (by definition of AND). This will conclude the proof.

Let us define p as follows:

\begin{aligned} \forall i\leq n, \quad p(i):= \frac {T_k\left ( \frac {i}{n-1}\right )}{T_k\left ( \frac {n}{n-1}\right )}. \end{aligned}

Intuitively, this uses the fact that Chebychev polynomials are bounded in [-1,1] (Property 2.) and then increase very fast (Property 3.).

More precisely, we have:

  • p(n)=1 by construction;
  • for i<n, we have:

    T_k\left ( \frac {i}{n-1}\right ) \leq 1 by Property 2.;

    T_k\left ( \frac {n}{n-1}\right ) = T_k\left (1 + \frac {1}{n-1}\right ) \geq 1 + \frac {k^2}{n-1} by Property 3. and 4., and therefore for some k = \mathcal {O}(\sqrt {n}), we have: T_k\left ( \frac {n}{n-1}\right ) \geq 3.

\square

Let us now prove the corresponding lower bound:

Claim 6. We have:

\begin{aligned}d_{1/3}(\text {AND}) = \Omega (\sqrt {n}).\end{aligned}

Proof. Let p be a polynomial that approximates the AND function with error 1/3. Consider the univariate symmetrization p' of p.

We have the following result from approximation theory:

Theorem 7. Let q be a real univariate polynomial such that:

  1. \forall i \in \{0,\dots ,n\}, |q(i)| \leq \mathcal {O}(1);
  2. q'(x) \geq \Omega (1) for some x \in [0,n].

    Then \deg q = \Omega (\sqrt {n}).

To prove our claim, it is therefore sufficient to check that p' satisfies conditions 1. and 2., as we saw that \deg p \geq \deg p':

  1. We have: \forall i \in \{0,\dots ,n\}, |p'(i)| \leq 1 + 1/3 by assumption on p;
  2. We have p'(n-1) \leq 1/3 and p'(n) \geq 2/3 (by assumption), so that the mean value theorem gives some x such that p'(x) \geq \Omega (1).

This concludes the proof. \square

1.5 Approximate Degree of AND-OR.

Consider the AND function on R bits and the OR function on N bits. Let AND-OR:\{0,1\}^{R\times N} \rightarrow \{0,1\} be their composition (which outputs the AND of the R outputs of the OR function on N-bits (disjoint) blocks).

It is known that d_{1/3}(AND-OR) = \Theta (\sqrt {RN}). To prove the upper bound, we will need a technique to compose approximating polynomials which we will discuss later.

Now we focus on the lower bound. This lower bound was recently proved independently by Sherstov and by Bun and Thaler. We present a proof that is different (either in substance or in language) and which we find more intuitive. Our proof replaces the “dual block method” with the following lemma.

Lemma 8. Suppose that

distributions A^0, A^1 over \{0,1\}^{n_A} are k_A-wise indistinguishable distributions; and

distributions B^0, B^1 over \{0,1\}^{n_B} are k_B-wise indistinguishable distributions.

Define C^0, C^1 over \{0,1\}^{n_A \cdot n_B} as follows: C^b: draw a sample x \in \{0,1\}^{n_A} from A^b, and replace each bit x_i by a sample of B^{x_i} (independently).

Then C^0 and C^1 are k_A \cdot k_B-wise indistinguishable.

Proof. Consider any set S \subseteq \{1,\dots , n_A\cdot n_B \} of k_A \cdot k_B bit positions; let us show that they have the same distribution in C^0 and C^1.

View the n_A \cdot n_B as n_A blocks of n_B bits. Call a block K of n_B bits heavy if |S\cap K| \geq k_B; call the other blocks light.

There are at most k_A heavy blocks by assumption, so that the distribution of the (entire) heavy blocks are the same in C^0 and C^1 by k_A-wise indistinguishability of A^0 and A^1.

Furthermore, conditioned on any outcome for the A^b samples in C^b, the light blocks have the same distribution in both C^0 and C^1 by k_B-wise indistinguishability of B^0 and B^1.

Therefore C^0 and C^1 are k_A \cdot k_B-wise indistinguishable. \square

Advertisements

Special Topics in Complexity Theory, Lectures 4-5

Special Topics in Complexity Theory, Fall 2017. Instructor: Emanuele Viola

1 Lectures 4-5, Scribe: Matthew Dippel

These lectures cover some basics of small-bias distributions, and then a more recent pseudorandom generator for read-once CNF [GMR^{+}12].

2 Small bias distributions

Definition 1.[Small bias distributions] A distribution D over \{0, 1\}^n has bias \epsilon if no parity function can distinguish it from uniformly random strings with probability greater than \epsilon . More formally, we have:

\begin{aligned}\forall S \subseteq [n], S \neq \emptyset , \left \vert \mathbb {P}_{x \in D}\left [\bigoplus _{i \in S} x_i = 1 \right ] - 1/2\right \vert \leq \epsilon . \end{aligned}

In this definition, the 1/2 is simply the probability of a parity test being 1 or 0 over the uniform distribution. We also note that whether we change the definition to have the probability of the parity test being 0 or 1 doesn’t matter. If a test has probability 1/2 + \epsilon of being equal to 1, then it has probability 1 - (1/2 + \epsilon ) = 1/2 - \epsilon of being 0, so the bias is independent of this choice.

This can be viewed as a distribution which fools tests T that are restricted to computing parity functions on a subset of bits.

Before we answer the important question of how to construct and efficiently sample from such a distribution, we will provide one interesting application of small bias sets to expander graphs.

Theorem 2.[Expander construction from a small bias set] Let D be a distribution over \{0, 1\}^n with bias \epsilon . Define G = (V, E) as the following graph:

\begin{aligned}V = \{0, 1\}^n, E = \{(x,y) \vert x \oplus y \in \text {support}(D)\}.\end{aligned}

Then, when we take the eigenvalues of the random walk matrix of G in descending order \lambda _1, \lambda _2, ... \lambda _{2^n}, we have that:

\begin{aligned}\max \{|\lambda _2|, |\lambda _{2^n}|\} \leq \epsilon .\end{aligned}

Thus, small-bias sets yields expander graphs. Small-bias sets also turn out to be equivalent to constructing good linear codes. Although all these questions have been studied much before the definition of small-bias sets [NN90], the computational perspective has been quite useful, even in answering old questions. For example Ta-Shma used this perspective to construct better codes [Ta-17].

3 Constructions of small bias distributions

Just like our construction of bounded-wise independent distributions from the previous lecture, we will construct small-bias distributions using polynomials over finite fields.

Theorem 1.[Small bias construction] Let \mathcal {F} be a finite field of size 2^\ell , with elements represented as bit strings of length \ell . We define the generator G : \mathcal {F}^2 \rightarrow \{0, 1\}^n as the following:

\begin{aligned}G(a, b)_i = \left \langle a^i, b \right \rangle = \sum _{j \leq \ell } (a^i)_j b_j \mod 2.\end{aligned}

In this notation, a subscript of j indicates taking the jth bit of the representation. Then the output of G(a, b) over uniform a and b has bias n / 2^{\ell }.

Proof. Consider some parity test induced by a subset S \subset [n]. Then when applied to the output of G, it simplifies as:

\begin{aligned}\sum _{i \in S}G(a, b)_i = \sum _{i \in S}\left \langle a^i, b \right \rangle = \left \langle \sum _{i \in S} a^i, b \right \rangle .\end{aligned}

Note that \sum _{i \in S} a^i is the evaluation of the polynomial P_S (x) := \sum _{i \in S} x^i at the point a. We note that if P_S(a) \neq 0, then the value of \left \langle P_S(a), b \right \rangle is equally likely to be 0 or 1 over the probability of a uniformly random b. This follows from the fact that the inner product of any non-zero bit string with a uniformly random bit string is equally likely to be 0 or 1. Hence in this case, our generator has no bias.

In the case where P_S(a) = 0, then the inner product will always be 0, independent of the value of b. In these situations, the bias is 1/2, but this is conditioned on the event that P_S(a) = 0.

We claim that this event has probability \leq n / 2^\ell . Indeed, for non empty S, P_S(a) is a polynomial of degree \leq n. Hence it has at most n roots. But we are selecting a from a field of size 2^\ell . Hence the probability of picking one root is \le n / 2^\ell .

Hence overall the bias is at most n/2^\ell . \square

To make use of the generator, we need to pick a specific \ell . Note that the seed length will be |a| + |b| = 2\ell . If we want to achieve bias \epsilon , then we must have \ell = \log \left (\frac {n}{\epsilon } \right ). Al the logarithms in this lecture are in base 2. This gives us a seed length of 2\log \left (\frac {n}{\epsilon } \right ).

Small-bias are so important that a lot of attention has been devote to optimizing the constant “2” above. A lower bound of \log n + (2 - o(1))\log (1 / \epsilon ) on the seed length was known. Ta-Shma recently [Ta-17] gave a nearly matching construction with seed length \log n + (2 + o(1))\log (1 / \epsilon ).

We next give a sense of how to obtain different tradeoffs between n and \epsilon in the seed length. We specifically focus on getting a nearly optimal dependence on n, because the construction is a simple, interesting “derandomization” of the above one.

3.1 An improved small bias distribution via bootstrapping

We will show another construction of small bias distributions that achieves seed length (1 + o(1))\log n + O(\log (1/\epsilon )). It will make use of the previous construction and proof.

The intuition is the following: the only time we used that b was uniform was in asserting that if P_S(a) \neq 0, then \left \langle P_S(a), b \right \rangle is uniform. But we don’t need b to be uniform for that. What do we need from b? We need that it has small-bias!

Our new generator is G(a, G'(a', b')) where G and G' are as before but with different parameters. For G, we pick a of length \ell = \log n/\epsilon , whereas G' just needs to be an \epsilon -biased generator on \ell bits, which can be done as we just saw with O(\log \ell /\epsilon ) bits. This gives a seed length of \log n + \log \log n + O(\log 1/\epsilon ), as promised.

We can of course repeat the argument but the returns diminish.

4 Connecting small bias to k-wise independence

We will show that using our small bias generators, we can create distributions which are almost k-wise independent. That is, they are very close to a k-wise independent distribution in statistical distance, while having a substantially shorter seed length than what is required for k-wise independence. In particular, we will show two results:

  • Small bias distributions are themselves close to k-wise independent.
  • We can improve the parameters of the above by feeding a small bias distribution to the generator for k-wise independence from the previous lectures. This will improve the seed length of simply using a small bias distribution.

Before we can show these, we’ll have to take a quick aside into some fundamental theorems of Fourier analysis of boolean functions.

4.1 Fourier analysis of boolean functions 101

Let f : \{-1, 1\}^n \rightarrow \{-1, 1\}. Here the switch between \{0, 1\} and \{-1, 1\} is common, but you can think of them as being isomorphic. One way to think of f is as being a vector in \{-1 , 1\}^{2^n}. The xth entry of f indicates the value of f(x). If we let \bf {1}_S be the indicator function returning 1 iff x = S, but once again written as a vector like f is, then any function f can be written over the basis of the \bf {1}_S vectors, as:

\begin{aligned}f = \sum _S f(s) \bf {1}_S.\end{aligned}

This is the “standard” basis.

Fourier analysis simply is a different basis in which to write functions, which is sometimes more useful. The basis functions are \chi _S(x) : \{-1, 1\}^n \rightarrow \{-1, 1\} = \prod _{i \in S} x_i. Then any boolean function f can be expressed as:

\begin{aligned}f(x) = \sum _{S \subseteq [n]}\hat {f}(S)\chi _S(x),\end{aligned}

where the \hat {f}(S), called the “Fourier coefficients,” can be derived as:

\begin{aligned}\hat {f}(S) = \mathbb{E} _{x ~ U_n} \left [f(x)\chi _S(x) \right ],\end{aligned}

where the expectation is over uniformly random x.

Claim 1. For any function f with range \{-1,1\}, its Fourier coefficients satisfy:

\begin{aligned}\sum _{S \subseteq [n]}\hat {f}(S)^2 = 1.\end{aligned}

Proof. We know that \mathbb{E} [f(x)^2] = 1, as squaring the function makes it 1. We can re-express this expectation as:

\begin{aligned}\mathbb{E} [f(x)f(x)] = \mathbb{E} \left [\sum _S \hat {f}(s)\chi _S(x) \cdot \sum _T \hat {f}(T)\chi _T(x)\right ] = \mathbb{E} \left [\sum _{S, T} \hat {f}(s)\chi _S(x) \hat {f}(T)\chi _T(x)\right ].\end{aligned}

We make use of the following fact: if S \neq T, then \mathbb{E} [\chi _S(x)\chi _T(x)] = \mathbb{E} [\chi _{S \oplus T}(x)] = 0. If they equal each other, then their difference is the empty set and this function is 1.

Overall, this implies that the above expectation can be simply rewritten as:

\begin{aligned}\sum _{S = T}\hat {f}(S)\hat {f}(T) = \sum _S \hat {f}(S)^2.\end{aligned}

Since we already decided that the expectation is 1, the claim follows. \square

5 Small bias distributions are close to k-wise independent

Before we can prove our claim, we formally introduce what we mean for two distributions to be close. We use the most common definition of statistical difference, which we repeat here:

Definition 1. Let D_1, D_2 be two distributions over the same domain H. Then we denote their statistical distance \text {SD}(D_1, D_2), and sometimes written as \Delta (D_1, D_2), as

\begin{aligned}\Delta (D_1, D_2) = \max _{T \subseteq H} \left | \mathcal {P}[D_1 \in T] - \mathcal {P}[D_2 \in T]\right |.\end{aligned}

Note that the probabilities are with respect to the individual distributions D_1 and D_2. We may also say that D_1 is \epsilon -close to D_2 if \Delta (D_1, D_2) \leq \epsilon .

We can now show our result, which is known as Vazirani’s XOR Lemma:

Theorem 2. If a distribution D over \{0, 1\}^n has bias \epsilon , then D is \epsilon 2^{n / 2} close to the uniform distribution.

Proof. Let T be a test. To fit the above notation, we can think of T as being defined as the set of inputs for which T(x) = 1. Then we want to bound:

\begin{aligned}|\mathbb{E} [T(D)] - \mathbb{E} [T(U)]|.\end{aligned}

Expanding T in Fourier basis we rewrite this as

\begin{aligned}|\mathbb{E} [\sum _S \hat {T_S}\chi _S(D)] - \mathbb{E} [\sum _S \hat {T_S}\chi _S(U)]|= |\sum _S \hat {T_S}\left (\mathbb{E} [\chi _S(D)] - \mathbb{E} [\chi _S(U)]\right )|.\end{aligned}

We know that \mathbb{E} _U[\chi _S(x)] = 0 for all non empty S, and 1 when S is the empty set. We also know that \mathbb{E} _D[\chi _S(x)] \leq \epsilon for all non empty S, and is 1 when S is the empty set. So the above can be bounded as:

\begin{aligned}\leq \sum _{S \ne \emptyset } |\hat {T_S}| |\mathbb{E} _D[\chi _S(x)] - \mathbb{E} _U[\chi _S(x)]| \leq \sum _S |\hat {T_S}| \epsilon = \epsilon \sum _S |\hat {T_S}|.\end{aligned}

Lemma 3. \sum _S |\hat {T_S}| \leq 2^{n / 2}

Proof. By Cauchy Schwartz:

\begin{aligned}\sum |\hat {T_S}| \leq 2^{n/2} \sqrt {\sum \hat {T_S} ^2} \leq 2^{n/2}\end{aligned}

Where the last simplification follows from Claim 1. \square

Using the above lemma completes the upper bound and the proof of the theorem. \square

Corollary 4. Any k bits of an \epsilon biased distribution are \epsilon 2^{k / 2} close to uniform.

Using the corollary above, we see that we can get \epsilon close to a k-wise independent distribution (in the sense of the corollary) by taking a small bias distribution with \epsilon ' = \epsilon / 2^{k / 2}. This requires seed length \ell = O(\log (n / \epsilon ') = O(\log (2^{k/2}n / \epsilon ) = O(\log (n) + k + \log (1 / \epsilon )). Recall that for exact k-wise we required seed length k \log n.

5.1 An improved construction

Theorem 5. Let G : \{0, 1\}^{k\log n} \rightarrow \{0, 1\}^n be the generator previously described that samples a k-wise independent distribution (or any linear G). If we replace the input to G with a small bias distribution of \epsilon ' = \epsilon / 2^k, then the output of G is \epsilon -close to being k-wise independent.

Proof. Consider any parity test S on k bits on the output of G. It can be shown that G is a linear map, that is, G simply takes its seed and it multiplies it by a matrix over the field GF(2) with two elements. Hence, S corresponds to a test S' on the input of G, on possibly many bits. The test S' is not empty because G is k-wise independent. Since we fool S' with error \epsilon ', we also fool S with error \epsilon , and the theorem follows by Vazirani’s XOR lemma. \square

Using the seed lengths we saw we get the following.

Corollary 6. There is a generator for almost k-wise independent distributions with seed length O(\log \log n + \log (1 / \epsilon ) + k).

6 Tribes Functions and the GMRTV Generator

We now move to a more recent result. Consider the Tribes function, which is a read-once CNF on k \cdot w bits, given by the And of k terms, each on w bits. You should think of n = k \cdot w where w \approx \log n and k \approx n/\log n.

We’d like a generator for this class with seed length O(\log n/\epsilon ). This is still open! (This is just a single function, for which a generator is trivial, but one can make this challenge precise for example by asking to fool the Tribes function for any possible negation of the input variables. These are 2^n tests and a generator with seed length O(\log n/\epsilon ) is unknown.)

The result we saw earlier about fooling And gives a generator with seed length O(\log n), however the dependence on \epsilon is poor. Achieving a good dependence on \epsilon has proved to be a challenge. We now describe a recent generator [GMR^{+}12] which gives seed length O(\log n/\epsilon ) (\log \log n)^{O(1)}. This is incomparable with the previous O(\log n), and in particular the dependence on n is always suboptimal. However, when \epsilon = 1/n the generator [GMR^{+}12] gives seed length O(\log n) \log \log n which is better than previously available constructions.

The high-level technique for doing this is based on iteratively restricting variables, and goes back about 30 years [AW89]. This technique seems to have been abandoned for a while, possibly due to the spectacular successes of Nisan [Nis91Nis92]. It was revived in [GMR^{+}12] (see also [GLS12]) with an emphasis on a good dependence on \epsilon .

A main tool is this claim, showing that small-bias distributions fool products of functions with small variance. Critically, we work with non-boolean functions (which later will be certain averages of boolean functions).

Claim 1. Let f_1, f_2, ..., f_k : \{0, 1\}^w \rightarrow [0,1] be a series of boolean functions. Further, let D = (v_1, v_2, ..., v_k) be an \epsilon -biased distribution over wk bits, where each v_i is w bits long. Then

\begin{aligned}\mathbb{E} _D[\prod _i f_i(v_i)] - \prod _i \mathbb{E} _U[f_i(U)] \leq \left (\sum _i \text {var}(f_i) \right )^d + (k2^w)^d\epsilon ,\end{aligned}

where \text {var}(f) := \mathbb{E} [f^2] - \mathbb{E} ^2[f] is variance of f with respect to the uniform distribution.

This claim has emerged from a series of works, and this statement is from a work in progress with Chin Ho Lee. For intuition, note that constant functions have variance 0, in which case the claim gives good bounds (and indeed any distribution fools constant functions). By contrast, for balanced functions the variance is constant, and the sum of the variances is about k, and the claim gives nothing. Indeed, you can write Inner Product as a product of nearly balanced functions, and it is known that small-bias does not fool it. For this claim to kick in, we need each variance to be at most 1/k.

In the tribes function, the And fucntions have variance 2^{-w}, and the sum of the variances is about 1 and the claim gives nothing. However, if you perturb the Ands with a little noise, the variance drops polynomially, and the claim is useful.

Claim 2. Let f be the AND function on w bits. Rewrite it as f(x, y), where |x| = |y| = w / 2. That is, we partition the input into two sets. Define g(x) as:

\begin{aligned}g(x) = \mathbb{E} _y[f(x, y)],\end{aligned}

where y is uniform. Then \text {var}(g) = \Theta (2^{-3w/2}).

Proof.

\begin{aligned}\text {var}(g) = \mathbb{E} [g(x)^2] - \left (\mathbb{E} [g(x)]\right )^2 = \mathbb{E} _x[\mathbb{E} _y[f(x,y)]^2] - \left (\mathbb{E} _x[\mathbb{E} _y[f(x,y)]] \right )^2.\end{aligned}

We know that \left (\mathbb{E} _x[\mathbb{E} _y[f(x,y)]] \right ) is simply the expected value of f, and since f is the AND function, this is 2^{-w}, so the right term is 2^{-2w}.

We reexpress the left term as \mathbb{E} _{x,y, y'}[f(x,y)f(x, y')]. But we note that this product is 1 iff x = y = y' = \bf {1}. The probability of this happening is (2^{-w/2})^3 = 2^{-3w/2}.

Thus the final difference is 2^{-3w/2}(1 - 2^{-w/2}) = \Theta (2^{-3w/2}). \square

We’ll actually apply this claim to the Or function, which has the same variance as And by De Morgan’s laws.

We now present the main inductive step to fool tribes.

Claim 3. Let f be the tribes function, where the first t \leq w bits of each of the terms are fixed. Let w' = w - t be the free bits per term, and k' \leq k the number of terms that are non-constant (some term may have become 0 after fixing the bits).

Reexpress f as f(x, y) = \bigwedge _{k'} \left (\bigvee (x_i, y_i) \right ), where each term’s input bits are split in half, so |x_i| = |y_i| = w' / 2.

Let D be a small bias distribution with bias \epsilon ^c (for a big enough c to be set later). Then

\begin{aligned}\left \vert \mathbb{E} _{(x, y) \in U^2}[f(x,y)] - \mathbb{E} _{(x, y) \in (D,U)}[f(x,y)] \right \vert \leq \epsilon .\end{aligned}

That is, if we replace half of the free bits with a small bias distribution, then the resulting expectation of the function only changes by a small amount.

To get the generator from this claim, we repeatedly apply Claim 3, replacing half of the bits of the input with another small bias distribution. We repeat this until we have a small enough remaining amount of free bits that replacing all of them with a small bias distribution causes an insignificant change in the expectation of the output.

At each step, w is cut in half, so the required number of repetitions to reduce w' to constant is R = \log (w) = \log \log (n). Actually, as explained below, we’ll stop when w = c' \log \log 1/\epsilon for a suitable constant c' (this arises from the error bound in the claim above, and we).

After each replacement, we incur an error of \epsilon , and then we incur the final error from replacing all bits with a small bias distribution. This final error is negligible by a result which we haven’t seen, but which is close in spirit to the proof we saw that bounded independence fools AND.

The total accumulated error is then \epsilon ' = \epsilon \log \log (n). If we wish to achieve a specific error \epsilon , we can run each small bias generator with \epsilon / \log \log (n).

At each iteration, our small bias distribution requires O(\log (n / \epsilon )) bits, so our final seed length is O(\log (n / \epsilon )) \text {poly}\log \log (n).

Proof of Claim 3. Define g_i(x) = \mathbb{E} _y[\bigvee _i(x_i, y_i)], and rewrite our target expression as:

\begin{aligned}\mathbb{E} _{x \in U}\left [\prod g_i(x_i)\right ] - \mathbb{E} _{x \in D}\left [\prod g_i(x_i)\right ].\end{aligned}

This is in the form of Claim 1. We also note that from Claim 2 that \text {var}(g_i) = 2^{-3w'/2}.

We further assume that k' \leq 2^{w'} \log (1 / \epsilon ). For if this is not true, then the expectation over the first 2^{w'} \log (1 / \epsilon ) terms is \leq \epsilon , because of the calculation

\begin{aligned}(1 - 2^{-w'})^{2^{w'} \log (1 / \epsilon )} \leq \epsilon .\end{aligned}

Then we can reason as in the proof that bounded independence fools AND (i.e., we can run the argument just on the first 2^{w'} \log (1 / \epsilon ) terms to show that the products are close, and then use the fact that it is small under uniform, and the fact that adding terms only decreases the probability under any distribution).

Under the assumption, we can bound the sum of the variances of g as:

\begin{aligned}\sum \text {var}(g_i) \leq k' 2^{-3w' / 2} \leq 2^{-\Omega (w')}\log (1 / \epsilon ).\end{aligned}

If we assume that w' \ge c \log \log (1 / \epsilon ) then this sum is \leq 2^{-\Omega (w')}.

We can then plug this into the bound from Claim 1 to get

\begin{aligned}(2^{-\Omega (w')})^d + (k2^{w'})^d \epsilon ^c = 2^{-\Omega (dw')} + 2^{O(dw')}\epsilon ^c.\end{aligned}

Now we set d so that \Omega (dw') = \log (1 / \epsilon )+1, and the bound becomes:

\begin{aligned}\epsilon / 2 + (1 / \epsilon )^{O(1)}\epsilon ^{c} \leq \epsilon .\end{aligned}

By making c large enough the claim is proved. \square

In the original paper, they apply these ideas to read-once CNF formulas. Interestingly, this extension is more complicated and uses additional ideas. Roughly, the progress measure is going to be number of terms in the CNF (as opposed to the width). A CNF is broken up into a small number of Tribes functions, the above argument is applied to each Tribe, and then they are put together using a general fact that they prove, that if f and g are fooled by small-bias then also f \wedge g on disjoint inputs is fooled by small-bias.

References

[AW89]    Miklos Ajtai and Avi Wigderson. Deterministic simulation of probabilistic constant-depth circuits. Advances in Computing Research – Randomness and Computation, 5:199–223, 1989.

[GLS12]    Dmitry Gavinsky, Shachar Lovett, and Srikanth Srinivasan. Pseudorandom generators for read-once accˆ0. In Proceedings of the 27th Conference on Computational Complexity, CCC 2012, Porto, Portugal, June 26-29, 2012, pages 287–297, 2012.

[GMR^{+}12]    Parikshit Gopalan, Raghu Meka, Omer Reingold, Luca Trevisan, and Salil Vadhan. Better pseudorandom generators from milder pseudorandom restrictions. In IEEE Symp. on Foundations of Computer Science (FOCS), 2012.

[Nis91]    Noam Nisan. Pseudorandom bits for constant depth circuits. Combinatorica. An Journal on Combinatorics and the Theory of Computing, 11(1):63–70, 1991.

[Nis92]    Noam Nisan. Pseudorandom generators for space-bounded computation. Combinatorica, 12(4):449–461, 1992.

[NN90]    J. Naor and M. Naor. Small-bias probability spaces: efficient constructions and applications. In 22nd ACM Symp. on the Theory of Computing (STOC), pages 213–223. ACM, 1990.

[Ta-17]    Amnon Ta-Shma. Explicit, almost optimal, epsilon-balanced codes. In ACM Symp. on the Theory of Computing (STOC), pages 238–251, 2017.

Special Topics in Complexity Theory, Lectures 2-3

Special Topics in Complexity Theory, Fall 2017. Instructor: Emanuele Viola

1 Lectures 2-3, Scribe: Tanay Mehta

In these lectures we conclude the proof that bounded independence fools the AND function, and look at the more recent result that bounded independence fools the circuit class \mathrm {AC}^0.

1.1 Bounded Independence Fools AND

We state again the theorem from last time.

Theorem 1. Let (X_1, \dots , X_n) be a distribution over \{0,1\}^n such that any k X_i are independent (but not necessarily uniform). Then, we have that

\begin{aligned} \left |\Pr \left [\bigwedge _{i=1}^n X_i = 1\right ] - \prod _{i=1}^n \Pr [X_i = 1]\right | \leq 2^{- \Omega (k)}\end{aligned}

Proof. Let D be the distribution of (X_1, \dots , X_n). Let B be the n-wise independent distribution (Y_1, \dots , Y_n) such that \Pr [Y_i = 1] = \Pr [X_i = 1] for all i \in [n] and the Y_i are independent. The theorem is equivalent to the following statement.

\begin{aligned}|\Pr _{X \leftarrow D}\left [\bigwedge _{i=1}^n X_i = 1 \right ] - \Pr _{X \leftarrow B}\left [\bigwedge _{i=1}^n X_i = 1 \right ]| \leq 2^{-\Omega (k)}\end{aligned}

We will prove the above statement by the following version of the Inclusion-Exclusion principle.

1.1.1 Inclusion-Exclusion Principle

Let V be any distribution over \{0,1\}^n. Note that by De Morgan’s laws, we have

\begin{aligned}\Pr \left [\bigwedge V_i = 1 \right ] = 1 - \Pr \left [\bigvee V_i = 0 \right ]\end{aligned}

Let E_i be the event that V_i = 0. We want to bound the quantity \Pr \left [\bigcup E_i\right ]. By looking at the Venn diagram of the events E_i, we can see that

\begin{aligned} \Pr \left [\bigcup E_i\right ] & \leq \Pr [E_1] + \dots + \Pr [E_n] = \sum _i \Pr [E_i] \nonumber \\ \Pr \left [\bigcup E_i\right ] & \geq \sum _i \Pr [E_i] - \sum _{i, j} \Pr [E_i \cap E_j] \nonumber \\ \Pr \left [\bigcup E_i\right ] & \leq \sum _i \Pr [E_i] ~~ - \sum _{S \subseteq [n], |S| = 2} \Pr \left [\bigcap _{i \in S} E_i \right ]~~+ \sum _{S \subseteq [n], |S| = 3} \Pr \left [\bigcap _{i \in S} E_i\right ], \nonumber \end{aligned}

and so on. In general, we have the following. Define

\begin{aligned}T_j := \sum _{S \subseteq [n], |S| = j} \Pr \left [ \bigcap _{i \in S} E_i \right ]\end{aligned}
\begin{aligned}S_h := \sum _{i=1}^h (-1)^{i+1} T_i \end{aligned}

Then, we have the bounds \Pr \left [\bigcup E_i\right ] \leq S_j for odd j, and \Pr \left [\bigcup E_i\right ] \geq S_j for even j. This fact holds for any distribution.

Let us return to the proof. Note that the S_h are the same for D and B up to h = k because they only involve sums of ANDs of at most k events. Hence, we have that

\begin{aligned}\left |\Pr _D \left [\bigwedge X_i = 1\right ] - \Pr _B\left [ \bigwedge X_i = 1\right ]\right | \leq |S_k - S_{k-1}| = |T_k|\end{aligned}

where the last equality comes from the definition of S_k. Therefore, we are done if |T_k| \leq 2^{-\Omega (k)}. We have that

\begin{aligned}T_k = \sum _{S \subseteq [n], |S| = k} \Pr \left [\bigcap _{i \in S} E_i \right ] ={ n \choose k} ~~ \mathbb {E}_{S \subseteq [n], |S| = k} \left [ \prod _{i \in S} P_i \right ] \end{aligned}

where P_i := \Pr [E_i] = 1 - \Pr [X_i = 1]. To bound the expectation we recall a useful inequality.

1.1.2 A Useful Inequality

Let Q_1, \dots , Q_n be non-negative real numbers. Then, by the AM-GM inequality, we have that

\begin{aligned}\frac {\sum _i Q_i}{n} \geq \left (\prod _i Q_i \right )^{1/n}.\end{aligned}

Consider the following more general statement,

\begin{aligned} \mathbb {E}_{S \subseteq [n], |S| = 1} \left [ \prod _{i \in S} Q_i \right ] \geq \mathbb {E}_{S \subseteq [n], |S| = 2} \left [ \prod _{i \in S} Q_i \right ]^{1/2} \geq \dots \hspace {5cm} \nonumber \\ \dots \geq \mathbb {E}_{S \subseteq [n], |S| = k} \left [ \prod _{i \in S}Q_i \right ]^{1/k} \geq \dots \geq \mathbb {E}_{S \subseteq [n], |S| = n} \left [ \prod _{i \in S} Q_i \right ]^{1/n} \nonumber \end{aligned}

and note that the left most term is equal to \frac {\sum _i Q_i}{n}, while the right most term is equal to \left (\prod _i Q_i \right )^{1/n}

Applying the above inequality to T_k and a common approximation for the binomial coefficient, we have that

\begin{aligned}T_k = { n \choose k} ~~ \mathbb {E}_{S \subseteq [n], |S| = k} \left [ \prod _{i \in S} P_i \right ] \leq {n \choose k} \sum _{i = 1}^n \left (\frac {P_i}{n}\right )^k \leq \left (\frac {e n}{k}\right )^k \left (\frac {\sum P_i}{n}\right )^k = \left (\frac {e \sum P_i}{k}\right )^k.\end{aligned}

Therefore, we are done if \sum P_i \leq \frac {k}{2 e}. Recall that P_i = \Pr [E_i] = 1 - \Pr [X_i = 1]. So if P_i is small then \Pr [X_i = 1] is close to 1.

It remains to handle the case that \sum P_i \geq \frac {k}{2 e}. Pick n' such that

\begin{aligned}\sum _{i=1}^{n'} P_i = \frac {k}{2e} \pm 1.\end{aligned}

By the previous argument, the AND of the first n' is the same up to 2^{-\Omega (k)} for D and B. Also, for every distribution the probability of that the And of n bits is 1 is at most the probability that the And of n' bits is 1. And also, for the n-wise independent distribution B we have

\begin{aligned} \Pr _B\left [\bigwedge _{i=1}^{n'} X_i = 1\right ] & = \prod _{i=1}^{n'} \Pr [X_i = 1] \\ & = \prod _{i=1}^{n'} (1 - P_i)\\ & \leq \left (\frac {\sum _{i=1}^{n'}(1-P_i)}{n'}\right )^{n'} \textrm { by the AM-GM inequality} \\ & \leq \left (\frac {n' - k/2e}{n'}\right )^{n'} \leq (1 - k/2en')^{n'} \leq e^{-\Omega (k)}. \end{aligned}

The combination of these facts concludes this case. To summarize, in this case we showed that

\begin{aligned}\Pr _D[\bigwedge _{i=1}^n X_i = 1] \le \Pr _D[\bigwedge _{i=1}^{n'} X_i = 1].\end{aligned}

as well as

\begin{aligned}\Pr _B[\bigwedge _{i=1}^n X_i = 1] \le \Pr _B[\bigwedge _{i=1}^{n'} X_i = 1] \leq 2^{-\Omega (k)}.\end{aligned}

By the choice of n' and the previous argument, we also know that |\Pr _D[\bigwedge _{i=1}^{n'} X_i = 1]- \Pr _B[\bigwedge _{i=1}^{n'} X_i = 1]| \le 2^{-\Omega (k)} and so we are done, as all quantities above are at most 2^{-\Omega (k)} (and at least 0). \square

Remark 2. The bound is tight up to \Omega ( .)

Proof. Let D be the distribution over \{0,1\}^{k+1} as follows: D_{1, \dots , k} = U_k and D_{k+1} = D_1 + \dots + D_k \mod 2. Then, D is k-wise independent. However, if k is even, then

\begin{aligned}\Pr [\bigwedge _{i=1}^{k+1} D_i = 1] = 0.\end{aligned}

Yet, we have that

\begin{aligned}\Pr [\bigwedge _{i=1}^{k+1} U_i = 1] = 2^{-(k+1)}.\end{aligned}

\square

1.2 Bounded Independence Fools \textrm {AC}^0

Acknowledgement. This section is based on Amnon Ta-Shma’s notes for the class 0368.4159 Expanders, Pseudorandomness and Derandomization CS dept, Tel-Aviv University, Fall 2016.

Note that a DNF on n bits can be modeled as a depth two circuit where the top layer is an OR-gate whose inputs are AND-gates, which take inputs X_1, \dots , X_n and their negations. The circuit class \textrm {AC}^0 can be viewed as a generalization of this to higher (but constant) depth circuits. That is, \textrm {AC}^0 consists of circuits using AND-gates, OR-gates, NOT-gates, and input registers. Each of the gates have unbounded fan-in (i.e. the number of input wires). The size of the circuit is defined to be the number of gates.

\textrm {AC}^0 is one of the most studied classes in complexity theory. \textrm {AC}^0 circuits of polynomial size can do many things, including adding and subtracting n-bit integers.

Conjecture 3.[Linial-Nisan[LN90]] \log ^{O(d)} s-wise independence fools \textrm {AC}^0 circuits of depth d and size s.

The conjecture was open for a long time, even for in the special case d=2. In 2007 a breakthrough work by Bazzi [Baz09] proved it for d=2. Shortly afterwards, Razborov presented a simpler proof of Bazzi’s result [Raz09], and Braverman proved the conjecture for any d with \log ^{d^2} s-wise independence [Bra10]. Tal improved the result to \log ^{O(d)} s [Tal17].

Interestingly, the progress on the conjecture does not use ideas that were not around since the time of its formulation. Bottom line: if a problem is open for a long time, you should immediately attack it with existing tools.

The high-level intuition why such a result should be true is the following:

  1. \textrm {AC}^0 is approximated by polylog degree polynomials.
  2. k-wise independence fools degree-k polynomials.

Proof of (2). Let x = (x_1, \dots , x_n) \in \{0,1\}^n. Let p(x_1, \dots , x_n) be a degree k polynomial over \mathbb {R}. Write p as

\begin{aligned}p(x_1, \dots , x_n) = \sum _{M \subseteq [n], |M| \leq k} c_M \cdot x_M.\end{aligned}

If D is a k-wise independent distribution on \{0,1\}^n, then by linearity of expectation

\begin{aligned} \mathbb {E}_D[P] = \sum _{M \subseteq [n], |M| \leq k} c_M \mathbb {E}_D[x_M] = \sum _{M \subseteq [n], |M| \leq k} c_M \mathbb {E}_U[x_M] = \mathbb {E}_U[P]. \end{aligned}

\square

There are several notions of approximating AC^0 by low-degree polynomials. We now review two of them, explaining why neither of them is sufficient. Braverman showed how to cleverly combine the two methods to prove a version of (1) that’s strong enough.

1.2.1 Approximation 1

Theorem 4. For all \textrm {AC}^0 circuits C(x_1, \dots , x_n) of size s and depth d, for all distributions D over \{0,1\}^n, for all \epsilon , there exists a polynomial p(x_1, \dots , x_n) of degree \log ^{O(d)} s/\epsilon such that

\begin{aligned}\Pr _{x \leftarrow D}[p(x) = C(x)] \geq 1 - \epsilon .\end{aligned}

The important features of this approximation are that it works under any distribution, and when the polynomial is correct it outputs a boolean value.

Similar approximations appear in many papers, going back to Razborov’s paper [Raz87] (who considers polynomials modulo 2) which uses ideas from earlier still work.

Note that the polynomial p depends on the circuit C chosen, and on the distribution. This theorem is not a good enough approximation because on the \epsilon fraction of inputs where the polynomial and circuit are unequal, the value of the polynomial can (and does) explode to be much greater than 1/\epsilon . This prevents us from bounding the average of the polynomial.

Nevertheless, let us prove the above theorem.

Proof. Consider one OR-gate of fan-in s. We construct a distribution of polynomials that compute any input with high probability. This implies that there is a fixed polynomial that computes the circuit on a large fraction of the inputs by an averaging argument.

For i = 1, 2, \dots , \log s. let S_i be a random subset of [s] where every element is included with probability 1/2^i, independently.

Suppose x has Hamming weight 2^j. Then, \mathbb {E}[\sum _{n \in S_j} x_n] = 1. And the sum can be shown to equal 1 with constant probability.

Define the approximation polynomial p to be

\begin{aligned}p(x) := 1 - \prod _{i=1}^{\log s} (1 - \sum _{h \in S_i} x_h)\end{aligned}

Note that if x has weight w > 0, then p(x) = 0 with constant probability. If w =0, then p(x) = 1 with probability 1. We can adjust the error probability to \epsilon by repeating each term in the product \log (1/\epsilon ) times.

Thus, we can approximate one gate with the above polynomial of degree O(\log (s) \cdot \log (1/\epsilon )). Construct polynomials as p above for each gate, with error parameter \epsilon /s. The probability that any of them is wrong is at most \epsilon by a union bound. To obtain the approximating polynomial for the whole circuit compose all the polynomials together. Since the circuit is of depth d, the final degree of the approximating polynomial is (\log (s) \cdot \log (s/\epsilon ))^{d}, as desired.

As mentioned at the beginning, this is a distribution on polynomials that computes correctly any input with probability at least 1-\epsilon . By averaging, there exists a fixed polynomial that computes correctly a 1-\epsilon fraction of inputs. \square

It can be verified that the value of the polynomial can be larger than 1/\epsilon . The polynomial for the gates closest to the input can be as large as s. Then at the next level it can be as large as s^{\log s/\epsilon }, which is already much larger than 1/\epsilon .

1.3 Approximation 2

Theorem 5. For all circuits C of size s and depth d, for all error values \epsilon , there exists a polynomial p(x_1, \dots , x_n) of degree O(\log (s)^{d-1} \log (1/\epsilon )) such that

\begin{aligned} \mathbb {E}_{x \leftarrow U_n}[(C(x) - p(x))^2] \leq \epsilon . \end{aligned}

The important feature of this approximation is that it bounds the average, but only under the uniform distribution. Because it does not provide any guarantee on other distributions, including k-wise independent distributions, it cannot be used directly for our aims.

Remark 6. Approximation 2 is proved via the switching lemma, an influential lemma first proved in the early 80’s by Ajtai [Ajt83] and by Furst, Saxe, and Sipser [FSS84]. The idea is to randomly set a subset of the variables to simplify the circuit. You can do this repeatedly to simplify the circuit even further, but it only works on the uniform distribution. Hastad [Hås87] gave a much tighter analysis of the switching lemma, and the paper [LMN93] used it to prove a version of Approximation 2 with a slightly worse dependence on the error. Recently, a refinement of the switching lemma was proved in [Hås14IMP12]. Based on that, Tal [Tal17] obtained the corresponding refinement of Approximation 2 where the parameters are as stated above. (The polynomial is simply obtained from the Fourier expansion of the function computed by the circuit by removing all Fourier coefficients larger than a certain threshold. The bound on the Fourier decay in [Tal17] implies the desired approximation.)

1.4 Bounded Independence Fools \textrm {AC}^0

Theorem 7. For all circuits C with unbounded fan-in of size s and depth d, for all error values \epsilon , for all k-wise independent distributions D on \{0,1\}^n, we have that

\begin{aligned} |\mathbb {E}[C(D)] - \mathbb {E}[C(U_n)]| \leq \epsilon \end{aligned}

for k = \log (s/\epsilon )^{O(d)}.

Corollary 8. In particular, if s = \textrm {poly}(n), d = O(1), s = 1/\textrm {poly}(n), then k = \log ^{O(1)}(n) suffices.

The next claim is the ultimate polynomial approximation used to prove the theorem.

Claim 9. For all circuits C with unbounded fan-in of size s and depth d, for all error values \epsilon , for all k-wise independent distributions D on \{0,1\}^n, there is a set E of inputs, and a degree-k polynomial p such that:

  1. E is ’rare’ under both D and U_n:
    \Pr _{x \leftarrow U_n}[E(x) = 1] \leq \epsilon , and \Pr _{x \leftarrow D}[E(x) =1] \leq \epsilon . Here we write E(x) for the indicator function of the event x \in E.
  2. For all x, p(x) \leq C(x) \vee E(x). Here \vee is the logical Or.
  3. \mathbb {E}[p(U_n)] = \mathbb {E}[C(U_n)] \pm \epsilon .

We only need (1) under D, but (1) under U is used to prove (3).

Proof of Theorem 7 from Claim 9.

\begin{aligned} \mathbb {E}[C(D)] & = \mathbb {E}[C(D) \vee E(D)] \pm \epsilon \text {, by Claim.(1)} \nonumber \\ & \geq \mathbb {E}[p(D)] \pm \epsilon \text {, by Claim.(2)} \nonumber \\ & = \mathbb {E}[p(U_n)] \pm \epsilon \text {, because } p \text { has degree } k \text { and } D \text { is } k \text {-wise independent} \nonumber \\ & = \mathbb {E}[C(U_n)] \pm \epsilon \text {, by Claim.(3)} \nonumber \end{aligned}

For the other direction, repeat the argument for ‘not C’. \square

We can construct the polynomial approximation from Claim 9 by using a combination of Approximation 1 and 2. First we need a little more information about Approximation 1.

Claim 10. Two properties of approximation 1:

  1. For all x, p(x) \leq 2^{\log (s/\epsilon )^{O(d)}}.
  2. The ’bad’ set E is computable by a circuit of size \textrm {poly}(s), and depth d + O(1).

Proof of Claim 10 part 2. Consider a single OR gate with input gates g_1, \dots , g_s. This is represented in the approximating polynomial by the term

\begin{aligned}1 - \prod _{i=1}^{\textrm {polylog}(s/\epsilon )}(1 - \sum _{j \in S_i} g_j).\end{aligned}

Note that the term is incorrect exactly when the input g_1, \dots , g_s has weight > 0 but all the sets S_i intersect 0 or \geq 2 ones. This can be checked in AC^0, in parallel for all gates in the circuit. \square

Proof of Claim 9. Run approximation 1 for the distribution \frac {D+U}{2}, yielding the polynomial p_c and the set E. This already proves the first part of the claim for both D and U, because if E has probability \epsilon under D it has probability \ge \epsilon /2 under (D+U)/2, and the same for U . Use Claim 10 part 2, and run approximation 2 on E. Call the resulting polynomial p_E, which has degree \log (s/\delta )^{O(d)} with error bound \delta .

The idea in the ultimate approximating polynomial is to “check if there is a mistake, and if so, output 0. Otherwise, output C”. Formally:

\begin{aligned}p(x) := 1 - (1 - p_c(1 - p_E))^2\end{aligned}

Claim 9 part 2 can be shown as follows. p(x) \leq 1 by definition. So, if C(x) \vee E(x) = 1, then we are done. Otherwise, C(x) \vee E(x) = 0. So there is no mistake, and C=0. Hence, by the properties of Approximation 1, p_c(x) = 0. This implies p(x)=0.

It only remains to show Claim 9 part 3:

\begin{aligned} \mathbb {E}_U[p(x)] = \mathbb {E}_U[C(x)] \pm \epsilon . \end{aligned}

By part 1 of Claim 9,

\begin{aligned} \mathbb {E}_U[C(x) - p(x)] = \mathbb {E}_U[C(x) \vee E(x) - p(x)] \pm \epsilon . \end{aligned}

We can show that this equals

\begin{aligned} \mathbb {E}_U\left [\left (C(x) \vee E(x) - p_c(x)(1-p_E(x))\right )^2\right ] \pm \epsilon \end{aligned}

by the following argument: If C(x) \vee E(x) =1 then 1-p(x) = (1-p_c(x)(1-p_E(x)) )^2 by definition. If C(x) \vee E(x) =0, then there is no mistake, and C(x) = 0. This implies that p_c(x)(1 - p_E(x)) = p(x) = 0.

Let us rewrite the above expression in terms of the expectation \ell _2 norm.

\begin{aligned} || C \vee E - p_c(1 - p_E)||_2^2. \end{aligned}

Recall the triangle inequality, which states: ||u - v||_2 \leq ||u - w||_2 + ||w - v||_2. Therefore, letting w = p_c(1 - E) we have that the above quantity is

\begin{aligned} \leq (|| p_c(1 - E) - p_c(1 - p_E)||_2 ~~ + ~~ ||p_c(1 - E) - C \vee E ||_2)^2 \end{aligned}
\begin{aligned} \leq O(||p_c(1-E) - p_c(1 - p_E)||_2^2 + ||p_c (1 - E) - C \vee E ||_2^2). \end{aligned}

To conclude, we will show that each of the above terms are \leq \epsilon . Note that

\begin{aligned} ||p_c(1-E) - p_c(1 - p_E)||_2^2 \leq \max _x |p_c(x)|^2 ||(1 - E) - (1 - p_E)||_2^2. \end{aligned}

By Claim 10 part 1 and Approximation 2, this is at most

\begin{aligned} 2^{\log (s/\epsilon )^{O(d)}} \cdot ||E - p_E||_2^2 \leq 2^{\log (s/\epsilon )^{O(d)}} \cdot \delta . \end{aligned}

For this quantity to be at most \epsilon we set \delta = \epsilon \cdot 2^{-\log (s/\epsilon )^{O(d)}}. Here we critically set the error in Approximation 2 much lower, to cancel the large values arising from Approximation 1. By Theorem 5, the polynomial arising from approximation 2 has degree O(\log (s)^{d-1} \log (1/\delta )) = \log (s/\epsilon )^{O(d)}.

Finally, let us bound the other term, ||p_c (1 - E) - C \vee E ||_2^2. If E(x) = 0, then the distance is 0. If E(x) =1, then the distance is \leq 1. Therefore, this term is at most \Pr _U[E(x) =1], which we know to be at most \epsilon . \square

References

[Ajt83]    Miklós Ajtai. \Sigma \sp {1}\sb {1}-formulae on finite structures. Annals of Pure and Applied Logic, 24(1):1–48, 1983.

[Baz09]    Louay M. J. Bazzi. Polylogarithmic independence can fool DNF formulas. SIAM J. Comput., 38(6):2220–2272, 2009.

[Bra10]    Mark Braverman. Polylogarithmic independence fools AC^{\mbox {0}} circuits. J. of the ACM, 57(5), 2010.

[FSS84]    Merrick L. Furst, James B. Saxe, and Michael Sipser. Parity, circuits, and the polynomial-time hierarchy. Mathematical Systems Theory, 17(1):13–27, 1984.

[Hås87]    Johan Håstad. Computational limitations of small-depth circuits. MIT Press, 1987.

[Hås14]    Johan Håstad. On the correlation of parity and small-depth circuits. SIAM J. on Computing, 43(5):1699–1708, 2014.

[IMP12]    Russell Impagliazzo, William Matthews, and Ramamohan Paturi. A satisfiability algorithm for AC^{\mbox {0}}. In ACM-SIAM Symp. on Discrete Algorithms (SODA), pages 961–972, 2012.

[LMN93]   Nathan Linial, Yishay Mansour, and Noam Nisan. Constant depth circuits, Fourier transform, and learnability. J. of the ACM, 40(3):607–620, 1993.

[LN90]     Nathan Linial and Noam Nisan. Approximate inclusion-exclusion. Combinatorica, 10(4):349–365, 1990.

[Raz87]    Alexander Razborov. Lower bounds on the dimension of schemes of bounded depth in a complete basis containing the logical addition function. Akademiya Nauk SSSR. Matematicheskie Zametki, 41(4):598–607, 1987. English translation in Mathematical Notes of the Academy of Sci. of the USSR, 41(4):333-338, 1987.

[Raz09]    Alexander A. Razborov. A simple proof of Bazzi’s theorem. ACM Transactions on Computation Theory (TOCT), 1(1), 2009.

[Tal17]    Avishay Tal. Tight bounds on the fourier spectrum of AC0. In Conf. on Computational Complexity (CCC), pages 15:1–15:31, 2017.

Special Topics in Complexity Theory: Lecture 1

Special Topics in Complexity Theory, Fall 2017. Instructor: Emanuele Viola

1 Lecture 1, Scribe: Chin Ho Lee

In this first lecture we begin with some background on pseudorandomness and then we move on to the study of bounded independence, presenting in particular constructions and lower bounds.

1.1 Background

Let us first give some background on randomness. There are 3 different theories:

(1) Classical probability theory. For example, if we toss a coin 12 times then the probability of each outcome is the same, i.e., \Pr [010101010101] = \Pr [011011100011]. However, intuitively we feel that the first outcome is less random than the second.

(2) Kolmogorov complexity. Here the randomness is measured by the length of the shortest program outputting a string. In the previous example, the program for the second outcome could be “print 011011100011”, whereas the program for the first outcome can be “print 01 six times”, which is shorter than the first program.

(3) Pseudorandomness. This is similar to resource-bounded Kolmogorov complexity. Here random means the distribution “looks random” to “efficient observers.”

Let us now make the above intuition precise.

Definition 1.[Pseudorandom generator (PRG)] A function f\colon \{0,1\}^s \to \{0,1\}^n is a pseudorandom generator (PRG) against a class of tests T \subseteq \{ t\colon \{0,1\}^n \to \{0,1\}\} with error \epsilon , if it satisfies the following 3 conditions:

(1) the output of the generator must be longer than its input, i.e., n > s;

(2) it should fool T, that is, for every test t \in T, we have \Pr [t(U_n) = 1] = \Pr [t(f(U_s)) = 1] \pm \epsilon ;

(3) the generator must be efficient.

To get a sense of the definition, note that a PRG is easy to obtain if we drop any one of the above 3 conditions. Dropping condition (1), then we can define our PRG as f(x) := x. Dropping condition (2), then we can define our PRG as f(x) := 0. Dropping condition (3), then the PRG is not as obvious to obtain as the previous two cases. We have the following claim.

Claim 2. For every class of tests T, there exists an inefficient PRG with error \epsilon and seed length s = \lg _2\lg _2(|T|) + 2 \lg _2(1/\epsilon ) + O(1).

Before proving the claim, consider the example where T is the class of circuits of size n^{100} over n-bit input, it is known that |T| = 2^{n^{O(1)}}. Hence, applying our claim above we see that there is an inefficient PRG that fools T with error \epsilon and seed length s = O(\lg _2(n/\epsilon )).

We now prove the claim using the probabilistic method.

Proof. Consider picking f at random. Then by the Chernoff bound, we have for every test t \in T,

\begin{aligned} \Pr _f[ | \Pr _{U_s} [ t(f(U_s)) = 1] - \Pr _{U_n}[ t(U_n) = 1] | \ge \epsilon ] \le 2^{-\Omega (\epsilon ^2 2^s)} < 1 / |T|, \end{aligned}

if s = \lg _2\lg _2(|T|) + 2 \lg _2(1/\epsilon ) + O(1). Therefore, by a union bound over t \in T, there exists a fixed f such that for every t \in T, the probabilities are within \epsilon . \square

1.2 k-wise independent distribution

A major goal in research in pseudorandomness is to construct PRGs for (1) richer and richer class T, (2) smaller and smaller seed length s, and making the PRG explicit. For starters, let us consider a simple class of tests.

Definition 3.[d-local tests] The d-local tests are tests that depend only on d bits.

We will show that for this class of tests we can actually achieve error \epsilon = 0. To warm up, consider what happens when d = 1, then we can have a PRG with seed length s = 1 by defining f(0) := 0^n and f(1) := 1^n.

For d = 2, we have the following construction. Define

\begin{aligned} f(x)_y := \langle x, y \rangle = \sum _i x_i y_i \bmod 2 . \end{aligned}

Here the length of x and y is |x| = |y| = \lg _2 n, and we exclude y = 0^{\lg _2 n}. Note that the output has n - 1 bits, but we can append one uniform bit to the output of f. So the seed length would be \lg _2 n + 1.

Now we prove the correctness of this PRG.

Claim 4. The f defined above is a PRG against 2-local tests with error \epsilon = 0.

Proof. We need to show that for every y \neq z, the random variable (f(x)_y, f(x)_z) over the choice of x is identical to U_2, the uniform 2-bit string. Since y \neq z, suppose without loss of generality that there exists an i such that y_i = 1 and z_i = 0. Now f(x)_z is uniform, and conditioned on z, f(x)_y is also uniform, thanks to the index y_i. \square

The case for d = 3 becomes much more complicated and involves the use of finite fields. One can think of a finite field as a finite domain that behaves like \mathbb {Q} in the sense that it allows you to perform arithmetic operations, including division, on the elements. We will use the following fact about finite fields.

Lemma 5. There exist finite fields of size p^k, for every prime p and integer k. Moreover, they can be constructed and operated with in time \mathrm {poly}(k, p).

Remark 6. Ideally one would like the dependence on p to be \lg _2 p. However, such construction remains an open question and there have been many attempts to constructing finite fields in time \mathrm {poly}(k, \lg _2 p). Here we only work with finite fields with p = 2, and there are a lot of explicit constructions for that.

One simple example of finite fields are integers modulo p.

Theorem 7. Let D = \{0, 1\}^{\lg _2 n}. For every k, there exists an explicit construction over D^n such that

(1) elements in D^n can be sampled with s = k \lg _2 n bits, and

(2) every k symbols are uniform in D^k.

For d = 3, we can use the above theorem with k = 3, and the PRG can output the first bit of every symbol.

Remark 8. There exist other constructions that are similar to the inner product construction for the case d = 2, with y carefully chosen, but the way to choose y involves the use of finite fields as well.

Note that we can also apply the theorem for larger d to fool d-local tests with seed length s = d \lg _2 n.

We now prove the theorem.

Proof. Pick a finite field \mathbb {F} of size 2^{\lg _2 n}. Let a_0, \ldots , a_{n-1} \in \mathbb {F} be uniform random elements in \mathbb {F} which we think of as a polynomial a(x) of degree k-1. We define the generator f to be

\begin{aligned} f(a_0, \ldots , a_{n-1})_x = a(x) = \sum _{i=0}^{n-1} a_i x^i . \end{aligned}

(One should think of the outputs of f as lines and curves in the real plane.)

The analysis of the PRG follows from the following useful fact: For every k points (x_0, y_0), (x_1, y_1), \ldots , (x_{k-1}, y_{k-1}), there exists exactly one degree k-1 polynomial going through them. \square

Let us now introduce a terminology for PRGs that fool d-local tests.

Definition 9. We call distributions that look uniform (with error 0) to k-local tests k-wise independent (also known as k-wise uniform). The latter terminology is more precise, but the former is more widespread.

We will soon see an example of a distribution where every k elements are independent but not necessarily uniform.

1.3 Lower bounds

We have just seen a construction of k-wise independent distributions with seed length s = d \lg _2 n. It is natural to ask, what is the minimum seed length of generating k-wise independent distributions?

Claim 10. For every k \ge 2, every PRG for k-local tests over \{0, 1\}^n has seed length s \ge \Omega (k \lg _2 (n/k)).

Proof. We use the linear-algebraic method. See the book by Babai–Frankl [1] for more applications of this method.

To begin, we will switch from \{0, 1\} to \{-1, 1\}, and write the PRG as a 2^s \times n matrix M, where the rows are all the possible outputs of the PRG. Since the PRG fools k-local tests and k \ge 2, one can verify that every 2 columns of M are orthogonal, i.e., \langle M_i, M_j \rangle = 0 for i \neq j. As shown below, this implies that the vectors are independent. And by linear algebra this gives a lower bound on s.

However so far we have not used k. Here’s how to use it. Consider all the column vectors v obtained by taking the entry-wise products of any of the k/2 vectors in M. Because of k-wise independence, these v’s are again orthogonal, and this also implies that they are linearly independent.

Claim 11. If v_1, v_2, \ldots , v_t are orthogonal, then they are linearly independent.

Proof. Suppose they are not and we can write v_i = \sum _{j \in S, i \not \in S} v_j for some S. Taking inner product with v_i on both sides, we have that the L.H.S. is nonzero, whereas the R.H.S. is zero because the vectors are orthogonal, a contradiction. \square

Therefore, the rank of M must be at least the number of v’s, and so

\begin{aligned} 2^s \ge \text {number of v's} \ge \binom {n}{k/2} \ge (2n / k)^{k/2} . \end{aligned}

Rearranging gives s \ge (k/2) \lg _2 (2n / k). \square

1.4 Who is fooled by k-wise independence?

In the coming lectures we will see that k-wise independence fools \mathrm {AC}^0, the class of constant-depth circuits with unbounded fan-in. Today, let us see what else is fooled by k-independence in addition to k-local tests.

(1) Suppose we have n independent variables x_1, \ldots , X_n \in [0,1] and we want to understand the behavior of their sum \sum _i X_i. Then we can apply tools such as the Chernoff bound, tail bounds, Central Limit Theorem, and the Berry–Esseen theorem. The first two give bounds on large deviation from the mean. The latter two are somewhat more precise facts that show that the sum will approach a normal distribution (i.e., the probability of being larger than t for any t is about the same). One can show that similar results hold when the X_i’s are k-wise independent. The upshot is that the Chernoff bound gives error 2^{-\text {samples}}, while under k-wise independence we can only get an error (\text {samples})^{-k/2}.

(2) We will see next time that k-wise independence fools \mathrm {DNF} and \mathrm {AC}^0.

(3) k-wise independence is also used as hashing in load-balancing.

1.4.1 k-wise independence fools AND

We now show that k-wise independent distributions fool the AND function.

Claim 12. Every k-wise uniform distribution fools the AND functions on bits with error \epsilon = 2^{-\Omega (k)}.

Proof. If the AND function is on at most k bits, then by definition the error is \epsilon = 0. Otherwise the AND is over more than k bits. Without loss of generality we can assume the AND is on the first t > k bits. Observe that for any distribution D, we have

\begin{aligned} \Pr _D[\text {AND on t bits is 1}] \le \Pr _D[\text {AND on k bits is 1}] . \end{aligned}

The right-hand-side is the same under uniform and k-wise uniformity, and is 2^{-k}. Hence,

\begin{aligned} | \Pr _{\text {uniform}}[AND = 1] - \Pr _{\text {k-wise ind.}}[\mathrm {AND} = 1] | \le 2^{-k} . \end{aligned}

\square

Instead of working over bits, let us now consider what happens over a general domain D. Given n functions f_1, \ldots , f_n \colon D \to \{0, 1\}. Suppose x_1, \ldots , x_n are k-wise uniform over D^n. What can you say about the AND of the outputs of the f_i’s, f_1(x_1), f_2(x_2), \ldots , f_n(x_n)?

This is similar to the previous example, except now that the variables are independent but not necessarily uniform. Nevertheless, we can show that a similar bound of 2^{-\Omega (k)} still holds.

Theorem 13.[[2]] Let X_1, X_2, \ldots , X_n be random variables over \{0, 1\}, which are k-wise independent, but not necessarily uniform. Then

\begin{aligned} \Pr [ \prod _{i = 1}^n X_i = 1 ] = \prod _{i = 1}^n \Pr [X_i = 1] \pm 2^{-\Omega (k)} . \end{aligned}

This fundamental theorem appeared in the conference version of [2], but was removed in the journal version. One of a few cases where the journal version contains less results than the conference version.

Proof. Since each X_i is in \{0, 1\}, by De Morgan’s law, we can write

\begin{aligned} \Pr [\prod _{i=1}^n X_i = 1] = \mathbb{E} [ \prod _{i=1}^n X_i ] = \mathbb{E} [ \mathrm {AND}_{i=1}^n X_i ] = \mathbb{E} [ 1 - \mathrm {OR}_{i=1}^n (1 - X_i)]. \end{aligned}

If we define the event E_i to be (1 - X_i), then \mathrm {OR}_{i=1}^n (1 - X_i) is the same as \Pr [\bigcup _{I=1}^n E_i]. Now we apply the inclusion-exclusion principle, which says

\begin{aligned} \Pr [\bigcup _{i=1}^n E_i] = \sum _{i=1}^n \Pr [E_i] - \sum _{i \neq j} \Pr [E_i \cap E_j] + \cdots + (-1)^{J + 1} \sum _{S \subseteq [n], |S| = J} \Pr [\bigcap _{i \in S} E_i] + \cdots . \end{aligned}

we will finish the proof in the next lecture. \square

References

[1]   László Babai and Peter Frankl. Linear algebra methods in combinatorics. 1992.

[2]   Guy Even, Oded Goldreich, Michael Luby, Noam Nisan, and Boban Velickovic. Approximations of general independent distributions. In ACM Symp. on the Theory of Computing (STOC), pages 10–16, 1992.

Double dragon and labor supply

Another interesting report came out at the NBER. Basically, it argues that better and cheaper video games are a major factor in the decline of hours of work among young men.  You can read the report here and a popular exposition here.

I can definitely relate to this report: I spent countless hours playing videogames, starting very young until quite recently (now I work hard 100% of my time, as you can see from this post). When I was younger I maintained that playing video games is actually good for the brain.  My basic idea was that to play a game you have to learn a new set of rules, and that’s good for you.

What better opportunity to remember one of my favorite games.  A game that I wasted an enormous amount of time not even playing, but waiting for my turn to play at a bar next to my (landlord’s) house in Rome, with the pocket full of jingling, picturesque 500 lire (roughly half a euro). A game that I still think is the best side-scrolling beat’em up ever made. Back then, who could have imagined that one day you will be able to play this first with the MAME and then in a browser?

Double dragon: Click here and the day is gone.

taking a course online, instead of in-person, reduces student success and progress in college

The title is taken from the abstract of a recent AER paper:

Online college courses are a rapidly expanding feature of higher education, yet little research identifies their effects relative to traditional in-person classes. Using an instrumental variables approach, we find that taking a course online, instead of in-person, reduces student success and progress in college. Grades are lower both for the course taken online and in future courses. Students are less likely to remain enrolled at the university. These estimates are local average treatment effects for students with access to both online and in-person options; for other students, online classes may be the only option for accessing college-level courses.

This is truly shocking! Who could have possibly guessed? (I am not saying that the paper is worthless.)

Teaching special topics in complexity theory in Fall 2017

Next Fall I am teaching a special topics class in complexity theory. Below and here is the formal announcement, including a highly tentative list of topics. Some of the topics are somewhat ambitious, but that problem lies well in the future, concerning not me but my future self. He also appreciates suggestions and pointers of cool results to include.


Special topics in complexity theory

Instructor.

Emanuele Viola

Logistics.

Class 1:35 pm – 3:15 pm Tuesday Friday Ryder Hall 273.

First class Sep 08, 2017.

Running at the same time Program on Combinatorics and Complexity at Harvard.

Tentative syllabus

This class will present recent (amazing) progress in complexity theory and related areas. A highly tentative list of topics follows:

(1) Pseudorandom generators. Bounded independence, small-bias, sandwiching polynomials. Bounded independence fools And, bounded independence fools AC0 (Braverman’s result), the Gopalan-Kane-Meka inequality, the Gopalan-Meka-Reingold-Trevisan-Vadhan generator. Vadhan’s survey, Amnon Ta-Shma’s class

(2) Approximate degree. Bounded indistinguishability. Bounded indistinguishability of Or, And-Or, and Surjectivity (the Bun-Thaler proof)

(3) Circuit complexity. Williams’ lower bounds from satisfiability. Succinct and explicit NP-completeness. ACC0-SAT algorithms. Exposition in web appendix of Arora and Barak’s book.

(4) Quasi-random groups. Austin’s corner theorem in SL(2,q).

(5) Communication complexity and quasi-random groups. Determinism vs. randomization in number-on-forehead communication complexity. Number-in-hand lower bounds for quasi-random groups. Various notes.

(5) Data structures. Overview: static, dynamic, bit-probe, cell-probe. Siegel’s lower bound for hashing. The Larsen-Weinstein-Yu superlogarithmic lower bound.

(6) Arithmetic circuits. Overview. The chasm at depth 3. (The Gupta-Kamath-Kayal-Saptharishi result.) Shpilka and Yehudayoff’s survey Survey

(7) Fine-grained complexity reductions (SETH, 3SUM)

Deliverables

Each student will scribe #lectures/#students, and present a lecture. Grade is based on scribe, presentation, and class participation.

Scribes: due 72 hours from the time of the lecture. Feel free to send me a draft and ask for suggestions before the cutoff. Scribe templates: lyx, tex. Optionally, the lectures will be posted on my blog. Using this template minimizes the risk that my wordpress compiler won’t work.

Presentations: should convey both a high-level overview of the proof of the result, as well as a self-contained exposition of at least one step. Talk to me for suggestions. Discuss with me your presentation plan at least 1 week before your presentation.

Presentation papers:

Note: Always look if there is an updated version. Check the authors’ webpages as well as standard repositories (arxiv, ECCC, iacr, etc.)

Pseudorandomness:

Amnon Ta-Shma. Explicit, almost optimal, epsilon-balanced codes.

(Perhaps Avraham Ben-Aroya, Dean Doron and Amnon Ta-Shma. An efficient reduction from non-malleable extractors to two-source extractors, and explicit two-source extractors with near-logarithmic min-entropy.)

Prahladh Harsha, Srikanth Srinivasan: On Polynomial Approximations to AC0

SAT algorithms for depth-2 threshold circuits:

Here and here,

Quasirandom groups:

Higher-dimensional corners

Data structures:

Parts of Larsen-Weinstein-Yu superlogarithmic we left out.

Any of the many lower bounds we didn’t see.

Arithmetic circuit complexity:

Parts of the reduction we did not see.

Survey

Tavenas

Regularity lemmas:

TTV, Skorski

Fine-grained reductions:

Dynamic problems, LCS, edit distance.

There are many other reductions in this area.

Classes

  1. 2017-09-08 Fri
  2. 2017-09-12 Tue
  3. 2017-09-15 Fri
  4. 2017-09-19 Tue
  5. 2017-09-22 Fri
  6. 2017-09-26 Tue
  7. 2017-09-29 Fri
  8. 2017-10-03 Tue. Additive combinatorics workshop. NO CLASS
  9. 2017-10-06 Fri. Additive combinatorics workshop. NO CLASS
  10. 2017-10-10 Tue
  11. W-R Lectures by Gowers
  12. 2017-10-13 Fri
  13. 2017-10-17 Tue
  14. 2017-10-20 Fri
  15. 2017-10-24 Tue
  16. 2017-10-27 Fri
  17. 2017-10-31 Tue
  18. 2017-11-03 Fri
  19. 2017-11-07 Tue
  20. 2017-11-10 Fri
  21. 2017-11-14 Tue. Workshop. Class planned as usual, but check later
  22. 2017-11-17 Fri. Workshop. Class planned as usual, but check later
  23. 2017-11-21 Tue
  24. 2017-11-24 Fri. Thanksgiving, no classes.
  25. 2017-11-28 Tue
  26. 2017-12-01 Fri
  27. 2017-12-05 Tue
  28. 2017-12-08 Fri
  29. 2017-12-12 Tue
  30. 2017-12-15 Fri

STOC/FOCS PC meetings: Does nature of decisions justify cost?

I am just back from the FOCS 2017 program committee (PC) meeting. This is my second time attending a physical PC meeting, both times for FOCS. In the past I was sorry that I had to decline some invitations for personal reasons. Just like the previous instance, I was impressed with the depth and thoroughness of the discussion. I also had a great time sitting in the same room with so many esteemed colleagues, chatting about what’s cool in the new research papers, and cracking nerdy jokes. It is also a privilege for me to hear what everyone has to say about what is going on. I even managed to have some quick research meetings here and there. During the flight I even proved a great theorem watched Trainspotting 2. And I had never been to Caltech: it is a great place with a moving history as told by the book in the drawer next to my bed in the Athenaeum.

However in the rest of the post I want to address the title. Throughout, by “PC meetings”, I mean “physical” meetings as opposed to “virtual.” Before I start, I want to stress two things. First, what I am about to say has absolutely nothing to do with this or any specific STOC/FOCS PC meeting, rather it is aimed to PC meetings in general. Second, I am not saying that the decisions made are “bad.” Indeed I don’t think a substantially better way to make decisions exists (here and in many other cases).

First, the experiment. As mentioned in a previous post, this time around I ran a little experiment: Right before the meeting started, I saved the ranking of the papers, based on the reviews and the offline discussion, weighted by confidence. Then I compared it with the final decisions at the end of the meeting (with the knowledge of the total number of papers accepted). There is a handful of papers that I have a conflict with and so I don’t know about. Those I know about number 88. The two lists of 88 papers have edit distance 19, meaning that you need to change 19 papers in one list to get the other list. This is a 0.216 fraction of the papers. I consider this number negligible. Note that it is based on information that was entered in expectation of resolving many things during the meeting: extra rounds of offline discussion would have probably calibrated this much better. Think also how this compares with the NIPS experiment. In that case, the two lists of 37 papers have edit distance 21 (plus or minus one), which is a lot larger than the above. However STOC/FOCS decisions may be more “stable” than NIPS (would be fun to have any data on this).

A common experience is also that a large fraction of papers is a clear reject, a small fraction is a clear accept, and the rest are more or less rated like novels or movies; and I am not aware of a better rating scheme.

Second, how the decisions are made. I am not sure that decisions made during the meeting are significantly better than decisions made offline (I will use the word “offline” to mean “not in a physical meeting”). A meeting can start at 8:30AM and go past 6PM, with total break time about 1 hour. Many people are also jet lagged or exhausted. Working offline may be better. Sure, in a physical meeting you can literally grab someone by the arm and ask: Do you really think X should be accepted given Y? But in general there is very little time for this. On the other hand, in an offline discussion you can say: OK, we need a little extra expertise on papers x, y, z: reviewer w, please dig into it and report (happened to me). This supposedly should also take place before the physical meeting, but I think it’s done less than it should be, because the feeling is that we are going to have a meeting anyway so we might as well resolve this then. Instead what happens at times during the meeting is that expertise is missing, and there’s just no time to remedy. Sometimes, the in-person decision process can be rather chaotic. I want to stress again that I am not saying that *any* PC is not run well. My point is that this is the nature of the hard decisions that we need to make. Of course, email conversations can get very tedious. And you can also say that if there’s no threat of a physical meeting people feel less pressure to write good reviews and make good decisions, though that’s not my experience from SODA/CCC program committees. A good model could be to have a virtual meeting, with all the participants present at the same time. This could be done early on, and in fact having two such meetings could be the best thing. At my institutions we have many, many meetings virtually. During such meetings, you can still stare at someone in the camera and ask: Do you really think X should be accepted given Y?

There are many venues where decisions are made offline, such as CCC/CRYPTO/SODA/JACM/SICOMP. One can say that STOC/FOCS are more important than CCC/CRYPTO/SODA, but it’s hard to say that they are more important than say JACM. An argument is then that journal decisions are different because they involve a lot more time and care. I tend to disagree with this last point. My impression is that journal and conference reviews are rather indistinguishable when it comes to computing the accept/reject bit. The difference is only in the second-order term: journals can spend more time on details and style.

Third, the location. This meeting was held at Caltech. The Theory festival at Montreal ended with a full day of amazing tutorials on friday. The festival is promoted as a “must-attend” event. So what we are telling the program committee is this: “Look, you really gotta attend the theory festival. If you need to attend anything at all this year, that’s the meeting! Oh, and right after that (4PM) please catch a red-eye Montreal-LAX, and be ready to start at 8:30 AM on Saturday.” Personally, I wanted to attend the theory fest, but I just couldn’t to the above. So I skipped it in favor of the PC meeting.

If we really need to have physical meetings, I think we should combine them with some other popular event, or at least put them where the center of mass is. Typically, none of this happens, and the tradition seems to be that the meeting is held where the PC chair works, which may not even be where the conference is. I am not the only one who feels so, by the way, about this point and the others. In particular a past program committee in an act of rebelion organized a conference for themselves and co-located it with the PC meeting. But that’s unusual, and it’s clearly better to co-locate with the existing events, since attendance is already an issue. One argument is that the PC chair can offer a nice room with lots of support, which is hard to find elsewhere. I think this argument does not stand. The “support” consists in everybody bringing their laptops (an example of “downfall of mankind” discussed earlier), and wireless. I don’t think it’s so hard to get a room with a table and wireless in a hotel or another university. Finally, I think these meetings favor the United States even more disproportionately so than the conference itself. Indeed, STOC/FOCS are at least once in a while held elsewhere (one was in Rome, for example — I didn’t go). I’ve never heard of a meeting done abroad, would be curious to know.

I am told that this year TCC has a physical meeting (not always the case) but that it is co-located with CRYPTO, a conference typically attended by most of the crypto community. This makes sense to me.

Look at some recent STOC/FOCS PC, and think of the cost of flying everybody to the meeting, and think of the result.

Fourth, the time. As discussed earlier, do we really have to hold such meetings during week-ends? Especially in the summer, I don’t know of a reason for this. Why not Tuesday to Thursday? People don’t teach, departments are empty, and flying is less chaotic. Plus maybe you want to spend your summer week-ends going to the beach instead of being shut in a windowless room with dozens of laptops? An argument is that international flights are sometimes cheaper if you stay during the week-end. I am not sure this really makes a difference, for a number of reasons (including: not so many people from abroad, and they typically stay extra days anyway).

Synthesis.

My impression is that things are done this way mostly because of inertia: This is how it’s been done, and now good luck changing it. Also, being a STOC/FOCS chair is (or is close to) a once-in-a-lifetime appointment, which clearly one doesn’t want to screw up. Another argument I heard is that physical meetings are a better way to preserve tradition. That is, a young member of the PC learns better the trade through physical interaction.

If you put all the above things together, my impression is that the answer to the title is “no.” The resources are probably enough for at least a fraction of a postdoc, or they could be used to allow more people to actually attend the conference. My concrete proposal is to have the next meeting offline, with a mixture of desynchronized discussion and virtual synchronized meetings. At least, we should try.

Bounded independence fools And

One of the earliest (and coolest) results in pseudorandomness is the following theorem that shows that k-wise independence “fools” the And function.


Theorem 1. [1] Let x_{1},x_{2},\ldots ,x_{n} be k-wise independent distributions over \{0,1\}. Then

\begin{aligned} |\mathbb {E}[\prod _{i\le n}x_{i}]-\prod _{i\le n}\mathbb{E} [x_{i}]|\le 2^{-\Omega (k)}. \end{aligned}

Proof. First recall that the inclusion-exclusion principle shows that for any random variables y_{1},y_{2},\ldots ,y_{n} jointly distributed over \{0,1\}^{n} according to a distribution D we have (write Or_{i} for the Or function on i bits);

\begin{aligned} \mathbb{E} \prod _{i\le n}y_{i} & =1-\mathbb{E} \text {Or\ensuremath {_{i}}}(1-y_{i})\nonumber \\ & =1-\sum _{j=1}^{n}T_{j}~~~~(1) \end{aligned}

where

\begin{aligned} T_{j}=(-1)^{j}\sum _{S\subseteq [n],|S|=j}\mathbb{E} \prod _{i\in S}1-y_{i}. \end{aligned}

Moreover, if we truncate the sum in Equation (1) to the first t terms, it gives either a lower bound or an upper bound depending on whether t is odd or even.

Because this holds under any distribution, if we show that the right-hand side of Equation (1) is approximately the same if the y_{i} are n-wise independent and k-wise, then the left-hand sides of that equation will also be approximately the same in the two scenarios, giving the result. (Note that \prod _{i\le n}\mathbb{E} [x_{i}] equals the expectation of the product of the x_{i} if the latter are n-wise independent.)

Now, since the terms T_{j} only involve expectations of the product of at most j variables, we have that they are the same under n-wise and k-wise independence up to j=k. This would conclude the argument if we can show that |T_{k}| is 2^{-\Omega (k)} (since this is the quantity that you need to add to go from a lower/upper bound to an upper/lower bound). This is indeed the case if \sum _{i\le n}\mathbb{E} [1-x_{i}]\le k/2e because then by McLaurin’s inequality we have

\begin{aligned} |T_{k}|=\sum _{S\subseteq [n],|S|=k}\prod _{i\in S}\mathbb{E} [1-x_{i}]\le (e/k)^{k}(\sum _{i\le n}\mathbb{E} [1-x_{i}])^{k}\le 2^{-k} \end{aligned}

where the first equality holds because the x_{i} are k-wise independent.

There remains to handle the case where \sum _{i\le n}\mathbb{E} [1-x_{i}]>k/2e. In this case, the expectations are so small that even running the above argument over a subset of the random variables is enough. Specifically, let n' be such that \sum _{i\le n}\mathbb{E} [1-x_{i}]=k/2e (more formally you can find an n' such that this holds up to an additive 1, which is enough for the argument; but I will ignore this for simplicity). Then the above argument still applies to the first n' variables. Moreover, the expectation of the product of just these n' variables is already fairly small. Specifically, because the geometric mean is always less than the arithmetic mean (a fact which is closely related to McLaurin’s inequality), we have:

\begin{aligned} \prod _{i\le n'}\mathbb{E} [x_{i}]\le (\frac {1}{n'}\sum _{i\le n'}\mathbb{E} [x_{i}])^{n'}\le (\frac {1}{n'}(n'-k/2e))^{n'}\le (1-k/2en')^{n'}\le 2^{-\Omega (k)}. \end{aligned}

Since under any distribution the expectation of the product of the first n variables is at most the expectation of the product of the first n', the result follows. \square

An interesting fact is that this theorem is completely false if the variables are over \{-1,1\} instead of \{0,1\}, even if the independence is n-1. The counterexample is well-known: parity. Specifically, take uniform variables and let x_{n}=\prod _{i<n}x_{i}. What may be slightly less known is that this parity counterexample also shows that the error term in the theorem is tight up to the constant in the \Omega . This is because you can write any function on t bits as the disjoint union of 2^{t} rectangles.

A side aim of this post is to try a new system I put together to put math on wordpress. I’ll test it some more and then post about it later, comparing it with the alternatives. Hopefully, my thinking that it was the lack of this system that prevented me from writing math-heavier posts was not just an excuse to procrastinate.

References

[1]   Guy Even, Oded Goldreich, Michael Luby, Noam Nisan, and Boban Velickovic. Approximations of general independent distributions. In ACM Symp. on the Theory of Computing (STOC), pages 10–16, 1992.

ECCC as a zero-formatting “publisher” for CCC proceedings?

Background: After going solo, the CCC conference is using LIPIcs as a “publisher” for the papers accepted to the conference. This involves a non-trivial amount of formatting (to put the papers in their format) and also some monetary costs.

I would like to use the opportunity that CCC is going solo to move to a model where the “publishing” involves *zero* effort from authors. This could be a selling point for the conference, and maybe set an example for others.

Specifically, in the vein of previous posts, I propose that authors of accepted papers simply send the .pdf of their paper in whatever format they like. The CCC people take care of placing a stamp “CCC 20xx camera-ready” and putting the paper on the ECCC. Papers with indecent formatting are treated exactly as papers with indecent introductions.

Disclaimer: although I am on the reviewing board of ECCC I had no discussions with the ECCC people about this.

The main benefits of ECCC are:

– Submission is painless: just send the .pdf! Again, authors can write their paper in whatever format they like.

– Indexed by DBLP

– It’s run by “us”, it’s about computational complexity and in fact “Under the auspices of the Computational Complexity Foundation (CCF)”

– It has an ISSN number (1433-8092). I am told this is important for some institutions, though I don’t know if some insist on ISBN over ISSN. If they do, perhaps there’s a way to get that too?

– They do various nice things already, like archiving papers in CD’s etc. In fact, going back to the ISBN issue, couldn’t we simply assign an ISBN to the reports from each year?

– It has no cost (given that ECCC already exists).

Another option is to use arxiv or an arxiv overlay. This would also be better than using LIPIcs, I think, but it does not enjoy many of the benefits above.