A main justification for my belief is history:

- In the 1950’s Kolmogorov conjectured that multiplication of -bit integers requires time . That’s the time it takes to multiply using the method that mankind has used for at least six millennia. Presumably, if a better method existed it would have been found already. Kolmogorov subsequently started a seminar where he presented again this conjecture. Within one week of the start of the seminar, Karatsuba discovered his famous algorithm running in time . He told Kolmogorov about it, who became agitated and terminated the seminar. Karatsuba’s algorithm unleashed a new age of fast algorithms, including the next one. I recommend Karatsuba’s own account [9] of this compelling story.
- In 1968 Strassen started working on proving that the standard algorithm for multiplying two matrices is optimal. Next year his landmark algorithm appeared in his paper “Gaussian elimination is not optimal” [12].
- In the 1970s Valiant showed that the graphs of circuits computing certain linear transformations must be a
*super-concentrator*, a graph which certain strong connectivity properties. He conjectured that super-concentrators must have a super-linear number of wires, from which super-linear circuit lower bounds follow [13]. However, he later disproved the conjectured [14]: building on a result of Pinsker he constructed super-concentrators using a linear number of edges. - At the same time Valiant also defined
*rigid*matrices and showed that an explicit construction of such matrices yields new circuit lower bounds. A specific matrix that was conjectured to be sufficiently rigid is the Hadamard matrix. Alman and Williams recently showed that, in fact, the Hadamard matrix is not rigid [1]. - After finite automata, a natural step in lower bounds was to study sightly more general programs with constant memory. Consider a program that only maintains bits of memory, and reads the input bits in a fixed order, where bits may be read several times. It seems quite obvious that such a program could not compute the majority function in polynomial time. This was explicitly conjectured by several people, including [5]. Barrington [4] famously disproved the conjecture by showing that in fact those seemingly very restricted constant-memory programs are in fact equivalent to log-depth circuits, which can compute majority (and many other things).
- [Added 2/18] Mansour, Nisan, and Tiwari conjectured [10] in 1990 that computing hash functions on bits requires circuit size . Their conjecture was disproved in 2008 [8] where a circuit of size was given.

And these are just some of the more famous ones. The list goes on and on. In number-on-forehead communication complexity, the function Majority-of-Majorities was a candidate for being hard for more than logarithmically many parties. This was disproved in [3] and subsequent works, where many other counter-intuitive protocols are presented. In data structures, would you think it possible to switch between binary and ternary representation of a number using constant time per digit and *zero* space overhead? Turns out it is [11, 7]. Do you believe factoring is hard? Then you also believe there are pseudorandom generators where each output bit depends only on input bits [2]. Known algorithms for directed connectivity use either super-polynomial time or polynomial memory. But if you are given access to polynomial memory full of junk that you can’t delete, then you can solve directed connectivity using only logarithmic (clean) memory and polynomial time [6]. And I haven’t even touched on the many broken conjectures in cryptography, most recently related to obfuscation.

On the other hand, arguably the main thing that’s surprising in the lower bounds we have is that they can be proved at all. The bounds themselves are hardly surprising. Of course, the issue may be that we can prove so few lower bounds that we shouldn’t expect surprises. Some of the undecidability results I do consider surprising, for example Hilbert’s 10th problem. But what is actually surprising in those results are the *algorithms*, showing that even very restricted models can simulate more complicated ones (same for the theory of NP completeness). In terms of lower bounds they all build on diagonalization, that is, go through every program and flip the answer, which is boring.

The evidence is clear: we have grossly underestimated the reach of efficient computation, in a variety of contexts. All signs indicate that we will continue to see bigger and bigger surprises in upper bounds, and P=NP. Do I really believe the formal inclusion P=NP? Maybe, let me not pick parameters. What I believe is that the idea that lower bounds are obviously true and we just can’t prove them is not only baseless but even clashes with historical evidence. It’s the upper bounds that are missing.

[1] Josh Alman and R. Ryan Williams. Probabilistic rank and matrix rigidity. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, Montreal, QC, Canada, June 19-23, 2017, pages 641–652, 2017.

[2] Benny Applebaum, Yuval Ishai, and Eyal Kushilevitz. Cryptography in NC. SIAM J. on Computing, 36(4):845–888, 2006.

[3] László Babai, Anna Gál, Peter G. Kimmel, and Satyanarayana V. Lokam. Communication complexity of simultaneous messages. SIAM J. on Computing, 33(1):137–166, 2003.

[4] David A. Mix Barrington. Bounded-width polynomial-size branching programs recognize exactly those languages in NC. J. of Computer and System Sciences, 38(1):150–164, 1989.

[5] Allan Borodin, Danny Dolev, Faith E. Fich, and Wolfgang J. Paul. Bounds for width two branching programs. In Proceedings of the 15th Annual ACM Symposium on Theory of Computing, 25-27 April, 1983, Boston, Massachusetts, USA, pages 87–93, 1983.

[6] Harry Buhrman, Richard Cleve, Michal Koucký, Bruno Loff, and Florian Speelman. Computing with a full memory: catalytic space. In ACM Symp. on the Theory of Computing (STOC), pages 857–866, 2014.

[7] Yevgeniy Dodis, Mihai Pǎtraşcu, and Mikkel Thorup. Changing base without losing space. In 42nd ACM Symp. on the Theory of Computing (STOC), pages 593–602. ACM, 2010.

[8] Yuval Ishai, Eyal Kushilevitz, Rafail Ostrovsky, and Amit Sahai. Cryptography with constant computational overhead. In 40th ACM Symp. on the Theory of Computing (STOC), pages 433–442, 2008.

[9] A. A. Karatsuba. The complexity of computations. Trudy Mat. Inst. Steklov., 211(Optim. Upr. i Differ. Uravn.):186–202, 1995.

[10] Yishay Mansour, Noam Nisan, and Prasoon Tiwari. The computational complexity of universal hashing. Theoretical Computer Science, 107:121–133, 1993.

[11] Mihai Pǎtraşcu. Succincter. In 49th IEEE Symp. on Foundations of Computer Science (FOCS). IEEE, 2008.

[12] Volker Strassen. Gaussian elimination is not optimal. Numer. Math., 13:354–356, 1969.

[13] Valiant. On non-linear lower bounds in computational complexity. In ACM Symp. on the Theory of Computing (STOC), pages 45–53, 1975.

[14] Leslie G. Valiant. Graph-theoretic arguments in low-level complexity. In 6th Symposium on Mathematical Foundations of Computer Science, volume 53 of Lecture Notes in Computer Science, pages 162–176. Springer, 1977.

]]>

Barak for pseudorandom functions: (**e.g.**, see [MV12])

Wigderson for communication complexity: *( e.g. see [*

I am not saying that these are not appropriate citation styles (I leave this determination to you). For me I am just happy that my work had an impact **for example** in three different areas.

My suggestion is to avoid et al. and instead spell out every name (as in Aaron and Zuck) or every initial (as in AZ). It isn’t perfect, but improvements like randomly permuting the order still aren’t easy to implement. The suggestion actually cannot be implemented in journals like computational complexity which punish the authors into using an idiosyncratic style which has et al. But it doesn’t matter too much; nobody reads papers in those formats anyway, as we discussed several times.

]]>Special Topics in Complexity Theory, Fall 2017. Instructor: Emanuele Viola

Guest lecture by Huacheng Yu on dynamic data structure lower bounds, for the 2D range query and 2D range parity problems. Thanks to Huacheng for giving this lecture and for feedback on the write-up.

**What is covered**.

- Overview of Larsen’s lower bound for 2D range counting.
- Extending these techniques for for 2D range parity.

**Definition 1.** 2D range counting

Give a data structure that maintains a weighted set of 2 dimensional points with integer coordinates, that supports the following operations:

**UPDATE**: Add a (point, weight) tuple to the set.**QUERY**: Given a query point , return the sum of weights of points in the set satisfying and .

**Definition 2.** 2D range parity

Give a data structure that maintains an unweighted set of 2 dimensional points with integer coefficients, that supports the following operations:

**UPDATE**: Add a point to the set.**QUERY**: Given a query point , return the parity of the number of points in the set satisfying and .

Both of these definitions extend easily to the -dimensional case, but we state the 2D versions as we will mainly work with those.

All upper bounds assume the RAM model with word size .

Upper bounds: Using range trees, we can create a data structure for 2D range counting, with all update and query operations taking time time. With extra tricks, we can make this work for 2D range parity with operations running in time .

Lower bounds. There are a series of works on lower bounds:

- Fredman, Saks ’89 – 1D range parity requires .
- Patrascu, Demaine ’04 – 1D range counting requires .
- Larsen ’12 – 2D range counting requires .
- Larsen, Weinstein, Yu ’17 – 2D range parity requires .

This lecture presents the recent result of [Larsen ’12] and [Larsen, Weinstein, Yu ’17]. They both use the same general approach:

- Show that, for an efficient approach to exist, the problem must demonstrate some property.
- Show that the problem doesn’t have that property.

All lower bounds are in the cell probe model with word size .

We consider a general data structure problem, where we require a structure that supports updates and queries of an unspecified nature. We further assume that there exists an efficient solution with update and query times . We will restrict our attention to operation sequences of the form . That is, a sequence of updates followed by a single query . We fix a distribution over such sequences, and show that the problem is still hard.

We divide the updates into epochs, so that our sequence becomes:

where and . The epochs are multiplicatively shrinking. With this requirement, we have that .

Let be the set of all memory cells used by the data structure when run on the sequence of updates. Further, let be the set of memory cells which are accessed by the structure at least once in , and never again in a further epoch.

**Claim 2.** There exists an epoch such that probes cells from when answering the query at the end. Note that this is simply our query time divided by the number of epochs. In other words, can’t afford to read cells from each set without breaking its promise on the query run time.

Claim 2 implies that there is an epoch which has the smallest effect on the final answer. We will call this the ”easy” epoch.

**Idea**. : The set contains ”most” information about among all memory cells in . Also, are not updated past epoch , and hence should contain no information relative to the updates in . Epochs are progressively shrinking, and so the total touched cells in during the query operation should be small.

Having set up the framework for how to analyze the data structure, we now introduce a communication game where two parties attempt to solve an identical problem. We will show that, an efficient data structure implies an efficient solution to this communication game. If the message is smaller than the entropy of the updates of epoch (conditioned on preceding epochs), this gives an information theoretic contradiction. The trick is to find a way for the encoder to exploit the small number of probed cells to send a short message.

**The game**. The game consists of two players, Alice and Bob, who must jointly compute a single query after a series of updates. The model is as follows:

- Alice has all of the update epochs . She also has an index , which still corresponds to the ”easy” epoch as defined above.
- Bob has all update epochs EXCEPT for . He also has a random query . He is aware of the index .
- Communication can only occur in a single direction, from Alice to Bob.
- We assume some fixed input distribution .
- They win this game if Bob successfully computes the correct answer for the query .

Then we will show the following generic theorem, relating this communication game to data structures for the corresponding problem:

**Theorem 3.** If there is a data structure with update time and probes cells from in expectation when answering the final query , then the communication game has an efficient solution, with communication cost, and success probability at least . This holds for any choice of .

Before we prove the theorem, we consider specific parameters for our problem. If we pick

then, after plugging in the parameters, the communication cost is . Note that, we could always trivially achieve by having Alice send Bob all of , so that he can compute the solution of the problem with no uncertainty. The success probability is , which simplifies to . This is significantly better than , which could be achieved trivially by having Bob output a random answer to the query, independent of the updates.

Proof.

We assume we have a data structure for the update / query problem. Then Alice and Bob will proceed as follows:

**Alice’s steps**.

- Simulate on . While doing so, keep track of memory cell accesses and compute .
- Sample a random subset , such that .
- Send .

We note that in Alice’s Step 3, to send a cell, she sends a tuple holding the cell ID and the cell state before the query was executed. Also note that, she doesn’t distinguish to Bob which cells are in which sets of the union.

**Bob’s steps**.

- Receive from Alice.
- Simulate on epochs . Snapshot the current memory state of the data structure as .
- Simulate the query algorithm. Every time attempts to probe cell , Bob checks if . If it is, he lets probe from . Otherwise, he lets probe from .
- Bob returns the result from the query algorithm as his answer.

If the query algorithm does not query any cell in , then Bob succeeds, as he can exactly simulate the data structure query. Since the query will check cells in , and Bob has a random subset of them of size , then the probability that he got a subset the data structure will not probe is at least . The communication cost is the cost of Alice sending the cells to Bob, which is

The extension to 2D range parity proceeds in nearly identical fashion, with a similar theorem relating data structures to communication games.

**Theorem 1.** Consider an arbitrary data structure problem where queries have 1-bit outputs. If there exists a data structure having:

- update time
- query time
- Probes cells from when answering the last query

Then there exists a protocol for the communication game with bits of communication and success probability at least , for any choice of . Again, we plug in the parameters from 2D range parity. If we set

then the cost is , and the probability simplifies to .

We note that, if we had different queries, then randomly guessing on all of them, with constant probability we could be correct on as many as . In this case, the probability of being correct on a single one, amortized, is .

Proof. The communication protocol will be slightly adjusted. We assume an a priori distribution on the updates and queries. Bob will then compute the posterior distribution, based on what he knows and what Alice sends him. He then computes the maximum likelihood answer to the query . We thus need to figure out what Alice can send, so that the answer to is often biased towards either or .

We assume the existence of some public randomness available to both Alice and Bob. Then we adjust the communication protocol as follows:

**Alice’s modified steps**.

- Alice samples, using the public randomness, a subset of ALL memory cells , such that each cell is sampled with probability . Alice sends to Bob. Since Bob can mimic the sampling, he gains additional information about which cells are and aren’t in .

**Bob’s modified steps**.

- Denote by the set of memory cells probed by the data structure when Bob simulates the query algorithm. That is, is what Bob ”thinks” D will probe during the query, as the actual set of cells may be different if Bob had full knowledge of the updates, and the data structure may use that information to determine what to probe. Bob will use to compute the posterior distribution.

Define the function to be the ”bias” when takes on the value . In particular, this function is conditioned on that Bob receives from Alice. We can then clarify the definition of as

In particular, has the following two properties:

In these statements, the expectation is over everything that Bob knows, and the probabilities are also conditioned on everything that Bob knows. The randomness comes from what he doesn’t know. We also note that when the query probes no cells in , then the bias is always , since the a posterior distribution will put all its weight on the correct answer of the query.

Finishing the proof requires the following lemma:

**Lemma 2.** For any with the above two properties, there exists a such that and

Note that the sum inside the absolute values is the bias when .

]]>

I spent months fishing out, producing, and emailing back-and-forth documents. I found a little strange that my being tenured did not affect their evaluation of my financial stability the least. I thought I could provide a small but stable cash flow that they could reliably bleed white over the course of my remaining lifetime. The only logical explanation I have is that they benefit if I default. Instead, they were very curious about exactly why I wrote multiple checks for a few thousand dollars that were cashed in California?

The barrage of bureaucracy got to the point that I had to switch lender, in favor of someone who was less demanding in that department. At long last, I got back into the market, however only to find out that the document I had chased so hard was almost completely worthless. To explain in one word: appraisal.

This buyer-ready commitment is still contingent on appraisal. This means that after the offer is accepted, the bank still has to go there and see the property, and decide if it is valued right. Only in that case I get the mortgage. That means that the seller can’t be sure I have the dough, so why should they bother with me? Indeed, they don’t. The only slight advantage that this document provides is a little saving in time over someone who has to get a mortgage from scratch. But that has nothing to do with competing against cash buyers.

For the benefit of posterity, let me list the three main contingencies related to buying real-estate the old way.

MORTGAGE: This is whether the bank thinks that you (the buyer) are financially stable enough to be given a loan. This is the check that you can preprocess with the “buyer-ready commitment.”

APPRAISAL: As mentioned above, this is whether the bank thinks that the *property* is actually worth the money they put down. This can’t be done until after an offer is accepted, requires one-two appraisers, and guess who pays for them. In today’s crazy market when properties are sold way over asking price, you can’t be sure at all that the appraiser will say the house is worth what you pay for. At least, I can’t. And if they don’t, you are supposed to pay for the difference, which most likely you don’t have. For example, putting down all your savings of $200k, you can get a loan of $800k, for a purchase price of $1M. The house which you saw listed for $800k is sold for $1M, but the appraiser says the right price is $900k. Either you find another $100k quick, or you lose the 5% you gave at the purchase and sale (and the deal is over). Appraisal should not be confused with assessment, which is how much the town thinks the house is worth for tax purposes.

INSPECTION: OK, you can forget this. Moreover, from my experience a general inspection is nearly useless. If you are paying $1M for a house, why do you care if the boiler needs to be updated? Anything which interests me, like does this house have lead/asbestos/mold/structural damage/pest etc. the inspector can’t answer on the spot. For each of those things you need a different specialist, which you can’t get in time, and who can’t even do the job until the house is yours (because for example they can’t collect samples).

The running joke in the area where I am looking continues to be to list houses ridiculously below market price, and then have inexperienced families stress over their offers just to see them wiped out by yet another $1M cash. There are reasons slightly more subtle than my poverty why I think this is outrageous. Today’s house-buying protocol does nothing but force poor people into gambling desperate offers which could result in their financial ruin. Why don’t we also legalize Russian Roulette then? I think today’s protocol should be made illegal. That is, we should find a way so that someone with a mortgage has a fair shot at buying a house. There are several ways in which this could be realized. For example, the offers should not reveal the appraisal contingency. The fact that the buyer pays for the appraisal prevents them from making baseless offers. And the millionaires who offered less can wait one day for the appraisal to come back.

Nevertheless, after a 3-year ordeal, I am now a homeowner. Here’s how my offer went. First, it so happens that I was sick on that fateful Thursday. At around noon, a new listing pops up. The open house is scheduled for the week-end, so I might just wait for that, right? I instantly call and schedule a showing for the afternoon. At around 6PM, with effort I manage to get to the house. As usual, there are already 5 other interested parties, and the broker is busy scheduling more visits over the phone. At 9PM we put in an offer with a 16-hour deadline. The offer is completely “clean:” here’s the money, no contingencies, no questions asked. Moreover, it is over asking, though not by very much. My wife has not seen the property.

I then go to a pharmacy to buy medications. There I meet someone who was checking out the house at the same time as me! They say the house needs $.5M in works, which I later take as a move to kick me out of the competition. They also ask me if I’d be interested in putting in an offer.

To my astonishment, our offer is accepted on Friday morning. For once, I was the annoying person who took the property out of the market before the open house! There is however a small caveat: you wouldn’t think that the above gets you a house where you can actually live, would you?

]]>Special Topics in Complexity Theory, Fall 2017. Instructor: Emanuele Viola

In this lecture we study lower bounds on data structures. First, we define the setting. We have bits of data, stored in bits of memory (the data structure) and want to answer queries about the data. Each query is answered with probes. There are two types of probes:

- bit-probe which return one bit from the memory, and
- cell-probe in which the memory is divided into cells of bits, and each probe returns one cell.

The queries can be adaptive or non-adaptive. In the adaptive case, the data structure probes locations which may depend on the answer to previous probes. For bit-probes it means that we answer a query with depth- decision trees.

Finally, there are two types of data structure problems:

- The static case, in which we map the data to the memory arbitrarily and afterwards the memory remains unchanged.
- The dynamic case, in which we have update queries that change the memory and also run in bounded time.

In this lecture we focus on the non-adaptive, bit-probe, and static setting. Some trivial extremes for this setting are the following. Any problem (i.e., collection of queries) admits data structures with the following parameters:

- and , i.e. you write down all the answers, and
- and , i.e. you can always answer a query about the data if you read the entire data.

Next, we review the best current lower bound, a bound proved in the 80’s by Siegel [Sie04] and rediscovered later. We state and prove the lower bound in a different way. The lower bound is for the problem of -wise independence.

**Problem 1.** The data is a seed of size for a -wise independent distribution over . A query is defined to be the -th bit of the sample.

The question is: if we allow a little more space than seed length, can we compute such distributions fast?

**Theorem 2.** For the above problem with it holds that

It follows, that if then is . But if then nothing is known.

Proof. Let . We have the memory of bits and we are going to subsample it. Specifically, we will select a bit of with probability , independently.

The intuition is that we will shrink the memory but still answer a lot of queries, and derive a contradiction because of the seed length required to sample -wise independence.

For the “shrinking” part we have the following. We expect to keep memory bits. By a Chernoff bound, it follows that we keep bits except with probability .

For the “answer a lot of queries” part, recall that each query probes bits from the memory. We keep one of the queries if it so happens that we keep all the bits that it probed in the memory. For a fixed query, the probability that we keep all its probes is .

We claim that with probability at least , we keep queries. This follows by Markov’s inequality. We expect to not keep queries on average. We now apply Markov’s inequality to get that the probability that we don’t keep at least queries is at most .

Thus, if , then there exists a fixed choice of memory bits that we keep, to achieve both the “shrinking” part and the “answer a lot of queries” part as above. This inequality is true because and so . But now we have bits of memory while still answering as many as queries.

The minimum seed length to answer that many queries while maintaining -wise independence is . Therefore the memory has to be at least as big as the seed. This yields

from which the result follows.

This lower bound holds even if the memory bits are filled arbitrarily (rather than having entropy at most ). It can also be extended to adaptive cell probes.

We will now show a conceptually simple data structure which nearly matches the lower bound. Pick a random bipartite graph with nodes on the left and nodes on the right. Every node on the right side has degree . We answer each probe with an XOR of its neighbor bits. By the Vazirani XOR lemma, it suffices to show that any subset of at most memory bits has an XOR which is unbiased. Hence it suffices that every subset with has a unique neighbor. For that, in turn, it suffices that has a neighborhood of size greater than (because if every element in the neighborhood of has two neighbors in then has a neighborhood of size ). We pick the graph at random and show by standard calculations that it has this property with non-zero probability.

It suffices to have , so that the probability is strictly less than 1, because . We can match the lower bound in two settings:

- if for some constant , then suffices,
- and suffices.

**Remark 3.** It is enough if the memory is -wise independent as opposed to completely uniform, so one can have . An open question is if you can improve the seed length to optimal.

As remarked earlier the lower bound does not give anything when is much larger than . In particular it is not clear if it rules out . Next we show a lower bound which applies to this case.

**Problem 4.** Take bits to be a seed for -biased distribution over . The queries, like before, are the bits of that distribution. Recall that .

**Theorem 5.** You need .

Proof. Every query is answered by looking at bits. But queries are answered by the same 2-bit function of probes (because there is a constant number of functions on 2-bits). There are two cases for :

- is linear (or affine). Suppose for the sake of contradiction that . Then you have a linear dependence, because the space of linear functions on bits is . This implies that if you XOR those bits, you always get 0. This in turn contradicts the assumption that the distributions has small bias.
- is AND (up to negating the input variables or the output). In this case, we keep collecting queries as long as they probe at least one new memory bit. If when we stop we have a query left such that both their probes query bits that have already been queried. This means that there exist two queries and whose probes cover the probes of a third query . This in turn implies that the queries are not close to uniform. That is because there exist answers to and that fix bits probed by them, and so also fix the bits probed by . But this contradicts the small bias of the distribution.

]]>

Special Topics in Complexity Theory, Fall 2017. Instructor: Emanuele Viola

In these lectures we prove the corners theorem for pseudorandom groups, following Austin [Aus16]. Our exposition has several non-major differences with that in [Aus16], which may make it more computer-science friendly. The instructor suspects a proof can also be obtained via certain local modifications and simplifications of Green’s exposition [Gre05b, Gre05a] of an earlier proof for the abelian case. We focus on the case for simplicity, but the proof immediately extends to other pseudorandom groups.

**Theorem 1.** Let . Every subset of density contains a corner, i.e., a set of the form .

For intuition, suppose is a product set, i.e., for . Let’s look at the quantity

where iff . Note that the random variable in the expectation is equal to exactly when form a corner in . We’ll show that this quantity is greater than , which implies that contains a corner (where ). Since we are taking , we can rewrite the above quantity as

where the last line follows by replacing with in the uniform distribution. If , then and . Condition on , , . Then the distribution is a product of three independent distributions, each uniform on a set of measure greater than . By pseudorandomness is close to uniform in statistical distance. This implies that the above quantity equals

Given this, it is natural to try to write an arbitrary as a combination of product sets (with some error). We will make use of a more general result.

Let be some universe (we will take ). Let be a function (for us, ). Let be some set of functions, which can be thought of as “easy functions” or “distinguishers.”

**Theorem 2.**[Weak Regularity Lemma] For all , there exists a function where , and such that for all

The lemma is called ‘weak’ because it came after Szemerédi’s regularity lemma, which has a stronger distinguishing conclusion. However, the lemma is also ‘strong’ in the sense that Szemerédi’s regularity lemma has as a tower of whereas here we have polynomial in . The weak regularity lemma is also simpler. There also exists a proof of Szemerédi’s theorem (on arithmetic progressions), which uses weak regularity as opposed to the full regularity lemma used initially.

Proof. We will construct the approximation through an iterative process producing functions . We will show that decreases by each iteration.

**Start**: Define (which can be realized setting ).**Iterate**: If not done, there exists such that . Assume without loss of generality .**Update**: where shall be picked later.

Let us analyze the progress made by the algorithm.

where the last line follows by taking . Therefore, there can only be iterations because .

Returning to the lower bound proof, we will use the weak regularity lemma to approximate the indicator function for arbitrary by rectangles. That is, we take to be the collection of indicator functions for all sets of the form for . The weak regularity lemma gives us as a linear combination of rectangles. These rectangles may overlap. However, we ideally want to be a linear combination of *non-overlapping* rectangles.

**Claim 3.** Given a decomposition of into rectangles from the weak regularity lemma with functions, there exists a decomposition with rectangles which don’t overlap.

Proof. Exercise.

In the above decomposition, note that it is natural to take the coefficients of rectangles to be the density of points in that are in the rectangle. This gives rise to the following claim.

**Claim 4.** The weights of the rectangles in the above claim can be the average of in the rectangle, at the cost of doubling the distinguisher error.

Consequently, we have that , where is the sum of non-overlapping rectangles with coefficients .

Proof. Let be a partition decomposition with arbitrary weights. Let be a partition decomposition with weights being the average of . It is enough to show that for all rectangle distinguishers

By the triangle inequality, we have that

To bound , note that the error is maximized for a that respects the decomposition in non-overlapping rectangles, i.e., is the union of some non-overlapping rectangles from the decomposition. This can be argues using that, unlike , the value of and on a rectangle from the decomposition is fixed. But, for such , ! More formally, .

We need to get a little more from this decomposition. The conclusion of the regularity lemma holds with respect to distinguishers that can be written as where and map . We need the same guarantee for and with range . This can be accomplished paying only a constant factor in the error, as follows. Let and have range . Write where and have range , and the same for . The error for distinguisher is at most the sum of the errors for distinguishers , , , and . So we can restrict our attention to distinguishers where and have range . In turn, a function with range can be written as an expectation for functions with range , and the same for . We conclude by observing that

Let us now finish the proof by showing a corner exists for sufficiently dense sets . We’ll use three types of decompositions for , with respect to the following three types of distinguishers, where and have range :

- ,
- ,
- .

The last two distinguishers can be visualized as parallelograms with a 45-degree angle between two segments. The same extra properties we discussed for rectangles hold for them too.

Recall that we want to show

We’ll decompose the -th occurrence of via the -th decomposition listed above. We’ll write this decomposition as . We do this in the following order:

We first show that is big (i.e., inverse polylogarithmic in expectation) in the next two claims. Then we show that the expectations of the other terms are small.

**Claim 5.** For all , the values are the same (over ) up to an error of .

Proof. We just need to get error for any product of three functions for the three decomposition types. By the standard pseudorandomness argument we saw in previous lectures,

Recall that we start with a set of density .

**Claim 6.** .

Proof. By the previous claim, we can fix . We will relate the expectation over to by a trick using the Hölder inequality: For random variables ,

To apply this inequality in our setting, write

By the Hölder inequality, we get that

Note that

where is the set in the partition that contains . Finally, by non-negativity of , we have that . This concludes the proof.

We’ve shown that the term is big. It remains to show the other terms are small. Let be the error in the weak regularity lemma with respect to distinguishers with range .

**Claim 7.** .

Proof. Replace with in the uniform distribution to get

where the first inequality is by Cauchy-Schwarz.

Now replace and reason in the same way:

Replace to rewrite the expectation as

We want to view the last three terms as a distinguisher . First, note that has range . This is because and has range .

Fix . The last term in the expectation becomes a constant . The second term only depends on , and the third only on . Hence for appropriate functions and with range this expectation can be rewritten as

which concludes the proof.

There are similar proofs to show the remaining terms are small. For , we can perform simple manipulations and then reduce to the above case. For , we have a slightly easier proof than above.

Suppose our set has density . We apply the weak regularity lemma for error . This yields the number of functions . For say , we can bound from below by the same expectation with fixed to , up to an error . Then, . The expectation of terms with is less than . So the proof can be completed for all sufficiently small .

[Aus16] Tim Austin. Ajtai-Szemerédi theorems over quasirandom groups. In Recent trends in combinatorics, volume 159 of IMA Vol. Math. Appl., pages 453–484. Springer, [Cham], 2016.

[Gre05a] Ben Green. An argument of shkredov in the finite field setting, 2005. Available at people.maths.ox.ac.uk/greenbj/papers/corners.pdf.

[Gre05b] Ben Green. Finite field models in additive combinatorics. Surveys in Combinatorics, London Math. Soc. Lecture Notes 327, 1-27, 2005.

Special Topics in Complexity Theory, Fall 2017. Instructor: Emanuele Viola

In this lecture fragment we discuss multiparty communication complexity, especially the problem of separating deterministic and randomized communication, which we connect to a problem in combinatorics.

In number-on-forehead (NOH) communication complexity each party sees all of the input except its own input . For background, it is not known how to prove negative results for parties. We shall focus on the problem of separating deterministic and randomizes communication. For , we know the optimal separation: The equality function requires communication for deterministic protocols, but can be solved using communication if we allow the protocols to use public coins. For , the best known separation between deterministic and randomized protocol is vs [BDPW10]. In the following we give a new proof of this result, for a simpler function: if and only if for .

For context, let us state and prove the upper bound for randomized communication.

**Claim 1.** has randomized communication complexity .

Proof. In the NOH model, computing reduces to -party equality with no additional communication: Alice computes privately, then Alice and Bob check if .

To prove a lower bound for deterministic protocols, where , we reduce the communication problem to a combinatorial problem.

**Definition 2.** A corner in a group is , where are arbitrary group elements and .

For intuition, consider the case when is Abelian, where one can replace multiplication by addition and a corner becomes for .

We now state the theorem that gives the lower bound.

**Theorem 3.** Suppose that every subset with contains a corner. Then the deterministic communication complexity of is .

It is known that when is Abelian, then implies a corner. We shall prove that when , then implies a corner. This in turn implies communication .

Proof. We saw that a number-in-hand (NIH) -bit protocol can be written as a disjoint union of rectangles. Likewise, a number-on-forehead -bit protocol can be written as a disjoint union of cylinder intersections for some :

The proof idea of the above fact is to consider the transcripts of , then one can see that the inputs giving a fixed transcript are a cylinder intersection.

Let be a -bit protocol. Consider the inputs on which accepts. Note that at least fraction of them are accepted by some cylinder intersection . Let . Since the first two elements in the tuple determine the last, we have .

Now suppose contains a corner . Then

This implies , which is a contradiction because and so .

]]>

Special Topics in Complexity Theory, Fall 2017. Instructor: Emanuele Viola

This is a guest lecture by Justin Thaler regarding lower bounds on approximate degree [BKT17, BT15, BT17]. Thanks to Justin for giving this lecture and for his help with the write-up. We will sketch some details of the lower bound on the approximate degree of , and some intuition about the techniques used. Recall the definition of from the previous lecture as below:

**Definition 1.** The surjectivity function , takes input where each is interpreted as an element of . has value if and only if .

Recall from the last lecture that is the block-wise composition of the function on bits and the function on bits. In general, we will denote the block-wise composition of two functions , and , where is defined on bits and is defined on bits, by . Here, the outputs of copies of are fed into (with the inputs to each copy of being pairwise disjoint). The total number of inputs to is .

**Claim 2.** .

We will look at only the lower bound in the claim. We interpret the input as a list of numbers from . As presented in [BKT17], the proof for the lower bound proceeds in the following steps.

- Show that to approximate , it is necessary to approximate the block-composition on inputs of Hamming weight at most . i.e., show that .
Step 1 was covered in the previous lecture, but we briefly recall a bit of intuition for why the claim in this step is reasonable. The intuition comes from the fact that the

*converse*of the claim is easy to establish, i.e., it is easy to show that in order to approximate , it is*sufficient*to approximate on inputs of Hamming weight exactly .This is because can be expressed as an (over all range items ) of the (over all inputs ) of “Is input equal to ”? Each predicate of the form in quotes is computed exactly by a polynomial of degree , since it depends on only of the input bits, and exactly of the predicates (one for each ) evaluates to TRUE.

Step 1 of the lower bound proof for in [BKT17] shows a converse, namely that the

*only*way to approximate is to approximate on inputs of Hamming weight at most . - Show that , i.e., the degree required to approximate on inputs of Hamming weight at most is at least .

In the previous lecture we also sketched this Step 2. In this lecture we give additional details of this step. As in the papers, we use the concept of a “dual witness.” The latter can be shown to be equivalent to bounded indistinguishability.

Step 2 itself proceeds via two substeps:

- Give a dual witness for that has places little mass (namely, total mass less then ) on inputs of hamming weight .
- By modifying , give a dual witness for that places zero mass on inputs of Hamming weight .

In [BKT17], both Substeps 2a and 2b proceed entirely in the dual world (i.e., they explicitly manipulate dual witnesses and ). The main goal of this section of the lecture notes is to explain how to replace Step 2b of the argument of [BKT17] with a wholly “primal” argument.

The intuition of the primal version of Step 2b that we’ll cover is as follows. First, we will show that a polynomial of degree that is bounded on the low Hamming Weight inputs, cannot be too big on the high Hamming weight inputs. In particular, we will prove the following claim.

**Claim 3.** If is a degree polynomial that satisfies on all inputs of of Hamming weight at most , then for *all* inputs .

Second, we will explain that the dual witness constructed in Step 2a has the following “primal” implication:

**Claim 4.** For , any polynomial of degree satisfying for all inputs of Hamming weight at most must satisfy for some input .

Combining Claims 3 and 4, we conclude that no polynomial of degree can satisfy

which is exactly the desired conclusion of Step 2. This is because any polynomial satisfying Equation (1) also satisfies for all of Hamming weight of most , and hence Claim 3 implies that

But Claim 4 states that any polynomial satisfying both Equations (1) and (2) requires degree strictly larger than .

In the remainder of this section, we prove Claims 3 and 4.

Proof of Claim 3. For notational simplicity, let us prove this claim for polynomials on domain , rather than .

**Proof in the case that is symmetric.** Let us assume first that is symmetric, i.e., is only a function of the Hamming weight of its input . Then for some degree univariate polynomial (this is a direct consequence of Minsky-Papert symmetrization, which we have seen in the lectures before). We can express as below in the same spirit of Lagrange interpolation.

Here, the first term, ,is bounded in magnitude by , and . Therefore, we get the final bound:

**Proof for general .** Let us now consider the case of general (not necessarily symmetric) polynomials . Fix any input . The goal is to show that .

Let us consider a polynomial of degree obtained from by restricting each input such that to have the value 0. For example, if and , then . We will exploit three properties of :

- .
- Since for all inputs with , satisfies the analogous property: for all inputs with .
- If denotes the all-1s vector of length , then .

Property 3 means that our goal is to show that .

Let denote the symmetrized version of , i.e., , where the expectation is over a random permutation of , and . Since for all permutations , . But is symmetric, so Properties 1 and 2 together mean that the analysis from the first part of the proof implies that for all inputs . In particular, letting , we conclude that as desired.

**Discussion.** One may try to simplify the analysis of the general case in the proof Claim 3 by considering the polynomial defined via ], where the expectation is over permutations of . is a symmetric polynomial, so the analysis for symmetric polynomials immediately implies that . Unfortunately, this does *not* mean that .

This is because the symmetrized polynomial is averaging the values of over all those inputs of a given Hamming weight. So, a bound on this averaging polynomial does not preclude the case where is massively positive on some inputs of a given Hamming weight, and massively negative on other inputs of the same Hamming weight, and these values cancel out to obtain a small average value. That is, it is not enough to conclude that on the average over inputs of any given Hamming weight, the magnitude of is not too big.

Thus, we needed to make sure that when we symmetrize to , such large cancellations don’t happen, and a bound of the average value of on a given Hamming weight really gives us a bound on on the input itself. We defined so that . Since there is only *one* input in of Hamming weight , does not average ’s values on many inputs, meaning we don’t need to worry about massive cancellations.

**A note on the history of Claim 3.** Claim 3 was implicit in [RS10]. They explicitly showed a similar bound for symmetric polynomials using primal view and (implicitly) gave a different (dual) proof of the case for general polynomials.

A dual polynomial is a dual solution to a certain linear program that captures the approximate degree of any given function . These polynomials act as certificates of the high approximate degree of . The notion of strong LP duality implies that the technique is lossless, in comparison to symmetrization techniques which we saw before. For any function and any , there is always some dual polynomial that witnesses a tight -approximate degree lower bound for . A dual polynomial that witnesses the fact that is a function satisfying three properties:

**Correlation analysis:**If satisfies this condition, it is said to be well-correlated with .

**Pure high degree:**For all polynomials of degree less than , we haveIf satisfies this condition, it is said to have

*pure high degree*at least .**norm:**

For any function , we can write an LP capturing the approximate degree of . We can prove lower bounds on the approximate degree of by proving lower bounds on the value of feasible solution of this LP. One way to do this is by writing down the Dual of the LP, and exhibiting a feasible solution to the dual, thereby giving an upper bound on the value of the Dual. By the principle of LP duality, an upper bound on the Dual LP will be a lower bound of the Primal LP. Therefore, exhibiting such a feasible solution, which we call a dual witness, suffices to prove an approximate degree lower bound for .

However, for any given dual witness, some work will be required to verify that the witness indeed meets the criteria imposed by the Dual constraints.

When the function is a block-wise composition of two functions, say and , then we can try to construct a good dual witness for by looking at dual witnesses for each of and , and combining them carefully, to get the dual witness for .

The dual witness constructed in Step 2a for is expressed below in terms of the dual witness of the inner function viz. and the dual witness of the outer , viz. .

This method of combining dual witnesses for the “outer” function and for the “inner function” is referred to in [BKT17, BT17] as *dual block composition*.

Step 2a of the proof of the lower bound from [BKT17] gave a dual witness for (with ) that had pure high degree , and also satisfies Equations (4) and (5) below.

Equation (4) is a very strong “Hamming weight decay” condition: it shows that the total mass that places on inputs of high Hamming weight is very small. Hamming weight decay conditions play an essential role in the lower bound analysis for from [BKT17]. In addition to Equations (4) and (5) themselves being Hamming weight decay conditions, [BKT17]’s proof that satisfies Equations (4) and (5) exploits the fact that the dual witness for can be chosen to simultaneously have pure high degree , and to satisfy the following weaker Hamming weight decay condition:

**Claim 5.** There exist constants such that for all ,

(We will not prove Claim 5 in these notes, we simply state it to highlight the importance of dual decay to the analysis of ).

Dual witnesses satisfying various notions of Hamming weight decay have a natural primal interpretation: they witness approximate degree lower bounds for the target function ( in the case of Equation (4), and in the case of Equation (6)) *even when the approximation is allowed to be exponentially large on inputs of high Hamming weight*. This primal interpretation of dual decay is formalized in the following claim.

**Claim 6.** Let be any function mapping to . Suppose is a dual witness for satisfying the following properties:

- (Correlation): .
- (Pure high degree): has pure high degree .
- (Dual decay): for all .

Then there is no degree polynomial such that

Proof. Let be any degree polynomial. Since has pure high degree , .

We will now show that if satisfies Equation (7), then the other two properties satisfied by (correlation and dual decay) together imply that , a contradiction.

Here, Line 2 exploited that has correlation at least with , Line 3 exploited the assumption that satisfies Equation (7), and Line 4 exploited the dual decay condition that is assumed to satisfy.

Proof. Claim 4 follows from Equations (4) and (5), combined with Claim 6. Specifically, apply Claim 6 with , and

Now we take a look at how to extend this kind of analysis for to obtain even stronger approximate degree lower bounds for other functions in . Recall that can be expressed as an (over all range items ) of the (over all inputs ) of “Is input equal to ”? That is, simply evaluates on the inputs where indicates whether or not input is equal to range item .

Our analysis for can be viewed as follows: It is a way to turn the function on bits (which has approximate degree ) into a function on close to bits, with polynomially larger approximate degree (i.e. is defined on bits where, say, the value of is , i.e., it is a function on bits). So, this function is on not much more than bits, but has approximate degree , polynomially larger than the approximate degree of .

Hence, the lower bound for can be seen as a hardness amplification result. We turn the function on bits to a function on slightly more bits, but the approximate degree of the new function is significantly larger.

From this perspective, the lower bound proof for showed that in order to approximate , we need to not only approximate the function, but, additionally, instead of feeding the inputs directly to gate itself, we are further driving up the degree by feeding the input through gates. The intuition is that we cannot do much better than merely approximate the function and then approximating the block composed gates. This additional approximation of the gates give us the extra exponent in the approximate degree expression.

We will see two issues that come in the way of naive attempts at generalizing our hardness amplification technique from to more general functions.

**Grover’s algorithm** [Gro96] is a quantum algorithm that finds with high probability the unique input to a black box function that produces a given output, using queries on the function, where is the size of the the domain of the function. It is originally devised as a database search algorithm that searches an unsorted database of size and determines whether or not there is a record in the database that satisfies a given property in queries. This is strictly better compared to deterministic and randomized query algorithms because they will take queries in the worst case and in expectation respectively. Grover’s algorithm is optimal up to a constant factor, for the quantum world.

In general, let us consider the problem of taking any function that does not have maximal approximate degree (say, with approximate degree ), and turning it into a function on roughly the same number of bits, but with polynomially larger approximate degree.

In analogy with how equals evaluated on inputs , where indicates whether or not , we can consider the block composition evaluated on , and hope that this function has polynomially larger approximate degree than itself.

Unfortunately, this does not work. Consider for example the case . The function evaluates to 1 on all possible vectors , since all such vectors of Hamming weight exactly .

One way to try to address this is to introduce a dummy range item, all occurrences of which are simply ignored by the function. That is, we can consider the (hopefully harder) function to interpret its input as a list of numbers from the range , rather than range , and define (note that variables , which indicate whether or not each input equals range item , are simply ignored).

In fact, in the previous lecture we already used this technique of introducing a “dummy” range item, to ease the lower bound analysis for itself. Last lecture we covered Step 1 of the lower bound proof for , and we let denote the frequency of the dummy range item, 0. The introduction of this dummy range item let us replace the condition (i.e., the sum of the frequencies of all the range items is *exactly* ) by the condition (i.e., the sum of the frequencies of all the range items is *at most* ).

Unfortunately, introducing a dummy range item is not sufficient on its own. That is, even when the range is is rather than , the function may have approximate degree that is *not* polynomially larger than that of itself. An example of this is (once again) . With a dummy range item, evaluates to TRUE if and only if at least one of the inputs is *not* equal to the dummy range item . This problem has approximate degree (it can be solved using Grover search).

Therefore, the most naive approach at general hardness amplification, even with a dummy range item, does not work.

The approach that succeeds is to consider the block composition (i.e., apply the naive approach with a dummy range item not to itself, but to ). As pointed out in Section 2.3, the gates are crucial here for the analysis to go through.

It is instructive to look at where exactly the lower bound proof for breaks down if we try to adapt it to the function (rather than the function which we analyzed to prove the lower bound for ). Then we can see why the introduction of the gates fixes the issue.

When analyzing the more naively defined function (with a dummy range item), Step 1 of the lower bound analysis for *does work* unmodified to imply that in order to approximate , it is necessary to approximate block composition of on inputs of Hamming weight at most . But Step 2 of the analysis breaks down: one can approximate on inputs of Hamming weight at most using degree just .

Why does the Step 2 analysis break down for ? If one tries to construct a dual witness for by applying dual block composition (cf. Equation (3), but with the dual witness for replaced by a dual witness for ), will not be well-correlated with .

Roughly speaking, the correlation analysis thinks of each copy of the inner dual witness as consisting of a sign, , and a magnitude , and the inner dual witness “makes an error” on if it outputs the wrong sign, i.e., if . The correlation analysis winds up performing a union bound over the probability (under the product distribution ) that *any* of the copies of the inner dual witness makes an error. Unfortunately, each copy of the inner dual witness makes an error with constant probability under the distribution . So at least one of them makes an error under the product distribution with probability very close to 1. This means that the correlation of the dual-block-composed dual witness with is poor.

But if we look at , the correlation analysis *does* go through. That is, we can give a dual witness for and a dual witness for such that the the dual-block-composition of and is well-correlated with .

This is because [BT15] showed that for , . This means that has a dual witness that “makes an error” with probability just . This probability of making an error is so low that a union bound over all copies of appearing in the dual-block-composition of and implies that with probability at least , *none* of the copies of make an error.

In summary, the key difference between and that allows the lower bound analysis to go through for the latter but not the former is that the latter has -approximate degree for , while the former only has -approximate degree if is a constant bounded away from 1.

To summarize, the lower bound can be seen as a way to turn the function into a harder function , meaning that has polynomially larger approximate degree than . The right approach to generalize the technique for arbitrary is to (a) introduce a dummy range item, all occurrences of which are effectively ignored by the harder function , *and* (b) rather than considering the “inner” function , consider the inner function , i.e., let . The gates are essential to make sure that the error in the correlation of the inner dual witness is very small, and hence the correlation analysis for the dual-block-composed dual witness goes through. Note that can be interpreted as follows: it breaks the range up into blocks, each of length , (the dummy range item is excluded from all of the blocks), and for each block it computes a bit indicating whether or not every range item in the block has frequency at least 1. It then feeds these bits into .

By recursively applying this construction, starting with , we get a function in AC with approximate degree for any desired constant .

The above mentioned very same issue also arises in [BKT17]’s proof of a lower bound on the approximate degree of the -distinctness function. Step 1 of the lower bound analysis for reduced analyzing -distinctness to analyzing (restricted to inputs of Hamming weight at most ), where is the function that evaluates to TRUE if and only if its input has Hamming weight at least . The lower bound proved in [BKT17] for -distinctness is . is the function. So, is “close” to . And we’ve seen that the correlation analysis of the dual witness obtained via dual-block-composition breaks down for .

To overcome this issue, we have to show that is harder to approximate than itself, but we have to give up some small factor in the process. We will lose some quantity compared to the lower bound for . It may seem that this loss factor is just a technical issue and not intrinsic, but this is not so. In fact, this bound is almost tight. There is an upper bound from a complicated quantum algorithm [BL11, Bel12] for -distinctness that makes that we won’t elaborate on here.

[Bel12] Aleksandrs Belovs. Learning-graph-based quantum algorithm for k-distinctness. In Foundations of Computer Science (FOCS), 2012 IEEE 53rd Annual Symposium on, pages 207–216. IEEE, 2012.

[BKT17] Mark Bun, Robin Kothari, and Justin Thaler. The polynomial method strikes back: Tight quantum query bounds via dual polynomials. arXiv preprint arXiv:1710.09079, 2017.

[BL11] Aleksandrs Belovs and Troy Lee. Quantum algorithm for k-distinctness with prior knowledge on the input. arXiv preprint arXiv:1108.3022, 2011.

[BT15] Mark Bun and Justin Thaler. Hardness amplification and the approximate degree of constant-depth circuits. In International Colloquium on Automata, Languages, and Programming, pages 268–280. Springer, 2015.

[BT17] Mark Bun and Justin Thaler. A nearly optimal lower bound on the approximate degree of . arXiv preprint arXiv:1703.05784, 2017.

[Gro96] Lov K Grover. A fast quantum mechanical algorithm for database search. In Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, pages 212–219. ACM, 1996.

[RS10] Alexander A Razborov and Alexander A Sherstov. The sign-rank of . SIAM Journal on Computing, 39(5):1833–1855, 2010.