“It is often said that we live in a computational universe. But if Nature “computes” in a classical, input-output fashion then our current prospect to leverage this viewpoint to gain fundamental insights may be scarce. This is due to the combination of two facts. First, our current understanding of fundamental questions such as “P=NP?” is limited to restricted computational models, for example the class AC0 of bounded-depth circuits. Second, those restricted models are incapable of modeling many processes which appear to be present in nature. For example, a series of works in complexity theory culminating in [Hås87] shows that AC0 cannot count. But what if Nature, instead, “samples?” That is, what if Nature is better understood as a computational device that given some initial source of randomness, samples the observed distribution of the universe? Recent work by the Project Investigator (PI) gives two key insights in this direction. First, the PI has highlighted that, when it comes to sampling, restricted models are capable of surprising behavior. For example, AC0 can count, in the sense that it can sample a uniform bit string together with its hamming weight.[Vio12a] Second, despite the growth in power given by sampling, for these restricted models the PI was still able to answer fundamental questions of the type of “P=NP?”[Vio14]”
|
Thus begins my application for the Turing Centenary Research Fellowship. After reading it, perhaps you too, like me, are not surprised that it was declined. But I was unprepared for the strange emails that accompanied its rejection. Here’s an excerpt:
“[…] A reviewing process can be thought of as a kind of Turing Test for fundability. There is a built-in fallibility; and just as there is as yet no intelligent machine or effective algorithm for recognising one (otherwise why would we bother with a Turing Test), there is no algorithm for either writing the perfect proposal, or for recognising the worth of one. Of course, the feedback may well be useful, and will come. But we will be grateful for your understanding in the meantime.”
|
Well, I am still waiting for comments.
Even the rejection was sluggish: for months I and apparently others were told that our proposal didn’t make it, but was so good that they were looking for extra money to fund it anyway. After the money didn’t materialize, I was invited to try the regular call (of the sponsoring foundation). The first step of this was submitting a preliminary proposal, which I took: I re-sent them the abstract of my proposal. I was then invited to submit the full proposal. This is a rather painstaking process which requires you to address a seemingly endless series of minute questions referring to mysterious concepts such as the “Theory of Change.” Nevertheless, given that they had suggested I try the regular call, they had seen what I was planning to submit, and they had still invited me for the full proposal, I did answer all the questions and re-sent them what they already had, my Turing Research Fellowship application. Perhaps it only makes sense that the outcome was as it was.
The proposal was part of a research direction which started exactly five years ago, when the question was raised of proving computational lower bounds for sampling. Since then, there has been progress: [Vio12a, LV12, DW11, Vio14, Vio12b, BIL12, BCS14]. One thing I like of this area is that it is uncharted – wherever you point your finger chances are you find an open problem. While this is true for much of Complexity Theory, questions regarding sampling haven’t been studied nearly as intensely. Here’s three:
A most basic open question. Let D be the distribution on n-bit strings where each bit is independently 1 with probability 1∕4. Now suppose you want to sample D given some random bits x1,x2,…. You can easily sample D exactly with the map
(x1 ∧ x2,x3 ∧ x4,…,x2n–1 ∧ x2n).
This map is 2-local, i.e., each output bit depends on at most 2 input bits. However, we use 2n inputs bits, whereas the entropy of the distribution is H(1∕4)n ≈ 0.81n. Can you show that any 2-local map using a number of bits closer to H(1∕4)n will sample a distribution that is very far from D? Ideally, we want to show that the statistical distance between the distribution is very high, exponentially close to n.
Such strong statistical distance bounds also enable a connection to lower bounds for succinct dictionaries; a problem that Pǎtraşcu thinks important. A result for d-local maps corresponds to a result for data structures which answer membership queries with d non-adaptive bit probes. Adaptive bit probes correspond to decision trees. While d cell probes correspond to samplers whose input is divided in blocks of O(log n) bits, and each output bit depends on d cells, adaptively.
There are some results in [Vio12a] on a variant of the above question where you need to sample strings whose Hamming weight is exactly n∕4, but even there there are large gaps in our knowledge. And I think the above case of 2-local maps is still open, even though it really looks like you cannot do anything unless you use 2n random bits.
Stretch. With Lovett we suggested [LV12] to prove negative results for sampling (the uniform distribution over a) subset S ⊆{0, 1}n by bounding from below the stretch of any map
f : {0, 1}r → S.
Stretch can be measured as the average Hamming distance between f(x) and f(y), where x and y are two uniform input strings at Hamming distance 1. If you prove a good lower bound on this quantity then some complexity lower bounds for f follow because local maps, AC0 maps, etc. have low stretch.
We were able to apply this to prove that AC0 cannot sample good codes. Our bounds are only polynomially close to 1; but a nice follow-up by Beck, Impagliazzo, and Lovett, [BIL12], improves this to exponential. But can this method be applied to other sets that do not have error-correcting structure?
Consider in particular the distribution UP which is uniform over the upper-half of the hypercube, i.e., uniform over the n-bit strings whose majority is 1. What stretch is required to sample UP? At first sight, it seems the stretch must be quite high.
But a recent paper by Benjamini, Cohen, and Shinkar, [BCS14], shows that in fact it is possible with stretch 5. Moreover, the sampler has zero error, and uses the minimum possible number of input bits: n – 1!
I find their result quite surprising in light of the fact that constant-locality samplers cannot do the job: their output distribution has Ω(1) statistical distance from UP [Vio12a]. But local samplers looked very similar to low-stretch ones. Indeed, it is not hard to see that a local sampler has low average stretch, and the reverse direction follows from Friedgut’s theorem. However, the connections are only average-case. It is pretty cool that the picture changes completely when you go to worst-case computation.
What else can you sample with constant stretch?
AC0 vs. UP. Their results are also interesting in light of the fact that AC0 can sample UP with exponentially small error. This follows from a simple adaptation of the dart-throwing technique for parallel algorithms, known since the early 90’s [MV91, Hag91] – the details are in [Vio12a]. However, unlike their low-stretch map, this AC0 sampler uses superlinear randomness and has a non-zero probability of error.
Can AC0 sample UP with no error? Can AC0 sample UP using O(n) random bits?
Let’s see what the next five years bring.
References
[BCS14] Itai Benjamini, Gil Cohen, and Igor Shinkar. Bi-lipschitz bijection between the boolean cube and the hamming ball. In IEEE Symp. on Foundations of Computer Science (FOCS), 2014.
[BIL12] Chris Beck, Russell Impagliazzo, and Shachar Lovett. Large deviation bounds for decision trees and sampling lower bounds for AC0-circuits. In IEEE Symp. on Foundations of Computer Science (FOCS), pages 101–110, 2012.
[DW11] Anindya De and Thomas Watson. Extractors and lower bounds for locally samplable sources. In Workshop on Randomization and Computation (RANDOM), 2011.
[Hag91] Torben Hagerup. Fast parallel generation of random permutations. In 18th Coll. on Automata, Languages and Programming (ICALP), pages 405–416. Springer, 1991.
[Hås87] Johan Håstad. Computational limitations of small-depth circuits. MIT Press, 1987.
[LV12] Shachar Lovett and Emanuele Viola. Bounded-depth circuits cannot sample good codes. Computational Complexity, 21(2):245–266, 2012.
[MV91] Yossi Matias and Uzi Vishkin. Converting high probability into nearly-constant time-with applications to parallel hashing. In 23rd ACM Symp. on the Theory of Computing (STOC), pages 307–316, 1991.
[Vio12a] Emanuele Viola. The complexity of distributions. SIAM J. on Computing, 41(1):191–218, 2012.
[Vio12b] Emanuele Viola. Extractors for turing-machine sources. In Workshop on Randomization and Computation (RANDOM), 2012.
[Vio14] Emanuele Viola. Extractors for circuit sources. SIAM J. on Computing, 43(2):355–972, 2014.
A question regarding “A most basic open question”: Is there an easy 2-local sampler that uses 1.99n bits and outputs a distribution that is close to D=D_{1/4}? What if you allow the sampler to be 10-local?
I think that is open. Let me elaborate. One approach to sample is to break your input in blocks of d bits, and use one block to sample exactly O(d) bits. This allows you to trade input length for locality. However, if d is a constant you will still have a constant statistical distance in each block, which will accumulate to exponential. Still, this is a non-trivial construction.
In terms of lower bounds something that comes to mind is the following. Consider the set of n pairs corresponding to the neighbors of the input bits. If this set has a small covering, you can fix a few bits and reduce to the 1-local case. If not, then you have lots of output bits that are independent, which means that you are wasting a lot of entropy, something you cannot afford. I have not thought much about this approach, but I think that when I tried it it did not immediately work. But perhaps we can try some more?
Suppose you have approx 1.414n random bits, so there are n**2 pairs of bits. Pick n of those pairs (a_i, b_i) bits at random and output (a_i ∧ b_i) for the i’th output bit. So the output bits won’t have quite the right probability but will be pretty close, and they won’t be quite independent but will appear pretty close. Does that sound relevant?
The model is different. I am not allowing any extra randomness beyond what is in the input bits. In particular, the input-output connections are fixed.
If you allow the connections to be chosen at random, I think you can do something pretty good even with just four (!) input bits. Indeed, set each output bit to be one of those four, independently and uniformly. Now, if those 4 bits happen to have Hamming weight 1, you got exactly the right distribution. This happens with constant probability.
The difference between the models becomes less relevant the more powerful they are, because you can e.g. implement picking a random bit with a decision tree of depth O(log n) (though that costs input length). But with the lower bounds we are not there yet.
Sorry, I don’t mean make a new random selection for each sample (that’s perfectly fine except it consumes more random bits, that you’d have to account for). I mean choose a fixed mapping before doing any samples, but the fixed mapping is drawn at random from the possible ones, so it’s likely to be free of structure. There might alternatively be a way to deterministically choose such a mapping. The question is, how are you measuring the distance between two distributions.
I place no restriction on how the connection graph is obtained, so we are free to choose it at random. Once the connection graph is fixed, and also we decide what each output bit is computing (presumably, And), we consider the distribution obtained by a uniform choice of the input bits. Let me know if this does not answer your question.
I would not think that a random graph is a good construction. As I am thinking for the reasons for this, I wonder if the following familiar strategy can be used to prove a general lower bound.
Find 2 output bits which depend on 3 input bits. Just looking at those, you get constant statistical distance. Now the idea is to find 2 more output bits which depend on 3 input bits that are disjoint from the previous ones, and boost the statistical distance. Intuitively, if you can’t repeat this process long enough to get exponential statistical distance, it should be possible to reduce the degree by fixing a few bits.
So just to write your argument a bit more explicitly: Let G=(V,E) be the corresponding graph with n vertices and 1.9n edge.
If all degrees of G are bounded by d (think of d as small), then we can find many pairs of adjacent edges repeatedly. In each step we remove 3 vertices and at most 3d edges. Therefore, as long as the number of remaining edges is smaller than twice the number of the remaining vertices we can find another pair of edges and continue. So we get $\Omega(n/d)$ pairs of edges, which gives statistical distance of $1-\exp(-n/d)$.
Otherwise, if the graph has a vertex with high degree, then the output-edges that touch this vertex will give a large statistical distance. Specifically, if a graph has a vertex of degree d, then the statistical distance is at least $1 – 2^{-d}$.
Taking $d = \sqrt{n}$ we get the distance $ \geq 1 – \exp(-\sqrt{n})$. right?
Right. Just to fix notation, I would prefer to use the terminology “input bits” rather than “edges,” as I can think of different interpretations for “edge.”
Perhaps we can improve the statistical-distance lower bound as follows. Let C be a large constant. Collect an input bit, if any, of degree > C. Note that the corresponding C output bits depend on <= C+1 input bits, using the fact that the map is 2-local. So intuitively we have gained in statistical distance, while leaving the ratio between input length and output length almost unchanged. If feels that if we can collect Omega(n) such input bits, we're done.
If not, it feels that we can use the previous argument and collect Omega(n) pairs of output bits where each pair depends on <= 3 input bits, and again be done.
Right, I suppose it can be improve it by being more careful. I’ll think about it…
BTW, In my previous comment I confused the notation. The vertices of the graph correspond to inputs and the edges to output. That is, we have a graph with n edges and 1.9n vertices. And we continue this process as long as the number of vertices is smaller than twice the number of edges.
Sorry about this.
@asdf (or anyone else): if you are interested in pursuing this let me know, either here or via email. Igor and I are considering working together on these problems.
Here’s an attempt that shows 1-2^-Omega(n) distance. The idea is more or less the same as above.
Consider the input-output dependency graph and pick k input nodes by repeating the following: Pick an input node i with the largest degree in the current graph and remove its neighbors from the graph.
If k = Omega(n), then at some point there are Omega(n) input nodes left in the graph and each of them has constant degree >= 2. In this case we can pick Omega(n) independent blocks from the output bits and get distance 1-2^-Omega(n).
Now we assume k = o(n). It suffices to show that for each of the possible value at these k positions, there is a test T such that Pr[F(U) in T] is 1 – 2^-Omega(n) and Pr[Ber(n,1/4) in T] is 2^-Omega(n). There are 3 cases:
(1) There are many fixed output bits;
(2) There are many unfixed output bits but they depend on very few input nodes;
(3) There are many unfixed output bits but they depend on many input nodes.
For more details, see http://www.ccs.neu.edu/home/chlee/sampling.pdf
Sorry, the proof for the case k = Omega(n) is wrong. The issue is that the Omega(n) blocks are independent only after removing the large degree input nodes (and their adjacent output nodes). So there can be dependency between these blocks in the original dependency graph.
Doesn’t (source) coding theory have something to say on your question on generating a p=1/4 distribution? In this case, H=0.81n. For generating 1 bit of your sample you need at least 2 bits of entropy. To generate 2 bits you need >1.62, so at least 2 bits of entropy, which can also be ruled out by trying to construct a source code. Only starting at 6 bits you are not forbidden to be able to use less than n bits. An easy lower bound is building a Huffman code over the distribution of n output bits: in the limit you will approach both entropy and the ideal distribution, altough the use of bits is variable. You can improve this by lumping the codewords of the Huffman code and padding as necessary, to obtain fixed length codewords that approach both entropy (i.e. 0.81n) and the distribution. I’m sure there’s a paper somewhere that tells you exactly how fast this happens… This map is quite computationally intensive though, I presume, i.e. it gives you a good “locality” you desired but doesn’t say much about complexity.
Note that since this distribution is non-dyadic, you can’t actually simultaneously achieve entropy and a perfect distribution in finite size.
This “non-dyadicness” is one of the sources of difficulty in assembling an arbitrary finite discrete distribution from bits.
Interesting write up, sorry if I didn’t make much sense, I don’t know much about either fields…
Hi Gustav,
the classical theory of source coding does not take into account the computational complexity of the decoder. These questions are a step towards developing a version of the theory which does.
You can definitely use results such as the Huffman code to get *some* bounds on the problem. For example, you can divide the n output bits in blocks of b bits each, and use an optimal code for each block. This gives a tradeoff between locality, input length, and statistical distance. A general question is whether this is the best that you can do. The question about locality two is a seemingly basic question towards answering the general one.
I hope this makes sense. Let me know if I can say more.
I should probably add that, to be sure, there do exist papers in the coding-theory literature that address source-coding questions with additional constraints that have a computational flavor. But they do not answer the locality question, as far as I know, and in general they don’t consider models that are natural from a computer science point of view.
I have been thinking more about this today. I find it interesting how difficult it is to prove this thing. Here’s an approach that might work.
Recall that we want to rule out 2-local samplers with input length 1.9n and output length n that sample the D_{1/4} distribution.
Pick a large constant C, say C=100. Call an input node large-degree if its degree is at least C. The number of large-degree input nodes is at most 2n/C, since the total number of edges is 2n.
Let r be the number of output nodes that are only connected to large-degree input nodes.
If r is a suitable constant factor larger than 2n/C, then just looking at those r outputs you get an exponential statistical distance just counting support sizes.
Otherwise r is at most about 2n/C. If C is large enough compared to 1.9 in the definition of input length, we can ignore those output bits. Specifically, remove those output bits and redefine the sampler by eliminating all adjacent edges. We are going to prove a statistical distance lower bound on this new sampler, which implies one on the original sampler.
What we are going to exploit of the new sampler is that large-degree nodes have disjoint neighborhoods.
Here is the strategy to prove a lower bound on such samplers. We are going to find sets of output bits of size d_1, d_2, … such that:
1. the marginal distributions of the output on the sets are independent; in particular, the sets are disjoint.
2. sum_i d_i = Omega(n).
3. sets of size d_i are connected to at most d_i + 1 bits.
4. each set is of size at least 2.
Once we have these sets the proof is concluded as follows. By 3. and 4., each set gives a statistical distance advantage that is exponential in its cardinality. For example, if d_1 = 2, we have 2 outputs connected to at most 3 inputs.
By 1., the advantage multiplies across sets.
By 2., the total advantage is exponential in n.
Next we explain how to obtain these sets. Let S = [n] be the output. (Actually the size of S is a bit less, due to the removal above, but I’ll ignore this.) We iteratively find a subset of S, and maintain the invariant that outputs in S are independent from all the previous sets. For simplicity, eliminate all edges except those from S to NS. So all the degrees and neighborhoods are calculated w.r.t. this new graph.
While |S| = n – o(n), we can find two outputs in S which share an input. This is because the input length is 1.9n. So pick an input x that has maximum degree d which is at least 2.
Your next set is Nx. The set satisfies 4. by what we just said. This set satisfies 3. because the sampler is 2-local.
By the invariant, 1. is satisfied.
To maintain the invariant, remove NNNx from S. This ensures that future sets don’t have a neighbor in common with the set Nx.
It remains to guarantee 2. and it is here that making sure that large-degree inputs do not have common neighbors is helpful.
I claim that |NNNx| is at most C(|Nx|+1). This means that we will continue until 2. is satisfied. (This wouldn’t true if we hadn’t made sure that the neighborhoods of large-degree nodes don’t intersect; there could be a quadratic gap between |NNNx| and |Nx|.)
To prove the claim, we do a case analysis. Suppose that the degree of x is more than C. In particular, x is of large degree in the original graph. Then no node in NNx has degree more than C, otherwise that node and x would have a common neighbor. Hence |NNNx| is at most C|NNx| which is at most C(|Nx|+1).
Otherwise, x has degree at most C. Since x was picked to have the maximum degree, no node has degree larger than C, and again the claim is proved.
If that works, an obvious difficult to handle large locality is that the graph could be an expander, so you won’t find small sets of outputs that depend on few inputs to apply a hypothetical induction.
In fact, could a construction like the following work? This is similar to the approach proposed earlier by @ASDF. Pick a random graph with locality 1000, and apply to every node the same “crazy” boolean function which is 1 on 1/4 of the inputs.