In this lecture we study lower bounds on data structures. First, we define the setting. We have bits of data, stored in bits of memory (the data structure) and want to answer queries about the data. Each query is answered with probes. There are two types of probes:
- bit-probe which return one bit from the memory, and
- cell-probe in which the memory is divided into cells of bits, and each probe returns one cell.
The queries can be adaptive or non-adaptive. In the adaptive case, the data structure probes locations which may depend on the answer to previous probes. For bit-probes it means that we answer a query with depth- decision trees.
Finally, there are two types of data structure problems:
- The static case, in which we map the data to the memory arbitrarily and afterwards the memory remains unchanged.
- The dynamic case, in which we have update queries that change the memory and also run in bounded time.
In this lecture we focus on the non-adaptive, bit-probe, and static setting. Some trivial extremes for this setting are the following. Any problem (i.e., collection of queries) admits data structures with the following parameters:
- and , i.e. you write down all the answers, and
- and , i.e. you can always answer a query about the data if you read the entire data.
Next, we review the best current lower bound, a bound proved in the 80’s by Siegel [Sie04] and rediscovered later. We state and prove the lower bound in a different way. The lower bound is for the problem of -wise independence.
Problem 1. The data is a seed of size for a -wise independent distribution over . A query is defined to be the -th bit of the sample.
Theorem 2. For the above problem with it holds that
It follows, that if then is . But if then nothing is known.
Proof. Let . We have the memory of bits and we are going to subsample it. Specifically, we will select a bit of with probability , independently.
The intuition is that we will shrink the memory but still answer a lot of queries, and derive a contradiction because of the seed length required to sample -wise independence.
For the “shrinking” part we have the following. We expect to keep memory bits. By a Chernoff bound, it follows that we keep bits except with probability .
For the “answer a lot of queries” part, recall that each query probes bits from the memory. We keep one of the queries if it so happens that we keep all the bits that it probed in the memory. For a fixed query, the probability that we keep all its probes is .
We claim that with probability at least , we keep queries. This follows by Markov’s inequality. We expect to not keep queries on average. We now apply Markov’s inequality to get that the probability that we don’t keep at least queries is at most .
Thus, if , then there exists a fixed choice of memory bits that we keep, to achieve both the “shrinking” part and the “answer a lot of queries” part as above. This inequality is true because and so . But now we have bits of memory while still answering as many as queries.
The minimum seed length to answer that many queries while maintaining -wise independence is . Therefore the memory has to be at least as big as the seed. This yields
from which the result follows.
This lower bound holds even if the memory bits are filled arbitrarily (rather than having entropy at most ). It can also be extended to adaptive cell probes.
We will now show a conceptually simple data structure which nearly matches the lower bound. Pick a random bipartite graph with nodes on the left and nodes on the right. Every node on the right side has degree . We answer each probe with an XOR of its neighbor bits. By the Vazirani XOR lemma, it suffices to show that any subset of at most memory bits has an XOR which is unbiased. Hence it suffices that every subset with has a unique neighbor. For that, in turn, it suffices that has a neighborhood of size greater than (because if every element in the neighborhood of has two neighbors in then has a neighborhood of size ). We pick the graph at random and show by standard calculations that it has this property with non-zero probability.
It suffices to have , so that the probability is strictly less than 1, because . We can match the lower bound in two settings:
- if for some constant , then suffices,
- and suffices.
Remark 3. It is enough if the memory is -wise independent as opposed to completely uniform, so one can have . An open question is if you can improve the seed length to optimal.
Theorem 5. You need .
Proof. Every query is answered by looking at bits. But queries are answered by the same 2-bit function of probes (because there is a constant number of functions on 2-bits). There are two cases for :
- is linear (or affine). Suppose for the sake of contradiction that . Then you have a linear dependence, because the space of linear functions on bits is . This implies that if you XOR those bits, you always get 0. This in turn contradicts the assumption that the distributions has small bias.
- is AND (up to negating the input variables or the output). In this case, we keep collecting queries as long as they probe at least one new memory bit. If when we stop we have a query left such that both their probes query bits that have already been queried. This means that there exist two queries and whose probes cover the probes of a third query . This in turn implies that the queries are not close to uniform. That is because there exist answers to and that fix bits probed by them, and so also fix the bits probed by . But this contradicts the small bias of the distribution.