Guest lecture by Huacheng Yu on dynamic data structure lower bounds, for the 2D range query and 2D range parity problems. Thanks to Huacheng for giving this lecture and for feedback on the write-up.
What is covered.
- Overview of Larsen’s lower bound for 2D range counting.
- Extending these techniques for for 2D range parity.
Definition 1. 2D range counting
Give a data structure that maintains a weighted set of 2 dimensional points with integer coordinates, that supports the following operations:
- UPDATE: Add a (point, weight) tuple to the set.
- QUERY: Given a query point , return the sum of weights of points in the set satisfying and .
Definition 2. 2D range parity
Give a data structure that maintains an unweighted set of 2 dimensional points with integer coefficients, that supports the following operations:
- UPDATE: Add a point to the set.
- QUERY: Given a query point , return the parity of the number of points in the set satisfying and .
Both of these definitions extend easily to the -dimensional case, but we state the 2D versions as we will mainly work with those.
All upper bounds assume the RAM model with word size .
Upper bounds: Using range trees, we can create a data structure for 2D range counting, with all update and query operations taking time time. With extra tricks, we can make this work for 2D range parity with operations running in time .
Lower bounds. There are a series of works on lower bounds:
- Fredman, Saks ’89 – 1D range parity requires .
- Patrascu, Demaine ’04 – 1D range counting requires .
- Larsen ’12 – 2D range counting requires .
- Larsen, Weinstein, Yu ’17 – 2D range parity requires .
This lecture presents the recent result of [Larsen ’12] and [Larsen, Weinstein, Yu ’17]. They both use the same general approach:
- Show that, for an efficient approach to exist, the problem must demonstrate some property.
- Show that the problem doesn’t have that property.
All lower bounds are in the cell probe model with word size .
We consider a general data structure problem, where we require a structure that supports updates and queries of an unspecified nature. We further assume that there exists an efficient solution with update and query times . We will restrict our attention to operation sequences of the form . That is, a sequence of updates followed by a single query . We fix a distribution over such sequences, and show that the problem is still hard.
3.1 Chronogram method [FS89]
We divide the updates into epochs, so that our sequence becomes:
where and . The epochs are multiplicatively shrinking. With this requirement, we have that .
Let be the set of all memory cells used by the data structure when run on the sequence of updates. Further, let be the set of memory cells which are accessed by the structure at least once in , and never again in a further epoch.
Claim 2. There exists an epoch such that probes cells from when answering the query at the end. Note that this is simply our query time divided by the number of epochs. In other words, can’t afford to read cells from each set without breaking its promise on the query run time.
Claim 2 implies that there is an epoch which has the smallest effect on the final answer. We will call this the ”easy” epoch.
Idea. : The set contains ”most” information about among all memory cells in . Also, are not updated past epoch , and hence should contain no information relative to the updates in . Epochs are progressively shrinking, and so the total touched cells in during the query operation should be small.
Having set up the framework for how to analyze the data structure, we now introduce a communication game where two parties attempt to solve an identical problem. We will show that, an efficient data structure implies an efficient solution to this communication game. If the message is smaller than the entropy of the updates of epoch (conditioned on preceding epochs), this gives an information theoretic contradiction. The trick is to find a way for the encoder to exploit the small number of probed cells to send a short message.
The game. The game consists of two players, Alice and Bob, who must jointly compute a single query after a series of updates. The model is as follows:
- Alice has all of the update epochs . She also has an index , which still corresponds to the ”easy” epoch as defined above.
- Bob has all update epochs EXCEPT for . He also has a random query . He is aware of the index .
- Communication can only occur in a single direction, from Alice to Bob.
- We assume some fixed input distribution .
- They win this game if Bob successfully computes the correct answer for the query .
Theorem 3. If there is a data structure with update time and probes cells from in expectation when answering the final query , then the communication game has an efficient solution, with communication cost, and success probability at least . This holds for any choice of .
Before we prove the theorem, we consider specific parameters for our problem. If we pick
then, after plugging in the parameters, the communication cost is . Note that, we could always trivially achieve by having Alice send Bob all of , so that he can compute the solution of the problem with no uncertainty. The success probability is , which simplifies to . This is significantly better than , which could be achieved trivially by having Bob output a random answer to the query, independent of the updates.
We assume we have a data structure for the update / query problem. Then Alice and Bob will proceed as follows:
- Simulate on . While doing so, keep track of memory cell accesses and compute .
- Sample a random subset , such that .
- Send .
We note that in Alice’s Step 3, to send a cell, she sends a tuple holding the cell ID and the cell state before the query was executed. Also note that, she doesn’t distinguish to Bob which cells are in which sets of the union.
- Receive from Alice.
- Simulate on epochs . Snapshot the current memory state of the data structure as .
- Simulate the query algorithm. Every time attempts to probe cell , Bob checks if . If it is, he lets probe from . Otherwise, he lets probe from .
- Bob returns the result from the query algorithm as his answer.
If the query algorithm does not query any cell in , then Bob succeeds, as he can exactly simulate the data structure query. Since the query will check cells in , and Bob has a random subset of them of size , then the probability that he got a subset the data structure will not probe is at least . The communication cost is the cost of Alice sending the cells to Bob, which is
Theorem 1. Consider an arbitrary data structure problem where queries have 1-bit outputs. If there exists a data structure having:
- update time
- query time
- Probes cells from when answering the last query
Then there exists a protocol for the communication game with bits of communication and success probability at least , for any choice of . Again, we plug in the parameters from 2D range parity. If we set
then the cost is , and the probability simplifies to .
We note that, if we had different queries, then randomly guessing on all of them, with constant probability we could be correct on as many as . In this case, the probability of being correct on a single one, amortized, is .
Proof. The communication protocol will be slightly adjusted. We assume an a priori distribution on the updates and queries. Bob will then compute the posterior distribution, based on what he knows and what Alice sends him. He then computes the maximum likelihood answer to the query . We thus need to figure out what Alice can send, so that the answer to is often biased towards either or .
We assume the existence of some public randomness available to both Alice and Bob. Then we adjust the communication protocol as follows:
Alice’s modified steps.
- Alice samples, using the public randomness, a subset of ALL memory cells , such that each cell is sampled with probability . Alice sends to Bob. Since Bob can mimic the sampling, he gains additional information about which cells are and aren’t in .
Bob’s modified steps.
- Denote by the set of memory cells probed by the data structure when Bob simulates the query algorithm. That is, is what Bob ”thinks” D will probe during the query, as the actual set of cells may be different if Bob had full knowledge of the updates, and the data structure may use that information to determine what to probe. Bob will use to compute the posterior distribution.
Define the function to be the ”bias” when takes on the value . In particular, this function is conditioned on that Bob receives from Alice. We can then clarify the definition of as
In particular, has the following two properties:
In these statements, the expectation is over everything that Bob knows, and the probabilities are also conditioned on everything that Bob knows. The randomness comes from what he doesn’t know. We also note that when the query probes no cells in , then the bias is always , since the a posterior distribution will put all its weight on the correct answer of the query.
Lemma 2. For any with the above two properties, there exists a such that and
Note that the sum inside the absolute values is the bias when .