Mixing in groups

Non-abelian groups behave in ways that are useful in computer science. Barrington’s famous result [Bar89] shows that we can write efficiently an arbitrary low-depth computation as a group product over any non-solvable group. (Being non-solvable is a certain strengthening of being non-abelian which is not important now.) His result, which is false for abelian groups, has found myriad applications in computer science. And it is amusing to note that actually results about representing computation as group products were obtained twenty years before Barrington, see [KMR66]; but the time was not yet ripe.

This post is about a different property that certain non-abelian groups have and that is also useful. Basically, these groups ”mix” in the sense that if you have several distributions over the group, and the distributions have high entropy, then the product distribution (i.e., sample from each distribution and output the product) is very close to uniform.

First, let us quickly remark that this is completely false for abelian groups. To make an example that is familiar to computer scientists, consider the group of n-bit strings with bit-wise xor. Now let A be the uniform distribution over this group where the first bit is always 0. Then no matter how many independent copies of A you multiply together, the product is always A.

Remarkably, over other groups it is possible to show that the product distribution will become closer and closer to uniform. A group that works very well in this respect is SL(2,q), the group of 2×2 matrices over the field with q elements with determinant 1. This is a group that in some sense is very far from abelian. In particular, one can prove the following result.

Theorem 1.[Three-step mixing [Gow08BNP08]] Let G = SL(2,q), and let A, B, and C be three subsets of G of constant density. Let a, b, and c be picked independently and uniformly from A, B, and C respectively. Then for any g in G we have

| Pr[abc = g] – 1∕|G|| < 1∕|G|1+Ω(1).

Note that the conclusion of the theorem in particular implies that abc is supported over the entire group. This is remarkable, since the starting distributions are supported over only a small fraction of the group. Moreover, by summing over all elements g in the group we obtain that abc is polynomially close to uniform in statistical distance.

Theorem 1 can be proved using representation theory. This must be a great tool, but for some reason I always found it a little difficult to digest the barrage of definitions that usually anticipate the interesting stuff.

Luckily, there is another way to prove Theorem 1. I wouldn’t be surprised if this is in some sense the same way, and moreover this other way is not sometimes I would call elementary. But it is true that I will be able to sketch a proof of the theorem without using the word ”representation”. In this post we will prove some preliminary results that are valid for all groups, and the most complicated thing used is the Cauchy-Schwarz inequality. In the next post we will work specifically with the group SL(2,q), and use more machinery. This is all taken from this paper with Gowers [GV15] (whose main focus is the study of mixing in the presence of dependencies).

First, for convenience let us identify a set A with its characteristic function. So we write A(a) for a belongs to the set A. It is convenient to work with a slightly different statement:

Theorem 2. Let G = SL(2,q) and let A,B,C be three subsets of G of densities α,β,γ respectively. For any g in G,

|Eabc=gA(a)B(b)C(c) – αβγ|≤|G|-Ω(1)

where the expectation is over uniform elements a, b, and c from the group G such that their product is equal to g.

This Theorem 2 is equivalent to Theorem 1, because

Eabc=gA(a)B(b)C(c) = Pr[A(a),B(b),C(c)|abc = g]
= Pr[abc = g|A(a),B(b),C(c)]|G|αβγ

by Bayes’ rule. So we can get Theorem 1 by dividing by |G|αβγ.

Now we observe that to prove this ”mixing in three steps” it actually suffices to prove mixing in four steps.

Theorem 3.[Mixing in four steps] Let G = SL(2,q) and let A,B,C,D be four subsets of G of densities α,β,γ,δ respectively. For any g in G,

Eabcd=gA(a)B(b)C(c)D(d) – αβγδ ≤|G|-Ω(1),

where the expectation is over uniform elements a, b, c, and d from the group G such that their product is equal to g.

Lemma 4. Mixing in four steps implies mixing in three.

Proof: Rewrite

|Eabc=gA(a)B(b)C(c) – αβγ| = |Eabc=gf(a)B(b)C(c)|

where f(a) := A(a) – α.

In these proofs we will apply Cauchy-Schwarz several times. Each application ”loses a square,” but since we are aiming for an upper bound of the form 1∕|G|Ω(1) we can afford any constant number of applications. Our first one is now:

(Eabc=gf(a)B(b)C(c))2 ≤ (E cC(c)2)(E c(Eab=gc-1f(a)B(b))2)
= γEcEab=a′b′=gc-1f(a)B(b)f(a′)B(b′)
= γEab=a′b′(A(a) – α)B(b)B(b′)(A(a′) – α).

There are four terms that make up the expectation. The terms that involve at least one α sum to -α2β2. The remaining term is the expectation of A(a)B(b)B(b′)A(a′). Note that ab = a′b′ is equivalent to ab(1∕b′)(1∕a′) = 1G. Hence by Theorem 3 this expectation is at most |G|-Ω(1). QED

So what remains to see is how to prove mixing in four steps. We shall reduce the mixing problem to the following statement about the mixing of conjugacy classes of our group.

Definition 5. We denote by C(g) the conjugacy class {h-1gh : h in G} of an element g in G. We also denote by C(g) the uniform distribution over C(g) for a uniformly selected g in G.

Theorem 6.[Mixing of conjugacy classes of SL(2,q)] Let G = SL(2,q). With probability ≥ 1 -|G|-Ω(1) over uniform a,b in G, the distribution C(a)C(b) is |G|-Ω(1) close in statistical distance to uniform.

Theorem 6 is proved in the next blog post. Here we just show that is suffices for our needs.

Lemma 7. Theorem 6 implies Theorem 3.

Proof: We rewrite the quantity to bound as


for f(b,d) = B(b)D(d) – βδ.

Now by Cauchy-Schwarz we bound this above by


where the expectation is over variables such that abcd = g and ab′cd′ = g. As in the proof that mixing in four steps implies mixing in three, we can rewrite the last two equations as the single equation bcd = b′cd′.

The fact that the same variable c occurs on both sides of the equation is what gives rise to conjugacy classes. Indeed, this equation can be rewritten as

c-1(1∕b)b′c = d(1∕d′).

Performing the following substitutions: b = x,b′ = xh,d′ = y we can rewrite our equation as

d = c-1hcy.

Hence we have reduced our task to that of bounding


for uniform x,y,h.

We can further replace y with C(h)-1y, and rewrite the expression as


This is at most

(Ex,yf2(x,y))E x,y,h,h′f(xh,C(h-1)y)f(xh′,C(h′-1)y).

Recalling that f(b,d) = B(b)D(d) – βδ, and that E[f] = βδ, the first factor is at most 1. The second can be rewritten as


replacing x with xh-1 and y with C(h-1)-1y = C(h)y.

Again using the definition of f this equals

Ex,y,h,h′B(x)D(y)B(xh-1h′)D(C(h′-1)C(h)y) – β2δ2.

Now Lemma 6 guarantees that the distribution (x,y,xh-1h′,C(h′-1)C(h)y) is 1∕|G|Ω(1)-close in statistical distance to the uniform distribution over G4, and this concludes the proof. QED


[Bar89]    David A. Mix Barrington. Bounded-width polynomial-size branching programs recognize exactly those languages in NC1. J. of Computer and System Sciences, 38(1):150–164, 1989.

[BNP08]    László Babai, Nikolay Nikolov, and László Pyber. Product growth and mixing in finite groups. In ACM-SIAM Symp. on Discrete Algorithms (SODA), pages 248–257, 2008.

[Gow08]    W. T. Gowers. Quasirandom groups. Combinatorics, Probability & Computing, 17(3):363–387, 2008.

[GV15]    W. T. Gowers and Emanuele Viola. The communication complexity of interleaved group products. In ACM Symp. on the Theory of Computing (STOC), 2015.

[KMR66]   Kenneth Krohn, W. D. Maurer, and John Rhodes. Realizing complex Boolean functions with simple groups. Information and Control, 9:190–195, 1966.

Ditch your family and come to FOCS

In a week I will be attending the FOCS conference. As usual, I find the program very interesting, and look forward to the talks. Hopefully not every one of them will be over my head; I hope to write a report about the talks later. It’s also a great fortune that I will be able to get there by train!

Workshops and the end of the celebration for Avi Wigderson’s birthday are on Saturday. The main conference starts on Sunday, and lasts until Tuesday. Monday is Columbus Day, a national holiday. To state the obvious, Saturday, Sunday, and Monday are days when schools, day cares, and other things that allow for work are not available. Couldn’t we hold conferences during week-days, like other regular events such as NSF panels? To be clear this has nothing specific to do with this FOCS, but is a general theme. Which may be part of the reason why certain groups of computer scientists are called minorities.

Sure, holding a conference during week-days means that you have to skip “work.” But isn’t attending the conference at least as important work? And I have never met an attendee who wouldn’t jump for joy if they had a valid excuse to skip a lecture or a committee meeting, nor, but maybe I have been lucky, a dean or department chair who would obstruct attendance.

Exercise, diet, and sleep improve brain skills (and health)

This semester I am teaching 80 undergraduates Theory of Computation. I love the material and so every minute is precious, but I decided to sacrifice a few for a quick illustration of the title. After all, I thought to myself, it *is* my job to know, use, and disseminate teaching techniques that improve the students’ performance. So why shouldn’t I tell them the benefits of cardio exercise on learning? So this morning I scrambled together a few slides which you can see here.

I plan to add much more in future versions, but it is a euphemism to say that I am not an expert in these areas. So I’d very much appreciate any pointers, especially to what are the landmark papers in these areas.

Paper X, on the ArXiv, keeps getting rejected

Paper X, on the ArXiv, keeps getting rejected. Years later, Paper Y comes up and makes progress on X, or does something closely related to X. Y may or may not cite X. Y gets published. Now X cannot get published because the referees do not see what the contribution of X is, given that Y has been published and that in light of Y X is not new.

The solution in my opinion, following a series of earlier posts the last one of which is this, is to move the emphasis away from publication towards ArXiv appearance. Citations should refer to the first available version, often the ArXiv one. Journals/conferences can still exist in this new paradigm: their main important job would be to assign badges to ArXiv papers.

Obviously, this plan does not work for the entities behind current journals/conferences. So they enforce the status quo, and in the most degrading way: by forcing authors to fish out, maintain, and format useless references.

Hokuto No Ken and growing up in Italy

It seems that the Hokuto No Ken videogame that should have been made decades ago is finally being made. Thanks to Marco Genovesi for sending me this link. (More about Marco later on this blog.)

I consider watching the Hokuto No Ken series (excluding the more recent garbage) one of the most significant artistical experiences of my life, something that also makes me understand how can some people be so passionate about Dante or Homer. And, if you grow up in Italy there is a special treat for you. You can watch a version where the words are dubbed, but the soundtrack and the screams are from the original Japanese. By contrast, the English-speaking audience can either watch the Japanese version with subtitles — and I always hate subtitles — or they can watch an English dubbed version. I once happened to get a glimpse of the latter and I was horrified: The masterful soundtrack has been replaced by a very cheap synth, not to mention the screams. Compare this to this.


provides an objective ranking of CS departments. It is the revamped version of a previous system which I followed also because it did not include NEU. The new one does. But there are reasons slightly more subtle than the fact that NEU ranks 9 in “theory” — or 11 if you remove “logic and verification”, in both cases beating institutions which are way above NEU in other rankings — why I think having this information around is very valuable. Unobjective rankings tend to be sticky and not reflect recent decay or growth. And if one still wants to ignore data at least now one knows exactly what data is being ignored.

One dimension where I would like the system to be more flexible is in the choice of venues to include in the count. For example, I think journals should be added. Among the conferences, CCC should be included, as the leading conference specialized in computational complexity. I think the user should be allowed to select any weighted subset of journals and conferences.


I strongly recommend playing Oiligarchy a brilliant online game about American politics from the point of view of the oil industry. You should play to the end to see which ending you get: There are four, though one of them may be unattainable. But at the very least make sure to get to a presidential election. Its depiction is memorable. (Incidentally, many other games on the website are well worth a look.)

Turning to what is unfortunately not a videogame, this time we have a candidate, let us call them candidate X, whom many think is exceptionally unqualified to be President. I disagree in an unimportant way. I am not sure X really is more unqualified, or more dangerous, than some of the other candidates in recent history, including at least one who actually became President for two terms. The one distinctive feature of X seems to be that X is more colorful and more openly arrogant than other candidates. But I don’t find this to be a substantial difference. I certainly wouldn’t mind it too much if what X said made any sense at all.

I also suspect that X and their entourage know well that the chance of X winning the election is negligible, but want to rake in and maximize publicity, according to the old dictum that there is no bad publicity.

Bounded indistinguishability

Countless papers study the properties of k-wise independent distributions, which are distributions where any k bits are uniform and independent. One property of interest is which computational models are fooled by such distributions, in the sense that they cannot distinguish any such distribution from a uniformly random one. Recently, Bazzi’s breakthrough, discussed earlier on this blog, shows that k = polylog(n) independence fools any polynomial-size DNF on n bits.

Let us change the question. Let us say that instead of one distribution we have two, and we know that any k bits are distributed identically, but not necessarily uniformly. We call such distributions k-wise indistinguishable. (Bounded independence is the special case when one distribution is uniform.) Can a DNF distinguish the two distributions? In fact, what about a single Or gate?

This is the question that we address in a paper with Bogdanov, Ishai, and Williamson. A big thank you goes to my student Chin Ho Lee for connecting researchers who were working on the same problems on different continents. Here at NEU the question was asked to me by my neighbor Daniel Wichs.

The question turns out to be equivalent to threshold/approximate degree, an influential complexity measure that goes back to the works by Minsky and Papert and by Nisan and Szegedy. The equivalence is a good example of the usefulness of duality theory, and is as follows. For any boolean function f on n bits the following two are equivalent:

1. There exist two k-wise indistinguishable distributions that f tells apart with advantage e;

2. No degree-k real polynomial can approximate f to pointwise error at most e/2.

I have always liked this equivalence, but at times I felt slightly worried that could be considered too “simple.” But hey, I hope my co-authors don’t mind if I disclose that it’s been four different conferences, and not one reviewer filed a complaint about that.

From the body of works on approximate degree one readily sees that bounded indistinguishability behaves very differently from bounded independence. For example, one needs k = Ω(√ n) to fool an Or gate, and that is tight. Yes, to spell this out, there exist two distributions which are 0.001 √ n indistinguishable but Or tells them apart with probability 0.999. But obviously even constant independence fools Or.

The biggest gap is achieved by the Majority function: constant independence suffices, by this, while linear indistinguishability is required by Paturi’s lower bound.

In the paper we apply this equivalence in various settings, and here I am just going to mention the design of secret-sharing schemes. Previous schemes like Shamir’s required the computation of things like parity, while the new schemes use different types of functions, for example of constant depth. Here we also rely on the amazing ability of constant-depth circuits to sample distributions, also pointed out earlier on this blog, and apply some expander tricks to trade alphabet size for other parameters.

The birthday paradox

The birthday paradox is the fact that if you sample t independent variables each uniform in {1, 2,,n} then the probability that two will be equal is at least a constant independent from n when t √n. the The word ”paradox” refers to the fact that t can be as small as √n, as opposed to being closer to n. (Here I am not interested in the precise value of this constant as a function of t.)

The Wikipedia page lists several proofs of the birthday paradox where it is not hard to see why the √n bound arises. Nevertheless, I find the following two-stage approach more intuitive.

Divide the random variables in two sets of 0.5√n each. If there are two in the first set that are equal, then we are done. So we can condition on this event not happening, which means that the variables in the first set are all distinct. Now take any variable in the second set. The probability that it is equal to any variable in the first set is 0.5√n∕n = 0.5√n. Hence, the probability that all the variables in the second set are different from those in the first is

(1 0.5√n)0.5√n e0.25 < 1.

You do not need to leave your room

You do not need to leave your room. Remain sitting at your table and listen. Do not even listen, simply wait. Do not even wait, be quiet still and solitary. The world will freely offer itself to you to be unmasked, it has no choice, it will roll in ecstasy at your feet.

In this time of emphasis on collaborative, interdisciplinary, cross-fertilizing research, I find these words by Kafka refreshing.