Mixing time of a pseudorandom number generator

Today I was due to speak at the Isaac Newton Institute’s workshop on “Interactions between group theory, number theory, combinatorics and geometry”, which has obviously been canceled. I was going to speak about my work on the Hall–Paige conjecture: see this earlier post. Meanwhile Péter Varjú was going to speak about some other joint work, which I’d like to share with you now.

Consider (if you can consider anything other than coronaviral apocalapyse right now) the random process {(X_n)} in {\mathbf{Z}/p\mathbf{Z}} defined in the following way: {X_0 = 1} (or any initial value), and for {n\geq 1}

\displaystyle   X_n = aX_{n-1} + b_n. \ \ \ \ \ (1)

Here we take {a} to be a fixed integer and {b_1, b_2, \dots} a sequence of independent draws from some finitely supported measure {\mu} on {\mathbf{Z}}. As a representative case, take {a = 2} and {\mu = u_{\{-1, 0, 1\}}} (uniform on {\{-1, 0, 1\}}).

Question 1 What is the mixing time of {(X_n \bmod p)} in {\mathbf{Z}/p\mathbf{Z}}? That is, how large must we take {n} before {X_n} is approximately uniformly distributed in {\mathbf{Z}/p\mathbf{Z}}?

This question was asked by Chung, Diaconis, and Graham (1987), who were motivated by pseudorandom number generation. The best-studied pseudorandom number generators are the linear congruential generators, which repeatedly apply an affine map {x\mapsto ax+b \bmod p}. These go back to Lehmer (1949), whose parameters {a} and {b} were also subject to some randomness (from MathSciNet, “The author’s proposal for generating a sequence of `random’ 8 decimal numbers {u_n} is to start with some number {u_0\neq 0}, multiply it by 23, and subtract the two digit overflow on the left from the two right hand digits to obtain a new number {u_1}.”) The process {(X_n)} defined above is an idealized model: we assume the increment {b} develops under some external perfect randomness, and we are concerned with the resulting randomness of the iterates.

My own interest arises differently. The distribution of {X_n} can be thought of as the image of {a} under the random polynomial

\displaystyle  P_n(X) = X^n + b_1 X^{n-1} + \cdots + b_n,.

In particular, {X_n \bmod p = 0} if and only if {a} is a root of {P_n}. Thus the distribution of {X_n \bmod p} is closely related to the roots of a random polynomial (of high degree) mod {p}. There are many basic and difficult questions about random polynomials: are they irreducible, what are their Galois groups, etc. But these are questions for another day (I hope!).

Returning to Question 1, in the representative case {a = 2} and {\mu = u_{\{-1, 0, 1\}}}, CDG proved the following results (with a lot of clever Fourier analysis):

  1. The mixing time is at least

    \displaystyle (1+o(1)) \log_2 p.

    This is obvious: {X_n} is supported on a set of size {O(2^n)}, and if {n \leq (1-\varepsilon) \log_2 p} then {2^n \leq p^{1-\varepsilon}}, so at best {X_n} is spread out over a set of size {p^{1-\varepsilon}}.

  2. The mixing time is never worse than

    \displaystyle C \log p \log \log p.

  3. Occasionally, e.g., if {p} is a Mersenne prime, or more generally if {a} has small order mod {p} (or even roughly small order, suitably defined), the mixing time really is as bad as that.
  4. For almost all odd {p} the mixing time is less than

    \displaystyle 1.02 \log_2 p.

You would be forgiven for guessing that {1.02} can be replaced with {1+o(1)} with more careful analysis, but you would be wrong! In 2009, Hildebrand proved that for all {p} the mixing time of {X_n \bmod p} is at least

\displaystyle  1.004 \log_2 p.

Therefore the mixing time of {X_n \bmod p} is typically slightly more than {\log_2 p}. What is going on here? What is the answer exactly?

In a word, the answer is entropy. Recall that entropy of a discrete random variable {X} is defined by

\displaystyle  H(X) = \sum_{x : \mu(x) > 0} \mathbf{P}(X = x) \log \mathbf{P}(X = x)^{-1}.

Here are some basic facts about entropy:

  1. If {X} is uniform on a set of size {k} then {H(X) = \log k}.
  2. Entropy is subadditive: the entropy of the joint variable {(X, Y)} is at most {H(X) + H(Y)}.
  3. Entropy cannot be increased by a (deterministic) function: {H(f(X)) \leq H(X)}.

By combining the second and third facts, we have the additive version (in the sense of adding the variables) of subadditivity:

\displaystyle  H(X + Y) \leq H(X) + H(Y).

In particular, it follows from (1) that

\displaystyle  H(X_{m+n}) \leq H(X_m) + H(X_n),

and hence by the subadditive lemma {H(X_n)/n} converges. (We are thinking now of {X_n} as a random variable in {\mathbf{Z}}, not reduced modulo anything.) Call the limit {H = H(a, \mu)}.

It is easy to see in our model case {a=2, \mu = u_{\{-1, 0, 1\}}} that {H \leq \log 2} (because {X_n} has support size {O(2^n)}). If {H < \log 2}, then it follows that the mixing time of {X_n \bmod p} is strictly greater than {\log_2 p} (as {X_n} cannot approach equidistribution mod {p} before its entropy is at least {(1+o(1))\log p}).

Indeed it turns out that {H < \log 2}. This is "just a computation": since {H = \inf_{n\geq 1} H(X_n) / n}, we just need to find some {n} such that {H(X_n) / n < \log 2}. Unfortunately, the convergence of {H(X_n) / n} is rather slow, as shown in Figure 1, but we can take advantage of another property of entropy: entropy satisfies not just subadditivity but submodularity

\displaystyle  H(X+Y+Z) + H(X) \leq H(X+Y) + H(X+Z),

and it follows by a short argument that {H(X_n) - H(X_{n-1})} is monotonically decreasing and hence also convergent to {H}; moreover, unlike the quotient {H(X_n)/n}, the difference {H(X_n) - H(X_{n-1})} appears to converge exponentially fast. The result is that

\displaystyle  H / \log 2 = 0.98876\dots,

so the mixing time of {X_n \bmod p} is not less than

\displaystyle  H^{-1} \log p = (1.01136\dots ) \log_2 p.

(We can also deduce, a posteriori, that we will have {H(X_n) / n < \log 2} for {n \geq 72}, though it is out of the question to directly compute {H(X_n)} for such large {n}.)

Figure 1: The entropy difference rapidly converges. Predicted values are dashed.

This observation was the starting point of a paper that Péter and I have just finished writing. The preprint is available at arxiv:2003.08117. What we prove in general is that the mixing time of {X_n \bmod p} is indeed just {(H^{-1} + o(1)) \log p} for almost all {p} (either almost all composite {p} coprime with {a}, or alternatively almost all prime {p}). In other words, entropy really is the whole story: as soon as the entropy of {X_n} is large enough, {X_n \bmod p} should be close to equidistributed (with a caveat: see below). The lower bound is more or less clear, as above. Most of the work of the paper is involved with the upper bound, for which we needed several nontrivial tools from probability theory and number theory, as well as a few arguments recycled from the original CDG paper.

However, just as one mystery is solved, another arises. Our argument depends on the large sieve, and it therefore comes with a square-root barrier. The result is that we have to assume {H(a, \mu) > \frac12 \log a}. This is certainly satisfied in our representative case {a = 2, \mu = u_{\{-1, 0, 1\}}} (as the entropy is very close to {\log 2}), but in general it need not be, and in that case the problem remains open. The following problem is representative.

Problem 2 Let {a \geq 2} and let {\mu = u_{\{0, 1\}}}. Then {X_n} is uniform on a (Cantor) set of size {2^n}, so {H(a, \mu) = \log 2}. Show that the mixing time of {X_n \bmod p} is {(1+o(1))\log_2 p} for almost all primes {p}.

You might call this the “2–3–4–5 problem”. The case {a=2} is trivial, as {X_n} is uniform on {\{1, \dots, 2^n\}}. The case {a=3} is covered by our main theorem, since {\log 2 > \frac12 \log 3}. The case {a=4} is exactly borderline, as {\log 2 = \frac12 \log 4}, and this case is not covered by our main theorem, but we sketch how to stretch the proof to include this case anyway. For {a \geq 5} we need a new idea.

The avoidance density of (k,l)-sum-free sets

A set {S \subset\mathbf{N} = \{1, 2, \dots\}} is called sum-free if it contains no solutions to {x+y=z}. For example, the set of odd integers is sum-free, as is the set {\{51, 52, \dots, 100\}}. Generalizing these two examples, for every {\theta \in \mathbf{R}/\mathbf{Z}} (real numbers mod {1}), the “middle third set”

\displaystyle S_\theta = \{x \in N: \theta x \in (1/3, 2/3) \pmod 1\}

is sum-free (the odd integers is precisely {S_{1/2}}, while the interval {\{51, \dots, 100\}} is a subset of {S_\theta} for {\theta = 1/150 - \epsilon}). As observed by Erdős, for any set {A \subset \mathbf{N}}, one of the sets {S_\theta \cap A} is large, because

\displaystyle  \int |S_\theta \cap A| \,d\theta = |A|/3.

Therefore every set of {n} integers contains a sum-free subset of size at least {n/3}.

Suppose more generally you have some type of configuration {\mathfrak{C}} (being deliberately vague about allowed generality), and you are interested in finding {\mathfrak{C}}-free subsets of arbitrary sets. The avoidance density of {\mathfrak{C}} is the largest constant {\delta} such that every set of {n} positive integers has a {\mathfrak{C}}-free subset of size at least {\delta n}. Here are some examples.

  1. The equation {x - y = 1} has avoidance density {1/2}: clearly any subset of a long interval contains at most half-sized subsets without adjacent elements, and conversely I can always take either the odds or the evens to get an adjacency-free subset of at least half the size. The equation {x - y = 2} also has avoidance density {1/2}, as does {x - y = c} for any constant {c}. (Partition into arithmetic progressions of common difference {c} and take every other element to find your large subset. For an example of a set with no large {\mathfrak{C}}-free subsets, take an arithmetic progression of common difference {c}.)
  2. For any rational {a \neq 1}, the equation {ax = y} has avoidance density {1/2} (this is a similar easy exercise).
  3. Suppose {\mathfrak{C}} represents {3}-term arithmetic progressions. Then a {\mathfrak{C}}-free subset is a set without {3}-term arithmetic progressions, and by Roth’s theorem the set {\{1, \dots, n\}} does not have any positive-density {\mathfrak{C}}-free subsets, so the avoidance density of {\mathfrak{C}} is zero. Similarly, the avoidance density of {k}-term progressions is zero, by Szemerédi’s theorem. More generally, any sort of finite configuration which is preserved under affine maps has avoidance density zero, because you can always find it inside a sufficiently long arithmetic progression.

Erdős’s argument shows that {x+y=z} has avoidance density at least {1/3}. A few years ago, Ben Green, Freddie Manners and I showed that the avoidance density of {x+y=z} is equal to {1/3}: that is, for every {\epsilon>0} we constructed a set of {n} integers with no sum-free subset of size larger than {(1/3 + \epsilon)n}.

The next step is to look at {\mathfrak{C}_{k,l} : x_1 + \cdots + x_k = y_1 + \cdots + y_l} for arbitrary {k > l}. A {\mathfrak{C}_{k,l}}-free set is sometimes called {(k,l)}-sum-free. The natural conjecture is that the avoidance density of {\mathfrak{C}_{k,l}} is {1/(k+l)}, as this is what the natural adaptation of Erdős’s argument shows. I was able to prove this in two big cases: in the case {k \leq 3l}, by generalizing the previous method, and in the case {l = 1}, using a new argument based on transference and a beautiful theorem of Łuczak and Schoen.

Today a paper appeared on the arxiv by Yifan Jing which settles this question for all {k, l}. Thus the avoidance density of {\mathfrak{C}_{k,l}} is indeed {1/(k+l)}! To do this Jing extended the Łuczak–Schoen theorem to {l > 1}, proving the following theorem (Lemma 5.4 in Jing’s paper).

Theorem 1 Every {(k,l)}-sum-free subset of {\mathbf{N}} of upper density greater than {1/(k+l)} is contained in a periodic {(k,l)}-sum-free set.

What this means is that {(k,l)}-sum-free sets of density greater than {1/(k+l)} generally have to be structured. On the other hand, if {\delta} is the avoidance density of {\mathfrak{C}_{k,l}}, then we can create random-looking {(k,l)}-sum-free subsets of density nearly {\delta}. In particular, using a transference tool (namely, Loeb measure), we can create random-looking infinite {(k,l)}-sum-free subsets of upper density nearly {\delta}. Therefore we must have {\delta \leq 1/(k+l)}.

Jing also looked the lower bound. Famously Bourgain did some work on this, showing that every set of {n} positive integers has a sum-free subset of size at least {(n+2)/3}, emphasis on the {2}. With a more interesting argument Bourgain showed that there is a {(3,1)}-sum-free set of size at least {n/4 + c \log n / \log \log n}. Jing adapted this argument for many pairs {(k,l)}, though doing so for {(k,l) = (2,1)} remains a stubborn open problem. It is interesting to see how Bourgain’s method adapts, and I look forward to thinking more about this.

It is all-too-easy to pose difficult open questions in this circle of ideas, but I will risk two.

  1. This question is posed by Jing. Let {\square^d} represent {d}-dimensional projective cubes. That is, {\square^2} represents the sum configuration {\{x, y, x+y\}}, {\square^3} represents {\{x, y, z, x+y, x+z, y+z, x+y+z\}}, etc. What is the avoidance density of {\square^d}?
  2. What is the avoidance density of {2x+y = z}? Or for that matter any equation without only {\pm1} coefficients?

One general theorem that comes out of the transference method is the following. Say that {\mathfrak{C}} is dilation-invariant if whenever {C \in \mathfrak{C}} we have also {\lambda C \in \mathfrak{C}} for every {\lambda \in \mathbf{N}}. (Everything we have mentioned is dilation-invariant, apart from {x-y = c}.)

Theorem 2 Let {\mathfrak{C}} be any dilation-invariant configuration. Then the avoidance density of {\mathfrak{C}} is the same as the largest multiplicative upper density of an infinite {\mathfrak{C}}-free subset of {\mathbf{N}}.

Here multiplicative upper density is defined as the asymptotic upper density with respect to your favourite multiplicative Følner sequence, such as

\displaystyle  A_n = \{p_1^{e_1} \cdots p_n^{e_n} : 0 \leq e_i < n\},

where {p_1, p_2, \dots} are the primes in any order. Therefore questions 1 and 2 ask for the largest multiplicative upper density of {\square^d}-free or {2x+y=z}-free subset of {\mathbf{N}}, respectively.

Here is my general conjecture.

Conjecture 3 Let {\mathfrak{C}} be any type of configuration described by a single equation {a_1 x_1 + \cdots + a_k x_k = 0}, with {a_1, \dots, a_k \in \mathbf{Z}\setminus\{0\}}. Then the avoidance density of {\mathfrak{C}} is the same as the measure of the largest {\mathfrak{C}}-free subset of {\mathbf{R}/\mathbf{Z}}.

One might also conjecture this for higher-complexity systems like {\square^3}, but I have no idea whether that is likely to be true! Certainly the lower bound holds: if {S \subset \mathbf{R}/\mathbf{Z}} is {\mathfrak{C}}-free then the avoidance density of {\mathfrak{C}} is at least {\mu(S)}. But even if this is true it’s not very useful! What is the largest {\square^3}-free subset of {\mathbf{R}/\mathbf{Z}}? What is the largest {2x+y=z}-free subset of {\mathbf{R}/\mathbf{Z}}? These questions are simpler, certainly, but still not at all obvious!

Edit: The largest {\mathfrak{C}}-free sets {S \subset \mathbf{R}/\mathbf{Z}} I know in these two cases are the following. I have made very little effort to look for other solutions. Can you spot a better solution? Is there a systematic way of solving this problem?

{\mathfrak{C}} {S} {\mu(S)}
{\square^3} {(1/4, 3/4)} {1/2}
{2x+y=z} {(1/8, 3/8)} {1/4}

Later edit: Here are some variants of the {\square^3} question, essentially due to Cosmin Pohoata:

  1. Suppose {A} is a {\square^3}-free subset of {\{1, \dots, n\}}, e.g., {\{x \equiv 1,2 \pmod 3\}}, or {(n/3, n]} (both of these are {S_\theta} for some {\theta}). Is it true that {|A| \leq (2/3 + o(1))n}?
  2. Suppose {A} is a {\square^3}-free subset of {\mathbf{Z}/2^k\mathbf{Z}}, e.g., {\{x \equiv 1, 4 \pmod 8\}}. Is it true that {|A| \leq (5/8 + o(1))2^k}?

The interesting point here is that shoehorning the set into {\mathbf{Z}_2} appears to come with a density hit (as with sum-free sets), and the extremal set appears not to be of the form {S_\theta}.