The characteristic polynomial of a random matrix is almost always irreducible

I just uploaded to the arxiv a paper which can be viewed as vindication by anybody who has ever struggled to compute an eigenvalue by hand: the paper gives mathematical proof that eigenvalues are almost always as complicated as they can be. Intuitive as this may be, to prove it I needed two mathematical bazookas: both the extended Riemann hypothesis and the classification of finite simple groups!

An example will illustrate what I mean. Here is a random {10\times 10} matrix with {\pm1} entries:

\displaystyle  M = \left(\begin{array}{rrrrrrrrrr} -1 & 1 & 1 & -1 & -1 & 1 & -1 & 1 & -1 & 1 \\ -1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & -1 & -1 \\ -1 & -1 & 1 & 1 & 1 & -1 & -1 & -1 & 1 & 1 \\ 1 & -1 & -1 & 1 & -1 & -1 & 1 & 1 & 1 & -1 \\ -1 & -1 & -1 & 1 & 1 & 1 & 1 & -1 & -1 & -1 \\ -1 & -1 & -1 & -1 & -1 & 1 & 1 & 1 & 1 & 1 \\ -1 & -1 & -1 & 1 & -1 & 1 & 1 & 1 & -1 & 1 \\ 1 & -1 & -1 & -1 & 1 & 1 & 1 & 1 & -1 & -1 \\ -1 & -1 & 1 & -1 & 1 & -1 & -1 & -1 & -1 & -1 \\ -1 & -1 & -1 & -1 & -1 & 1 & -1 & 1 & -1 & -1 \end{array}\right) .

The eigenvalues of {M} as approximated by scipy.lingalg.eig are shown below.

But what if we want to know the eigenvalues exactly? Well, they are the roots of the characteristic polynomial of {M}, which is

\displaystyle  \varphi(x) = x^{10} - 4x^{9} + 6x^{8} - 80x^{6} + 144x^{5} - 544x^{4} + 1152x^{3} - 1024x^{2} + 1280x - 1024,

so we just have to solve {\varphi(x) = 0}. The Abel–Ruffini theorem states that, in general, this is not possible in the quadratic-formula-type sense involving radicals, etc., so there are certainly evil matrices that defy solution. But could there be something special about random matrices that makes {\varphi} typically more special?

It is not at all obvious that {\varphi} is “generic”. For one thing, if {M} is singular, then {\varphi} has an obvious root: zero! For another, if {M} happens to have a repeated eigenvalue, then {\varphi} and {\varphi'} share a factor, which is certainly a special algebraic property. It turns out that these events are low-probability events, but those are landmark results about random matrices (due originally to Komlós and Tao–Vu (symmetric case) and Ge (nonsymmetric), respectively).

Algebraically, the complexity of solving {\varphi(x) = 0} is measured by the Galois group {G}. In particular, the polynomial is irreducible if and only if {G} is transitive, and the equation is solvable by a formula involving radicals as in the quadratic formula if and only if {G} is solvable in the sense of group theory. Galois groups are not easy to calculate in general, but “typically” they are, by the following well-known trick. If we factorize {\varphi} mod {p} for some prime {p}, then, as long as we do not get a repeated factor, the degree sequence of the factorization is the same as the cycle type of some element of {G}. If we do this for enough small primes {p} we can often infer that {G = S_n}.

Let’s do this for the above example. If we factorize {\varphi} mod {p} for {p < 20}, discarding the “ramified” primes for which there is a repeated factor, we get the following degree sequences:

{p} degree sequence
{5} {\left(9, 1\right)}
{7} {\left(4, 4, 2\right)}
{13} {\left(5, 3, 2\right)}
{17} {\left(5, 3, 2\right)}
{19} {\left(7, 2, 1\right)}

Therefore {G} contains a {9}-cycle, an element of type {(4, 4, 2)}, etc. Just from the {9}-cycle and the {(4,4,2)} element it is clear that {G} must be transitive (in fact 2-transitive), so {\varphi} is irreducible. The are various other rules that help eliminate other possibilities for {G}. A common one is a theorem of Jordan which states that if {G} is primitive and contains a {p}-cycle for some prime {p \leq n - 3} then {G \geq A_n}. The element of cycle type {(5, 3, 2)} is odd and has a power which is a 5-cycle, so Jordan’s theorem implies {G = S_{10}}.

Conclusion: for the matrix {M} above, solving {\varphi(x) = 0} is a certifiable pain. You should just give up and accept {\varphi} itself as the prettiest answer you are going to get.

What my paper shows in general is that, if you choose the entries of an {n\times n} matrix from a fixed distribution in the integers,then, with probability tending to {1} as {n \to \infty}, the characteristic polynomial {\varphi} is irreducible, and moreover its Galois group is at least {A_n}. The result is conditional on the extended Riemann hypothesis, as mentioned, except in some nice special cases, such as when the entries are uniformly distributed in {\{1, \dots, 210\}.}

Note I only said “at least {A_n}”. Presumably the event {G = A_n} also has negligible probability, so {G = S_n} with probability {1 - o(1)}, but this remains open, and seems difficult. To prove it (naively), you have to show that the discriminant of the characteristic polynomial is, with high probability, not a perfect square. That should certainly be a low-probability event, but it’s not clear how to prove it!

Why do I need the extended Riemann hypothesis (ERH) and the classification of finite simple groups (CFSG)?

  • ERH: The example above involving factorizing mod {p} is not just useful in practice but also in theory. The prime ideal theorem tells you the statistics of what happens with your polynomial mod {p} for large {p}. In particular, to show that {G} is transitive it suffices to show that {\varphi} has one root on average mod {p}, for large primes {p}. Exactly how large {p} must be is determined by the error term in the prime ideal theorem, and ERH really helps a lot.
  • CFSG: An elaboration of the above strategy is used to show that {G} is not just transitive but {m}-transitive for any constant {m}. By CFSG, {m = 4} and {n > 24} is enough to imply {G \geq A_n}.

Arguably, matrices are not playing a large role in the statement of the theorem: it is just a statement about a random polynomial {\varphi}, where the model happens to be “take a random matrix with independent entries and take its characteristic polynomial”. The theorem should be true for any good model for a random polynomial of high degree. For example, in the “independent (or restricted) coefficients model”, the coefficients of {\varphi} are independent with some fixed distribution. For this model the corresponding statements were proved recently by Bary-Soroker–Kozma (see also Bary-Soroker’s blog post) and Breuillard–Varju, and I certainly borrow a lot from those papers. However, the common ground is isolated to the local–global principle that reduces the problem to a problem modulo {p}, and there the commonality ends. The independent coefficients case becomes a question about a certain kind of random walk (see the Breuillard–Varju paper) mod {p}, whereas the characteristic polynomial case becomes a problem about counting eigenvalues of a random matrix mod {p}.

There are many beautiful questions and mysteries about random polynomials. See for example the many high-definition pictures on John Baez’s website here showing the roots of random polynomials with independent coefficients. In many ways, taking the characteristic polynomial of a random matrix is actually a more natural model of a random polynomial, and some aspects become simpler. For example, I bet that the {A_n} vs {S_n} question will be settled sooner in the case of a random characteristic polynomial than in the case of independent coefficients.

Mixing time of a pseudorandom number generator

Today I was due to speak at the Isaac Newton Institute’s workshop on “Interactions between group theory, number theory, combinatorics and geometry”, which has obviously been canceled. I was going to speak about my work on the Hall–Paige conjecture: see this earlier post. Meanwhile Péter Varjú was going to speak about some other joint work, which I’d like to share with you now.

Consider (if you can consider anything other than coronaviral apocalapyse right now) the random process {(X_n)} in {\mathbf{Z}/p\mathbf{Z}} defined in the following way: {X_0 = 1} (or any initial value), and for {n\geq 1}

\displaystyle   X_n = aX_{n-1} + b_n. \ \ \ \ \ (1)

Here we take {a} to be a fixed integer and {b_1, b_2, \dots} a sequence of independent draws from some finitely supported measure {\mu} on {\mathbf{Z}}. As a representative case, take {a = 2} and {\mu = u_{\{-1, 0, 1\}}} (uniform on {\{-1, 0, 1\}}).

Question 1 What is the mixing time of {(X_n \bmod p)} in {\mathbf{Z}/p\mathbf{Z}}? That is, how large must we take {n} before {X_n} is approximately uniformly distributed in {\mathbf{Z}/p\mathbf{Z}}?

This question was asked by Chung, Diaconis, and Graham (1987), who were motivated by pseudorandom number generation. The best-studied pseudorandom number generators are the linear congruential generators, which repeatedly apply an affine map {x\mapsto ax+b \bmod p}. These go back to Lehmer (1949), whose parameters {a} and {b} were also subject to some randomness (from MathSciNet, “The author’s proposal for generating a sequence of `random’ 8 decimal numbers {u_n} is to start with some number {u_0\neq 0}, multiply it by 23, and subtract the two digit overflow on the left from the two right hand digits to obtain a new number {u_1}.”) The process {(X_n)} defined above is an idealized model: we assume the increment {b} develops under some external perfect randomness, and we are concerned with the resulting randomness of the iterates.

My own interest arises differently. The distribution of {X_n} can be thought of as the image of {a} under the random polynomial

\displaystyle  P_n(X) = X^n + b_1 X^{n-1} + \cdots + b_n,.

In particular, {X_n \bmod p = 0} if and only if {a} is a root of {P_n}. Thus the distribution of {X_n \bmod p} is closely related to the roots of a random polynomial (of high degree) mod {p}. There are many basic and difficult questions about random polynomials: are they irreducible, what are their Galois groups, etc. But these are questions for another day (I hope!).

Returning to Question 1, in the representative case {a = 2} and {\mu = u_{\{-1, 0, 1\}}}, CDG proved the following results (with a lot of clever Fourier analysis):

  1. The mixing time is at least

    \displaystyle (1+o(1)) \log_2 p.

    This is obvious: {X_n} is supported on a set of size {O(2^n)}, and if {n \leq (1-\varepsilon) \log_2 p} then {2^n \leq p^{1-\varepsilon}}, so at best {X_n} is spread out over a set of size {p^{1-\varepsilon}}.

  2. The mixing time is never worse than

    \displaystyle C \log p \log \log p.

  3. Occasionally, e.g., if {p} is a Mersenne prime, or more generally if {a} has small order mod {p} (or even roughly small order, suitably defined), the mixing time really is as bad as that.
  4. For almost all odd {p} the mixing time is less than

    \displaystyle 1.02 \log_2 p.

You would be forgiven for guessing that {1.02} can be replaced with {1+o(1)} with more careful analysis, but you would be wrong! In 2009, Hildebrand proved that for all {p} the mixing time of {X_n \bmod p} is at least

\displaystyle  1.004 \log_2 p.

Therefore the mixing time of {X_n \bmod p} is typically slightly more than {\log_2 p}. What is going on here? What is the answer exactly?

In a word, the answer is entropy. Recall that entropy of a discrete random variable {X} is defined by

\displaystyle  H(X) = \sum_{x : \mu(x) > 0} \mathbf{P}(X = x) \log \mathbf{P}(X = x)^{-1}.

Here are some basic facts about entropy:

  1. If {X} is uniform on a set of size {k} then {H(X) = \log k}.
  2. Entropy is subadditive: the entropy of the joint variable {(X, Y)} is at most {H(X) + H(Y)}.
  3. Entropy cannot be increased by a (deterministic) function: {H(f(X)) \leq H(X)}.

By combining the second and third facts, we have the additive version (in the sense of adding the variables) of subadditivity:

\displaystyle  H(X + Y) \leq H(X) + H(Y).

In particular, it follows from (1) that

\displaystyle  H(X_{m+n}) \leq H(X_m) + H(X_n),

and hence by the subadditive lemma {H(X_n)/n} converges. (We are thinking now of {X_n} as a random variable in {\mathbf{Z}}, not reduced modulo anything.) Call the limit {H = H(a, \mu)}.

It is easy to see in our model case {a=2, \mu = u_{\{-1, 0, 1\}}} that {H \leq \log 2} (because {X_n} has support size {O(2^n)}). If {H < \log 2}, then it follows that the mixing time of {X_n \bmod p} is strictly greater than {\log_2 p} (as {X_n} cannot approach equidistribution mod {p} before its entropy is at least {(1+o(1))\log p}).

Indeed it turns out that {H < \log 2}. This is "just a computation": since {H = \inf_{n\geq 1} H(X_n) / n}, we just need to find some {n} such that {H(X_n) / n < \log 2}. Unfortunately, the convergence of {H(X_n) / n} is rather slow, as shown in Figure 1, but we can take advantage of another property of entropy: entropy satisfies not just subadditivity but submodularity

\displaystyle  H(X+Y+Z) + H(X) \leq H(X+Y) + H(X+Z),

and it follows by a short argument that {H(X_n) - H(X_{n-1})} is monotonically decreasing and hence also convergent to {H}; moreover, unlike the quotient {H(X_n)/n}, the difference {H(X_n) - H(X_{n-1})} appears to converge exponentially fast. The result is that

\displaystyle  H / \log 2 = 0.98876\dots,

so the mixing time of {X_n \bmod p} is not less than

\displaystyle  H^{-1} \log p = (1.01136\dots ) \log_2 p.

(We can also deduce, a posteriori, that we will have {H(X_n) / n < \log 2} for {n \geq 72}, though it is out of the question to directly compute {H(X_n)} for such large {n}.)


Figure 1: The entropy difference rapidly converges. Predicted values are dashed.

This observation was the starting point of a paper that Péter and I have just finished writing. The preprint is available at arxiv:2003.08117. What we prove in general is that the mixing time of {X_n \bmod p} is indeed just {(H^{-1} + o(1)) \log p} for almost all {p} (either almost all composite {p} coprime with {a}, or alternatively almost all prime {p}). In other words, entropy really is the whole story: as soon as the entropy of {X_n} is large enough, {X_n \bmod p} should be close to equidistributed (with a caveat: see below). The lower bound is more or less clear, as above. Most of the work of the paper is involved with the upper bound, for which we needed several nontrivial tools from probability theory and number theory, as well as a few arguments recycled from the original CDG paper.

However, just as one mystery is solved, another arises. Our argument depends on the large sieve, and it therefore comes with a square-root barrier. The result is that we have to assume {H(a, \mu) > \frac12 \log a}. This is certainly satisfied in our representative case {a = 2, \mu = u_{\{-1, 0, 1\}}} (as the entropy is very close to {\log 2}), but in general it need not be, and in that case the problem remains open. The following problem is representative.

Problem 2 Let {a \geq 2} and let {\mu = u_{\{0, 1\}}}. Then {X_n} is uniform on a (Cantor) set of size {2^n}, so {H(a, \mu) = \log 2}. Show that the mixing time of {X_n \bmod p} is {(1+o(1))\log_2 p} for almost all primes {p}.

You might call this the “2–3–4–5 problem”. The case {a=2} is trivial, as {X_n} is uniform on {\{1, \dots, 2^n\}}. The case {a=3} is covered by our main theorem, since {\log 2 > \frac12 \log 3}. The case {a=4} is exactly borderline, as {\log 2 = \frac12 \log 4}, and this case is not covered by our main theorem, but we sketch how to stretch the proof to include this case anyway. For {a \geq 5} we need a new idea.