Sharp Bounds on Spectral Norm of a Random Matrix

Singular values of matrices show up in many different places, and studying them has significant importance in math. As per usual, it has a fairly simple definition but requires quite diffficult tools to analyze comprehensively. In this article I aim to demonstrate how would norm of a Gaussian random matrix behave and answer questions such as “How does it’s mean behave?” or “Is it concentrated around it’s mean?”. A crucial assumption employed on the entries of A is independence. Indeed, the results can be extended easily to subgaussians as well, however stating theorems with gaussians evaporates the annoying constants! Let’s set up some notations, shall we?

Definition. For any matrix ARn×m we define spectral norm (i.e. largest singular value) as: ||A||op=maxuSn1vSm1u,Av

It is worth mentioning, the reason that I exchanged maximum instead of suprema is merely because suprema of a continuous function will be achieved on a compact set. This norm is a measure of how much a vector can be dialated through the linear transform xAx in the worst case.

Definition. An unbiased random variable X is called σ2-subgaussian if one of the following holds.1 Moreover, all the statements are equivalent with different constants:

  • (Moment Generating Bound) E[etX]et2σ22 for all tR.
  • (Tail Bound) P[|X|t]2et22σ2 for all t0.
  • (Moment Bound) ||X||Lp=(E[Xp])1pσp for all p1.
  • (Another MGF Bound) E[eX22σ2]2

In a nutshell, this property is roughly saying that the random variable behaves like a normal and it is desirable since it’s basically equivalent to concentration around mean (or median). For fixed vectors uSn1,vSm1 one can easily check that Xu,v=u,Av is also sub-gaussian with the same parameter of entries of A, thus concentrated around it’s mean. However, it’s not clear what can be said about suprema of this collection of random variables which is the quantity of interest.2 Note that these random variables are locally correlated, i.e. for a pair (u,v) close to (u,v) we expect Xu,v to be close to Xu,v. More precisely, one can write: |Xu,vXu,v||Xu,vXu,v|+|Xu,vXu,v|||A||op(||uu||+||vv||) Surprisingly, the inverse is also true meaning if one take two pairs far apart from each other the corrseponding random variables will also be uncorrelated, thus behave as though they were independent since uncorrelatedness is equivalent to independence in gaussian fields. This sort of argument motivated proofs based on ϵ-net methods which I do not intend to go through here, however the intuition is very insightful. In order to make this statement more precise one may use the following comparison inequality for gaussian random variables.

Theorem (Sudakov-Fernique Inequality). Suppose X,YRn are two independent centered Gaussian vectors such that E[|XiXj|2]E[|YiYj|2] for i,j[n]. Then: E[maxinXi]E[maxinYi] As a special case if we assume that X,Y has the same variance coordinatewise, then the condition in the theorem verbally translates into X is more correlated than Y, thus should possess a lower maximum which aligns with intuition.3 In addition, it is easy to extend this to infinite case using a simple truncation and then applying monotone convergence theorem provided that the index set is seperable. I am not going to provide a proof for this theorem here since the ideas overlap with the theorems mentioned below but this theorem has an direct application to upper bounding spectral norm of a matrix. Take Xu,v as defined above and assume entries of matrix A has standard normal distribution: E[|Xu,vXu,v|2]=E[|i[n]j[m](uivjuivj)Ai,j|2]=i[n]j[m](uivjuivj)2=i[n]j[m](uivjuivj)2+(uivjuivj)2+2uivj(uiui)(vjvj)=i[n](uiui)2+j[m](vjvj)2+2i[n]ui(uiui)0j[m]vj(vjvj)0i[n](uiui)2+j[m](vjvj)2 Now, one can define the following process for independent random variables GN(0,In×n) and GN(0,Im×m) as: Yu,v=u,G+v,GuSn1,vSm1 Therefore, by Fernique’s inequality we obtain: E[||A||op]=E[supuSn1vSm1Xu,v]E[supuSn1vSm1Yu,v]=E[||G||]+E[||G||]m+n This is a sharp upper bound when m=Ω(n) since we can lower bound spectral norm by the norm of its first column: E[||A||op]E[supuSn1Xu,e1]=E[||A:,1||]n Now that we established a sharp behaviour of largest singular value in expectation we move onto the next question on the list which is how well this quantity is conentrated around its mean?

A Hammer (Lipschitz Concentration)

Before we continue, we need to introduce a powerful tool which is widely used in the literature of measure concentration. Suppose γn is n-dimensional Gaussian measure where expectation is deonted by γnf=Eγn[f] for f:RnR. Then Poincaré’s inequality suggest that functions with small local variations should expect small variances. In other words, some sort of gradient of the function as an indication of local fluctuation can be translated into a bound on global fluctuation. Var(f)E[||f||22] This mainly suggest that one should expect dimension-independent concentration bounds for Lipschitz functions.

Theorem. Suppose f:RnR is a κ-Lipschitz function. Then, we have the following concentration inequality: γn{|fγnf|t}2et22κ2 There are in fact many ways to approach this problem namely Interpolation Method due to Pisier-Mauray, Smart Path due to Talagrand, Isoperimetry due to Borell and Sudakov, Transporation Method due to Marton, etc.4 However, I will provide a neat proof usng Stochastic Calculus which is mathematically beautiful but may seem like a magical juggling performed by the very Brownian Motion! Honestly, I get super excited whenever Brownian Motion comes up and it almost never disappoints.

Proof. (Requires Knowledge in SDE)

Proof. The idea is to think of our gaussian vector as a point from a n-dimensional brownian path Bt at time 1 and define the martingale Mt=E[f(B1)|Ft] which interpolates continuously between M0=γnf and M1=f(B1). Since, Brownian Motion is also a Markov Process we can rewrite Mt as E[f(B1)|Bt]=F(t,Bt) where F is a smooth function and in fact inherits smoothness properties of f.5 Assuming it’s valid to exchange integration and differentiation, one can write: ||BtF(t,Bt)||=||BtE[f(B1Bt+Bt)|Bt]||=||EG[Btf(1tG+Bt)]||EG[||Btf(1tG+Bt)||]κ where the last inequality is due to Jensen’s inequality. Thus, we can now use Fundamental Theorem of Calculus and write the difference M1M0 as an integral of dMt and this is the part Ito’s Formula kicks in: dMt=tF(t,Bt)dt+i=1nBt,iF(t,Bt)dBt,i+12i,j2Bt,iBt,jF(t,Bt)dB,i,B,jt Where X,Y is the quadratic variation of two L2 martingales such that XYX,Y is again martingale. Now quadratic variation between coordinate i and j is zero due to independence for distnct i,j. However, by the uniqueness of quadratic variation for i=j we have: E[Bt,i2t|Fs]=E[(Bt,iBs,i+Bs,i)2t|Fs]=Bs,i2s Thus, B,i,B,it=t and we obtain: dMt=i=1nBt,iF(t,Bt)dBt,i+(tF(t,Bt)+12i2Bt,i2F(t,Bt))dt Since M is martingale the second term (which has finite variation) must be zero which implies that F satisfies heat equation and again we can rewrite: MtM0=i=1n0tBt,iF(t,Bt)dBt,i

Now by linearity of quadratic variations: Mt=i,j0tBt,iF(t,Bt)Bt,jF(t,Bt)dB,i,B,jt=i=1n0t|Bt,iF(t,Bt)|2dt=0t||F(t,.)||2dtκ2t

This might not seem significant however a lot of information is burried deep within it. Again using Ito’s formula one can prove that the following process is a martingale: Zt=eλMtλ22MteλMtλ2κ22t Therefore: eλγnf=eλM0=E[Z0]=E[Z1]E[eλf(B1)λ2κ22] Now by rearranging terms: E[eλ(fγnf)]eλ2κ22 Which proves f is κ-subgaussian.

Subsequently, This theorem allows us to derive dimension-independent concentration for spectral norm based on the fact that it is a 1-Lipschitz function.6 Note that this inequality is in fact sharp for the special case of linear functions in the sense that RHS is the asymptotic tail of N(0,κ2), though this sharpness might drop for some other class of Lipschitz functions. An immediate implication of this theorem is that the variance of f is bounded above by κ2 but what happens as we change the dimensions? One way for testing sharpness is via finding true order of the variance and see if it’s bounded below by a constant or shrinks toward zero as n increases. In most of the cases, actually variance vanishes to zero! This phenomena is usually referred to as Superconcentration7 which provides novel methods to compute true order of the variance with respect to n. All this, arise the question of how does the variance of spectral norm behaves? In the following I intend to specify some results surrounding this question without providing a proof. As an illuminating example, we may take a look at maximum of a gaussian vector which is a special case of the problem at hand.

Finite Maximum Over a Gaussian Vector

Consider XRn to be a Gaussian vector with covariance matrix Σ. It is a well established result that the mean of X(n)=maxiXi behaves roughly as 2σ2log(n) and the variance is bounded by σ2=maxinVar(Xi). In order to obtain the former one can simply use Jensen’s inequality and optimize over λ afterwards: 1λE[log(eλX(n))]1λlog(E[eλX(n)])1λlog(E[i=1neλXi])1λlog(neλ2σ22) However, the variance bound requires extra work and can be proved via Poincaré’s Inequality mentioned earlier. Note that f(Y)=maxiXi where Y=Σ12X is almost surely differentiable. Therefore, if we define i=argmaxiXi then one can prove f is σ-Lipschitz via the chain rule and obtain the bound: Var(f)E[||f(Y)||2]=E[||XYfX||2]=E[||Σ12ei||2]=E[eiTΣei]maxiΣi,i=σ2

Even though the concentration theorem suggest that X(n) is σ-subgaussian however it’s sharpness is merely contingent upon Poincare’s inequality being unbeatable. In fact Talagrand8 proved an extention of Poincare’s inequality which provides a sufficient condition on how to beat it. In particular, the case of maximum over independent gaussians satisfy this condition thus concludes the following bound: Var(X(n))1log(n) Below is a simple simulation study with 1000 replications for each n (between 10 to 100) which verifies the theoretical results.

Variance Bound for Spectral Norm

Thus far we proved ||A||op is 1-subgaussian, moreover its variance is bounded by one in the case of standard normal entries due to the fact that it’s 1-Lipschitz. However the following simulation indicates that spectral norm enjoys superconcentration and its variance behaves roughly as (n+m)(1/n+1/m)13, thus Lipschitz concentration once again has failed to capture the underlying truth!9 In this simulation m=n and the variance was estimated over 1000 replications. Note that the slope of the line should be close to 13.

It demands much more theory to prove this variance bound which I personally believe is one of the cases that the amount of effort that should be invested into proving it does not pay off proportionally!


  1. One can look at High Dimensional Probability by Vershynin For a more comprehensive definition

  2. This suprema can be written over a countable collection indexed by a set of dense vectors on Sn1 and Sm1 for those who are concerned about measurablity of spectral norm.

  3. One might be suspicious about how maximum magically showed up and to the extent of which class of functions this theorem might hold. It’s in fact a well established result that superadditive functions (functions which 2ijf0 holds for all i,j) satisfy this inequality and maximum is just a special case.

  4. The first tree methods can be found in Pollard. The latter can be found in Sec. 4 of High Dimesnional Probability lecture notes by Ramon Van Handel.

  5. By running a simple approximation argument one may assume f is sufficiently smooth and then extend the results to the general Lipschitz case.

  6. The space of Rn×m can be equipped with either Frobenius or spectral norm distances in order for this to make sense.

  7. There is a wonderful book by Chatterjee on this topic.

  8. For a detailed derivation look at HDP by Ramon Van Handel Sec. 8.3

  9. A proof of this can be found in chapter 9 of the book by Chatterjee mentioned earlier.