Lecture 17: Variance

\( \newcommand{\set}[1]{\{#1\}} \newcommand{\comprehension}[2]{\{#1\,\vert\,#2\}} \newcommand{\size}[1]{\left\vert#1\right\vert} \newcommand{\true}{\top} \newcommand{\false}{\bot} \newcommand{\limplies}{\rightarrow} \newcommand{\divides}{\mathbin{\vert}} \newcommand{\mult}{\mathbin{\cdot}} \newcommand{\xor}{\oplus} \newcommand{\union}{\cup} \newcommand{\intersect}{\cap} \newcommand{\complement}[1]{\overline#1} \newcommand{\powerset}{\mathcal{P}} \newcommand{\ixUnion}{\bigcup} \newcommand{\ixIntersect}{\bigcap} \newcommand{\Div}{\mathrm{div}} \newcommand{\gcd}{\mathrm{gcd}} \newcommand{\divmod}{\mathop{\mathbf{divmod}}} \newcommand{\div}{\mathop{\mathbf{div}}} \newcommand{\mod}{\mathop{\mathbf{mod}}} \newcommand{\ceiling}[1]{\lceil#1\rceil} \newcommand{\floor}[1]{\lfloor#1\rfloor} \DeclareMathOperator{\Pr}{Pr} \DeclareMathOperator{\rng}{rng} \DeclareMathOperator{\E}{E} \DeclareMathOperator{\V}{V} \newcommand{\abs}[1]{\left\lvert#1\right\rvert} \)

Expectation (continued)

Theorem 17.1 If $A$ and $B$ are independent random variables over a probability space $(\Omega,\Pr)$, then $\E(A \mult B) = \E(A) \mult \E(B)$.

Proof

\begin{align*} \E(A \mult B) &= \sum_{r \in rng(A \mult B)} r \mult \Pr(A \mult B = r) \\ &= \sum_{r_1\in\rng(A)\\r_2 \in\rng(B)} r_1 r_2\mult\Pr(A = r_1 \land B = r_2) \\ &= \sum_{r_1\in\rng(A)}\sum_{r_2 \in\rng(B)} r_1 r_2 \mult\Pr(A=r_1) \mult \Pr(B=r_2) \\ &= \sum_{r_1\in\rng(A)}\left[r_1\mult\Pr(A=r_1)\mult\sum_{r_2 \in\rng(B)} r_2 \mult \Pr(B=r_2)\right] \\ &= \left[\sum_{r_1\in\rng(A)}r_1\mult\Pr(A=r_1)\right]\mult\left[\sum_{r_2 \in\rng(B)} r_2 \mult \Pr(B=r_2)\right] \\ &= \E(A) \mult \E(B) \end{align*}

$\Box$ Theorem 17.1

Example 17.2 A lottery offers prizes according to the following table:

PrizeOdds
1,000,0001/10,000,000
100,0001/1,000,000
10,0001/100,000

In addition, there's a “multiplier” feature that triggers with the following odds:

MultiplierOdds
101/100
51/50

What is the expected prize for ticket?

Because the multipliers and base prizes are independent, we can compute the expectations separately and them multiply.

\begin{align*} \E(\text{base}) &= \verb-$-1,000,000 \mult 1/10,000,000 + \verb-$-100,000 \mult 1/1,000,000 + \verb-$-10,000 \mult 1/100,000\\ &= \verb-$-0.10 + \verb-$-0.10 + \verb-$-0.10\\ &= \verb-$-0.30 \end{align*}

The expected multiplier is

\begin{align*} \E(\text{multiplier}) &= 0.01 \mult 10 + 0.02 \mult 5 + 0.97 \mult 1\\ &= 0.1 + 0.1 + 0.97\\ &= 1.17 \end{align*}

So

\begin{align*} \E(\text{prize}) &= \E(\text{base} \mult \text{multiplier})\\ &= \E(\text{base}) \mult \E(\text{multiplier})\\ &= \verb-$-0.30 \mult 1.17\\ &= \verb-$-0.351 \end{align*}

Variance

Definition 17.3 Let $X$ be a random variable on the probability space $(\Omega,\Pr)$. The variance of $X$ is

\begin{align*} \V(X) &= \sum_{\omega\in\Omega} (X(\omega) - E(X))^2 \mult \Pr(\omega)\\ &= \E((X-E(X))^2) \end{align*}

The standard deviation of $X$ is $\sigma(X) = \sqrt{V(X)}$. [Note that, not illogically, you'll sometimes see the variance denoted by $\sigma^2(X)$.]

There are a few things to notice about this definition. The individual terms in the variance sum are all non-negative (they're a square times a positive number), so the variance is always non-negative, and the standard deviation well-defined. [Important note: this is not necessarily the case for infinite probability spaces!] The definition of standard deviation has very much a “Euclidean Distance” feel to it (the square root of a sum of squares), and that's a useful relationship to keep in mind. We're approaching these things through the guise of pure mathematics, but were this a course in an applied science, the random variables would be dimensioned (e.g., it would have units of feet per second, or pips, or some such), and while variance doesn't have the same dimension as expectation, the standard deviation does.

Exercise 17.4 Show that $V(X) = 0$ if and only if $X$ is constant.

Theorem 17.5 If $X$ is a random variable, then $\V(X) = \E(X^2) - \E(X)^2$.

Proof

\begin{align*} \V(X) &= \sum_{\omega\in\Omega} \left(X(\omega) - \E(X)\right)^2 \mult \Pr(\omega) \\ &= \sum_{\omega\in\Omega} \left(X(\omega)^2 - 2\mult X(\omega) \mult E(X) + E(X)^2\right) \mult \Pr(\omega) \\ &= \sum_{\omega\in\Omega} X(\omega)^2 \mult \Pr(\omega) - 2 \mult \E(X) \mult \sum_{\omega\in\Omega} X(\omega)\mult \Pr(\omega) + \E(X)^2 \mult \sum_{\omega\in\Omega} \Pr(\omega)\\ &= \E(X^2) - 2 \E(X)^2 + \E(X)^2\\ &= \E(X^2) - \E(X)^2 \end{align*}

$\Box$ Theorem 17.5

We will often used Theorem 17.5 to simplify variance calculations.

Example 17.6 Consider a single Bernoulli Trial $X$ with probability $p$ of success, and $q=1-p$ of failure. Because $X$ is 0-1 valued, $X^2=X$, so we can compute

\begin{align*} \V(X) &= \E(X^2) - \E(X)^2\\ &= \E(X) - \E(X)^2\\ &= p - p^2 \tag{$\E(X) = p$ for a Bernoulli Trial }\\ &= p \mult (1-p)\\ &= p \mult q \end{align*}

Example 17.7 Consider a fair die. We've already showed $\E(\text{roll}) = 7/2$. We can compute the variance via $\E(X^2) = \sum_{i=1}^6 i^2 / 6 = 91/6$, so

\begin{align*} \V(X) &= \E(X^2) - \E(X)^2\\ &= {91\over 6} - \left({7 \over 2}\right)^2\\ &= {35 \over 12} \end{align*}

and

\begin{align*} \sigma(X) &= \sqrt{V(X)}\\ &= \sqrt{35\over 12}\\ &\approx 1.7078\ldots \end{align*}

Theorem 17.8 (Bienayme's Formula) If $X$ and $Y$ are independent random variables, then $\V(X+Y) = \V(X) + \V(Y)$.

Proof

\begin{align*} \V(X+Y) &= \E((X+Y)^2) - \E(X+Y)^2\\ &= \E(X^2 + 2XY + Y^2) - (\E(X) + \E(Y))^2\\ &= \E(X^2) + 2\E(XY) + \E(Y^2) - \E(X)^2 - 2\E(X)\E(Y) - \E(Y^2)\\ &= (\E(X^2) - \E(X)^2) + (\E(Y^2) - E(Y)^2)\tag{*}\\ &= V(X) + V(Y)\\ \end{align*}

Note that at (*), we use $\E(XY) = \E(X)\E(Y)$ by independence.

$\Box$ Theorem 17.8

Example 17.9 Let the random variable $X$ be the roll of 2 independent fair die. What are the expectation, variance, and standard deviation?

We've already computed the expectation of one die as $7/2$. By the additivity of expectation, $\E(X) = \E(\text{roll}) + E(\text{roll}) = 7/2 + 7/2 = 7$. We've also already computed the variance $\V(\text{roll}) = 35/12$, so $\V(X) = \V(\text{roll}) + \V(\text{roll}) = 35/6$, and $\sigma(X) = \sqrt{\V(X)} \approx 2.4152\ldots$.

Example 17.10 What is the variance associated with $n$ independent Bernoulli Trials, each with probability $p$ of succcess, and $q = 1-p$ of failure. By our calculation above for a single Bernoulli Trial, the variance is $p q$, so for $n$ independent trials, it is $n p q$. This gives rise to a fundamental intuition: the variance grows linearly with the number of trials, and therefore the standard deviation grows with the square root.

The “point” to variance (and standard deviation) is that they give us some way to quantify how ”spread out” a random variable is.

Theorem 17.11 (Chebyshev's Inequality) Let $X$ be a random variable. For all $a > 0$,

\begin{equation*} \Pr(\abs{X - E(X)} \geq a) \leq {\V(X) \over a^2}. \end{equation*}

Proof Let $Y$ be the random variable $(X-E(X))^2$. Then, by definition, $\E(Y) = \V(X)$. Markov's inequality applies to $Y$, as $Y$ is non-negative, so

\begin{equation*} \Pr(Y \geq a^2) \leq {\E(Y) \over a^2} \end{equation*}

We next observe that

\begin{align*} Y \geq a^2 &\iff (X-\E(X))^2 \geq a^2\\ &\iff \abs{X-\E(X)} \geq a \end{align*}

and recall that $\E(Y) = \V(X)$, whereupon

\begin{equation*} \Pr(\abs{X-\E(X)} \geq a) \leq {\V(X) \over a^2} \end{equation*}

as required.

$\Box$ Theorem 17.11

*Exercise 17.12 Prof. Babai notes, “Chebyshev’s inequality tells us that random variables don’t like to stray away from their expected value by more than a small multiple of their standard deviation.“ As presented (both here and in his notes), it does not say so directly. But Babai's remark is true, nevertheless. Give an alternative statement of Chebyshev's inequality that captures his meaning.

Example 17.13 Consider flipping a fair coin $n$ times. We know that the expected number of heads is $n/2$. We know that events like “all heads” are unlikely, especially as $n$ increases, but we'd like to say more, i.e., that with high probability, we get something close to $n/2$ heads. We know that the variance in the number of heads obtained on flipping the coin $n$ times is $n \mult 1/2 \mult 1/2$ by the Bernoulli trials computation. We can then plug $c \sqrt{n}$ in for $a$ in Chebyshev's inequality, obtaining,

\begin{align*} \text{Pr}\left(\abs{\text{heads} - {n\over 2}} \geq c \sqrt{n}\right) &\leq {\V(\text{heads})\over (c \sqrt(n))^2} \\ &= {n \over 4} \mult {1\over c^2 n} \\ &= {1 \over 4c^2} \end{align*}

Let's consider now the special case where $n=100$, so $\sqrt{n} = 10$, and try to bound the probability of flipping 60 or more heads. Using $c=1$ in the analysis gives

\begin{equation*} \Pr(\abs{\text{heads} - 50} \geq 10) \leq {1\over 4} \end{equation*}

but this counts both tails of the distribution, “$\text{heads} \geq 60$” and “$\text{heads} \leq 40$.” By symmetry (we can interchange the rolls of heads and tails, as the coin is fair), we know $\Pr(\text{heads} \geq 60) = \Pr(\text{heads} \leq 40) = {1\over 2} \Pr(\abs{\text{heads} - 50} \geq 10)$, whence,

\begin{equation*} \Pr(\abs{\text{heads} \geq 60}) \leq {1 \over 2 } \mult {1\over 4} = {1\over 8} \end{equation*}

An exact calculation gives a tighter upper bound of $2.845\%$, but the bound we got from Chebyshev's inequality isn't terrible, and it relied on very limited information about the distribution (specifically, just $\E(X)$, $\V(X)$, and symmetry). Note that this is a much better bound than we'd have gotten by Markov's inequality alone.

Exercise 17.14 What bound would Markov's inequality have given us on $\Pr(\text{heads} \geq 60)$ in the example above?

*Exercise 17.15 Consider a biased coin, in which $\Pr(\text{heads}) = 0.6$. Assume $100$ independent flips of this coin, and use Chebyshev's inequality to bound $\Pr(\text{heads} \leq 50)$. Note that with a biased coin, the tails are not symmetric.