Lecture 16: Expectation
\( \newcommand{\set}[1]{\{#1\}} \newcommand{\comprehension}[2]{\{#1\,\vert\,#2\}} \newcommand{\size}[1]{\left\vert#1\right\vert} \newcommand{\true}{\top} \newcommand{\false}{\bot} \newcommand{\limplies}{\rightarrow} \newcommand{\divides}{\mathbin{\vert}} \newcommand{\mult}{\mathbin{\cdot}} \newcommand{\xor}{\oplus} \newcommand{\union}{\cup} \newcommand{\intersect}{\cap} \newcommand{\complement}[1]{\overline#1} \newcommand{\powerset}{\mathcal{P}} \newcommand{\ixUnion}{\bigcup} \newcommand{\ixIntersect}{\bigcap} \newcommand{\Div}{\mathrm{div}} \newcommand{\gcd}{\mathrm{gcd}} \newcommand{\divmod}{\mathop{\mathbf{divmod}}} \newcommand{\div}{\mathop{\mathbf{div}}} \newcommand{\mod}{\mathop{\mathbf{mod}}} \newcommand{\ceiling}[1]{\lceil#1\rceil} \newcommand{\floor}[1]{\lfloor#1\rfloor} \DeclareMathOperator{\Pr}{Pr} \DeclareMathOperator{\rng}{rng} \DeclareMathOperator{\E}{E} \)Definition 16.1 Let $(\Omega,\Pr)$ be a probability space, and let $X$ be a random variable. The expected value of $X$ is
\begin{equation*} \E(X) = \sum_{\omega\in\Omega} X(\omega)\,Pr(\omega) \end{equation*}You may think of “average” as a synonym for “expectation,” but in my opinion, they are distinct notions. An expectation is a statistic associated with a random variable over a given probability space. We use averages (of samples) to estimate that statistic. Still, “average,” or more precisely, ”weighted average,” is a good basis for gaining intuition about what expectation means.
Example 16.2 Consider a fair 6-sided die. Let the random variable $X$ denote the number rolled.
\begin{align*} \E(X) &= 1 \mult {1\over 6} + 2 \mult {1\over 6} + 3 \mult {1\over 6} + 4 \mult {1\over 6} + 5 \mult {1\over 6} + 6 \mult {1\over 6}\\ &= {21\over 6}\\ &= 3 {1\over 2} \end{align*}Example 16.3 Now consider a loaded 6-sided die, in which
\begin{equation*} \Pr(\text{roll}) = \begin{cases} {1/2}, &\text{if $\text{roll}=6$,}\\ {1/10}, &\text{otherwise} \end{cases} \end{equation*}Let the random variable $Y$ denote the number rolled with this loaded die. We can now compute
\begin{align*} \E(Y) &= 1 \mult {1\over 10} + 2 \mult {1\over 10} + 3 \mult {1\over 10} + 4 \mult {1\over 10} + 5 \mult {1\over 10} + 6 \mult {1\over 2}\\ &= {45 \over 10}\\ &= 4 {1\over 2} \end{align*}Example 16.4 Let's suppose we have three three loaded dice as above, and counted the sum of the numbers shown. What is the expected value? A direct computation by hand involves a certain amount of tedium: there are $6^3=216$ cases to consider. We can use the multinomial theorem, which reduces the problem to managing ${3 + 6 - 1 \choose 3} = 56$ terms, which is an improvement, but still a bit of a pain. A natural alternative (this is a CS class after all) is to write a little program, e.g.,
#!/usr/bin/env python3
from fractions import Fraction
sum = Fraction() # zero
def roll_pr(n):
if n == 6:
return Fraction(1,2) # 1/2
else:
return Fraction(1,10) # 1/10
for roll1 in range(1,7):
for roll2 in range(1,7):
for roll3 in range(1,7):
pips = roll1 + roll2 + roll3
prob = roll_pr(roll1) * roll_pr(roll2) * roll_pr(roll3)
sum += pips * prob
print(sum)
This is a pretty straightforward translation of the usual expectation calculation from the notation of mathematics to the programming language Python. About the only things that are notable are that the range(1,7)
construct defines the half-open interval $[1,\dots,7) = \set{1,\ldots,6}$, and the use of the Fraction
module to do exact rational arithmetic. Running the program results in the output 27/2
, which if we thought about for a minute, is $3 \times 9/2 = 3 \times \E(Y)$, suggesting that there might be an easier way. Patience.
Lemma 16.5 Let $\Pr(X=r) = \Pr(\set{\omega: X(\omega) = r})$ as before, and let $r_1,\ldots,r_k$ be the (distinct) elements of $\rng(X)$. Then
\begin{equation*} \E(X) = \sum_{i=1}^k r_i \, \Pr(X=r) \end{equation*}Exercise 16.6 Prove Lemma 16.5.
*Exercise 16.7 Prove $\min_{\omega\in\Omega} X(\omega) \leq \E(X) \leq \max_{\omega\in\Omega} X(\omega)$.
The sample-space parameter $\omega$ of a random variable is often suppressed, and expressions are formed of random variables that look like combinations of ordinary values, but implicitly define functions, e.g., if $X$ and $Y$ are random variables, we might consider the random variable $Z = 2X + 3Y$, i.e., $Z(\omega) = 2X(\omega) + 3Y(\omega)$.
Theorem 16.8 (Additivity of Expected Value) Let $(\Omega,\Pr)$ be a probability space, and $X_1,X_2,\ldots,X_n$ be random variables over $\Omega$. Then
\begin{equation*} \E(X_1 + \mult + X_n) = \sum_{i=1}^n \E(X_i) \end{equation*}This theorem is often expressed by the pithy saying, “The expectation of the sum is the sum of the expectations.”
Proof
\begin{align*} \E(X_1 + \mult + X_n) &= \sum_{\omega\in\Omega} \left(\sum_{i=1}^n X_i(\omega)\right)\Pr(\omega) \\ &= \sum_{\omega\in\Omega} \sum_{i=1}^n X_i(\omega)\Pr(\omega) \\ &= \sum_{i=1}^n\sum_{\omega\in\Omega} X_i(\omega)\Pr(\omega) \\ &= \sum_{i=1}^n \E(X_i) \end{align*}$\Box$ Theorem 16.8
Example 16.9 (Example 16.4 revisited) Consider again the problem of computing the expected value of the sum of three loaded dice. We can view this as the expectation of the sum of three random variables, and so apply Theorem 16.8 directly. Let $A$, $B$, and $C$ be random variables denoting each die. Then,
\begin{align*} \E(A+B+C) &= \E(A) + \E(B) + \E(C) \\ &= 9/2 + 9/2 + 9/2 \\ &= 27/2 \end{align*}One way in which this example is mildly misleading is in that the random variables $A$, $B$, and $C$ are not only pairwise independent, they're mutually independent. Of course, we haven't yet defined for random variables, only for events, and so:
Definition 16.10 Random variables $X$ and $Y$ are independent if for all $r$ and $s$ in $\mathbb{R}$, $\Pr(X=r \land Y=s) = \Pr(X=r) \cdot \Pr(Y=s)$.
But Theorem 16.8 has no independence hypotheses, and indeed, it is often most useful in situations where the random variables in question are not independent. Here are two:
Example 16.11 (The Hatcheck Problem) Assume $m$ people throw their hats in a bin, and that the hats are redistributed in a one-to-one fashion at random. What is the expected number of people who get their own hats back?
Let $X$ be a random variable that denotes the number of people who get their own hat back. We can introduce the random variable
\begin{equation*} X_i = \begin{cases} 1, &\text{if person $i$ gets their own hat back}\\ 0, &\text{otherwise} \end{cases} \end{equation*}Clearly $X = \sum_{i=1}^n X_i$. It's also clear that $\E(X_i) = 1/n$. Thus,
\begin{align*} \E(X) &= \E(\sum_{i=1}^n X_i)\\ &= \sum_{i=1}^n \E(X_i) \\ &= \sum_{i=1}^n 1/n \\ &= 1 \end{align*}Example 16.12 (Inversions in a Permutation) Consider the permutations of the numbers $\set{1,\ldots,n}$ with the uniform distribution. A pair of numbers $i \lt j$ are said to be inverted in a permutation $\pi$ if $j$ precedes $i$, i.e., they occur out of order. What is the expected number of inverted pairs in a permutation?
A glib answer would be “half of them.” It's also a precise answer, in that a permutation is as likely as its reversal, swapping the set of inverted and non-inverted pairs. Still, we can strive for a more complete answer.
Let $I_{i,j}$ for $i \lt j$ be the random variable that is $1$ if $i$ occurs after $j$, i.e., the pair $(i,j)$ is inverted. If $X$ is the random variable denoting the number of inverted pairs in a permutation, we easy see that $X = \sum_{1\leq i \lt j \lt n} I_{i,j}$, and so $\E(X) = \sum_{1\leq i \lt j \lt n} E(I_{i,j})$. Now, $i$ and $j$ will be ordered half the time, and out of order half the time (to see this, note that we can pair in-order and out-of-order permutations by swapping $i$ and $j$), so $E(I_{i,j}) = 1/2$, and the number of terms in the summation is $n \choose 2$, so
\begin{align*} \E(X) &= {n \choose 2}\mult {1 \over 2} \\ &= {n (n-1) \over 2}\mult {1 \over 2} \\ &= {n (n-1) \over 4} \end{align*}One reason why this matters is that there's a standard, albeit quite poor, sorting algorithm known as “bubble-sort.” The idea behind bubble sort is to repeated make passes across an array of items to be sorted, and if a[i] > a[i+1]
, to swap them. If bubble-sort is coded naïvely, it will make $n-1$ passes on an array of $n$ elements (the first pass is guaranteed to put the largest item into the correct position, the second pass the second largest element, etc.). This gives rise to a trivial $O(n^2)$ time analysis: $n$ sweeps through the array, times $n-1$ comparison/swaps per sweep.
But from time to time, it occurs to people to try to “improve” bubble-sort, the practical effect of which is often to make a precise analysis more difficult to pin down. Still, we can see that an inversion step in bubble sort will have the effect of lowering the inversion count by $1$, and so we expect it will take $n (n-1)/4$ swapping-of-adjacent steps to bring a random permutation into sorted order. Thus, all of these variants require $O(n^2)$ swap steps (the worst case being at least as bad as the expected case), even if you're given the optimal next swap for free, so we'll have to do something more clever than swaps-of-adjacents to beat ${n (n-1) \over 4}$.
Exercise 16.13 Prove the following useful generalization of Theorem 16.8:
Theorem (Linearity of Expected Value) Let $(\Omega,\Pr)$ be a probability space, $c_1,c_2,\ldots,c_n$ be real numbers, and $X_1,X_2,\ldots,X_n$ be random variables over $\Omega$. Then
\begin{equation*} \E(\sum_{i=1}^n c_i X_i) = \sum_{i=1}^n c_i \E(X_i) \end{equation*}Theorem 16.14 Markov's Inequality If $X$ is nonnegative, then $\forall a > 0,$
\begin{align*} \Pr(X \geq a) \leq {\E(X) \over a} \end{align*}Proof We begin with calculating a simple estimate for $\E(X)$:
\begin{align*} \E(X) &= \sum_{\omega\in\Omega} X(\omega)\,Pr(\omega)\\ &\geq \sum_{\omega\in\Omega, X(\omega) \geq a} X(\omega)\,Pr(\omega)\tag{summing over fewer non-negative items}\\ &\geq \sum_{\omega\in\Omega, X(\omega) \geq a} a\,Pr(\omega)\tag{summing smaller things}\\ &= a \sum_{\omega\in\Omega, X(\omega) \geq a}Pr(\omega)\tag{linearity of summation}\\ &= a \Pr(X \geq a) \end{align*}We complete the proof by dividing through by $a$.
$\Box$ Theorem 16.14
Markov's inequality seems quite weak, and many of the “applications” look lame, inevitably because they involve circumstances where $\Pr$ is known. What gives Markov's inequality power is that it gives us some information about the tail of the distribution of a random variable, even when very little (only the expectation) is known. If you know more, it's unsurprising that you can do better. Even much, much better.
It's worth noting that Markov's inequality is “tight,” given its hypotheses. Consider an arbitrary $a$, the probability space $(\set{x,y},\Pr)$, where $\Pr(x) = 1/a$ and $\Pr(y) = 1-1/a$, and $X$ is a random variable such that $X(x) = a$, and $X(y)=0$. In this case, it is easy to see that $E(X)=1$, and $\Pr(X \geq a) = 1/a$.
Exercise 16.15 Prove the following generalization of Markov's inequality: if $X$ is random variable, and $f$ is non-negative on the range of $X$, then $\Pr(f(X) \geq a) \leq {\E(f(X)) \over a}$.