Lecture 18: Bayes' Theorem and Asymptotics
\( \newcommand{\set}[1]{\{#1\}} \newcommand{\comprehension}[2]{\{#1\,\vert\,#2\}} \newcommand{\size}[1]{\left\vert#1\right\vert} \newcommand{\true}{\top} \newcommand{\false}{\bot} \newcommand{\limplies}{\rightarrow} \newcommand{\divides}{\mathbin{\vert}} \newcommand{\mult}{\mathbin{\cdot}} \newcommand{\xor}{\oplus} \newcommand{\union}{\cup} \newcommand{\intersect}{\cap} \newcommand{\complement}[1]{\overline#1} \newcommand{\powerset}{\mathcal{P}} \newcommand{\ixUnion}{\bigcup} \newcommand{\ixIntersect}{\bigcap} \newcommand{\Div}{\mathrm{div}} \newcommand{\gcd}{\mathrm{gcd}} \newcommand{\divmod}{\mathop{\mathbf{divmod}}} \newcommand{\div}{\mathop{\mathbf{div}}} \newcommand{\mod}{\mathop{\mathbf{mod}}} \newcommand{\ceiling}[1]{\lceil#1\rceil} \newcommand{\floor}[1]{\lfloor#1\rfloor} \DeclareMathOperator{\Pr}{Pr} \DeclareMathOperator{\rng}{rng} \DeclareMathOperator{\E}{E} \DeclareMathOperator{\V}{V} \newcommand{\abs}[1]{\left\lvert#1\right\rvert} \)Bayes' Theorem
This material is covered in Rosen, Ch. 7.3.
A common situation is that we are given an prior probability of a certain event, some additional information, and then are asked to give a posterior probability of the same event, i.e., the probability that takes the additional information into account.
Example 18.1 A light bulb factory has three machines, $A$, $B$, and $C$. Machine $A$ produces $20,000$ bulbs per day, machine $B$ produces $30,000$ bulbs per day, and machine $C$ produces $50,000$ bulbs per day. Of the bulbs produced by machine $A$, $5\%$ are defective, by machine $B$, $3\%$, and by machine $C$, $1\%$. If an item chosen at random from the day's production is found to be defective, what is the probability that it was produced by machine $A$?
For this example, we'll compute the number of defective bulbs produced by each machine: $20,000 \mult 0.05 = 1,000$ for machine $A$, $30,000 \mult 0.03 = 900$ for machine $B$, and $50,000 \mult 0.01 = 500$ for machine $C$. Thus, there are $1,000 + 900 + 500 = 2,400$ defective bulbs produced by the factory overall, and the fraction of defective bulbs produced by $A$ is ${1000/2400} = {5/12}$. Likewise, the probability that the defective bulb was manufactured by machine $B$ is ${900/2400} = {3/8}$, and by machine $C$ is ${500/2400} = {5/24}$.
In this particular example, the prior probabilities are relative rates of production of the three machines, expressed in terms of light-bulbs produced by each machine per day. We're then given some additional information: the likelihood that a bulbs produced by particular machines are defective. Then we're asked to give the posterior probability distribution that a light bulb was produced by a machine, given that it was defective. In this case, the posterior distribution can be quite different from the prior distribution, cf., $C$ accounts for half of the total production, but only a little more than a fifth of production of defective bulbs.
Theorem 18.2 (Bayes' Theorem) If $A$ and $B$ are events with $\Pr(A),\Pr(B) \gt 0$, then
\begin{equation*} \Pr(A \vert B) = {\Pr(B \vert A) \mult \Pr(A) \over \Pr(B)} \end{equation*}Proof By the definition of conditional probability,
\begin{equation*} \Pr(A \vert B) = {\Pr(A \intersect B) \over \Pr(B)} \end{equation*}If we multiply through by $\Pr(B)$, we get
\begin{equation*} \Pr(A \vert B) \mult \Pr(B) = \Pr(A \intersect B) \end{equation*}by symmetry
\begin{equation*} \Pr(B \vert A) \mult \Pr(A) = \Pr(A \intersect B) \end{equation*}By transitivity
\begin{align*} \Pr(A \vert B) \mult \Pr(B) &= \Pr(B \vert A) \mult \Pr(A)\\ \Pr(A \vert B) &= {\Pr(B \vert A) \mult \Pr(A) \over \Pr(B)} \end{align*}as required.
$\Box$ Theorem 18.2
Note that in practical uses, we'll often compute $\Pr(B)$ via the law of total probability, and the version of Bayes' Theorem in Rosen anticipates this. I prefer the form above, not merely because it is more often cited, but because it is easier to remember.
Example 18.3 Let's revisit Example 18.1, in light of Bayes' Theorem. Let the $A$ of Bayes' Theorem be the event “manufactured by machine $A$,” and the $B$ of Bayes' Theorem be the event “the sampled light bulb is defective.”
We compute $\Pr(\text{machine $A$}) = 20,000 / (20,000 + 30,000 + 50,000) = 0.2$. We compute $\Pr(\text{defective})$ by the law of total probability:
\begin{align*} \Pr(\text{defective}) &= \sum_{M\in\set{A,B,C}} \Pr(\text{defective} \vert \text{machine $M$}) \mult \Pr(\text{machine $M$})\\ &= 0.05 \mult {20,000\over 100,000} + 0.03 \mult {30,000\over 100,000} + 0.01 \mult {50,000\over 100,000}\\ &= 0.024\\ \end{align*}Whence
\begin{align*} \Pr(\text{machine $A$} \vert \text{defective}) &= {\Pr(\text{defective} \vert \text{machine $A$}) \mult \Pr(\text{machine $A$}) \over \Pr(\text{defective})} \\ &= {0.05 \mult 0.2 \over 0.024}\\ &= {10 \over 24}\\ &= {5 \over 12}\\ \end{align*}as before.
Example 18.4 Let's consider the occasionally dishonest casino, in which $10\%$ of the dice are loaded so that a $6$ has probability $1/2$, and all the other five numbers have probability $1/10$th, and the remaining dice are fair. Suppose you pick up a die, and roll two consecutive $6$'s. How likely is the die to be loaded?
We want to compute $\Pr(\text{loaded} \vert 66)$. To apply Bayes' Theorem we'll need $\Pr(\text{loaded})$, $\Pr(66 \vert \text{loaded})$ and $\Pr(66)$. We were given $\Pr(\text{loaded}) = 0.10$, and we can easily compute $\Pr(66 \vert \text{loaded}) = (1/2)^2 = 1/4$. To compute $\Pr(66)$, we need to do a total probability calculation:
\begin{align*} \Pr(66) &= \Pr(66 \vert \text{loaded}) \mult \Pr(\text{loaded}) + \Pr(66 \vert \text{fair}) \mult \Pr(\text{fair}) \\ &= {1 \over 2^2} \mult {1 \over 10} + {1\over 6^2} \mult {9\over 10}\\ &= {1 \over 20} \end{align*}finally
\begin{align*} \Pr(\text{loaded} \vert 66) &= {\Pr(66 \vert \text{loaded}) \mult \Pr(\text{loaded}) \over \Pr(66)} \\ &= {{1 \over 2^2} \mult {1 \over 10} \over {1\over 20}}\\ &= {1 \over 2} \end{align*}Surprising.
*Exercise 18.5 Two local industries occasionally discharge pollution illegally into the river. United Toxic Chemicals is responsible for only $10\%$ of all such discharges, but their pollution is very dangerous, and it is important to take immediate action if they are responsible. Amalgamated Organic Fertilizers is responsible for $90\%$ of all discharges, but their pollution, while obnoxious, is not dangerous.
The two discharges are difficult to distinguish, the most obvious difference being that the UTC discharges typically contain twice as many white plastic jugs as black plastic jugs, while this proportion is reversed for AOF discharges.
You notice a discharge, and floating past you observe the following sequence of jugs: black, black, white, black, white, white, black, black, black, white, black. How likely is it that UTC is the polluter?
An interesting application of Bayes' Theorem not covered in the standard texts is Bayesian optimal search. There is a story.
In 1966, a B-52 bomber carrying four nuclear weapons collided with a KC-135 tanker during mid-air refueling off the coast of Spain. Both aircraft and most of their crews were lost, and the nuclear weapons dispersed. Three of the weapons were found on land, near the fishing village of Palomares. The fourth was more difficult to locate. An ad hoc team of scholars (some from the military, some from MIT), who developed a technique based on Bayesian methods to organize the search. The basic idea was to develop a probabilistic decision tree of the various events that could have happened: did the parachute deploy? If so, did it deploy fully? What were the directions of the winds, currents, etc.? They then divided a map into small squares, and computed a probability that the weapon would be in each location. As the search progressed, failures to find the weapon in the a priori most probable locations were used to update the probabilities in the model, via Bayes' Theorem, and the future search priorities were updated accordingly. The effort was a success.
In 1968, the USS Scorpion, a nuclear attack submarine, was lost transiting between the Mediterranean and its home port in Norfolk. The same team that had been involved in the Palomares search was reconstituted, and lead another search effort, resulting in the discovery of the wreck of the Scorpion.
Curiously enough, during my last year at graduate school, I participated in a job fare at the AMS annual meeting, and was interviewed by a fellow who ran a consulting business for the US Navy, using Bayesian Optimal Search to find various things that the Navy was interested in finding. I declined, but it was clear that this place had hired a number of new Ph.D.'s in computability theory, and was interested in hiring more.
Let's consider a simple problem, which gives something of the flavor of this particular sort of application.
A young child is lost in a national park. There are 12 search teams to be partitioned between two different areas: the plains and the valleys. Past experience indicates that 71% of the time, lost children are ultimately found in the plains, and only 29% of the time are they found in the valleys. Moreover, if a child is lost in the plains, and $n$ teams search there, the the probability that they will be found alive is $1−e^{−1.2n}$; but the corresponding probability for $n$ teams searching the valley is only $1 − e^{−0.4n}$. How should you allocate your search teams between the plains and the valleys to maximize the probability that the child will be found?
In this case, we can compute the probability that the child will be found alive on the assumption that $k$-teams are assigned to the plains as follows:
\begin{align*} \Pr(\hbox{found alive}|\hbox{$k$-assigned to the plain}) &= 0.71 * (1−e^{−1.2k}) + 0.29 * (1 − e^{−0.4(n-k)}) \end{align*}Determining the optimal allocation from here is best done by writing a small program, and it turns out that the optimal allocation is 4 teams to the plain, and 8 to valley, which results in a 98.2% chance that the child will be found alive.
Asymptotics
This material is covered in Rosen, Ch 3.2.
We often need a rough-and-ready system for describing the grown of functions for increasing $n$. It's clear enough, e.g., that $n^2$ doesn't grow nearly as fast as $2^n$, and we'd like a language for talking about this.
Definition 18.6 Let $f,g :\mathbb{N} \to \mathbb{R}$. We say $f(x)$ is $O(g(x))$ if there exist constants $c$ and $k$ such that
\begin{equation*} \forall x \geq k, \abs{f(x)} \leq c \mult \abs{g(x)} \end{equation*}This is read as “$f(x)$ is big-oh of $g(x)$.”
The idea here is that $g(x)$ grows as fast or faster than $f(x)$, past some initial finite portion, and up to a constant multiple.
Note that because this is a robust, rough-and-ready system, there's not a lot of interest in being economical in our choice of $k$ and $c$. We're more concerned with the form of $g$, that it be as simple as possible. Just win, baby, as Al Davis used to say.
Example 18.7 Show that $x^2 + 2x + 1$ is $O(x^2)$. Let $k=1$ and $c=4$.
\begin{align*} x^2 + 2x + 1 &\leq x^2 + 2x^2 + x^2 \tag{$x^2 \geq x,1$ for $x \geq 1$}\\ &= 4x^2 \end{align*}Note here how we “round up” the low-order terms, resulting in an expression which involves only the high-order term, and then the fact that we don't care about simple constant multiples means that we (in effect) get to disregard those low-order terms.
Let's develop a little theoretical machinery...
Theorem 18.8 If $\abs{f(x)}$ is $O(\abs{g(x)})$, then $f(x)$ is $O(g(x))$
To see this, it's enough to notice that $\abs{\abs{z}} = \abs{z}$. The significance of this theorem is that it enables us to assume that the functions we're working with are non-negative valued, which simplifies the definition of $O$, as we can throw away the absolute values, and reason about simple inequalities.
Theorem 18.9 Let $f_1(x)$ be $O(g_1(x))$, and $f_2(x)$ be $O(g_2(x))$. Then $f_1(x) + f_2(x)$ is $O(\max(g_1(x),g_2(x)))$.
Proof WLOG, all of the functions are non-negative valued. Let $k_1$ and $c_1$ witness that $f_1(x)$ is $O(g_1(x))$, i.e., $\forall x \gt k_1, f(x) \leq c \mult g_1(x)$, and let $k_2$ and $c_2$ witness that $f_2(x)$ is $O(g_2(x))$. We claim that $k = \max(k_1,k_2)$ and $c = c_1 + c_2$ witness that $f_1(x) + f_2(x)$ is $O(\max(g_1(x),g_2(x)))$. To that end, let $x \gt k$:
\begin{align*} f_1(x) + f_2(x) &\leq c_1 \mult g_1(x) + c_2 \mult g_2(x) \tag{$x \geq k \geq k_1,k_2$}\\ &\leq c_1 \mult \max(g_1(x),g_2(x)) + c_2 \mult \max(g_1(x),g_2(x))\\ &\leq (c_1 + c_2) \mult \max(g_1(x),g_2(x))\\ &\leq c \mult \max(g_1(x),g_2(x)) \end{align*}$\Box$ Theorem 18.9
Corollary 18.10 If $f_1(x)$ and $f_2(x)$ are both $O(g(x))$, then $f_1(x) + f_2(x)$ is $O(g(x))$
The following simple theorems can be proven in much the same way as Theorem 18.9:
Exercise 18.11 Show that if $f(x)$ is $O(g(x))$, then $c \mult f(x)$ is $O(g(x))$.
Exercise 18.12 Show that if $f_1(x)$ is $O(g_1(x))$, and $f_2(x)$ is $O(g_2(x))$, then $f_1(x) \mult f_2(x)$ is $O(g_1(x) \mult g_2(x))$.
Much as $O$-notation classifies the rate of growth of a function from above, we have the following similar definitions
Definition 18.13 Let $f,g :\mathbb{N} \to \mathbb{R}$. We say $f(x)$ is $\Omega(g(x))$ if there exist constants $c$ and $k$ such that
\begin{equation*} \forall x \geq k, \abs{f(x)} \geq c \mult \abs{g(x)} \end{equation*}This is read as “$f(x)$ is big-Omega of $g(x)$.”
Definition 18.14 Let $f,g :\mathbb{N} \to \mathbb{R}$. We say $f(x)$ is $\Theta(g(x))$ if $f(x)$ is both $O(g(x)$ and $\Omega(g(x))$.
This is read as “$f(x)$ is big-Theta of $g(x)$.”
Exercise 18.15 Show that the relation “f(x) is $\Theta(g(x))$” is an equivalence relation on functions of type $\mathbb{N} \to \mathbb{R}$.
If there are bigger hammers in your toolkit...
A common conceit of discrete math classes is that the only real connection between continuous and discrete methods is that mathematical maturity in one prepares one to learn the other, but there's not a lot of interplay between the two at the level of content. Asymptotic analysis, though, is one of those areas where the tools of calculus can give us quick and easy proofs. After all, many of these functions of type $\mathbb{N} \to \mathbb{R}$ are actually restrictions of functions of type $\mathbb{R} \to \mathbb{R}$ to the domain $\mathbb{N}$, and this has consequences. If you don't know calculus, you can ignore this, but if you do know calculus, let's swing that hammer...
In the most common case, we're interested in comparing functions $f$ and $g$ such that $\lim_{x \to \infty} f(x) = \lim_{x \to \infty} g(x) = \infty$. In this case, we can show without too much difficulty that $f(x)$ is $O(g(x))$ if and only if $\limsup_{x \to\infty} f(x) / g(x) \lt \infty$, and this is nice because very often $\lim_{x\to\infty} f(x)/g(x)$ exists, and is easy to compute. E.g., let's consider the example above, and so $x^2 + 2x + 1$ is $O(x^2)$.
We compute
\begin{align*} \lim_{x \to \infty} {x^2 + 2x + 1 \over x^2} &= lim_{x\to\infty} 1 + {2\over x} + {1 \over x^2} \\ &= 1 \\ &\lt \infty \end{align*}and we're done. Moreover, we get some additional nice information: we can't infer from this that $c=1$ will work as a witness to $x^2+2x+1$ being $O(x^2)$, but we do know that $1+\epsilon$ for any $\epsilon>0$ will work.
Let's do another
Example 18.16 Let $p(x)$ be a polynomial. Then $p(x)$ is $O(2^x)$.
To see this, we want to compute $$\lim_{x\to\infty} {p(x) \over 2^x}.$$ Assuming $p$ is non-constant, this takes the indeterminate form $\infty / \infty$, so L'Hôpital's rule applies. If we apply it order $p$ times, the numerator will be constant, while the denominator will be $(\ln 2)^n 2^x \to \infty$, so the limit is zero, which is bounded.
Of course, we can apply this hammer to theory and not just practice, obtaining alternative (and more direct) proofs to the various “combination” theorems we gave earlier.