Give a possible sample space $\Omega$ for random variables representing each of the following:
Let $X$ denote a random variable describing the outcome of the presidential election. Depending on how the “result” of the election is defined, there are several options for the sample space $\Omega_X$. If we are concerned with the person who is nominated president, then the sample space would be $\Omega_x = \{\text{Trump}, \text{Biden} \}$ if third-party candidates were not considered. If the result was whether a republican or democrat was nominated, then $\Omega_X = \{ \text{Republican}, \text{Democrat} \}$. Given where we currently are in the election cycle, these two sample spaces are almost synonymous. However, if this question was asked before the primary elections, there would be a more significant difference.
Let $Y$ denote a random variable describing the day of the week. Thus, the samples space of $Y$ is $\Omega_Y=\{Sun.,Mon.,Tue., Wed., Thu., Fri., Sat.\}$. Now consider a random variable $X$ denoting the year. No restrictions on the year were given, so the year could be any integer (including negative integers for year B.C.E.). The sample space of $X$ is therefore $\Omega_X = \mathbb{Z}$ and the sample space of the year and day of someones birthday is given by the cartesian product of $\Omega_Y$ and $\Omega_X$, i.e., $\Omega = \Omega_Y \times \Omega_X$. If we also assumed that the person was born in either 1980 or 1981, then $\Omega_X = \{1980, 1981\}$ and the Cartesian product of $\Omega_X$ an $\Omega_Y$ would be
\[\Omega=\{ (1981,Sun.), (1981, Mon.), \ldots, (1981, Sat.), (1982, Sun.), \ldots (1982, Sat.)\}\]This sample space can again be answered using a Cartesian product of one dimensional sample spaces. Let $I$ denote a random variable defining the days since Jan. 1, 2000 (an arbitrary point in time). If $I$ is negative, then it represents a date before Jan. 1., 2000. If $I$ is positive, then it represents a date after Jan 1., 2000. Without any additional stipulations on the year or month of the hurricane landfall, $I$ can thus be any integer and $\Omega_I=\mathbb{Z}$. Let $S$ denote the state of landfall. We are implicitly assuming that the hurricane hits land in a US State bordering the Atlantic ocean, so $S$ could be any US State touching the Atlantic Ocean or Gulf of Mexico. Thus,
\[\Omega_S = \{\text{TX}, \text{LA}, \text{MS}, \text{AL}, \text{FL}, \text{GA}, \text{SC}, \text{NC}, \text{VA}, \text{MD}, \text{DE}, \text{NJ}, \text{NY}, \text{CT}, \text{RI}, \text{MA}, \text{NH}, \text{ME}\}.\]The Cartesian product of $\Omega_I$ and $\Omega_S$ describes all possible dates and states of landfall. Of course, the probabilities for each of these options will be different.
Consider two random variables $X$ and $Y$ with samples spaces $\Omega_x = \{x_1,x_2\}$ and $\Omega_y=\{y_1,y_2\}$, respectively. The joint sample space is then $\Omega_{xy} = \Omega_x \times \Omega_y = \{ (x_1,y_1), (x_1,y_2), (x_2,y_1), (x_2,y_2)\}$. The joint probability mass function for $X$ and $Y$ is provided in the following table.
$(x,y)$ | $m(x,y)$ |
---|---|
$(x_1,y_1)$ | 0.12 |
$(x_1,y_2)$ | 0.08 |
$(x_2,y_1)$ | 0.24 |
$(x_2,y_2)$ | 0.56 |
Use the joint probabilities above to compute the marginal probability mass functions $m(x)$ and $m(y)$ and fill in the tables below. Recall that for a particular value $x_i$, the mass function $m(x_i)=P[X=x_i]$.
$x$ | $m(x)$ |
---|---|
$x_1$ | ? |
$x_2$ | ? |
$y$ | $m(y)$ |
---|---|
$y_1$ | ? |
$y_2$ | ? |
Compute the conditional probability $m(y|x)$ for each combination of $x$ and $y$ so that you can fill in the following table. Recall that $m(y_i | x_j) = P[Y=y_i | X=x_j]$.
$x$ | $y$ | $m(y|x)$ |
---|---|---|
$x_1$ | $y_1$ | ? |
$x_1$ | $y_2$ | ? |
$x_2$ | $y_1$ | ? |
$x_2$ | $y_2$ | ? |
c. Are the random variables $X$ and $Y$ independent? Why or why not?
The marginal probability $m(x_i)$ is given by
\(m(x_i) = m(x_i, y_1) + m(x_i, y_2)\)
Thus, the values of the marginal probabilities are
$x$ | $m(x)$ |
---|---|
$x_1$ | 0.2 |
$x_2$ | 0.8 |
As expected, these add to one. A similar procedure can be used for $m(y)$ to obtain
$y$ | $m(y)$ |
---|---|
$y_1$ | 0.36 |
$y_2$ | 0.64 |
Again, these add to one as we would expect for a valid probability mass function.
From the product rule of probability, we know that $m(x,y) = m(y|x)m(x)$. Thus $m(y|x) = m(x,y) / m(x)$. Using the results from part a. and the joint probabilities given in the problem description, we have
$x$ | $y$ | $m(y | x)$ |
---|---|---|---|
$x_1$ | $y_1$ | 0.6 | |
$x_1$ | $y_2$ | 0.4 | |
$x_2$ | $y_1$ | 0.3 | |
$x_2$ | $y_2$ | 0.7 |
If $X$ and $Y$ were independent, then by the definition of independence we would have the conditional mass function $m(y|x)$ would be equal to the marginal mass functions $m(x)$. However, comparing the table above, we see that $m(y_1) = 0.36$ but $m(y_1 | x_1) = 0.6$, which are not equal and imply that $X$ and $Y$ are not independent. Independence could also be checked by comparing $m(x,y)$ to $m(x)m(y)$. Again, you’d find that the values are different and thus $X$ and $Y$ are not independent.
This is a classic probability brain teaser.
Imagine a game show where you are given a choice of three doors. Behind one door is a brand new car but the other two doors conceal goats. Yes, goats. You pick door number 1, but then the host opens a different door, say door number 3, which has a goat. After seeing the goat, the host asks you if you’d like to switch to door number 2 or if you’d like to stay with door number 1. 1) Assume that initially every door is equally likely to conceal the car. Use Bayes’ rule to compute the posterior probabilities that the car is behind each door after you observe the goat in door 3. 2) Should you switch your choice to door number 2? (Assume for this exercise that you’d prefer a car to a goat.) 3) Now imagine that you have some additional information. After the goat is revealed in door 3, you are 50\% sure that you also heard a goat behind door 2 (they can be loud after all). Compute the posterior probabilities again. What should you do now?
Let $C$ be a random variable denoting the door concealing the car. The sample space of $C$ is thus $\Omega_C=\{1,2,3\}$. Let $D_1$ be the event that you choose door $1$ and let $H_3$ be the event that the host showed you a goat behind door $3$. We are interested in the three probabilities $\mathbb{P}[C=i | D_1,H_3]$ for $i\in\{1,2,3\}$. We will assume all of the doors are equally likely to hide the car apriori, thus the prior probabilities in Bayes’ rule are $\mathbb{P}[C=i] = \frac{1}{3}$. To apply Bayes’ rule, we also need the conditional probabilities $\mathbb{P}[D_1,H_3 | C=i]$. We assume that the host will never reveal the car at this stage of the gameshow, but will otherwise randomly choose one of the doors with equal probability. Under these assumptions, the likelihood probabilities are
\[\begin{eqnarray} \mathbb{P}[D_1, H_3 | C=1] &=& \frac{1}{2} \\ \mathbb{P}[D_1, H_3 | C=2] &=& 1\\ \mathbb{P}[D_1, H_3 | C=3] &=& 0 \end{eqnarray}\]In this setting, Bayes’ rule is given by
\[\mathbb{P}[C=i | D_1, H_3] = \frac{\mathbb{P}[D_1,H_3 | C=i]\,\mathbb{P[C=i]}}{\sum_{j=1}^3\mathbb{P}[D_1,H_3 | C=j]\,\mathbb{P[C=j]}}\]and therefore
\[\begin{eqnarray} \mathbb{P}[C=1 | D_1, H_3] & = & \frac{\frac{1}{2}\times\frac{1}{3}}{\frac{1}{6} + \frac{1}{3} + 0} = \frac{1}{3}\\ \mathbb{P}[C=2 | D_1, H_3] & = & \frac{1\times\frac{1}{3}}{\frac{1}{6} + \frac{1}{3} + 0}= \frac{2}{3}\\ \mathbb{P}[C=3 | D_1, H_3] & = & \frac{0\times\frac{1}{3}}{\frac{1}{6} + \frac{1}{3} + 0} = 0 \end{eqnarray}\]If you would prefer a car to a goat, then you should switch your choice to door two to maximize your probability of choosing the door.
Let $G$ be a random variable with a sample space
\[\Omega_G = \left\{\text{Heard a goat behind door 2}, \text{Did not hear anything.}\right\}\]For notational convenience, we will abbreviate this sample space as $\Omega_G = \{\text{Heard}, \text{Nothing}\}$ We can interpret the phrase “you are 50\% sure that you also heard a goat behind door 2” to imply that $\mathbb{P}[G=\text{Heard}] = \mathbb{P}[G=\text{Nothing}] = \frac{1}{2}$. Now we can compute the conditional probabilities $\mathbb{P}[C=i | D_1, H_3, G=\text{Heard}]$ and $\mathbb{P}[C=i | D_1, H_3, G=\text{Nothing}]$. Assuming there is no chance a goat can drive a car, if you hear a goat behind door two, then there can’t be a car there. Thus,
\[\begin{eqnarray} \mathbb{P}[C=1 | D_1, H_3, G=\text{Heard}] &=& 1\\ \mathbb{P}[C=2 | D_1, H_3, G=\text{Heard}] &=& 0\\ \mathbb{P}[C=3 | D_1, H_3, G=\text{Heard}] &=& 0 \end{eqnarray}\]If I didn’t hear a goat, then I’ve gained not new information and the probabilities stay the same
\[\begin{eqnarray} \mathbb{P}[C=1 | D_1, H_3, G=\text{Nothing}] &=& \frac{1}{3}\\ \mathbb{P}[C=2 | D_1, H_3, G=\text{Nothing}] &=& \frac{2}{3}\\ \mathbb{P}[C=3 | D_1, H_3, G=\text{Nothing}] &=& 0 \end{eqnarray}\]To combine these probabilities, we can marginalize over $G$ to obtain
\[\begin{eqnarray} \mathbb{P}[C=i | D_1, H_3, \text{50% Sure}] &=& \mathbb{P}[C=i | D_1, H_3, G=\text{Heard}] \mathbb{P}[G=\text{Heard}] \\&+& \mathbb{P}[C=i | D_1, H_3, G=\text{Nothing}] \mathbb{P}[G=\text{Nothing}] \end{eqnarray}\]and thus
\[\begin{eqnarray} \mathbb{P}[C=1 | D_1, H_3, \text{50% Sure}] & = & 1\times\frac{1}{2} + \frac{1}{3}\times\frac{1}{2} = \frac{2}{3}\\ \mathbb{P}[C=2 | D_1, H_3, \text{50% Sure}] & = & 0\times\frac{1}{2} + \frac{2}{3}\times\frac{1}{2} = \frac{1}{3}\\ \mathbb{P}[C=3 | D_1, H_3, \text{50% Sure}] & = & 0 \end{eqnarray}\]Probability distributions that place high probability on a single outcome express less uncertainty than distributions that spread probability more equally between outcomes in the sample space. One way to measure these levels of “uncertainty” is to compute the information entropy (sometimes called Shannon entropy) of its probability mass function. The information entropy of a discrete random variable with sample space $\Omega=\{x_1, x_2, \ldots, x_N\}$ and probability mass function $m(x)$ is given by \(H(X) = - \sum_{i=1}^N m(x_i) \log \left[ m(x_i)\right].\)
In this exercise, we will consider the information entropy of a simple random variable with sample space $\Omega=\{0,1\}$ and probability mass function $m(1) = p, m(0)=1-p$. This is called a Bernoulli random variable.
1) Compute the information entropy of a Bernoulli random variable with $p=\frac{1}{2}$. Let’s denote this by $H_{1/2}$.
2) Compute the information entropy for values $p=0$ and $p=1$.
2) Now imagine you have an arbitrary value of $p$, say $\tilde{p}$. How does the information entropy of a Bernoulli random variable with $\tilde{p}$ compare to $H_{1/2}$, is $H_\tilde{p}$ always greater than $H_{1/2}$?
In general, the information entropy of a Bernoulli random variable is given by \(H_p = -p\log p - (1-p)\log (1-p)\) The information entropy is defined for any base of the logarithm. Different logarithm bases simply measure the entropy in different units. For example, a base of $2$ measures the amount of information needed to describe $X$ in “bits.” A base of $e$ results in units of “nats” and a base of $10$ results in units known as “bans” because of the relationship with the Banburismus technique developed by Alan Turing in WWII.
\(\begin{eqnarray} H_{1/2} &=& -0.5\log 0.5 - 0.5\log 0.5 \\ & = & -\log 0.5 \end{eqnarray}\)
Logarithm Base | H(X) | Units |
---|---|---|
$2$ | 1.0 | bits |
$e$ | 0.69 | nats |
$10$ | 0.3 | bans |
For $p=0$ \(\begin{eqnarray} H_0 &=& -\lim_{p\rightarrow 0}\, p\log p \\ & = & 0 \end{eqnarray}\) Similarly, for $p=1$ \(\begin{eqnarray} H_1 &=& -\lim_{p\rightarrow 1}\, (1-p)\log (1-p) \\ & = & 0 \end{eqnarray}\)
Recall that local maximima and minimima of a smooth function have a zero derivative. Now consider the derivative of the entropy with respect to $p$, i.e., \(\begin{eqnarray} \frac{\partial H}{\partial p} &=& -\log p - 1 + \log(1-p) + 1\\ &=& \log(1-p)-\log(p) \end{eqnarray}\) Because the log is a monotone function, there is only one point where $\frac{\partial H}{\partial p}=\log(1-p)-\log(p)=0$, and that occurs when $p=\frac{1}{2}$. Thus, for $p\in(0,1)$, there is only one possible maxima or minima. We know that $H_{p=1/2} = 1$ (with base log2), while $H_{p=0}=H_{p=1}=0$. The extreme point at $p=1/2$ must therefore be a maximum. Thus, $H_{\tilde{p}}\leq H_{1/2}$ for any $\tilde{p}\in[0,1]$.
Given that the information energy is a measure of uncertainty, finding the probability mass function (or density in the continuous setting) that maximizes the information entropy is a common way to choose prior probability distributions that are “maximally uncertain” or “uninformative.” We will talk more about this later in the course.