Add answers for 4.1

d5573e2a · Erik Strand · 5b229107 · d5573e2a
Commit d5573e2a authored Feb 23, 2019 by Erik Strand
--- a/_psets/3.md
+++ b/_psets/3.md
@@ -8,12 +8,141 @@ title: Problem Set 3
 Verify that the entropy function satisfies the required properties of continuity, non-negativity,
 monotonicity, and independence.
+I will prove these properties for the discrete case.
+First, I want to show that $$\lim_{x \to 0^+} x \log_b x = 0$$ for any $$b > 1$$, since otherwise
+entropy is not defined for distributions that assign zero probability to any outcome. This can be
+done using [L'H&ocirc;pital's Rule](https://en.wikipedia.org/wiki/L%27H%C3%B4pital%27s_rule).
+$$
+\begin{align*}
+\lim_{x \to 0^+} x \log_b x &= \lim_{x \to 0^+} \frac{\log_b x}{x^{-1}} \\
+&= \lim_{x \to 0^+} \frac{\frac{d}{d x} \log_b x}{\frac{d}{d x} x^{-1}} \\
+&= \lim_{x \to 0^+} \left( \frac{1}{x \ln b} \right) \left( \frac{-1}{x^{-2}} \right) \\
+&= \lim_{x \to 0^+} \frac{-x}{\ln b} \\
+&= 0
+\end{align*}
+$$
+### Continuity
+To talk about the continuity of entropy, we need to define a topology for probability distributions.
+Let $$\Omega$$ be a set of finite cardinality $$n$$. The space of probability distributions over
+$$\Omega$$ can be viewed as the set of vectors in $$\mathbb{R}_{\geq 0}^n$$ with $$L^1$$ norm 1. In
+this way entropy is a function that maps a subset of $$\mathbb{R}_{\geq 0}^n$$ to $$\mathbb{R}$$. So
+I will prove the continuity of entropy with respect to the topologies of $$\mathbb{R}_{\geq 0}^n$$
+and $$\mathbb{R}$$.
+First let's show that $$x \log x$$ is continuous. I take as given that $$\log(x)$$ is a continuous
+function on its domain. Then $$x \log(x)$$ is also continuous, since finite products of continuous
+functions are continuous. This suffices for $$x > 0$$. At zero, $$x \log x$$ is continuous because
+we have defined it to be equal to the limit we found above.
+Thus each term of the entropy function is a continuous function from $$\mathbb{R}$$ to
+$$\mathbb{R}$$. This suffices to show that negative entropy is continuous, based on the lemma below.
+Thus entropy is continuous, since negation is a continuous function, and finite compositions of
+continuous functions are continuous.
+The necessary lemma is easy to prove, if symbol heavy. I will use the
+[$$L^1$$ norm](https://en.wikipedia.org/wiki/Norm_(mathematics)#p-norm), but this is without loss of
+generality because all norms on a finite dimensional vector space induce the same topology. Let
+$$f : \mathbb{R}^n \to \mathbb{R}$$ and $$g : \mathbb{R} \to \mathbb{R}$$ be continuous functions,
+and define $$h : \mathbb{R}^{n + 1} \to \mathbb{R}$$ as
+$$h(x_1, \ldots, x_{n + 1}) = f(x_1, \ldots, x_n) + g(x_{n + 1})$$
+Fix any $$x = (x_1, \ldots, x_{n + 1}) \in \mathbb{R}^{n + 1}$$, and any $$\epsilon > 0$$. Since
+$$f$$ is continuous, there exists some positive $$\delta_f$$ such that for any $$y \in
+\mathbb{R}^n$$, $$\lVert (x_1, \ldots, x_n) - (y_1, \ldots, y_n) \rVert < \delta_f$$ implies
+$$\lVert f(x_1, \ldots, x_n) - f(y_1, \ldots, y_n) \rVert < \epsilon / 2$$. For the same reason
+there is a similar $$\delta_g$$ for $$g$$. Let $$\delta$$ be the smaller of $$\delta_f$$ and
+$$\delta_g$$. Now fix any $$y \in \mathbb{R}^{n + 1}$$ such that $$\lVert x - y \rVert < \delta$$.
+Note that
+$$
+\begin{align*}
+\lVert (x_1, \ldots, x_n) - (y_1, \ldots, y_n) \rVert &= \sum_{i = 1}^n \lVert x_i - y_i \rVert \\
+&\leq \sum_{i = 1}^{n + 1} \lVert x_i - y_i \rVert \\
+&< \delta_f
+\end{align*}
+$$
+and similarly for the projections of $$x$$ and $$y$$ along the $$n + 1$$st dimension. Thus
+$$
+\begin{align*}
+\lVert h(x) - h(y) \rVert
+&= \lVert f(x_1, \ldots, x_n) + g(x_{n + 1}) - f(y_1, \ldots, y_n) + g(y_{n + 1}) \rVert \\
+&\leq \lVert f(x_1, \ldots, x_n) - f(y_1, \ldots, y_n) \rVert +
+      \lVert g(x_{n + 1}) + g(y_{n + 1}) \rVert \\
+&< \frac{\epsilon}{2} + \frac{\epsilon}{2} \\
+&= \epsilon
+\end{align*}
+$$
+It follows that $$h$$ is continuous.
+### Non-negativity
+The probability of each individual outcome must be between zero and one. Thus $$-p_i \log p_i \geq
+0$$ for all $$i$$. Since $$x \log x$$ is only equal to zero when $$x$$ is zero or one, the entropy
+can only be zero when a single outcome has probability one.
+### Monotonicity
+Note that $$\partial/\partial p_i H(p) = -\log(p_i) - 1$$ for any $$i$$. This is a strictly
+decreasing function, so entropy is strictly concave on all of $$\mathbb{R}_{\geq 0}^n$$. The
+constraint that $$\sum p_i$$ is one is linear, so entropy is concave on this subset of
+$$\mathbb{R}_{\geq 0}^n$$ as well. Thus there is a unique global maximum.
+We can locate it using a [Lagrange multiplier](https://en.wikipedia.org/wiki/Lagrange_multiplier).
+Our Lagrange function is
+$$
+-\sum_{i = 1}^n p_i \log p_i + \lambda \left( \sum_{i = 1}^n p_i - 1 \right)
+$$
+The partial derivative with respect to any $$p_i$$ is $$-\log p_i - 1 + \lambda$$. Since this
+depends only on $$\lambda$$, it implies that all the $$p_i$$ must be the same. Taking our constraint
+into account this means there's only one possibility: $$p_i = 1/n$$ for all $$i$$.
+Call this distribution $$p_*$$. Its entropy is $$-\sum_{i = 1}^n 1/n \log 1/n = -\log 1/n$$. Thus
+$$
+H(p) \leq H(p_*) = - \log \frac{1}{n}
+$$
+for all probability distributions $$p$$ over $$n$$ outcomes. Equality is only achieved for $$p_*$$
+itself, by the strict concavity of entropy. Note that $$H(p_*)$$ grows without bound as $$n$$
+increases.
+### Independence
+$$
+\begin{align*}
+H(p, q) &= -\sum_{i = 1}^n \sum_{j = 1}^m p_i q_j \log(p_i q_j) \\
+&= -\sum_{i = 1}^n \sum_{j = 1}^m p_i q_j \left( \log p_i + \log q_j \right) \\
+&= -\sum_{i = 1}^n \sum_{j = 1}^m p_i q_j \log p_i -
+    \sum_{i = 1}^n \sum_{j = 1}^m p_i q_j \log q_j \\
+&= -\sum_{i = 1}^n p_i \log p_i \sum_{j = 1}^m q_j -
+    \sum_{j = 1}^m q_j \log q_j \sum_{i = 1}^n p_i \\
+&= -\sum_{i = 1}^n p_i \log p_i - \sum_{j = 1}^m q_j \log q_j \\
+&= H(p) + H(q)
+\end{align*}
+$$
 ## (4.2)
 {:.question}
 Prove the relationships in Equation (4.10).
+$$
+\begin{align*}
+\end{align*}
+$$
 ## (4.3)