03.md



title: Problem Set 3


(4.1)
{:.question}
Verify that the entropy function satisfies the required properties of continuity, non-negativity,
monotonicity, and independence.
I will prove these properties for the discrete case.
First, I want to show that \lim_{x \to 0^+} x \log_b x = 0 for any b > 1, since otherwise
entropy is not defined for distributions that assign zero probability to any outcome. This can be
done using L'Hôpital's Rule.

\begin{align*}
\lim_{x \to 0^+} x \log_b x &= \lim_{x \to 0^+} \frac{\log_b x}{x^{-1}} \\
&= \lim_{x \to 0^+} \frac{\frac{d}{d x} \log_b x}{\frac{d}{d x} x^{-1}} \\
&= \lim_{x \to 0^+} \left( \frac{1}{x \ln b} \right) \left( \frac{-1}{x^{-2}} \right) \\
&= \lim_{x \to 0^+} \frac{-x}{\ln b} \\
&= 0
\end{align*}


Continuity
To talk about the continuity of entropy, we need to define a
topology for probability distributions. Let \Omega be
a set of finite cardinality n. The space of probability distributions over \Omega can be
viewed as the set of vectors in \mathbb{R}_{\geq 0}^n with L^1 norm 1. In this way entropy
is a function that maps a subset of \mathbb{R}_{\geq 0}^n to \mathbb{R}. So I will prove the
continuity of entropy with respect to the topologies of \mathbb{R}_{\geq 0}^n and
\mathbb{R}.
First let's show that x \log x is continuous. I take as given that \log(x) is a continuous
function on its domain (after all it's the inverse of e^x, which is strictly monotonic and
C^\infty). Then x \log(x) is also continuous, since finite products of continuous functions
are continuous. This suffices for x > 0. At zero, x \log x is continuous because we have
defined it to be equal to the limit we found above.
Thus each term of the entropy function is a continuous function from \mathbb{R}_{\geq 0} to
\mathbb{R}. But we can also view each term as a function from \mathbb{R}_{\geq 0}^n to
\mathbb{R}. Each one ignores most of its inputs, but this doesn't change its continuity. (The
epsilon-delta proof follows easily from the triangle inequality, since the only part of the distance
between inputs that matters is that along the active coordinate.) So entropy is a sum of continuous
functions, and is thus continuous.

Non-negativity
The probability of each individual outcome must be between zero and one. Thus -p_i \log p_i \geq
0 for all i. Since x \log x is only equal to zero when x is zero or one, the entropy
can only be zero when a single outcome has probability one.

Monotonicity
Note that \partial/\partial p_i H(p) = -\log(p_i) - 1 for any i. This is a strictly
decreasing function, so entropy is strictly concave on all of \mathbb{R}_{\geq 0}^n. The
constraint that \sum p_i is one is linear, so entropy is concave on this subset of
\mathbb{R}_{\geq 0}^n as well. Thus there is a unique global maximum.
We can locate it using a Lagrange multiplier.
Our Lagrange function is

-\sum_{i = 1}^n p_i \log p_i + \lambda \left( \sum_{i = 1}^n p_i - 1 \right)

The partial derivative with respect to any p_i is -\log p_i - 1 + \lambda. Since this
depends only on \lambda, it implies that all the p_i must be the same. Taking our constraint
into account this means there's only one possibility: p_i = 1/n for all i. This is the
maximum entropy distribution that we seek.
Call this distribution p_*. Its entropy is -\sum_{i = 1}^n 1/n \log 1/n = -\log 1/n. Thus

H(p) \leq H(p_*) = - \log \frac{1}{n}

for all probability distributions p over n outcomes. Equality is only achieved for p_*
itself, since the maximum is unique. Note that H(p_*) grows without bound as n increases.