03.md



title: Problem Set 3


(4.1)
{:.question}
Verify that the entropy function satisfies the required properties of continuity, non-negativity,
monotonicity, and independence.
I will prove these properties for the discrete case.
First, I want to show that \lim_{x \to 0^+} x \log_b x = 0 for any b > 1, since otherwise
entropy is not defined for distributions that assign zero probability to any outcome. This can be
done using L'Hôpital's Rule.

\begin{align*}
\lim_{x \to 0^+} x \log_b x &= \lim_{x \to 0^+} \frac{\log_b x}{x^{-1}} \\
&= \lim_{x \to 0^+} \frac{\frac{d}{d x} \log_b x}{\frac{d}{d x} x^{-1}} \\
&= \lim_{x \to 0^+} \left( \frac{1}{x \ln b} \right) \left( \frac{-1}{x^{-2}} \right) \\
&= \lim_{x \to 0^+} \frac{-x}{\ln b} \\
&= 0
\end{align*}


Continuity
To talk about the continuity of entropy, we need to define a
topology for probability distributions. Let \Omega be
a set of finite cardinality n. The space of probability distributions over \Omega can be
viewed as the set of vectors in \mathbb{R}_{\geq 0}^n with L^1 norm 1. In this way entropy
is a function that maps a subset of \mathbb{R}_{\geq 0}^n to \mathbb{R}. So I will prove the
continuity of entropy with respect to the topologies of \mathbb{R}_{\geq 0}^n and
\mathbb{R}.
First let's show that x \log x is continuous. I take as given that \log(x) is a continuous
function on its domain (after all it's the inverse of e^x, which is strictly monotonic and
C^\infty). Then x \log(x) is also continuous, since finite products of continuous functions
are continuous. This suffices for x > 0. At zero, x \log x is continuous because we have
defined it to be equal to the limit we found above.
Thus each term of the entropy function is a continuous function from \mathbb{R}_{\geq 0} to
\mathbb{R}. But we can also view each term as a function from \mathbb{R}_{\geq 0}^n to
\mathbb{R}. Each one ignores most of its inputs, but this doesn't change its continuity. (The
epsilon-delta proof follows easily from the triangle inequality, since the only part of the distance
between inputs that matters is that along the active coordinate.) So entropy is a sum of continuous
functions, and is thus continuous.

Non-negativity
The probability of each individual outcome must be between zero and one. Thus -p_i \log p_i \geq
0 for all i. Since x \log x is only equal to zero when x is zero or one, the entropy
can only be zero when a single outcome has probability one.

Monotonicity
Note that \partial/\partial p_i H(p) = -\log(p_i) - 1 for any i. This is a strictly
decreasing function, so entropy is strictly concave on all of \mathbb{R}_{\geq 0}^n. The
constraint that \sum p_i is one is linear, so entropy is concave on this subset of
\mathbb{R}_{\geq 0}^n as well. Thus there is a unique global maximum.
We can locate it using a Lagrange multiplier.
Our Lagrange function is

-\sum_{i = 1}^n p_i \log p_i + \lambda \left( \sum_{i = 1}^n p_i - 1 \right)

The partial derivative with respect to any p_i is -\log p_i - 1 + \lambda. Since this
depends only on \lambda, it implies that all the p_i must be the same. Taking our constraint
into account this means there's only one possibility: p_i = 1/n for all i. This is the
maximum entropy distribution that we seek.
Call this distribution p_*. Its entropy is -\sum_{i = 1}^n 1/n \log 1/n = -\log 1/n. Thus

H(p) \leq H(p_*) = - \log \frac{1}{n}

for all probability distributions p over n outcomes. Equality is only achieved for p_*
itself, since the maximum is unique. Note that H(p_*) grows without bound as n increases.

Independence
If p and q are independent, their joint probability distribution is the product of the
individual distributions. Thus

\begin{align*}
H(p, q) &= -\sum_{i = 1}^n \sum_{j = 1}^m p_i q_j \log(p_i q_j) \\
&= -\sum_{i = 1}^n \sum_{j = 1}^m p_i q_j \left( \log p_i + \log q_j \right) \\
&= -\sum_{i = 1}^n \sum_{j = 1}^m p_i q_j \log p_i -
\sum_{i = 1}^n \sum_{j = 1}^m p_i q_j \log q_j \\
&= -\sum_{i = 1}^n p_i \log p_i \sum_{j = 1}^m q_j -
\sum_{j = 1}^m q_j \log q_j \sum_{i = 1}^n p_i \\
&= -\sum_{i = 1}^n p_i \log p_i - \sum_{j = 1}^m q_j \log q_j \\
&= H(p) + H(q)
\end{align*}


(4.2)
{:.question}
Prove the relationships in Equation (4.10).
I take I(x, y) = H(x) + H(y) - H(x, y) as the definition of mutual information. By the
definition of conditional entropy,

\begin{align*}
H(y | x) &= H(x, y) - H(x) \\
H(x | y) &= H(x, y) - H(y)
\end{align*}

Thus

\begin{align*}
I(x, y) &= H(y) - H(y | x) \\
&= H(x) - H(x | y)
\end{align*}

Finally, using the definition of marginal distributions we can show that

\begin{align*}
I(x, y)
&= -\sum_x p(x) \log p(x) - \sum_y p(y) \log p(y) + \sum_{x, y} p(x, y) \log p(x, y) \\
&= -\sum_{x, y} p(x, y) \log p(x) - \sum_{x, y} p(x, y) \log p(y) +
\sum_{x, y} p(x, y) \log p(x, y) \\
&= \sum_{x, y} p(x, y) \left( \log p(x, y) - \log p(x) - \log p(y) \right) \\
&= \sum_{x, y} p(x, y) \log \frac{p(x, y)}{p(x) p(y)}
\end{align*}


(4.3)
{:.question}
Consider a binary channel that has a small probability \epsilon of making a bit error.
For reasons that will become clear I will call the error probability \epsilon_0.

(a)
{:.question}
What is the probability of an error if a bit is sent independently three times and the value
determined by majority voting?
Majority voting can recover the message if a single instance of the bit is flipped. So the
probability of an error is the probability of having two or three bits flipped. This can be
expressed using the binomial distribution. Let's call it \epsilon_1.

\begin{align*}
\epsilon_1 &= B(2; \epsilon_0, 3) + B(3, \epsilon_0, 3) \\
&= {3 \choose 2} \epsilon_0^2 (1 - \epsilon_0) + {3 \choose 3} \epsilon_0^3 \\
&= 3 \epsilon_0^2 (1 - \epsilon_0) + \epsilon_0^3 \\
&= 3 \epsilon_0^2 - 2 \epsilon_0^3
\end{align*}


(b)
{:.question}
How about if that is done three times, and majority voting is done on the majority voting?