03.md



title: Problem Set 3


(4.1)
{:.question}
Verify that the entropy function satisfies the required properties of continuity, non-negativity,
monotonicity, and independence.
I will prove these properties for the discrete case.
First, I want to show that \lim_{x \to 0^+} x \log_b x = 0 for any b > 1, since otherwise
entropy is not defined for distributions that assign zero probability to any outcome. This can be
done using L'Hôpital's Rule.

\begin{align*}
\lim_{x \to 0^+} x \log_b x &= \lim_{x \to 0^+} \frac{\log_b x}{x^{-1}} \\
&= \lim_{x \to 0^+} \frac{\frac{d}{d x} \log_b x}{\frac{d}{d x} x^{-1}} \\
&= \lim_{x \to 0^+} \left( \frac{1}{x \ln b} \right) \left( \frac{-1}{x^{-2}} \right) \\
&= \lim_{x \to 0^+} \frac{-x}{\ln b} \\
&= 0
\end{align*}


Continuity
To talk about the continuity of entropy, we need to define a
topology for probability distributions. Let \Omega be
a set of finite cardinality n. The space of probability distributions over \Omega can be
viewed as the set of vectors in \mathbb{R}_{\geq 0}^n with L^1 norm 1. In this way entropy
is a function that maps a subset of \mathbb{R}_{\geq 0}^n to \mathbb{R}. So I will prove the
continuity of entropy with respect to the topologies of \mathbb{R}_{\geq 0}^n and
\mathbb{R}.
First let's show that x \log x is continuous. I take as given that \log(x) is a continuous
function on its domain (after all it's the inverse of e^x, which is strictly monotonic and
C^\infty). Then x \log(x) is also continuous, since finite products of continuous functions
are continuous. This suffices for x > 0. At zero, x \log x is continuous because we have
defined it to be equal to the limit we found above.
Thus each term of the entropy function is a continuous function from \mathbb{R}_{\geq 0} to
\mathbb{R}. But we can also view each term as a function from \mathbb{R}_{\geq 0}^n to
\mathbb{R}. Each one ignores most of its inputs, but this doesn't change its continuity. (The
epsilon-delta proof follows easily from the triangle inequality, since the only part of the distance
between inputs that matters is that along the active coordinate.) So entropy is a sum of continuous
functions, and is thus continuous.

Non-negativity
The probability of each individual outcome must be between zero and one. Thus -p_i \log p_i \geq
0 for all i. Since x \log x is only equal to zero when x is zero or one, the entropy
can only be zero when a single outcome has probability one.

Monotonicity
Note that \partial/\partial p_i H(p) = -\log(p_i) - 1 for any i. This is a strictly
decreasing function, so entropy is strictly concave on all of \mathbb{R}_{\geq 0}^n. The
constraint that \sum p_i is one is linear, so entropy is concave on this subset of
\mathbb{R}_{\geq 0}^n as well. Thus there is a unique global maximum.
We can locate it using a Lagrange multiplier.
Our Lagrange function is

-\sum_{i = 1}^n p_i \log p_i + \lambda \left( \sum_{i = 1}^n p_i - 1 \right)

The partial derivative with respect to any p_i is -\log p_i - 1 + \lambda. Since this
depends only on \lambda, it implies that all the p_i must be the same. Taking our constraint
into account this means there's only one possibility: p_i = 1/n for all i. This is the
maximum entropy distribution that we seek.
Call this distribution p_*. Its entropy is -\sum_{i = 1}^n 1/n \log 1/n = -\log 1/n. Thus

H(p) \leq H(p_*) = - \log \frac{1}{n}

for all probability distributions p over n outcomes. Equality is only achieved for p_*
itself, since the maximum is unique. Note that H(p_*) grows without bound as n increases.

Independence
If p and q are independent, their joint probability distribution is the product of the
individual distributions. Thus

\begin{align*}
H(p, q) &= -\sum_{i = 1}^n \sum_{j = 1}^m p_i q_j \log(p_i q_j) \\
&= -\sum_{i = 1}^n \sum_{j = 1}^m p_i q_j \left( \log p_i + \log q_j \right) \\
&= -\sum_{i = 1}^n \sum_{j = 1}^m p_i q_j \log p_i -
\sum_{i = 1}^n \sum_{j = 1}^m p_i q_j \log q_j \\
&= -\sum_{i = 1}^n p_i \log p_i \sum_{j = 1}^m q_j -
\sum_{j = 1}^m q_j \log q_j \sum_{i = 1}^n p_i \\
&= -\sum_{i = 1}^n p_i \log p_i - \sum_{j = 1}^m q_j \log q_j \\
&= H(p) + H(q)
\end{align*}


(4.2)
{:.question}
Prove the relationships in Equation (4.10).
I take I(x, y) = H(x) + H(y) - H(x, y) as the definition of mutual information. By the
definition of conditional entropy,

\begin{align*}
H(y | x) &= H(x, y) - H(x) \\
H(x | y) &= H(x, y) - H(y)
\end{align*}

Thus

\begin{align*}
I(x, y) &= H(y) - H(y | x) \\
&= H(x) - H(x | y)
\end{align*}

Finally, using the definition of marginal distributions we can show that

\begin{align*}
I(x, y)
&= -\sum_x p(x) \log p(x) - \sum_y p(y) \log p(y) + \sum_{x, y} p(x, y) \log p(x, y) \\
&= -\sum_{x, y} p(x, y) \log p(x) - \sum_{x, y} p(x, y) \log p(y) +
\sum_{x, y} p(x, y) \log p(x, y) \\
&= \sum_{x, y} p(x, y) \left( \log p(x, y) - \log p(x) - \log p(y) \right) \\
&= \sum_{x, y} p(x, y) \log \frac{p(x, y)}{p(x) p(y)}
\end{align*}


(4.3)
{:.question}
Consider a binary channel that has a small probability \epsilon of making a bit error.
For reasons that will become clear I will call the error probability \epsilon_0.

(a)
{:.question}
What is the probability of an error if a bit is sent independently three times and the value
determined by majority voting?
Majority voting can recover the message if a single instance of the bit is flipped. So the
probability of an error is the probability of having two or three bits flipped. This can be
expressed using the binomial distribution. Let's call it \epsilon_1.

\begin{align*}
\epsilon_1 &= B(2; \epsilon_0, 3) + B(3, \epsilon_0, 3) \\
&= {3 \choose 2} \epsilon_0^2 (1 - \epsilon_0) + {3 \choose 3} \epsilon_0^3 \\
&= 3 \epsilon_0^2 (1 - \epsilon_0) + \epsilon_0^3 \\
&= 3 \epsilon_0^2 - 2 \epsilon_0^3
\end{align*}


(b)
{:.question}
How about if that is done three times, and majority voting is done on the majority voting?
The answer is the same as above, just using \epsilon_1 instead of \epsilon_0.

\begin{align*}
\epsilon_2 &= B(2; \epsilon_1, 3) + B(3, \epsilon_1, 3) \\
&= 3 \epsilon_1^2 - 2 \epsilon_1^3
\end{align*}


(c)
{:.question}
If majority voting on majority voting on … on majority voting is done N times, how many bits
are needed, and what is the probability of an error? How does this probability depend on
\epsilon?
Each round triples the total number of bits sent. So n rounds of voting requires 3^n bits.
The probability of an error can be expressed as a recurrence relation. Define

\begin{align*}
f_0(x) &= 3 x^2 - 2 x^3 \\
f_{n+1}(x) &= f_0(f_n(x))
\end{align*}

Then with a base error rate (i.e. per individual bit) of \epsilon_0, the probability of an error
after n rounds of voting is \epsilon_n = f_n(\epsilon_0). If there is a closed form solution
to this relation, I don't know how to find it. But it's still possible to say a lot about its
behavior.

Convergence to a step function
As n approaches infinity, f_n converges pointwise on [0, 1] to

f_\infty(x) =
\begin{cases}
0 & \text{for} & 0 \leq x < 1/2 \\
1/2 & \text{for} & x = 1/2 \\
1 & \text{for} & 1/2 \leq x \leq 1
\end{cases}

So as long as the error rate isn't 1/2 (i.e. zero information gets through), with enough rounds
of majority voting the error rate can be made arbitrarily small.
Let's prove this. Since f_0 is a polynomial, it's continuous. By inspection there are three
fixed points: zero, one half, and one.
This suffices to show that for any x, if f_n(x) converges it must converge to zero, one
half, or one (I'll prove this below).
Now fix any x such that 0 < x < 1/2. Because f_0 is a cubic polynomial, it can't cross
the line y = x more than three times. We've already noted that it does cross this line exactly
three times, namely at its fixed points. So the fact that

3 \cdot \left( \frac{1}{4} \right)^2 - 2 \cdot \left( \frac{1}{4} \right)^3
= \frac{5}{32}
< \frac{1}{4}

is sufficient to prove that f_0(x) < x. Furthermore, 3 x^2 is greater than
2 x^3, so 0 < f_0(x). Thus 0 < f_0(x) < x < 1/2. By induction this shows that

0 < \cdots < f_2(x) < f_1(x) < f_0(x) < x

Thus f_n(x) is a bounded monotonic sequence, and must converge. Since x < 1/2 the only fixed
point it can converge to is zero.
All that remains is to show that all points in (1/2, 1) converge to one. Note that

\begin{align*}
1 - f_0(1 - x)
&= 1 - 3(1 - x)^2 + 2(1 - x)^3 \\
&= 1 - (1 - x)^2 (3 - 2(1 - x)) \\
&= 1 - (1 - 2x + x^2) (1 + 2x)) \\
&= 1 - (1 + 2x - 2x - 4x^2 + x^2 + 2x^3) \\
&= 3x^2 - 2x^3 \\
&= f_0(x)
\end{align*}

This symmetry establishes the claim.
For completeness let's prove the claim that a sequence generated by recursive application of a
continuous function f can only converge to a fixed point of f. I'll do this in two steps.
First, we must establish that if a sequence x_0, x_1, \ldots in the domain of f converges to
a point x_\infty (also in the domain of f), then the sequence f(x_0), f(x_1), \ldots
converges to f(x_\infty). Fix any \epsilon > 0. Since f is continuous, there is some
\delta such that \lvert x - x_\infty \rvert < \delta implies \lvert f(x) - f(x_\infty)
\rvert < \epsilon. Since x_0, x_1, \ldots converges to x_\infty, there is some N such
that n > N implies \lvert x_n - x_\infty \rvert < \delta. Thus n > N implies \lvert
f(x_n) - f(x_\infty) \rvert < \epsilon which establishes the claim.
Second, we can show that any sequence x_0, x_1, \ldots generated by successive application of
f that converges must converge to a fixed point of f. Say x_0, x_1, \ldots converges to
x_\infty. Since x_{n + 1} = f(x_n) for all n, this means that f(x_0), f(x_1), \ldots
converges to x_\infty. But by the result just proven, f(x_0), f(x_1), \ldots converges to
f(x_\infty). Since limits are unique this means x_\infty = f(x_\infty).

Behavior of leading term
For small \epsilon_0, the cubic term in f_0(\epsilon_0) is approximately zero. So let's
consider the function g_0(x) = 3x^2, with g_n defined recursively in a similar fashion as
f_n.
For any x \in (0, 1], it's clear that f_0(x) < g_0(x). And for any x and y such that
0 < x < y < 1,
0 < 3x^2 - 2x^3 < 3x^2 < 3y^2 < 1
so 0 < f_0(x) < g_0(y) < 1. Thus by induction f_n(x) < g_n(x) for any x in (0, 1] and
for all n. So g_n(\epsilon_0) is an upper bound for the probability of an error after n
rounds of majority voting with a base error rate of \epsilon_0.
The interesting thing about g_n is that it has a closed form solution that's easy to find.

g_n(x) = 3^{2^n - 1} x^{2^n}

So while each round of voting increases the number of bits by a factor of three, the exponent on
top of the base error rate grows by a factor of two. This is an astonishing win for majority voting.
Just keep in mind that g_n(\epsilon_0) is only a reasonable upper bound for very small
\epsilon_0. Indeed for \epsilon_0 > 1/3, g_n(\epsilon_0) goes to infinity while
f_n(\epsilon_0) goes to zero.

Examples
Here is a table showing error rates after various numbers of voting rounds for a variety of base
rates. These are calculated exactly from f_n(\epsilon_0), not the approximate upper bound
g_n(\epsilon_0).


voting rounds
0
1
2
3
4
5


bits
1
3
9
27
81
243


---
---
---
---
---
---
---


p_\text{error}
0.25
\num{1.6e-1}
\num{6.6e-2}
\num{1.2e-3}
\num{4.5e-4}
\num{6.2e-7}


---
---
---
---
---
---
---


p_\text{error}
0.1
\num{2.8e-2}
\num{2.3e-3}
\num{1.6e-5}
\num{7.6e-10}
\num{1.8e-18}


---
---
---
---
---
---
---


p_\text{error}
0.01
\num{3.0e-4}
\num{2.7e-7}
\num{2.1e-13}
\num{1.4e-25}
\num{5.5e-50}


---
---
---
---
---
---
---


p_\text{error}
0.001
\num{3.0e-6}
\num{2.7e-11}
\num{2.2e-21}
\num{1.4e-41}
\num{6.1e-82}


---
---
---
---
---
---
---


(4.4)
{:.question}
Calculate the differential entropy of a Gaussian process.
Since we're integrating the Gaussian over the whole real line, translation is irrelevant. So without
loss of generality I'll calculate the differential entropy of a Gaussian with zero mean.

\begin{align*}
H(N(\mu, \sigma^2))
&= -\int_{-\infty}^\infty \frac{1}{\sqrt{2 \pi \sigma^2}} e^{\frac{-x^2}{2 \sigma^2}} \log \left(
\frac{1}{\sqrt{2 \pi \sigma^2}} e^{\frac{-x^2}{2 \sigma^2}} \right) \mathrm{d} x \\
&= -\int_{-\infty}^\infty \frac{1}{\sqrt{2 \pi \sigma^2}} e^{\frac{-x^2}{2 \sigma^2}} \left(
-\frac{1}{2} \log(2 \pi \sigma^2) - \frac{x^2}{2 \sigma^2} \right) \mathrm{d} x \\
&= \frac{1}{2} \log(2 \pi \sigma^2) \int_{-\infty}^\infty \frac{1}{\sqrt{2 \pi \sigma^2}}
e^{\frac{-x^2}{2 \sigma^2}} \mathrm{d} x + \int_{-\infty}^\infty \frac{1}{\sqrt{2 \pi \sigma^2}}
\frac{x^2}{2 \sigma^2} e^{\frac{-x^2}{2 \sigma^2}} \mathrm{d} x \\
&= \frac{1}{2} \log(2 \pi \sigma^2) + \frac{1}{2} \int_{-\infty}^\infty
\left( \frac{x}{\sqrt{2 \pi \sigma^2}} \right) \left( \frac{x}{\sigma^2}
e^{\frac{-x^2}{2 \sigma^2}} \right) \mathrm{d} x \\
&= \frac{1}{2} \log(2 \pi \sigma^2) + \frac{1}{2} \int_{-\infty}^\infty
\frac{1}{\sqrt{2 \pi \sigma^2}} e^{\frac{-x^2}{2 \sigma^2}} \mathrm{d} x \\
&= \frac{1}{2} \log(2 \pi \sigma^2) + \frac{1}{2} \\
&= \frac{1}{2} \log(2 \pi e \sigma^2)
\end{align*}

The second term is integrated by parts. Note that

\frac{d}{dx} e^{\frac{-x^2}{2 \sigma^2}} = -\frac{x}{\sigma^2} e^{\frac{-x^2}{2 \sigma^2}}

(Of course one could also just note that this integral is the distribution's variance, but now we
know how to perform that calculation ourselves. Going forward I'll skip this derivation.)

(4.5)
{:.question}
A standard telephone line is specified to have a bandwidth of 3300 Hz and an SNR of 20 dB.

(a)
{:.question}
What is the capacity?
A signal to noise ratio of 20 dB means that 20 = 10 \log_{10}(S/N). So S/N = 10^2.
Presumably the noise power in this figure is calculated over the relevant bandwidth, so we may infer
that \Delta f N_0 = N. Then the signal capacity per second is given by

\begin{align*}
C &= \Delta f \log_2 \left( 1 + \frac{S}{\Delta f N_0} \right) \\
&= 3300 \si{Hz} \log_2 \left( 1 + 10^2 \right) \\
&= 22 \si{kbits/s}
\end{align*}


\begin{align*}
\end{align*}


(b)
{:.question}
What SNR would be necessary for the capacity to be 1 Gbit/s?
If the SNR in decibels is x, then S/N = 10^{x/10}. So we need to solve

C = \Delta f \log_2 \left( 1 + 10^{x/10} \right)

for x. This gives us

\begin{align*}
x &= 10 \log_{10} \left( 2^{C / \Delta f} - 1 \right) \\
&= 10 \log_{10} \left( 2^{\frac{10^9 \si{bits/s}}{3300 \si{Hz}}} - 1 \right) \\
&= \num{9e5} \si{dB}
\end{align*}

This is unrealistically high. A more reasonable option would be to increase the bandwidth to
\num{1.5e8} \si{Hz}.

(4.6)
{:.question}
Let (x_1, x_2, \ldots, x_n) be drawn from a Gaussian distribution with variance \sigma^2 and
unknown mean value x_0. Show that f(x_1, \ldots, x_n) = n^{-1} \sum_{i = 1}^n x_i is an
estimator for x_0 that is unbiased and achieves the Cramér–Rao lower bound.
The probability of seeing a particular sequence x_1, \ldots, x_n of independent samples is

p(x_1, \ldots, x_n)
= \prod_{i = 1}^n \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x_i - x_0)^2}{2 \sigma^2}} \mathrm{d} x

So the expected value of this estimator is

\begin{align*}
\langle \hat{x_0} \rangle
&= \left \langle \frac{1}{n} \sum_{i = 1}^n x_i \right \rangle \\
&= \int_{\mathbb{R}^n} \left( \prod_{i = 1}^n \frac{1}{\sqrt{2 \pi \sigma^2}}
e^{-\frac{(x_i - x_0)^2}{2 \sigma^2}} \right)
\left( \frac{1}{n} \sum_{i = 1}^n x_i \right) \mathrm{d} x_1 \ldots \mathrm{d} x_n \\
&= \frac{1}{n} \sum_{i = 1}^n \prod_{j = 1}^n
\int_{-\infty}^\infty
\frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x_j - x_0)^2}{2 \sigma^2}}
\left( 1 - \delta_{ij} (1 + x_i) \right)
\mathrm{d} x_j \\
&= \frac{1}{n} \sum_{i = 1}^n x_0 \\
&= x_0
\end{align*}

The Kronecker delta is used to indicate that only
one term in the product contains x_i. This term integrates to x_0 (since it is the expected
value of a single Gaussian), while the others integrate to one.
Now let's show that this estimator achieves the Cramér–Rao bound. That is, that the variance
of this estimator is equal to one over the Fisher information. To do this it's helpful to know that
in this case

\begin{align*}
\frac{\partial}{\partial \alpha} p_\alpha(x)
&= \frac{d}{d x_0} \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x - x_0)^2}{2 \sigma^2}} \\
&= \frac{d}{d x_0} \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x - x_0)^2}{2 \sigma^2}}
\frac{x - x_0}{\sigma^2} \\
&= p_\alpha(x) \frac{x - x_0}{\sigma^2}
\end{align*}

Then the Fisher information is

\begin{align*}
J(\alpha) &= \int_{-\infty}^\infty \frac{1}{p_\alpha(x)} \left(
\frac{\partial p_\alpha(x)}{\partial \alpha} \right)^2 \mathrm{d} x \\
&= \int_{-\infty}^\infty \frac{1}{p_\alpha(x)} \left(
p_\alpha(x) \frac{x - x_0}{\sigma^2} \right)^2 \mathrm{d} x \\
&= \int_{-\infty}^\infty p_\alpha(x) \frac{(x - x_0)^2}{\sigma^4} \mathrm{d} x \\
&= \frac{1}{\sigma^4} \int_{-\infty}^\infty \frac{1}{\sqrt{2 \pi \sigma^2}}
e^{-\frac{(x - x_0)^2}{2 \sigma^2}} (x - x_0)^2 \mathrm{d} x \\
&= \frac{1}{\sigma^2}
\end{align*}

The integral evaluates to \sigma^2 since it's the distribution's variance. We have n
samples, so in this case the Cramér–Rao bound states that the variance of our estimator can
be no smaller than \sigma^2/n.
We know that the expected value of our estimator is x_0, so to find its variance let's compute
the expected value of our estimator squared.

\begin{align*}
\left \langle \hat{x_0}^2 \right \rangle
&= \left \langle \left( \frac{1}{n} \sum_{i = 1}^n x_i \right)^2 \right \rangle \\
&= \int_{\mathbb{R}^n} \left( \prod_{i = 1}^n \frac{1}{\sqrt{2 \pi \sigma^2}}
e^{-\frac{(x_i - x_0)^2}{2 \sigma^2}} \right)
\left( \frac{1}{n} \sum_{i = 1}^n x_i \right)^2 \mathrm{d} x_1 \ldots \mathrm{d} x_n \\
&= \frac{1}{n^2} \int_{\mathbb{R}^n} \left(
\prod_{i = 1}^n \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x_i - x_0)^2}{2 \sigma^2}} \right)
\left( \sum_{i, j} x_i x_j \right) \mathrm{d} x_1 \ldots \mathrm{d} x_n \\
&= \frac{1}{n^2} \sum_{i, j} \int_{\mathbb{R}^n} \left( \prod_{k = 1}^n
\frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x_k - x_0)^2}{2 \sigma^2}} \right) x_i x_j
\mathrm{d} x_1 \ldots \mathrm{d} x_n \\
&= \frac{1}{n^2} \left(
\sum_{i=j} \int_{-\infty}^\infty \frac{1}{\sqrt{2 \pi \sigma^2}}
e^{-\frac{(x_i - x_0)^2}{2 \sigma^2}} x_i^2 \mathrm{d} x_i \right. \\
&\phantom{=\frac{1}{n^2}} + \left. \sum_{i \neq j} \int_{-\infty}^\infty
\frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x_i - x_0)^2}{2 \sigma^2}} x_i \mathrm{d} x_i
\int_{-\infty}^\infty \frac{1}{\sqrt{2 \pi \sigma^2}}
e^{-\frac{(x_j - x_0)^2}{2 \sigma^2}} x_j \mathrm{d} x_j
\right) \\
&= \frac{1}{n^2} \left( \sum_{i=j} (\sigma^2 + x_0^2) + \sum_{i \neq j} x_0^2 \right) \\
&= \frac{1}{n^2} \left( n (\sigma^2 + x_0^2) + (n^2 - n) x_0^2 \right) \\
&= \frac{\sigma^2}{n} + x_0^2
\end{align*}

Thus the variance of our estimator is

\begin{align*}
\left \langle \hat{x_0}^2 \right \rangle - \left \langle \hat{x_0} \right \rangle^2
&= \frac{\sigma^2}{n} + x_0^2 - x_0^2 \\
&= \frac{\sigma^2}{n}
\end{align*}

This is equal to the inverse of the Fisher information, so this estimator achieves the
Cramér–Rao bound.