2 Shannon entropy inequalities
Random variables in this paper are measurable maps \(X : \Omega \to S\) from a probability space \(\Omega \) to a measurable space \(S\), and called \(S\)-valued random variables. In many cases we will assume that singletons in \(S\) are measurable. Often we will restrict further to the case when \(S\) is finite with the discrete \(\sigma \)-algebra, which of course implies that \(S\) has measurable singletons.
If \(X\) is an \(S\)-valued random variable, the entropy \(\mathbb {H}[X]\) of \(X\) is defined
with the convention that \(0 \log \frac{1}{0} = 0\).
If \(X: \Omega \to S\) and \(Y: \Omega \to T\) are random variables, and \(Y = f(X)\) for some injection \(f: S \to T\), then \(\mathbb {H}[X] = \mathbb {H}[Y]\).
If \(X: \Omega \to S\) and \(Y: \Omega \to T\) are random variables, and \(Y = f(X)\) and \(X = g(Y)\) for some functions \(f: S \to T\), \(g: T \to S\), then \(\mathbb {H}[X] = \mathbb {H}[Y]\).
Expand out both entropies and rearrange.
If \(X\) is an \(S\)-valued random variable, then \(\mathbb {H}[X] \leq \log |S|\).
This is a direct consequence of Lemma 1.2.
If \(H\) is a subset of \(S\), an \(S\)-random variable \(X\) is said to be uniformly distributed on \(H\) if \(\mathbb {P}[X = s] = 1/|H|\) for \(s \in X\) and \(\mathbb {P}[X=s] = 0\) otherwise.
Given a finite non-empty subset \(H\) of a set \(S\), there exists a random variable \(X\) (on some probability space) that is uniformly distributed on \(H\).
Direct construction.
If \(X\) is \(S\)-valued random variable, then \(\mathbb {H}[X] = \log |S|\) if and only if \(X\) is uniformly distributed on \(S\).
Direct computation in one direction. Converse direction needs Lemma 1.3.
If \(X\) is uniformly distributed on \(H\), then, then \(\mathbb {H}[X] = \log |H|\).
Direct computation.
If \(X\) is an \(S\)-valued random variable, then there exists \(s \in S\) such that \(\mathbb {P}[X=s] \geq \exp (-\mathbb {H}[X])\).
We have
and the claim follows.
We use \(X,Y\) to denote the pair \(\omega \mapsto (X(\omega ),Y(\omega )\)).
If \(X: \Omega \to S\), \(Y: \Omega \to T\), and \(Z: \Omega \to U\) are random variables, then \(\mathbb {H}[X, Y] = \mathbb {H}[Y, X]\) and \(\mathbb {H}[X, (Y,Z)] = \mathbb {H}[(X,Y), Z]\).
Set up an injection from \((X,Y)\) to \((Y,X)\) and use Lemma 2.2 for the first claim. Similarly for the second claim.
If \(X: \Omega \to S\) is an \(S\)-valued random variable and \(E\) is an event in \(\Omega \), then the conditioned event \((X|E)\) is defined to be the same random variable as \(X\), but now the ambient probability measure has been conditioned to \(E\).
Note: it may happen that \(E\) has zero measure. In which case, the ambient probability measure should be replaced with a zero measure. (In our formalization we achieve this by working with arbitrary measures, and normalizing them to be probability measures if possible, else using the zero measure. Conditioning is also formalized using existing Mathlib definitions.)
If \(X: \Omega \to S\) and \(Y: \Omega \to T\) are random variables, the conditional entropy \(\mathbb {H}[X|Y]\) is defined as
If \(X: \Omega \to S\), \(Y: \Omega \to T\), and \(Z: \Omega \to U\) are random variables, and \(Y = f(X,Z)\) almost surely for some map \(f: S \times U \to T\) that is injective for each fixed \(U\), then \(\mathbb {H}[X|Z] = \mathbb {H}[Y|Z]\).
Similarly, if \(g: T \to U\) is injective, then \(\mathbb {H}[X|g(Y)] = \mathbb {H}[X|Y]\).
For the first part, use Definition 2.11 and then Lemma 2.2. The second part is a direct computation.
If \(X: \Omega \to S\) and \(Y: \Omega \to T\) are random variables, then
Direct computation.
If \(X: \Omega \to S\), \(Y: \Omega \to T\), \(Z: \Omega \to U\) are random variables, then
For each \(z \in U\), we can apply Lemma 2.13 to the random variables \((X|Z=z)\) and \((Y|Z=z)\) to obtain
Now multiply by \(\mathbb {P}[Z=z]\) and sum. Some helper lemmas may be needed to get to the form above. This sort of “average over conditioning” argument to get conditional entropy inequalities from unconditional ones is commonly used in this paper.
If \(X: \Omega \to S\), \(Y: \Omega \to T\) are random variables, then
With notation as above, we have
We have \(\mathbb {I}[X:Y] \geq 0\).
An application of Lemma 1.2 and Lemma 2.16.
With notation as above, we have \(\mathbb {H}[X,Y] \leq \mathbb {H}[X] + \mathbb {H}[Y]\).
Use Lemma 2.17.
With notation as above, we have \(\mathbb {H}[X|Y] \leq \mathbb {H}[X]\).
Combine Lemma 2.17 with Lemma 2.16.
With three random variables \(X,Y,Z\), one has \(\mathbb {H}[X|Y,Z] \leq \mathbb {H}[X|Z]\).
Apply the “averaging over conditioning” argument to Corollary 2.19.
With three random variables \(X,Y,Z\), one has
Apply Corollary 2.20 and Lemma 2.13.
Two random variables \(X: \Omega \to S\) and \(Y: \Omega \to T\) are independent if the law of \((X,Y)\) is the product of the law of \(X\) and the law of \(Y\). Similarly for more than two variables.
If \(X,Y\) are random variables, then \(\mathbb {I}[X:Y] = 0\) if and only if \(X,Y\) are independent.
An application of the equality case of Jensen’s inequality, Lemma 1.3.
If \(X,Y\) are random variables, then \(\mathbb {H}[X,Y] = \mathbb {H}[X] + \mathbb {H}[Y]\) if and only if \(X,Y\) are independent.
Direct from Lemma 2.23.
If \(X,Y,Z\) are random variables, with \(Z\) \(U\)-valued, then
We have
and
Routine computation.
If \(X,Y,Z\) are random variables, then \(\mathbb {I}[X:Y|Z] \ge 0\).
Use Definition 2.25 and Corollary 2.20.
Two random variables \(X: \Omega \to S\) and \(Y: \Omega \to T\) are conditionally independent relative to another random variable \(Z: \Omega \to U\) if \(P[X = s \wedge Y = t| Z=u] = P[X=s|Z=u] P[Y=t|Z=u]\) for all \(s \in S, t \in T, u \in U\). (We won’t need conditional independence for more variables than this.)
If \(X,Y,Z\) are random variables, then \(\mathbb {I}[X:Y|Z] = 0\) iff \(X,Y\) are conditionally independent over \(Z\).
Immediate from Lemma 2.23 and Definition 2.28.
If \(X, Y\) are conditionally independent over \(Z\), then
Immediate from Lemma 2.29 and Lemma 2.26.