PFR

2 Shannon entropy inequalities

Random variables in this paper are measurable maps \(X : \Omega \to S\) from a probability space \(\Omega \) to a measurable space \(S\), and called \(S\)-valued random variables. In many cases we will assume that singletons in \(S\) are measurable. Often we will restrict further to the case when \(S\) is finite with the discrete \(\sigma \)-algebra, which of course implies that \(S\) has measurable singletons.

Definition 2.1 Entropy
#

If \(X\) is an \(S\)-valued random variable, the entropy \(\mathbb {H}[X]\) of \(X\) is defined

\[ \mathbb {H}[X] := \sum _{s \in S} \mathbb {P}[X=x] \log \frac{1}{\mathbb {P}[X=x]} \]

with the convention that \(0 \log \frac{1}{0} = 0\).

Lemma 2.2 Entropy and relabeling
  • If \(X: \Omega \to S\) and \(Y: \Omega \to T\) are random variables, and \(Y = f(X)\) for some injection \(f: S \to T\), then \(\mathbb {H}[X] = \mathbb {H}[Y]\).

  • If \(X: \Omega \to S\) and \(Y: \Omega \to T\) are random variables, and \(Y = f(X)\) and \(X = g(Y)\) for some functions \(f: S \to T\), \(g: T \to S\), then \(\mathbb {H}[X] = \mathbb {H}[Y]\).

Proof

Expand out both entropies and rearrange.

If \(X\) is an \(S\)-valued random variable, then \(\mathbb {H}[X] \leq \log |S|\).

Proof

This is a direct consequence of Lemma 1.2.

Definition 2.4 Uniform distribution
#

If \(H\) is a subset of \(S\), an \(S\)-random variable \(X\) is said to be uniformly distributed on \(H\) if \(\mathbb {P}[X = s] = 1/|H|\) for \(s \in X\) and \(\mathbb {P}[X=s] = 0\) otherwise.

Lemma 2.5 Uniform distributions exist

Given a finite non-empty subset \(H\) of a set \(S\), there exists a random variable \(X\) (on some probability space) that is uniformly distributed on \(H\).

Proof

Direct construction.

Lemma 2.6 Entropy of uniform random variable

If \(X\) is \(S\)-valued random variable, then \(\mathbb {H}[X] = \log |S|\) if and only if \(X\) is uniformly distributed on \(S\).

Proof

Direct computation in one direction. Converse direction needs Lemma 1.3.

Lemma 2.7 Entropy of uniform random variable, II

If \(X\) is uniformly distributed on \(H\), then, then \(\mathbb {H}[X] = \log |H|\).

Proof

Direct computation.

Lemma 2.8 Bounded entropy implies concentration

If \(X\) is an \(S\)-valued random variable, then there exists \(s \in S\) such that \(\mathbb {P}[X=s] \geq \exp (-\mathbb {H}[X])\).

Proof

We have

\[ \mathbb {H}[X] = \sum _{s \in S} \mathbb {P}[X=s] \log \frac{1}{\mathbb {P}[X=s]} \geq \min _{s \in S} \log \frac{1}{\mathbb {P}[X=s]} \]

and the claim follows.

We use \(X,Y\) to denote the pair \(\omega \mapsto (X(\omega ),Y(\omega )\)).

Lemma 2.9 Commutativity and associativity of joint entropy

If \(X: \Omega \to S\), \(Y: \Omega \to T\), and \(Z: \Omega \to U\) are random variables, then \(\mathbb {H}[X, Y] = \mathbb {H}[Y, X]\) and \(\mathbb {H}[X, (Y,Z)] = \mathbb {H}[(X,Y), Z]\).

Proof

Set up an injection from \((X,Y)\) to \((Y,X)\) and use Lemma 2.2 for the first claim. Similarly for the second claim.

Definition 2.10 Conditioned event
#

If \(X: \Omega \to S\) is an \(S\)-valued random variable and \(E\) is an event in \(\Omega \), then the conditioned event \((X|E)\) is defined to be the same random variable as \(X\), but now the ambient probability measure has been conditioned to \(E\).

Note: it may happen that \(E\) has zero measure. In which case, the ambient probability measure should be replaced with a zero measure. (In our formalization we achieve this by working with arbitrary measures, and normalizing them to be probability measures if possible, else using the zero measure. Conditioning is also formalized using existing Mathlib definitions.)

Definition 2.11 Conditional entropy

If \(X: \Omega \to S\) and \(Y: \Omega \to T\) are random variables, the conditional entropy \(\mathbb {H}[X|Y]\) is defined as

\[ \mathbb {H}[X|Y] := \sum _{y \in Y} \mathbb {P}[Y = y] \mathbb {H}[(X | Y=y)]. \]
Lemma 2.12 Conditional entropy and relabeling

If \(X: \Omega \to S\), \(Y: \Omega \to T\), and \(Z: \Omega \to U\) are random variables, and \(Y = f(X,Z)\) almost surely for some map \(f: S \times U \to T\) that is injective for each fixed \(U\), then \(\mathbb {H}[X|Z] = \mathbb {H}[Y|Z]\).

Similarly, if \(g: T \to U\) is injective, then \(\mathbb {H}[X|g(Y)] = \mathbb {H}[X|Y]\).

Proof

For the first part, use Definition 2.11 and then Lemma 2.2. The second part is a direct computation.

Lemma 2.13 Chain rule

If \(X: \Omega \to S\) and \(Y: \Omega \to T\) are random variables, then

\[ \mathbb {H}[X, Y] = \mathbb {H}[Y] + \mathbb {H}[X|Y]. \]
Proof

Direct computation.

Lemma 2.14 Conditional chain rule

If \(X: \Omega \to S\), \(Y: \Omega \to T\), \(Z: \Omega \to U\) are random variables, then

\[ \mathbb {H}[X, Y | Z] = \mathbb {H}[Y | Z] + \mathbb {H}[X|Y, Z]. \]
Proof

For each \(z \in U\), we can apply Lemma 2.13 to the random variables \((X|Z=z)\) and \((Y|Z=z)\) to obtain

\[ \mathbb {H}[(X|Z=z),(Y|Z=z)] = \mathbb {H}[Y|Z=z] + \mathbb {H}[(X|Z=z)|(Y|Z=z)]. \]

Now multiply by \(\mathbb {P}[Z=z]\) and sum. Some helper lemmas may be needed to get to the form above. This sort of “average over conditioning” argument to get conditional entropy inequalities from unconditional ones is commonly used in this paper.

Definition 2.15 Mutual information

If \(X: \Omega \to S\), \(Y: \Omega \to T\) are random variables, then

\[ \mathbb {I}[X : Y] := \mathbb {H}[X] + \mathbb {H}[Y] - \mathbb {H}[X,Y]. \]
Lemma 2.16 Alternative formulae for mutual information

With notation as above, we have

\[ \mathbb {I}[X : Y] = \mathbb {I}[Y:X] \]
\[ \mathbb {I}[X : Y] = \mathbb {H}[X] - \mathbb {H}[X|Y] \]
\[ \mathbb {I}[X : Y] = \mathbb {H}[Y] - \mathbb {H}[Y|X] \]
Proof

Immediate from Lemmas 2.9, 2.13.

Lemma 2.17 Nonnegativity of mutual information
#

We have \(\mathbb {I}[X:Y] \geq 0\).

Proof

An application of Lemma 1.2 and Lemma 2.16.

Corollary 2.18 Subadditivity
#

With notation as above, we have \(\mathbb {H}[X,Y] \leq \mathbb {H}[X] + \mathbb {H}[Y]\).

Proof

Use Lemma 2.17.

Corollary 2.19 Conditioning reduces entropy
#

With notation as above, we have \(\mathbb {H}[X|Y] \leq \mathbb {H}[X]\).

Proof

Combine Lemma 2.17 with Lemma 2.16.

Corollary 2.20 Submodularity

With three random variables \(X,Y,Z\), one has \(\mathbb {H}[X|Y,Z] \leq \mathbb {H}[X|Z]\).

Proof

Apply the “averaging over conditioning” argument to Corollary 2.19.

Corollary 2.21 Alternate form of submodularity

With three random variables \(X,Y,Z\), one has

\[ \mathbb {H}[X,Y,Z] + \mathbb {H}[Z] \leq \mathbb {H}[X,Z] + \mathbb {H}[Y,Z]. \]
Proof

Apply Corollary 2.20 and Lemma 2.13.

Definition 2.22 Independent random variables
#

Two random variables \(X: \Omega \to S\) and \(Y: \Omega \to T\) are independent if the law of \((X,Y)\) is the product of the law of \(X\) and the law of \(Y\). Similarly for more than two variables.

Lemma 2.23 Vanishing of mutual information

If \(X,Y\) are random variables, then \(\mathbb {I}[X:Y] = 0\) if and only if \(X,Y\) are independent.

Proof

An application of the equality case of Jensen’s inequality, Lemma 1.3.

Corollary 2.24 Additivity of entropy
#

If \(X,Y\) are random variables, then \(\mathbb {H}[X,Y] = \mathbb {H}[X] + \mathbb {H}[Y]\) if and only if \(X,Y\) are independent.

Proof

Direct from Lemma 2.23.

Definition 2.25 Conditional mutual information

If \(X,Y,Z\) are random variables, with \(Z\) \(U\)-valued, then

\[ \mathbb {I}[X:Y|Z] := \sum _{z \in U} P[Z=z] \mathbb {I}[(X|Z=z): (Y|Z=z)]. \]
Lemma 2.26 Alternate formula for conditional mutual information

We have

\[ \mathbb {I}[X:Y|Z] := \mathbb {H}[X|Z] + \mathbb {H}[Y|Z] - \mathbb {H}[X,Y|Z] \]

and

\[ \mathbb {I}[X:Y|Z] := \mathbb {H}[X|Z] - \mathbb {H}[X|Y,Z]. \]
Proof

Routine computation.

Lemma 2.27 Nonnegativity of conditional mutual information

If \(X,Y,Z\) are random variables, then \(\mathbb {I}[X:Y|Z] \ge 0\).

Proof

Use Definition 2.25 and Lemma 2.20.

Definition 2.28 Conditionally independent random variables
#

Two random variables \(X: \Omega \to S\) and \(Y: \Omega \to T\) are conditionally independent relative to another random variable \(Z: \Omega \to U\) if \(P[X = s \wedge Y = t| Z=u] = P[X=s|Z=u] P[Y=t|Z=u]\) for all \(s \in S, t \in T, u \in U\). (We won’t need conditional independence for more variables than this.)

Lemma 2.29 Vanishing conditional mutual information
#

If \(X,Y,Z\) are random variables, then \(\mathbb {I}[X:Y|Z] = 0\) iff \(X,Y\) are conditionally independent over \(Z\).

Proof

Immediate from Lemma 2.23 and Definition 2.28.

Corollary 2.30 Entropy of conditionally independent variables
#

If \(X, Y\) are conditionally independent over \(Z\), then

\[ \mathbb {H}[X,Y,Z] =\mathbb {H}[X,Z] + \mathbb {H}[Y,Z] - \mathbb {H}[Z]. \]
Proof

Immediate from Lemma 2.29 and Lemma 2.26.