Chapter 21 Information Theory Flashcards
Information theory is a subfield of mathematics concerned with ____.
P 206
transmitting data across a noisy channel
Calculating information and entropy is a useful tool in machine learning and is used as the basis for techniques such as ____ , ____ , and, more generally, ____. P 206
P 206
feature selection, building decision trees, fitting classification models
Information theory is concerned with representing data in a compact fashion (a task known as ____ or ____ ), as well as with transmitting and storing it in a way that is robust to errors (a task known as ____ or ____ ). P 207
data compression, source coding, error correction, channel coding
What is the intuition behind quantifying information? 207
The intuition behind quantifying information is the idea of measuring how much surprise there is in an event, that is, how unlikely it is. Those events that are rare (low probability) are more surprising and therefore have more information those events that are common (high probability).
Low Probability Event: High Information (surprising).
High Probability Event: Low Information (unsurprising).
The basic intuition behind information theory is that learning that an unlikely event has occurred is more informative than learning that a likely event has occurred.
Rare events are more uncertain or more surprising and require more information to represent them than common events.
We can calculate the amount of information there is in an event using the probability of the event. This is called ____, ____, or simply the ____, and can be calculated for a discrete event I(x) as follows: ____
P 208
Shannon information, self-information, information,
I(x) = − log(p(x))
The negative sign ensures that the result is always positive or zero.
base-2 logarithm
Information will be zero when the probability of an event is 1.0 or a certainty, e.g. there is no surprise
I(x)=− log2(p(x))
The choice of the base-2 logarithm means that ____. This can be directly interpreted in the information processing sense as ____. The calculation of information is often written as h() to contrast it with entropy H():
h(x) = − log2(p(x))
the units of the information measure is in bits (binary digits).
The number of bits required to represent an event on a noisy communication channel
Other logarithms can be used instead of the base-2. For example, it is also common to use the ____ logarithm that uses base ____ in calculating the information, in which case the units are referred to as ____. P 209
Natural, e (Euler’s number), nats
In effect, calculating the information for a random variable is the same as calculating the information for the probability distribution of the events for the random variable. Calculating the information for a random variable is called ____, ____, or simply ____.
P 210
information entropy, Shannon entropy, entropy.
What is the intuition behind choosing “entropy” for information gained from a random variable?
P 210
The intuition for entropy is that it is the average number of bits required to represent or transmit an event drawn from the probability distribution for the random variable.
The -Shannon- entropy of a distribution is the expected amount of information in an event drawn from that distribution.
Entropy can be calculated for a random variable X with K discrete states as follows:
P 211
.
H(X) = − ∑ Ki=1 p(ki) × log(p(ki))
Like information, the log() function uses base-2 and the units are bits. The natural logarithm can be used instead and the units will be nats.
The lowest entropy is calculated for a random variable that ____. The largest entropy for a random variable will be if ____.
P 211
has a single event with a probability of 1.0, a certainty
all events are equally likely
In the case where one event dominates, such as a skewed probability distribution, then there is less surprise and the distribution will have a ____ (lower/higher) entropy. In the case where no event dominates another, such as equal or approximately equal probability distribution, then we would expect ____ (smaller/larger) entropy.
P 212
Lower
larger or maximum
Skewed Probability Distribution
(unsurprising): Low entropy.
Balanced Probability Distribution (surprising): High entropy.
Calculating the entropy for a random variable provides the basis for other measures such as mutual information (information gain). True/False
P 213
True
It also provides the basis for calculating the difference between two probability distributions with cross-entropy and the KL-divergence.
Information provides a way to quantify the amount of ____ for an event measured in bits.
Entropy provides a measure of the ____ needed to represent an event drawn from a probability distribution for a random variable.
P 214
Surprise, Average amount of information
The more unlikely the event, the more surprising it is, and the more information it has.
The more something is probable, the more information we already have about it, so it’s the unlikelier events that offer more information. for example, a person in their 20s is more likely to be healty not really informative, but if this person had a heart attack, which is something unlikely, it would give us information about the state of this body