Chapter 4: Information-based Learning Flashcards
How do you build perceptive machine learning models?
Use the most informative features
In this context what is an informative feature?
Descriptive feature whose values split the instances in the dataset into the most homogenous sets with respect to the target feature value
How do you calculate the average number of questions you have to ask per game?
Add the number of paths to get to each person and divide by the number of people
What do we consider the effects of difference answers in terms of?
- How the domain is split up after the answer is received
- The likelihood of each of the answers
What does a decision tree consist of?
- Root node (starting node)
- Interior nodes
- Leaf nodes (terminating nodes)
What are some important information about non-leaf nodes and leaf nodes?
- Each non-leaf node specifies a test to be carried out on one of the query’s descriptive features
- Each leaf node contains a class label, it specifies a predicted classification for the query
What is the process of using a decision tree to make a prediction for a query instance?
- Start by testing the value of the descriptive feature at the root node of the tree
- The result of the test determines which of the root node’s children the process should descend to
- The two steps of testing the descriptive feature and descending a level are repeated until the process comes to a leaf node at which a prediction can be made
What is the preference for decision trees?
Shallower trees
How do we make shallow trees?
- Testing the informative features early on in the tree
- We do that with ENTROPY which is a computational metric of the purity of a set
What is Shannon’s entropy model?
- It defines a computational measure of the impurity of the elements of a set
How is entropy related to the probability of an outcome?
High probability –> Low entropy
Low probability –> High entropy
How do we map probability to entropy value?
log functions of the probability multiplied by -1
What is Shannon’s entropy model?
- A weighted sum of the logs of the probabilities of each of the possible outcomes when we make a random selection from a set
- It is the cornerstone of modern information theory and is an excellent measure of the impurity (heterogeneity) of a set
What are the weights used in the sum?
The weights used in the sum are the probabilities of the outcomes themselves so that outcomes with high probabilities contribute to the overall entropy of a set than outcomes with low probabilities
Why is there a minus sign at the beginning of the equation?
It is added to convert negative numbers returned by the log function to positive ones
What is the base of our calculation?
We always use base 2 so that entropy is calculated in bits
What is the relationship between a measure of heterogeneity of a set and predictive analytics?
If we can construct a sequence of tests that splits the training data into pure sets with respect to the target feature values then we can label queries by applying the same sequence of tests to a query and labeling it with the target feature value of instances in the set it ends up in
What is our intuition for information gain?
Our intuition is that the ideal discriminatory feature will partition the data into pure subsets where all the instances in each subset have the same classification
What is the information gain of a descriptive feature?
It is a measure of the reduction in the overall entropy of a prediction task by testing on that feature
What is the first step of the three step process to computing information gain?
- Compute the entropy of the original dataset with respect to the target feature.
- This gives a measure of how much information is required to organize the data into pure sets