Topic 3: Machine Learning: Regression, Support Vector Machine & Time Series Models Flashcards
Define information
A quantity which reduces uncertainty about something
Define prediction in the context of data science
A formula for estimating an unknown value of interest: the target
Compare and contrast predictive modeling with descriptive modeling.
Predictive Modelling tries to estimate a value while Descriptive Modelling tries to gain insight into the underlying phenomenon or process.
Define attributes or features
Attributes or features are selected variables used as input to estimate the value of the target variable. In database terminology these are the columns (instances or feature values are the rows).
Describe model induction
the procedure that creates the model from the data is called the induction algorithm or learner.
Induction = generalizing from specific cases to general rules
Contrast induction with deduction
Deduction starts with general rules and specific facts and creates other specific facts from them.
Define the training data and labeled data.
training data is the input data for the induction algorithm. They are called labeled data because the value for the target variable is known.
Describe supervised segmentation
To determine which are the most informative attributes (columns) when predicting the value of the target you can use supervised segmentation.
List the complications arising from selecting informative attributes.
- Attributes rarely split a group perfectly
- Not all attributes are binary
- Some attributes take on numeric values
When is a segmented group considered pure?
If every member of the group has the same value for the garget, then the group is pure.
What do you call outcome of a formula that evaluates how well each attribute splits a set of examples into segments?
purity measure or splitting criterion (most common one is information gain which is based on entropy)
Define entropy
Entropy measures the general disorder of a single set and corresponds to how mixed (impure) the segment is with respect to properties of interest.
high mix = high impurity = high entropy
Calculate the value of entropy
Parent set = 10, 7 non-write off, 3 write off
P(non-write off) = 7/10 = 70%
P(write off) = 3/10 = 30%
entropy = -[0.7 x log2(0.7) + 0.3 x log2(0.3) = 0.88
Define information gain
a measure how much an attribute improves (decreases) entropy over the whole segmentation it creates.
IG -> change in entropy due to any amount of new information added in
Formula information gain
Parent entropy - (weighted average of children’s entropy)
Calculate information gain for a set of children from a parent set
IG(parent, children) = entropy(parent) - [p(c1) x entropy(c2) + p(c2) x entropy(c2) + …]
How does entropy relate to information gain?
entropy is a measure of disorder in the dataset, information gain is a measure of the decrease in disorder achieved by segmenting the original data set
Discuss the issues with the numerical variables for supervised segmentation
Does it make sense to create a segment for each number? Numeric values are often discretized by choosing a split point (e.g. larger than or equal to 50%)
Define variance and discuss its application to numeric variables for supervised segmentation.
Variance is a measure for numerical values. You can look at the information gain by reductions in variance between parents and children.
Define an entropy graph/chart
X-axis proportion of the dataset, Y-axis is the entropy
the shaded area is the entropy when divided by some chosen attribute.
Goal is to decrease the shade.
Describe how an entropy chart can be used to select an informative variable.
Select the attribute which decreases the shaded area the most and does so for most of the values
Define a classification tree and decision nodes.
A classification tree (supervised segmentation) starts with a root node with branches to nodes (decision nodes) and ultimately to a terminal node or leaf.
Define a probability estimation tree, and tree induction.
probability estimation tree -> leafs contain probabilities
tree induction -> at each step select an attribute to partition the current group into subgroups that are as pure as possible with regards to the target variable (e.e.g Oval Body/Square Body)
Define a decision surface or decision boundaries.
Lines separating the regions in an instance space (scatterplot)
Describe the relationship between the decision surface and the number of variables.
n variables gives n-1 dimensional hyperplane
Define frequency-based estimation of class membership probability
At a leaf if you have n positives and m negatives the frequency based probability of n is n/(n+m)
Describe how Laplace correction is used to modify the probability of a leaf node with few members.
If you have one observation at a leaf the probability is 100%, LaPlace corrects for that.
n+1 / (n+m+2)
The higher the number of instances the less effect you have of the LaPlace correction
Define a linear classifier.
Weighted sum of the values for the various attributes
Define a linear discriminant.
Decision boundary where you classify instances of x (e.g. + or -)
Describe decision boundaries in 2-dimensions, 3-dimensions, and higher dimensions.
Decision boundaries:
2-dimension = above or below the line
3-dimension = a plane
Higher-dimensions = hyperplane
Interpret the magnitude of a feature’s weight in a general linear model.
heavier weight = more importance
Describe how linear discriminant functions can be used for scoring and ranking instances.
the output of the function gives a ranking itself (the further away from the decision boundary the more certain the instance belongs to the class)
Describe the objective function of the Support Vector Machine (SVM).
SVM (linear discriminants) fits the fattest bar between the classes (Maximizing margin) and the linear discriminant will be the center line.