Machine Learning Flashcards
What is supervised learning?
A method of training a model by observing its performance with labelled data
What is unsupervised learning?
A method of identifying structures within data without the aid of response variables
What is a regression problem?
A supervised learning task where the response variable is continuous
What is a classification problem?
A supervised learning task where the response variable is categorical
What is the equation we solve for when graphically fitting a logistic regression model?
logit(0.5) = B0 + B1x1 + B2x2 [ + … + Bpxp]
- rearranging the equation to make x2 the subject of the formula then gives us a line where the probability of “success” is equal to 0.5. Observations above and below the line will be predicted as difference classes. The feature space is partitioned
At a high level, how to tree-based methods work?
The predict classes by partitioning the feature space along each of its respective dimensions. The partitioning then results in certain regions which are associated with certain responses.
What is the R code for fitting a classification tree to a dataset and printing some information about the tree?
> library(tree)
treeModel = tree(Y ~ X1 + X2)
summary(treeModel)
What is a terminal node in the context of classification trees?
The points at which the binary tree stops splitting. Graphically, these are the end points of the tree and represent where prediction is finalized. Also known as leaf nodes
What process do most software packages use in rigorously determining the partition points in the feature space?
Recursive binary splitting
What is the R code to fit a logistic regression model to a dataset?
> logisticModel = gym(Y ~ . , family = binomial(link = “logit”))
summary(logisticModel)
What is a regression tree?
A classification tree where the classes are continuous numbers (eg. house valuations)
What are the two main advantages of using tree models?
- tree models can handle highly non-linear data
- tree models are relatively easy to interpret despite their non-linearity
What does RSS stand for?
Residual sum of squares
What is the learning algorithm or procedure for fitting a tree?
Initial step: start with the entire unseparated feature space find the split positions which lead to the greatest reduction in RSS for each dimension (assuming that split point is above the RSS reduction threshold). Retain the top reduction among these.
Iterate: Split the data according to the split in the initial step. In each of the two regions, perform the same procedure as in the initial step.
Stop: If no split can be found that exceeds the minimum RSS reduction threshold then stop and return the resulting tree.
What are impurity measures and what are 3 common impurity measures?
A series of methods for calculating the accuracy of classification trees. We can not use traditional RSS in the case of binary data.
3 common impurity measures:
- misclassification error
- Gini-index
- Shannon Entropy
For a binary tree, what is the prediction of a model for region R(j)? also known as [pi hat]
[1/N].[sum of I(y=1)]
What is the equation for misclassification error in the context of binary trees?
1 - max(pi hat, 1- pi hat)
- we want to minimize this quantity
What is the equation for Gini-index in the context of binary trees?
2[pi hat].[1-(pi hat)]
What is the equation for Shannon Entropy in the context of binary trees?
- refer to notes
What is one of the downsides of using tree-based methods?
They are prone to overfitting
What is overfitting?
Overfitting is a situation where the model replicates feature patterns seen in a particular dataset instead of “learning” the true feature patterns of the underlying data
What is in-sample-error?
A measure of the performance of a model based on seen data
What is out-of-sample-error?
A measure of the performance of a model based on unseen data