Final Flashcards
What is a node in a decision tree?
A node including the root represents a single feature and split point on that variable.
What is a leaf node in a decision tree?
A leaf node contains the output variable y used for prediction.
How do we decide the best split in a DT?
A greedy algorithm is used, where every possible feature and split point is evaluated to minimize a cost function.
Which is the cost function commonly used for regression in DT? Which is commonly used for classification?
Regression we use sum of squared errors
Classification we use Gini Index.
What does Gini score measure?
How successful a given split is, how classes are mixed between the two groups created by the split.
What is the best and worst case scenario for Gini score?
Binary class problem
Perfect separation results in a Gini score of 0
Worst-case split of 50/50 results in a Gini score of 0.5
What are two hyperparameters used to determine a node is terminal?
- Max tree depth The maximum tree depth is reached
- Min size The number of training points in the node is less or equal than a given threshold
How to predict a classification problem using a DT?
Starting with the root until we reach a terminal node, follow the path that evaluates true on a data point.
How do we determine the final prediction at a leaf node?
We choose the majority class of that node.
What is cross-entropy and how is it used in DT?
Cross-entropy is a measurement of the purity of a collection of samples. It is used as information gain in DT which is the difference between the cross-entropy before a split versus after.
We try to maximize information gain when determining the best split.
What is the best and worst case scenario for cross-entropy score?
Best is 0
when all points are correctly classified
Worst is 1
when there is a 50/50 split
What are two ways to reduce overfitting of a DT?
- Pruning leaves if it reduces cost on test set
- Ensembling, (RF, boosting, bagging)
What do we mean by an ensemble technique?
An ensemble technique combines the results from multiple models to obtain better performance
What is a random forest?
An ensemble of decision trees generated by randomly selecting the feature to split on each node.
How many features do we use when determining the random split in a random forest decision tree?
Generally we use a subset of size sqrt(features)
for each DT.
How do we make predictions in a random forest?
We take the majority vote from all decision trees in the ensemble.
What is boosting for decision trees?
An ensembling method where we create multiple decision trees sequentially and put more weight on misclassified samples on subsequent trees.
What is regression?
A method to predict countinuous output values based on a set of observations.
What is linear regression at a high level?
A model that assumes a linear relationship between input variables x
and a single output variable y
to make predictions
What equation defines linear regression?
The slope and intercept function h(x) = θ_0 + θ_1*x
we call the hypothesis
This can be expanded to higher dimensions through a dot product <1, x_0, x_1, ..., x_N> * <θ_0, θ_1, θ_2, ..., θ_N>
What are two equations that we use as cost functions for linear regression?
Squared error and mean squared error.
We want to minimize these cost functions when determining our linear function h(x)
What is gradient descent for linear regression?
Gradient descent is a means to automatically determine the best parameters for our linear regression model θ_i
by moving our features towards the gradient that minimizes error
What are two ways that we can speed up gradient descent for multi-variable linear regression?
- Normalization
- Standardization
What is normalization?
We scale values between 0 and 1 based on the maximum and minimum in the dataset.
𝑋′=(𝑋−min(𝑋))/(max(𝑋)−min(𝑋))
What is standardization?
Modifies a feature in a way so that it has zero as its mean value, and 1 as its standard deviation.
𝑋′=(𝑋−𝜇(𝑋))/𝜎(𝑋)
What is polynomial regression?
In polynomial regression we use different powers of x in our hypothesis. This can lead to better results when the data is not linear.
Can cause underfitting if the degree is too low and overfitting if the degree is too high
What is feature selection for linear regression?
It is based on the fact that not all features have equal importance. We can use a p-test with null hypothesis H_0: θ_i = 0
. The p-value here represents the probability that our feature is 0.
What is the main difference between logistic and linear regression?
The output is a discrete set of variables rather than continuous.
What is the function used for logistic regression?
The sigmoid function
S(z) = 1/(1+e^-z)
where z=θ_0 + x_1θ_1 + ... x_nθ_n
How do we make predictions in logistic regression?
We define a threshold, the sigmoid produces values between 0
and 1
, and so we assign a threshold for which we predict class 1 vs. class 0, e.g. 0.5
.
What is cross-entropy or log loss?
- It is a loss function that measures the difference between the actual probability distribution and the predicted probability distribution.
- It is split into two cost functions, one per output variable
What do we generally use to measure the performance of a prediction model?
A confusion matrix