Final Flashcards

1
Q

What is a node in a decision tree?

A

A node including the root represents a single feature and split point on that variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a leaf node in a decision tree?

A

A leaf node contains the output variable y used for prediction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do we decide the best split in a DT?

A

A greedy algorithm is used, where every possible feature and split point is evaluated to minimize a cost function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which is the cost function commonly used for regression in DT? Which is commonly used for classification?

A

Regression we use sum of squared errors

Classification we use Gini Index.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does Gini score measure?

A

How successful a given split is, how classes are mixed between the two groups created by the split.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the best and worst case scenario for Gini score?

Binary class problem

A

Perfect separation results in a Gini score of 0

Worst-case split of 50/50 results in a Gini score of 0.5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are two hyperparameters used to determine a node is terminal?

A
  1. Max tree depth The maximum tree depth is reached
  2. Min size The number of training points in the node is less or equal than a given threshold
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How to predict a classification problem using a DT?

A

Starting with the root until we reach a terminal node, follow the path that evaluates true on a data point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How do we determine the final prediction at a leaf node?

A

We choose the majority class of that node.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is cross-entropy and how is it used in DT?

A

Cross-entropy is a measurement of the purity of a collection of samples. It is used as information gain in DT which is the difference between the cross-entropy before a split versus after.

We try to maximize information gain when determining the best split.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the best and worst case scenario for cross-entropy score?

A

Best is 0 when all points are correctly classified

Worst is 1 when there is a 50/50 split

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are two ways to reduce overfitting of a DT?

A
  1. Pruning leaves if it reduces cost on test set
  2. Ensembling, (RF, boosting, bagging)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What do we mean by an ensemble technique?

A

An ensemble technique combines the results from multiple models to obtain better performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a random forest?

A

An ensemble of decision trees generated by randomly selecting the feature to split on each node.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How many features do we use when determining the random split in a random forest decision tree?

A

Generally we use a subset of size sqrt(features) for each DT.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do we make predictions in a random forest?

A

We take the majority vote from all decision trees in the ensemble.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is boosting for decision trees?

A

An ensembling method where we create multiple decision trees sequentially and put more weight on misclassified samples on subsequent trees.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is regression?

A

A method to predict countinuous output values based on a set of observations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is linear regression at a high level?

A

A model that assumes a linear relationship between input variables x and a single output variable y to make predictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What equation defines linear regression?

A

The slope and intercept function h(x) = θ_0 + θ_1*x we call the hypothesis

This can be expanded to higher dimensions through a dot product <1, x_0, x_1, ..., x_N> * <θ_0, θ_1, θ_2, ..., θ_N>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are two equations that we use as cost functions for linear regression?

A

Squared error and mean squared error.

We want to minimize these cost functions when determining our linear function h(x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is gradient descent for linear regression?

A

Gradient descent is a means to automatically determine the best parameters for our linear regression model θ_i by moving our features towards the gradient that minimizes error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are two ways that we can speed up gradient descent for multi-variable linear regression?

A
  1. Normalization
  2. Standardization
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is normalization?

A

We scale values between 0 and 1 based on the maximum and minimum in the dataset.

𝑋′=(𝑋−min(𝑋))/(max(𝑋)−min(𝑋))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is standardization?

A

Modifies a feature in a way so that it has zero as its mean value, and 1 as its standard deviation.

𝑋′=(𝑋−𝜇(𝑋))/𝜎(𝑋)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is polynomial regression?

A

In polynomial regression we use different powers of x in our hypothesis. This can lead to better results when the data is not linear.

Can cause underfitting if the degree is too low and overfitting if the degree is too high

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is feature selection for linear regression?

A

It is based on the fact that not all features have equal importance. We can use a p-test with null hypothesis H_0: θ_i = 0. The p-value here represents the probability that our feature is 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is the main difference between logistic and linear regression?

A

The output is a discrete set of variables rather than continuous.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is the function used for logistic regression?

A

The sigmoid function

S(z) = 1/(1+e^-z) where z=θ_0 + x_1θ_1 + ... x_nθ_n

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

How do we make predictions in logistic regression?

A

We define a threshold, the sigmoid produces values between 0 and 1, and so we assign a threshold for which we predict class 1 vs. class 0, e.g. 0.5.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is cross-entropy or log loss?

A
  • It is a loss function that measures the difference between the actual probability distribution and the predicted probability distribution.
  • It is split into two cost functions, one per output variable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What do we generally use to measure the performance of a prediction model?

A

A confusion matrix

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

How do we do multivariable classification with logistic regression?

A

Use an all vs rest approach, where we have k classifiers, one for each output class, and use the output of the classifier with the highest probability.

34
Q

What is the goal of regularization in logistic regression?

A

To penalize overfitting or model complexity

35
Q

What is Ridge (L2 regularization)?

A

Tries to keep the values of hyperparameters small.

Penalization is the sum of squares of the hyperparameters

36
Q

What is LASSO (L1 regularization)?

A

Tries to keep small values of hyperparameters, reduces many to zero.

Penalization is the sum of the absolute value of hyperparameters

37
Q

What is an ROC curve?

A

ROC (Receiver Operating Characteristic) curve is a probability curve to measure the performance of a model with respect to a classification problem at various threshold settings.

38
Q

Describe the parts of an ROC curve?

A
  • TPR is on the y-axis
  • FPR is on the x-axis
  • The ideal point is located at the top-left of the plot: no false positives and every predicted positive is correct
  • Area under curve AUC displays capacity to distinguish between classes
39
Q

What is the purpose of an SVM?

A

A support vector machine is a class of supervised model used for regression, classification and outlier detection

40
Q

How do SVM work?

A

They find a line, or curve or plane that best seperates classes

41
Q

Why is simply drawing a line between classes not the best strategy for dividing classes? Give an alternative.

A

On the training data, many lines may perfectly divide data. But some will be better than others on the test data.

The alternative is maximizing the margin or maximum width of a line before reaching points from either class

42
Q

How do we predict classes in an SVM?

A

If the signed distance from the decision boundary is negative, it is one class, and positive it is the other.

43
Q

What are the support vectors of an SVM?

A

The data points in a dataset that lie closest to the decision boundary

44
Q

What is the difference between a soft and hard margin for an SVM?

A
  • Hard margin the decision boundary cannot be violated (the data must be linearly seperable)
  • Soft margin, the decision boundary can be violated, misclassification is minimized

Soft margin is characterized by a slack variable

45
Q

True or False: SVMs are sensitive to scaling

A

True

46
Q

What is a kernel for an SVM model?

A

A kernel can transform data into higher dimensions to allow it to be linearly seperable.

e.g. pushing center points up along the z axis when they are near the center and down if they are further from the center

47
Q

What can we do in an SVM if our domain is not linearly seperable?

A

We may need to consider a different kernel to create linear seperation in higher dimensions.

48
Q

What role does the C parameter play in determining the margin width of an SVM classifier?

A

Trade-off between maximizing the margin width and minimizing the classification error

  • Large C results in a smaller margin width, enforcing a strict classification hard margin
  • Small C allows for a larger margin width, allowing some misclassifications soft margin
49
Q

True or False: SVM can be used for regression

A

True

Scikit-learn SVR model.

50
Q

What idea gives rise to KNN?

A

Similar data points tend to belong to the same class

51
Q

What four properties should a distance function have?

A
  1. Non negative dist(a,b) > 0
  2. Triangle inequality dist(a, b) + dist(b, c) ≥ dist(a,c)
  3. Identity dist(a, b) | a=b = 0
  4. Symmetry `dist(a,b) = dist(b,a)
52
Q

What’s the most common distance function for KNN?

A

Euclidian distance

53
Q
A
54
Q

What is Voronoi tessalation?

A

Tessalation using polygons which represent an area that is closest to a specific data point compared to any other data point in the dataset.

55
Q

What is a Voronoi region or a decision boundary?

A

It is a continuous section of Voronoi tessalations in which the same target is predicted.

56
Q

How does the value of k affect KNN, algorithm and performance?

A
  • We select the majority wins value for the class of the k nearest neighbours
  • A low k value will lead to overfitting, while higher k can result in underfitting
57
Q

What design problem can happen when we have a k that is too high? What is a solution?

A

We are considering too many neighbours, so we consider some that are very far away! Becomes majority rules over entire dataset.

A solution is weighted KNN

58
Q

What is weighted KNN?

A

Take into account the distance when making predictions.

59
Q

What is KMeans?

A

KMeans is a clustering algorithm which attempts to create k clusters.

60
Q

What does KMeans seek to reduce?

A

Intra-cluster variance within cluster sum of squares

61
Q

What are the steps for KMeans?

A
  1. Begin with k random centroids
  2. Assign every data point to the nearest centroid
  3. Calculate the new centroid from the assigned data points
  4. Repeat steps 2 and 3 until reaching a stopping criterion
62
Q

What is PCA?

A

A technique used for dimensionality reduction. This method transforms a large set of variables into a smaller one that still contains most of the information in the large set.

63
Q

Why dimensionality reduction?

A

In many real-world applications, data comes in the form of high-dimensional vectors, which can be difficult to analyze and visualize.

While keeping as much information as possible dimensionality reduction helps us visualize data.

64
Q

What is feature selection?

A

Feature selection means to select a subset containing the most relevant features to use in training a model

65
Q

True or False: Feature selection is a form of dimensionality reduction?

A

True

66
Q

What are three benefits of feature selection?

A
  1. Reduces training time
  2. Reduces overfitting
  3. Improves accuracy - less misleading data
67
Q

What is a filter based method for feature selection?

A

Use statistical techniques to guage the relevance of input variables to the target variable without training.

Like chi-squared, correlation and mutual information

68
Q

What are wrapper methods for feature selection?

A

Create many models with different subsets of input features, select the best.

Recursive Feature Elimination (RFE)

69
Q

What are intrinsic or embedded methods for feature selection?

A

Feature selection performed by some machine learning algorithms automatically as part of learning the model e.g. regularization

Penalize the model for using irrelevant features

70
Q

What is the elbow method for finding k in KMeans?

A

Plot the cost (mse points to centroid) vs. the number of clusters, at a certain k the graph will kink and the benefit will be less for increasing k.

71
Q

What are the two types of hierarchical clustering?

A
  1. Agglomerative (Bottom-up) - start with clusters and merge together nearby clusters
  2. Divisive (Top-down) - start with one big cluster and split into subclusters
72
Q

What are the steps in agglomerative clustering?

A
  1. Initialize each data point as a cluster
  2. Find distances between all clusters
  3. Merge closest two clusters into one
  4. Repeat steps 2 and 3 until a certain stopping criteria (distance)
73
Q

What the name of a graph we can use to visualize agglomerative clustering? What are the axes?

A

Dendrogram
* x axis represents clusters
* y axis represents cluster distance

74
Q

What are the two main differences between agglomerative and KMeans clustering?

A
  1. Need to define the number of clusters beforehand with KMeans
  2. Clusters can be arbitrary shapes for agglomerative clustering
75
Q

What does PCA try to minimize?

A

PCA tries to reduce the projection error to reduce the dimension from n

76
Q

What bayes function describes the probability of A given B

A

P(A|B) = (P(B|A) * P(A)) / P(B)

77
Q

How do you calculate P(headache, no fever, vomiting)

A

This is a joint probability, so we look at rows where there is Headache AND no fever AND vomiting over all rows.

78
Q

How would you calculate P(headache, no fever, vomiting | meningitis)

A

This is equal to P(headache | meningitis) * P(no fever | headache, meningitis) * P(vomiting | headache, no fever, meningitis)

Outermost cases, then narrow down. In the cases where they have headache AND meningitis is there also no fever?

79
Q

How to answer a bayes query M(q)?

A

Find P(t=l | q), for all l where in binary cases this is just true or false, and return the largest probability.

80
Q

Why naive for naive bayes?

A

Because we assume conditional independence between features.

81
Q

What is bagging for a random forest model?

A

Bagging (Boostrap aggregating) is making new datasets that are the same size as the original one by randomly picking data points from the original dataset, allowing duplicates