FinalExamReview-Yaseen Flashcards

1
Q

What is supervised learning?

A

Supervised Learning:

  • Goal is to make accurate predictions for new, never-before-seen data
  • we have input and output pairs to “learn” from

Examples:

  • k-Nearest Neighbors
  • Linear Models
  • Naive Bayes Classifiers
  • Decision Trees
  • Ensembles of Decision Tress
  • Kernelized Support Vector Machines
  • Neural Networks (Deep Learning)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the primary challenge in unsupervised learning?

A

Challenges of Unsupervised Learning:

  • no outcome to compare to (how well did we do? nobody knows!)
  • We must manually inspect the results to see how we did.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a common utilization of unsupervised algorithms?

A

Exploratory setting:

-useful to change representation of data to then use a supervised learning method

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the two primary types of unsupervised learning?

A

Unsupervised Learning Types
- unsupervised transformations: creating a new representation of data which might be easier for humans or other machine learning algorithms to understand compared to the original representation.
(dimensionality reduction, topic extraction)

  • Clustering algorithms: partition data into distinct groups of similar items.
    (like clustering in supervised learning but we don’t have output to compare to)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Define F1 score

A

The Harmonic mean of precision and sensitivity:

F1=(2TP)/(2TP+FP+FN)

Where:
TP=# True Positives
FP=# False Positives
FN=# False Negatives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Define True Positive Rate

A

TPR=TP/P=TP/(TP+FN)

Where:
TP=# True Positives
FN=# False Negatives
P=Total Actual Positives

Note: TPR (True Positive Rate)=Sensitivity=Recall=Hit Rate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Define True Negative Rate

A

TNR=TN/N=TN/(TN+FP)

Where:
TN=# True Negatives
FP=# False Positives
N=Total Actual Negatives

Note: TNR (True Negative Rate)=Specificity=Selectivity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Define Positive Predictive Value

A

PPV=TP/(TP+FP)

Where:
TP=# True Positives
FP=# False Positives

Positive Predictive Value = Precision

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Define Negative Predictive Value

A

NPV=TN/(TN+FN)

Where:
TN=# True Negatives
FN=# False Negatives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the benefits and drawbacks of k-fold cross-validation as an evaluation metric?

A

k-fold Cross Validation

Benefits:

  • Because there are multiple splits in the process, we have an idea of how the model might perform in best case and worst case scenarios
  • More effective use of the data

Disadvantage:

  • computational cost
  • we use k models instead of single model, so k times slower
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the distinction between stratified k-fold cross-validation and k-fold cross-validation?

A

Data is split so that proportions between classes are the same in each fold as they are in the entire dataset, then k-fold cross-validation is performed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is leave-one-out cross validation and what are the advantages and disadvantages of using it?

A

Leave-one-out Cross Validation:
Same as k-fold cross validation with each split of the data has single data point in the test set.

Advantage:
-better estimates in small datasets
Disadvantage:
-very time consuming in large datasets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is shuffle-split cross validation and what are the advantages and disadvantages of using it?

A

shuffle-split cross validation:
Each split samples “train_size” many points for the training set and “test_size” many points for the test set. These points are disjoint. The splitting is repeated “n_splits” number of times. Then cross validation is performed.

Advantage:

  • allows for control over the number of iterations independently of training and test sizes
  • allows for using part of the data for each iteration (subsampling)
  • subsampling is particularly useful for large datasets

Disadvantage:
?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is cross validation with groups and why do we use it?

A

Add identifier of Group, and don’t split Group across test/train set.

We use it so that we don’t have the same person/group in test/train set that might throw off our results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Grid Search and why do we use it?

A

Grid Search is a tool in sci kit-learn that allows us to try all possible combinations of parameters of interest. We can then return the “best” result according to some defined evaluation criteria such as accuracy or F1 score or AUC.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the danger of overfitting the parameters and the use of the validation set?

A

Overfitting and Validation set

  • We use the train dataset to fit our parameters and select the best set of parameters in our grid search so the train dataset cannot be used to evaluate the usefulness of our model.
  • When using gridsearch we are exploring many parameters and so we may overfit
  • solution: split data three times
    (1) training set to fit model
    (2) validation set to select parameters
    (3) test set to evaluate the model
17
Q

How do we evaluate regression performance?

A

R^2 - Mean Squared error

18
Q

When is accuracy not a good metric to use?

A

Accuracy is not a good metric for predictive performance. Also when we have imbalanced data

We must consider what are the consequences of our mistakes?
- Type I error vs Type II error (in our situation which is worse)

Consider using F1 score - trade-off between optimizing recall and optimizing precision

Always consider if you value precision or recall more

19
Q

What is the area under the precision-recall curve?

A

Average Precision

20
Q

What is the usefulness of the precision-recall curve?

A

Allows us to look at all possible thresholds of trade-offs between precision and recalls at once

21
Q

What is the usefulness of the Receiver operating characteristics (ROC) and AUC (area under the curve)?

A

Commonly used to analyze the behavior of classifiers at different thresholds
Relationship between false positive rate and true positive rate

AUC summarizes ROC curve into single number

22
Q

What is the advantage of using a pipeline over the initial methods we used?

A

Pipeline

  • allows gluing multiple processing steps into a single scikit-learn estimator
  • chaining preprocessing steps with a supervised model
  • reduces the code needed for preprocessing and classification process
  • we can combine Pipelines and grid search
Pipeline Example:
pipe = Pipeline([("scaler", MinMaxScaler()), ("svm", SVC())])
Call fit, predict, score
	pipe.fit(X_train, y_train)
	pipe.score(X_test, y_test)

Combo Example:
grid = GridSearchCV(pipe, param_grid=param_grid, cv=5)

23
Q

What is the only requirement of a pipeline estimator?

A

all but the last step need to have a transform method so they can produce a new representation of the data that can be used in the next step

24
Q

What are two primary ways we have used GridSearch and/or Pipeline?

A

(1) Grid-Searching Preprocessing Steps and Model Parameters:
- encapsulate all the processing steps in our workflow in a single estimator
- adjust the preprocessing parameters using the outcome of a supervised task
- searching over preprocessing parameters together with model parameters is a very powerful strategy

(2) Grid-Searching Which Model To Use
- combine GridSearchCV and Pipeline: search over the actual steps being performed in the pipeline

25
Q

What are some ways to represent Categorical Data?

A

One-Hot Encoding:

  • most common way to represent categorical variables
  • replace categorical variable with new feature (one per category) and set each to 0 or 1

Numbers to encode Categorical Variables:

  • use integers for each categorical value
  • should not necessarily be treated as continuous or ordered
26
Q

What is Binning and when is it useful?

A

Binning (Discretization):

  • Split feature into sub features
  • partition into a fixed number of “bins” (range of values)
  • transform continuous input feature into categorical based on bin location of data point
  • Can then use one-hot encoding

Notes:

  • linear models become more flexible
  • binning can increase power
  • decision tress can build a much more complex model of the data dependent on representation of the data
27
Q

Describe the use of interactions and polynomials in feature representation.

A

Interactions and Polynomials:

  • Use polynomial terms of features or interaction terms to build models
  • can be useful, but increases feature space
28
Q

Univariate transformations

A

Goal

  • Models work better with Gaussian distribution
  • log, exponential, trig transformations can help
29
Q

Why do we perform feature selection and how can we do it?

A

Why feature selection?

  • More features makes models more complex and increase risk of overfitting
  • fewer features makes simpler models and easier to generalize

How do we know what is a good feature?

  • Univariate statistics (compute if there is a significant relationship between feature and target)
  • model-based selection (supervised learning to select, tree based models for example)
  • iterative selection (build a series of models, start with none and add or build and remove with some stopping criteria)