FinalExamReview-Yaseen Flashcards by Alison Rector

What is supervised learning?

Supervised Learning:

Goal is to make accurate predictions for new, never-before-seen data
we have input and output pairs to “learn” from

Examples:

k-Nearest Neighbors
Linear Models
Naive Bayes Classifiers
Decision Trees
Ensembles of Decision Tress
Kernelized Support Vector Machines
Neural Networks (Deep Learning)

How well did you know this?

Not at all

Perfectly

What is the primary challenge in unsupervised learning?

Challenges of Unsupervised Learning:

no outcome to compare to (how well did we do? nobody knows!)
We must manually inspect the results to see how we did.

How well did you know this?

Not at all

Perfectly

What is a common utilization of unsupervised algorithms?

Exploratory setting:

-useful to change representation of data to then use a supervised learning method

How well did you know this?

Not at all

Perfectly

What are the two primary types of unsupervised learning?

Unsupervised Learning Types
- unsupervised transformations: creating a new representation of data which might be easier for humans or other machine learning algorithms to understand compared to the original representation.
(dimensionality reduction, topic extraction)

Clustering algorithms: partition data into distinct groups of similar items.
(like clustering in supervised learning but we don’t have output to compare to)

How well did you know this?

Not at all

Perfectly

Define F1 score

The Harmonic mean of precision and sensitivity:

F1=(2TP)/(2TP+FP+FN)

Where:
TP=# True Positives
FP=# False Positives
FN=# False Negatives

How well did you know this?

Not at all

Perfectly

Define True Positive Rate

TPR=TP/P=TP/(TP+FN)

Where:
TP=# True Positives
FN=# False Negatives
P=Total Actual Positives

Note: TPR (True Positive Rate)=Sensitivity=Recall=Hit Rate

How well did you know this?

Not at all

Perfectly

Define True Negative Rate

TNR=TN/N=TN/(TN+FP)

Where:
TN=# True Negatives
FP=# False Positives
N=Total Actual Negatives

Note: TNR (True Negative Rate)=Specificity=Selectivity

How well did you know this?

Not at all

Perfectly

Define Positive Predictive Value

PPV=TP/(TP+FP)

Where:
TP=# True Positives
FP=# False Positives

Positive Predictive Value = Precision

How well did you know this?

Not at all

Perfectly

Define Negative Predictive Value

NPV=TN/(TN+FN)

Where:
TN=# True Negatives
FN=# False Negatives

How well did you know this?

Not at all

Perfectly

What are the benefits and drawbacks of k-fold cross-validation as an evaluation metric?

k-fold Cross Validation

Benefits:

Because there are multiple splits in the process, we have an idea of how the model might perform in best case and worst case scenarios
More effective use of the data

Disadvantage:

computational cost
we use k models instead of single model, so k times slower

How well did you know this?

Not at all

Perfectly

What is the distinction between stratified k-fold cross-validation and k-fold cross-validation?

Data is split so that proportions between classes are the same in each fold as they are in the entire dataset, then k-fold cross-validation is performed.

How well did you know this?

Not at all

Perfectly

What is leave-one-out cross validation and what are the advantages and disadvantages of using it?

Leave-one-out Cross Validation:
Same as k-fold cross validation with each split of the data has single data point in the test set.

Advantage:
-better estimates in small datasets
Disadvantage:
-very time consuming in large datasets

How well did you know this?

Not at all

Perfectly

What is shuffle-split cross validation and what are the advantages and disadvantages of using it?

shuffle-split cross validation:
Each split samples “train_size” many points for the training set and “test_size” many points for the test set. These points are disjoint. The splitting is repeated “n_splits” number of times. Then cross validation is performed.

Advantage:

allows for control over the number of iterations independently of training and test sizes
allows for using part of the data for each iteration (subsampling)
subsampling is particularly useful for large datasets

Disadvantage:
?

How well did you know this?

Not at all

Perfectly

What is cross validation with groups and why do we use it?

Add identifier of Group, and don’t split Group across test/train set.

We use it so that we don’t have the same person/group in test/train set that might throw off our results.

How well did you know this?

Not at all

Perfectly

What is Grid Search and why do we use it?

Grid Search is a tool in sci kit-learn that allows us to try all possible combinations of parameters of interest. We can then return the “best” result according to some defined evaluation criteria such as accuracy or F1 score or AUC.

How well did you know this?

Not at all

Perfectly

What is the danger of overfitting the parameters and the use of the validation set?

Study These Flashcards

Overfitting and Validation set

We use the train dataset to fit our parameters and select the best set of parameters in our grid search so the train dataset cannot be used to evaluate the usefulness of our model.
When using gridsearch we are exploring many parameters and so we may overfit
solution: split data three times
(1) training set to fit model
(2) validation set to select parameters
(3) test set to evaluate the model

How do we evaluate regression performance?

Study These Flashcards

R^2 - Mean Squared error

When is accuracy not a good metric to use?

Study These Flashcards

Accuracy is not a good metric for predictive performance. Also when we have imbalanced data

We must consider what are the consequences of our mistakes?
- Type I error vs Type II error (in our situation which is worse)

Consider using F1 score - trade-off between optimizing recall and optimizing precision

Always consider if you value precision or recall more

What is the area under the precision-recall curve?

Study These Flashcards

Average Precision

What is the usefulness of the precision-recall curve?

Study These Flashcards

Allows us to look at all possible thresholds of trade-offs between precision and recalls at once

What is the usefulness of the Receiver operating characteristics (ROC) and AUC (area under the curve)?

Study These Flashcards

Commonly used to analyze the behavior of classifiers at different thresholds
Relationship between false positive rate and true positive rate

AUC summarizes ROC curve into single number

What is the advantage of using a pipeline over the initial methods we used?

Study These Flashcards

Pipeline

allows gluing multiple processing steps into a single scikit-learn estimator
chaining preprocessing steps with a supervised model
reduces the code needed for preprocessing and classification process
we can combine Pipelines and grid search

Pipeline Example:
pipe = Pipeline([("scaler", MinMaxScaler()), ("svm", SVC())])
Call fit, predict, score
	pipe.fit(X_train, y_train)
	pipe.score(X_test, y_test)

Combo Example:
grid = GridSearchCV(pipe, param_grid=param_grid, cv=5)

What is the only requirement of a pipeline estimator?

Study These Flashcards

all but the last step need to have a transform method so they can produce a new representation of the data that can be used in the next step

What are two primary ways we have used GridSearch and/or Pipeline?

Study These Flashcards

(1) Grid-Searching Preprocessing Steps and Model Parameters:
- encapsulate all the processing steps in our workflow in a single estimator
- adjust the preprocessing parameters using the outcome of a supervised task
- searching over preprocessing parameters together with model parameters is a very powerful strategy

(2) Grid-Searching Which Model To Use
- combine GridSearchCV and Pipeline: search over the actual steps being performed in the pipeline

What are some ways to represent Categorical Data?

One-Hot Encoding: - most common way to represent categorical variables - replace categorical variable with new feature (one per category) and set each to 0 or 1 Numbers to encode Categorical Variables: - use integers for each categorical value - should not necessarily be treated as continuous or ordered

What is Binning and when is it useful?

Binning (Discretization): - Split feature into sub features - partition into a fixed number of "bins" (range of values) - transform continuous input feature into categorical based on bin location of data point - Can then use one-hot encoding Notes: - linear models become more flexible - binning can increase power - decision tress can build a much more complex model of the data dependent on representation of the data

Describe the use of interactions and polynomials in feature representation.

Interactions and Polynomials: - Use polynomial terms of features or interaction terms to build models - can be useful, but increases feature space

Univariate transformations

Goal - Models work better with Gaussian distribution - log, exponential, trig transformations can help

Why do we perform feature selection and how can we do it?

Why feature selection? - More features makes models more complex and increase risk of overfitting - fewer features makes simpler models and easier to generalize How do we know what is a good feature? - Univariate statistics (compute if there is a significant relationship between feature and target) - model-based selection (supervised learning to select, tree based models for example) - iterative selection (build a series of models, start with none and add or build and remove with some stopping criteria)

FinalExamReview-Yaseen Flashcards

(29 cards)