FinalExamReview-Yaseen Flashcards
What is supervised learning?
Supervised Learning:
- Goal is to make accurate predictions for new, never-before-seen data
- we have input and output pairs to “learn” from
Examples:
- k-Nearest Neighbors
- Linear Models
- Naive Bayes Classifiers
- Decision Trees
- Ensembles of Decision Tress
- Kernelized Support Vector Machines
- Neural Networks (Deep Learning)
What is the primary challenge in unsupervised learning?
Challenges of Unsupervised Learning:
- no outcome to compare to (how well did we do? nobody knows!)
- We must manually inspect the results to see how we did.
What is a common utilization of unsupervised algorithms?
Exploratory setting:
-useful to change representation of data to then use a supervised learning method
What are the two primary types of unsupervised learning?
Unsupervised Learning Types
- unsupervised transformations: creating a new representation of data which might be easier for humans or other machine learning algorithms to understand compared to the original representation.
(dimensionality reduction, topic extraction)
- Clustering algorithms: partition data into distinct groups of similar items.
(like clustering in supervised learning but we don’t have output to compare to)
Define F1 score
The Harmonic mean of precision and sensitivity:
F1=(2TP)/(2TP+FP+FN)
Where:
TP=# True Positives
FP=# False Positives
FN=# False Negatives
Define True Positive Rate
TPR=TP/P=TP/(TP+FN)
Where:
TP=# True Positives
FN=# False Negatives
P=Total Actual Positives
Note: TPR (True Positive Rate)=Sensitivity=Recall=Hit Rate
Define True Negative Rate
TNR=TN/N=TN/(TN+FP)
Where:
TN=# True Negatives
FP=# False Positives
N=Total Actual Negatives
Note: TNR (True Negative Rate)=Specificity=Selectivity
Define Positive Predictive Value
PPV=TP/(TP+FP)
Where:
TP=# True Positives
FP=# False Positives
Positive Predictive Value = Precision
Define Negative Predictive Value
NPV=TN/(TN+FN)
Where:
TN=# True Negatives
FN=# False Negatives
What are the benefits and drawbacks of k-fold cross-validation as an evaluation metric?
k-fold Cross Validation
Benefits:
- Because there are multiple splits in the process, we have an idea of how the model might perform in best case and worst case scenarios
- More effective use of the data
Disadvantage:
- computational cost
- we use k models instead of single model, so k times slower
What is the distinction between stratified k-fold cross-validation and k-fold cross-validation?
Data is split so that proportions between classes are the same in each fold as they are in the entire dataset, then k-fold cross-validation is performed.
What is leave-one-out cross validation and what are the advantages and disadvantages of using it?
Leave-one-out Cross Validation:
Same as k-fold cross validation with each split of the data has single data point in the test set.
Advantage:
-better estimates in small datasets
Disadvantage:
-very time consuming in large datasets
What is shuffle-split cross validation and what are the advantages and disadvantages of using it?
shuffle-split cross validation:
Each split samples “train_size” many points for the training set and “test_size” many points for the test set. These points are disjoint. The splitting is repeated “n_splits” number of times. Then cross validation is performed.
Advantage:
- allows for control over the number of iterations independently of training and test sizes
- allows for using part of the data for each iteration (subsampling)
- subsampling is particularly useful for large datasets
Disadvantage:
?
What is cross validation with groups and why do we use it?
Add identifier of Group, and don’t split Group across test/train set.
We use it so that we don’t have the same person/group in test/train set that might throw off our results.
What is Grid Search and why do we use it?
Grid Search is a tool in sci kit-learn that allows us to try all possible combinations of parameters of interest. We can then return the “best” result according to some defined evaluation criteria such as accuracy or F1 score or AUC.