Midterm-Yaseen Flashcards
What is supervised learning?
- Goal is to make accurate predictions for new, never-before-seen data
- we have input and output pairs to “learn” from
Examples:
- k-Nearest Neighbors
- Linear Models
- Naive Bayes Classifiers
- Decision Trees
- Ensembles of Decision Tress
- Kernelized Support Vector Machines
- Neural Networks (Deep Learning)
Describe the k-Nearest Neighbors algorithm.
-Utilizing the training dataset, find the nearest data points and classify according these data points.
Note:
For KNeighborsClassifier we use majority vote to determine classificaiton
For KnEighborsRegressor we use mean (R^2)
Explain what is meant by “underfitting.”
A model that cannot capture the variations present in the training data.
Explain the concept of “overfitting.”
A model that focuses too much on the training data and is not able to generalize to the new data very well
When should we use Nearest Neighbors methods?
- ideal for small datasets
- good as a baseline
- easy to explain
When should we use Linear Models?
- go-to as a first method
- good for very large datasets
- good for very high-dimensional data
When should we use Naive Bayes?
- Only used for classification
- faster than linear models
- good for very large datasets and high-dimensional data
Disadvantage: often less accurate than linear models
What are the advantages of Decision Tree methods?
- very fast
- don’t need scaling of the data
- can visualize and easily explained
What are the advantages of Random Forest?
- Nearly always perform better than a single decision tree
- very robust and powerful
- Does NOT require scaling of data
What is a disadvantage of Random Forest? (When should they not be used)
Not good for high-dimensional sparse data
Compare Gradient boosted decision Trees and Random Forests in terms of advantages.
- Gradient forests are slightly more accurate than random forests
- Gradient forests are slower to train than random forests
- Gradient forests are faster to predict and smaller in memory than random forests
- Gradient forests require more parameter tuning than random forests
Describe the advantages of Support Vector Machines.
- powerful for medium-sized datasets of features with similar meaning
- Does require scaling of data
- sensitive to parameter settings
Describe the advantages of Neural Networks
- Can build very complex models (particularly for large datasets)
- Sensitive to scaling of the data and choice of parameters
- large models require a long time to train
What are the two primary types of unsupervised learning?
- unsupervised transformations: creating a new representation of data which might be easier for humans or other machine learning algorithms to understand compared to the original representation.
(dimensionality reduction, topic extraction) - Clustering algorithms: partition data into distinct groups of similar items.
(like clustering in supervised learning but we don’t have output to compare to)
What is the primary challenge in unsupervised learning?
- no outcome to compare to (how well did we do? nobody knows!)
- We must manually inspect the results to see how we did.