Machine Learning Flashcards
Machine learning
Algorithms that can learn from observational data and make predictions from it
Unsupervised learning
An algorithm makes sense of a data set without prior learning experience or answers to learn from
Latent variable
A previously unknown part of the data, which unsupervised learning can do
Supervised learning
An algorithm learns from a data set plus the correct “answers”
Training/testing sets
A model is trained using a training set of data, then the model is tested on a similar but disjoint set of data to test its accuracy.
What are practical considerations for training/testing sets?
- Both sets must be large to have many outliers and variations.
- Both sets must be randomly chosen from the source data pool.
Why is train/test useful?
It can guard against overfitting.
K-fold cross variation
- Split data randomly into K segments.
- Take one segment as the test set.
- Train on the other sets and compare with the test set.
- Average the resulting r-squared values.
K-means clustering
- Randomly pick K centroids.
- Assign each data point to the closest centroid.
- Recompute the centroids based on the average position of each centroid’s data points.
- Iterate until points stop moving.
What is a large caveat with K-means clustering?
The algorithm does not assign names or titles to clusters.
Entropy (data science)
Disorder of data
Zero if all data points are the same.