Lecture 1 - Supervised and Unsupervised Learning Flashcards
What is Machine Learning?
Machine Learning is the science (and art) of programming computers so they can learn from data. It is also known as inductive learning.
What are Machine Learning Algorithms
ML algorithms are tools for the automatic acquisition of knowledge.
What is Inductive Learning?
A form of logical inference that allows you to obtain generic conclusions about a particular set of examples.
What is Tom Mitchells definition of Machine Learning?
“A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.” - Refer to the email example to explain it
What is Supervised Learning?
Uses labeled datasets to train algorithms to predict outcomes
Examples: Classification and Regression
Define Attribute
Also called features, predictors, independent variables. An attribute describes a characteristic or aspect of an example. Also known as the descriptive features
Define Example
Also called instance, register, data point. It is a tuple of attribute values that describes an object of interest (e.g., a regular email, a patient, or a company’s customer history). The data points of the attributes (descriptive features)
Define Label
Also called the target or dependent variable. It is a special attribute that describes the phenomenon of interest. The predicted value based on the example
Define Dataset
A data set is composed of examples with respective attribute values and the associate label.
Supervised Learning Example - Credit Approval
REFER TO ONENOTE FOR MORE DETAILS
What are some examples of Supervised Learning Algorithms?
Linear Regression
Logistic Regression
k-Nearest Neighbors
Support Vector Machines (SVMs)
Decision Trees and Random Forests
What is Unsupervised Learning?
An Algorithm that learns from unlabelled data
What is Clustering in Unsupervised Learning?
An unsupervised machine learning technique designed to group unlabelled examples based on their similarity to each other
What is Anomaly Detection in Unsupervised Learning?
An unsupervised ML technique being the process of identifying data points, events, or observations that significantly deviate from the expected pattern within a dataset
What are some examples of Unsupervised Learning Algorithms?
Clustering: k-means clustering, Density-based spatial clustering of applications with noise (DBSCAN), Hierarchical Cluster Analysis (HCA)
Visualization and dimensionality reduction: Principal Component Analysis (PCA), Locally-Linear Embedding (LLE), t-distributed Stochastic Neighbor Embedding (t-SNE)
==
Association rule learning: Apriori algorithm, Eclat algorithm
What is Instance-Based Learning?
Instance-based Learning:
- Memorises the training data and uses it directly to make predictions
- Compares new examples to stored instances using similarity measure
Refer to the example on the slides
What is Model-Based Learning?
Model-based Learning:
- Builds a model from the training data to make predictions
- The model learns a general rule that applies to all data, not just memorising individual instances
What are some ways to classify Machine Learning Systems?
- Whether or not they can learn incrementally on the fly (online learning vs. batch learning)
- How they are supervised during training (supervised, semi-supervised, unsupervised, reinforcement learning).
- How they generalise (instance-based, model-based).
What are the main challanges of Machine Learning?
- Insufficient quantity of training data
- Non-representative or poor-quality data
- Irrelevant features
- Overfitting or underfitting
What is Overfitting and Underfitting?
Overfitting: the model fits the data perfectly
Underfitting: the model fits the data extremely loosely
How can you test and validate the performance of Machine Learning?
- Split your data into two sets: the training set and the test set. Train your model using the training set, and you test it using the test set.
- It is common to use 80% of the data for training and hold out 20% for testing.
- The error rate on the test set is called the generalisation error or out-of-sample error.
NOTE: Because using the test set to pick the best hyperparameter values tends to make the model not perform well new data other than the test set, we should also need a validation set.