Lecture 2 - Introduction to Machine Learning Flashcards
What is Machine Learning?
The study of algorithms that can learn from data and can make predictions on new data.
What are the primary differences between traditional programming versus machine learning programming?
** Traditional Programming **
Can be seen as having one stage
Data -> Program -> Output (Output being the focus)
** Machine Learning **
Can be seen as having two stages
1. Training
Data -> Algorithm -> Output (Algorithm being the focus)
- Deployment
Data -> Algorithm -> Output (Output being the focus)
What is an example of a timeline of machine learning?
Data Collection -> Preprocessing -> Exploratory Data Analysis -> Model Building -> Splitting the Data -> Training and Validation -> Testing the model -> Deploying the model
What is CRISP-DM?
CRISP-DM is an acronym for Cross-industry standard process for data mining
It is a commonly used methodology for data mining
It divides data mining into six phases starting from business understanding
What are the six stages of CRISP-DM?
They are iterative processes (That are also non-linear), but the general stage structure looks like this
Business Understanding Data Understanding -> Data Preparation Modelling -> Evaluation -> Deployment
What is supervised machine learning?
Supervised machine learning is when data includes labels
What is the goal of supervised machine learning?
Use predictor variables to predict a target variable
What are some examples of supervised machine learning algorithms?
- linear regression, logistic regression
- decision tree, random forest
- support vector machines (SVM)
- neural network; generative models (GANs)
What is classification?
Classification is putting something into a category (Species of flowers, spam or not spam, will a customer click on an ad)
What is regression?
Regression is predicting a continuous value
stock prices, temperatures, sales volume, price of a house etc
What are some alternative names for features?
Features = Predictor Variables = Independant variables
What are some alternative names for target variable?
Target Variable = Dependant Variable = Response Variable
What is unsupervised machine learning?
Unsupervised machine learning is when data does not include labels
What are some examples of unsupervised machine leraning algorithms?
- Clustering **
- K-means clustering
- Hierarchical
- Dimensionality Reduction **
- Principal Components Analysis (PCA)
- t-SNE
What is the primary goal of clustering?
You have a data set and the algorithm is grouping the data into multiple groups which it assumes are similar
What is reinforcement learning? What are some example use cases?
- an agent observes the state of the environment, takes actions and gets rewards
- agent learns by itself to maximize reward
Examples of use cases: Self-driving cars, Gaming (AI), Robotics
What are variable types? What three variable types have we primarily worked with?
Categorical (Also nominal)
- Set of values without order
- Examples, Gender (M,F), hair color (black, brown, red)
Ordinal
- Set of ordered values
- Magnitude between successive values not known
- Examples, Clothing Size (XS, S, M, L, XL)
Continuous (Also numeric)
- Integer or real values
- Examples, Temperature, year
Out of Categorical, Ordinal and Continuous, which two variable types are discrete?
Categorical and Ordinal
What is an example of categorical features that we used in our research?
Browser and Device
What are some common issues in that require data cleaning/preprocessing?
Missing values, faulty data, wrong data type or format, duplicates
What are some common preprocessing steps?
- Remove rows with missing data
- Fix rows with faulty datapoints
- Turn fields into numeric form
- Drop unneccessary variables (columns)
- Normalising data
- Renaming variables (remove capitalization, spaces etc)
What is the median?
The middle value after sorting the data
What is the mode?
The most common value
What is variance?
Measure of dispersion of the data
What is standard deviation?
A measure of dispersion of the data, in the same unit as the data
What is the range?
The difference between the minimum and maximum values in a dataset
Why should you visualise your data?
Summary statistics can be deceiving
Makes it easy to get an overview of outliers etc
Looks pretty
What is cross-validation?
Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.
It is a less biased or less optimistic estimate of the model than train/test split
Example: k-folds cross validation, where k is the number of splits (e.g., 10)
What is meant by “splitting the data”?
Dataset is split into two or three subsets
- Training set (For training the model
- Validation set (Optional: For tuning hyperparameters)
- Test set (For testing the model)
.train.test.split()
What are performance metrics/measures used for?
Assessing how good your model is
What are some performance metrics used for classification?
- Confusion Matrix
- Precision
- Recall
- F1 score
- Accuracy
- ROC
- AOC
Describe a confusion matrix
Confusion matrix shows the performance of the classifier, by mapping the test data into True Positives, False Positives, True Negatives and False Negatives
How is Precision calculated? Can you conceptualise it?
Precision = True Positive/True Positive + False Positive
“Out of all predicted positives, what is the chance that it is actually a positive?”
How is Recall calculated? Can you conceptualise it?
Recall = True Positive/True Positive + False Negative
“Out of all actual positives, what is the chance that we predict correct?”
What are some performance metrics used for regression?
- Mean squared error
- Root mean squared error
- Mean absolute error
Which two types of parameters are there? When are they adjusted?
- Parameters
Adjusted automatically when training/fitting the ML model - Hyperparameters
Hyperparameters are parameters that control how a ML model learns and needs to be adjusted by the user
What is grid search?
Grid Search is a way to systematically tune hyperparameters
- if you had 2 hyperparameters Alpha and C
- we could use a grid search, where we iterate through a matrix with each hyperparameter combination
Give some examples of functions to visualize data and some examples of functions to analyze the mean, median, and so on
.boxplot
.pairplot
.displot
.describe
What is Mean Squared Error
MSE is the average squared difference between the estimates and the actual values
How exactly does Grid Search work?
It iterates through a matrix with each hyperparameter combination - and then computes different values for the model performance to find the best hyperparameters for the model
True or False? Mean Absolute Error (MEA) is less sensitive to outliers than RMSE or MSE
True
True or False? One-hot encoding is used to transform ordinal values to continuous (having an inherent order)
False. One-hot encoding is used to transform categorical values into values that have no inherent order.
Define categorical, ordinal, and continuous variables.
Categorical - set of values without inherent order
Ordinal - set of ordered values whose magnitude between successive values is unknown
Continuous - integer or real values whose magnitude between successive values is known
What is a perfect ROC score and what is a random ROC score?
1 is a perfect model
0.5 is a random model
True or False? F1 score is useful when the classes are unevenly distributed
True
Define F1 Score
F1 score is the weighted average of precision and recall. It measures the balance between precision and recall