Lecture 2 - Introduction to Machine Learning Flashcards

1
Q

What is Machine Learning?

A

The study of algorithms that can learn from data and can make predictions on new data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the primary differences between traditional programming versus machine learning programming?

A

** Traditional Programming **
Can be seen as having one stage
Data -> Program -> Output (Output being the focus)

** Machine Learning **
Can be seen as having two stages
1. Training
Data -> Algorithm -> Output (Algorithm being the focus)

  1. Deployment
    Data -> Algorithm -> Output (Output being the focus)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is an example of a timeline of machine learning?

A

Data Collection -> Preprocessing -> Exploratory Data Analysis -> Model Building -> Splitting the Data -> Training and Validation -> Testing the model -> Deploying the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is CRISP-DM?

A

CRISP-DM is an acronym for Cross-industry standard process for data mining

It is a commonly used methodology for data mining

It divides data mining into six phases starting from business understanding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the six stages of CRISP-DM?

A

They are iterative processes (That are also non-linear), but the general stage structure looks like this

Business Understanding Data Understanding -> Data Preparation Modelling -> Evaluation -> Deployment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is supervised machine learning?

A

Supervised machine learning is when data includes labels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the goal of supervised machine learning?

A

Use predictor variables to predict a target variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are some examples of supervised machine learning algorithms?

A
  • linear regression, logistic regression
  • decision tree, random forest
  • support vector machines (SVM)
  • neural network; generative models (GANs)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is classification?

A

Classification is putting something into a category (Species of flowers, spam or not spam, will a customer click on an ad)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is regression?

A

Regression is predicting a continuous value

stock prices, temperatures, sales volume, price of a house etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are some alternative names for features?

A

Features = Predictor Variables = Independant variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are some alternative names for target variable?

A

Target Variable = Dependant Variable = Response Variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is unsupervised machine learning?

A

Unsupervised machine learning is when data does not include labels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are some examples of unsupervised machine leraning algorithms?

A
    • Clustering **
  • K-means clustering
  • Hierarchical
    • Dimensionality Reduction **
  • Principal Components Analysis (PCA)
  • t-SNE
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the primary goal of clustering?

A

You have a data set and the algorithm is grouping the data into multiple groups which it assumes are similar

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is reinforcement learning? What are some example use cases?

A
  • an agent observes the state of the environment, takes actions and gets rewards
  • agent learns by itself to maximize reward

Examples of use cases: Self-driving cars, Gaming (AI), Robotics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are variable types? What three variable types have we primarily worked with?

A

Categorical (Also nominal)

  • Set of values without order
  • Examples, Gender (M,F), hair color (black, brown, red)

Ordinal

  • Set of ordered values
  • Magnitude between successive values not known
  • Examples, Clothing Size (XS, S, M, L, XL)

Continuous (Also numeric)

  • Integer or real values
  • Examples, Temperature, year
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Out of Categorical, Ordinal and Continuous, which two variable types are discrete?

A

Categorical and Ordinal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is an example of categorical features that we used in our research?

A

Browser and Device

20
Q

What are some common issues in that require data cleaning/preprocessing?

A

Missing values, faulty data, wrong data type or format, duplicates

21
Q

What are some common preprocessing steps?

A
  • Remove rows with missing data
  • Fix rows with faulty datapoints
  • Turn fields into numeric form
  • Drop unneccessary variables (columns)
  • Normalising data
  • Renaming variables (remove capitalization, spaces etc)
22
Q

What is the median?

A

The middle value after sorting the data

23
Q

What is the mode?

A

The most common value

24
Q

What is variance?

A

Measure of dispersion of the data

25
What is standard deviation?
A measure of dispersion of the data, in the same unit as the data
26
What is the range?
The difference between the minimum and maximum values in a dataset
27
Why should you visualise your data?
Summary statistics can be deceiving Makes it easy to get an overview of outliers etc Looks pretty
28
What is cross-validation?
Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. It is a less biased or less optimistic estimate of the model than train/test split Example: k-folds cross validation, where k is the number of splits (e.g., 10)
29
What is meant by "splitting the data"?
Dataset is split into two or three subsets - Training set (For training the model - Validation set (Optional: For tuning hyperparameters) - Test set (For testing the model) .train.test.split()
30
What are performance metrics/measures used for?
Assessing how good your model is
31
What are some performance metrics used for classification?
- Confusion Matrix - Precision - Recall - F1 score - Accuracy - ROC - AOC
32
Describe a confusion matrix
Confusion matrix shows the performance of the classifier, by mapping the test data into True Positives, False Positives, True Negatives and False Negatives
33
How is Precision calculated? Can you conceptualise it?
Precision = True Positive/True Positive + False Positive | "Out of all predicted positives, what is the chance that it is actually a positive?"
34
How is Recall calculated? Can you conceptualise it?
Recall = True Positive/True Positive + False Negative | "Out of all actual positives, what is the chance that we predict correct?"
35
What are some performance metrics used for regression?
- Mean squared error - Root mean squared error - Mean absolute error
36
Which two types of parameters are there? When are they adjusted?
1. Parameters Adjusted automatically when training/fitting the ML model 2. Hyperparameters Hyperparameters are parameters that control how a ML model learns and needs to be adjusted by the user
37
What is grid search?
Grid Search is a way to systematically tune hyperparameters - if you had 2 hyperparameters Alpha and C - we could use a grid search, where we iterate through a matrix with each hyperparameter combination
38
Give some examples of functions to visualize data and some examples of functions to analyze the mean, median, and so on
.boxplot .pairplot .displot .describe
39
What is Mean Squared Error
MSE is the average squared difference between the estimates and the actual values
40
How exactly does Grid Search work?
It iterates through a matrix with each hyperparameter combination - and then computes different values for the model performance to find the best hyperparameters for the model
41
True or False? Mean Absolute Error (MEA) is less sensitive to outliers than RMSE or MSE
True
42
True or False? One-hot encoding is used to transform ordinal values to continuous (having an inherent order)
False. One-hot encoding is used to transform categorical values into values that have no inherent order.
43
Define categorical, ordinal, and continuous variables.
Categorical - set of values without inherent order Ordinal - set of ordered values whose magnitude between successive values is unknown Continuous - integer or real values whose magnitude between successive values is known
44
What is a perfect ROC score and what is a random ROC score?
1 is a perfect model | 0.5 is a random model
45
True or False? F1 score is useful when the classes are unevenly distributed
True
46
Define F1 Score
F1 score is the weighted average of precision and recall. It measures the balance between precision and recall