Lecture 2 - Introduction to Machine Learning Flashcards

1
Q

What is Machine Learning?

A

The study of algorithms that can learn from data and can make predictions on new data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the primary differences between traditional programming versus machine learning programming?

A

** Traditional Programming **
Can be seen as having one stage
Data -> Program -> Output (Output being the focus)

** Machine Learning **
Can be seen as having two stages
1. Training
Data -> Algorithm -> Output (Algorithm being the focus)

  1. Deployment
    Data -> Algorithm -> Output (Output being the focus)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is an example of a timeline of machine learning?

A

Data Collection -> Preprocessing -> Exploratory Data Analysis -> Model Building -> Splitting the Data -> Training and Validation -> Testing the model -> Deploying the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is CRISP-DM?

A

CRISP-DM is an acronym for Cross-industry standard process for data mining

It is a commonly used methodology for data mining

It divides data mining into six phases starting from business understanding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the six stages of CRISP-DM?

A

They are iterative processes (That are also non-linear), but the general stage structure looks like this

Business Understanding Data Understanding -> Data Preparation Modelling -> Evaluation -> Deployment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is supervised machine learning?

A

Supervised machine learning is when data includes labels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the goal of supervised machine learning?

A

Use predictor variables to predict a target variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are some examples of supervised machine learning algorithms?

A
  • linear regression, logistic regression
  • decision tree, random forest
  • support vector machines (SVM)
  • neural network; generative models (GANs)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is classification?

A

Classification is putting something into a category (Species of flowers, spam or not spam, will a customer click on an ad)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is regression?

A

Regression is predicting a continuous value

stock prices, temperatures, sales volume, price of a house etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are some alternative names for features?

A

Features = Predictor Variables = Independant variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are some alternative names for target variable?

A

Target Variable = Dependant Variable = Response Variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is unsupervised machine learning?

A

Unsupervised machine learning is when data does not include labels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are some examples of unsupervised machine leraning algorithms?

A
    • Clustering **
  • K-means clustering
  • Hierarchical
    • Dimensionality Reduction **
  • Principal Components Analysis (PCA)
  • t-SNE
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the primary goal of clustering?

A

You have a data set and the algorithm is grouping the data into multiple groups which it assumes are similar

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is reinforcement learning? What are some example use cases?

A
  • an agent observes the state of the environment, takes actions and gets rewards
  • agent learns by itself to maximize reward

Examples of use cases: Self-driving cars, Gaming (AI), Robotics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are variable types? What three variable types have we primarily worked with?

A

Categorical (Also nominal)

  • Set of values without order
  • Examples, Gender (M,F), hair color (black, brown, red)

Ordinal

  • Set of ordered values
  • Magnitude between successive values not known
  • Examples, Clothing Size (XS, S, M, L, XL)

Continuous (Also numeric)

  • Integer or real values
  • Examples, Temperature, year
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Out of Categorical, Ordinal and Continuous, which two variable types are discrete?

A

Categorical and Ordinal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is an example of categorical features that we used in our research?

A

Browser and Device

20
Q

What are some common issues in that require data cleaning/preprocessing?

A

Missing values, faulty data, wrong data type or format, duplicates

21
Q

What are some common preprocessing steps?

A
  • Remove rows with missing data
  • Fix rows with faulty datapoints
  • Turn fields into numeric form
  • Drop unneccessary variables (columns)
  • Normalising data
  • Renaming variables (remove capitalization, spaces etc)
22
Q

What is the median?

A

The middle value after sorting the data

23
Q

What is the mode?

A

The most common value

24
Q

What is variance?

A

Measure of dispersion of the data

25
Q

What is standard deviation?

A

A measure of dispersion of the data, in the same unit as the data

26
Q

What is the range?

A

The difference between the minimum and maximum values in a dataset

27
Q

Why should you visualise your data?

A

Summary statistics can be deceiving
Makes it easy to get an overview of outliers etc
Looks pretty

28
Q

What is cross-validation?

A

Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.

It is a less biased or less optimistic estimate of the model than train/test split

Example: k-folds cross validation, where k is the number of splits (e.g., 10)

29
Q

What is meant by “splitting the data”?

A

Dataset is split into two or three subsets

  • Training set (For training the model
  • Validation set (Optional: For tuning hyperparameters)
  • Test set (For testing the model)

.train.test.split()

30
Q

What are performance metrics/measures used for?

A

Assessing how good your model is

31
Q

What are some performance metrics used for classification?

A
  • Confusion Matrix
  • Precision
  • Recall
  • F1 score
  • Accuracy
  • ROC
  • AOC
32
Q

Describe a confusion matrix

A

Confusion matrix shows the performance of the classifier, by mapping the test data into True Positives, False Positives, True Negatives and False Negatives

33
Q

How is Precision calculated? Can you conceptualise it?

A

Precision = True Positive/True Positive + False Positive

“Out of all predicted positives, what is the chance that it is actually a positive?”

34
Q

How is Recall calculated? Can you conceptualise it?

A

Recall = True Positive/True Positive + False Negative

“Out of all actual positives, what is the chance that we predict correct?”

35
Q

What are some performance metrics used for regression?

A
  • Mean squared error
  • Root mean squared error
  • Mean absolute error
36
Q

Which two types of parameters are there? When are they adjusted?

A
  1. Parameters
    Adjusted automatically when training/fitting the ML model
  2. Hyperparameters
    Hyperparameters are parameters that control how a ML model learns and needs to be adjusted by the user
37
Q

What is grid search?

A

Grid Search is a way to systematically tune hyperparameters

  • if you had 2 hyperparameters Alpha and C
  • we could use a grid search, where we iterate through a matrix with each hyperparameter combination
38
Q

Give some examples of functions to visualize data and some examples of functions to analyze the mean, median, and so on

A

.boxplot
.pairplot
.displot

.describe

39
Q

What is Mean Squared Error

A

MSE is the average squared difference between the estimates and the actual values

40
Q

How exactly does Grid Search work?

A

It iterates through a matrix with each hyperparameter combination - and then computes different values for the model performance to find the best hyperparameters for the model

41
Q

True or False? Mean Absolute Error (MEA) is less sensitive to outliers than RMSE or MSE

A

True

42
Q

True or False? One-hot encoding is used to transform ordinal values to continuous (having an inherent order)

A

False. One-hot encoding is used to transform categorical values into values that have no inherent order.

43
Q

Define categorical, ordinal, and continuous variables.

A

Categorical - set of values without inherent order
Ordinal - set of ordered values whose magnitude between successive values is unknown
Continuous - integer or real values whose magnitude between successive values is known

44
Q

What is a perfect ROC score and what is a random ROC score?

A

1 is a perfect model

0.5 is a random model

45
Q

True or False? F1 score is useful when the classes are unevenly distributed

A

True

46
Q

Define F1 Score

A

F1 score is the weighted average of precision and recall. It measures the balance between precision and recall