Lecture 2 - Introduction to Machine Learning Flashcards by Simon Sardorf

What is Machine Learning?

The study of algorithms that can learn from data and can make predictions on new data.

How well did you know this?

Not at all

Perfectly

What are the primary differences between traditional programming versus machine learning programming?

** Traditional Programming **
Can be seen as having one stage
Data -> Program -> Output (Output being the focus)

** Machine Learning **
Can be seen as having two stages
1. Training
Data -> Algorithm -> Output (Algorithm being the focus)

Deployment
Data -> Algorithm -> Output (Output being the focus)

How well did you know this?

Not at all

Perfectly

What is an example of a timeline of machine learning?

Data Collection -> Preprocessing -> Exploratory Data Analysis -> Model Building -> Splitting the Data -> Training and Validation -> Testing the model -> Deploying the model

How well did you know this?

Not at all

Perfectly

What is CRISP-DM?

CRISP-DM is an acronym for Cross-industry standard process for data mining

It is a commonly used methodology for data mining

It divides data mining into six phases starting from business understanding

How well did you know this?

Not at all

Perfectly

What are the six stages of CRISP-DM?

They are iterative processes (That are also non-linear), but the general stage structure looks like this

Business Understanding Data Understanding -> Data Preparation Modelling -> Evaluation -> Deployment

How well did you know this?

Not at all

Perfectly

What is supervised machine learning?

Supervised machine learning is when data includes labels

How well did you know this?

Not at all

Perfectly

What is the goal of supervised machine learning?

Use predictor variables to predict a target variable

How well did you know this?

Not at all

Perfectly

What are some examples of supervised machine learning algorithms?

linear regression, logistic regression
decision tree, random forest
support vector machines (SVM)
neural network; generative models (GANs)

How well did you know this?

Not at all

Perfectly

What is classification?

Classification is putting something into a category (Species of flowers, spam or not spam, will a customer click on an ad)

How well did you know this?

Not at all

Perfectly

What is regression?

Regression is predicting a continuous value

stock prices, temperatures, sales volume, price of a house etc

How well did you know this?

Not at all

Perfectly

What are some alternative names for features?

Features = Predictor Variables = Independant variables

How well did you know this?

Not at all

Perfectly

What are some alternative names for target variable?

Target Variable = Dependant Variable = Response Variable

How well did you know this?

Not at all

Perfectly

What is unsupervised machine learning?

Unsupervised machine learning is when data does not include labels

How well did you know this?

Not at all

Perfectly

What are some examples of unsupervised machine leraning algorithms?

- Clustering **
K-means clustering
Hierarchical
- Dimensionality Reduction **
Principal Components Analysis (PCA)
t-SNE

How well did you know this?

Not at all

Perfectly

What is the primary goal of clustering?

You have a data set and the algorithm is grouping the data into multiple groups which it assumes are similar

How well did you know this?

Not at all

Perfectly

What is reinforcement learning? What are some example use cases?

an agent observes the state of the environment, takes actions and gets rewards
agent learns by itself to maximize reward

Examples of use cases: Self-driving cars, Gaming (AI), Robotics

How well did you know this?

Not at all

Perfectly

What are variable types? What three variable types have we primarily worked with?

Categorical (Also nominal)

Set of values without order
Examples, Gender (M,F), hair color (black, brown, red)

Ordinal

Set of ordered values
Magnitude between successive values not known
Examples, Clothing Size (XS, S, M, L, XL)

Continuous (Also numeric)

Integer or real values
Examples, Temperature, year

How well did you know this?

Not at all

Perfectly

Out of Categorical, Ordinal and Continuous, which two variable types are discrete?

Categorical and Ordinal

How well did you know this?

Not at all

Perfectly

What is an example of categorical features that we used in our research?

Study These Flashcards

Browser and Device

What are some common issues in that require data cleaning/preprocessing?

Study These Flashcards

Missing values, faulty data, wrong data type or format, duplicates

What are some common preprocessing steps?

Study These Flashcards

Remove rows with missing data
Fix rows with faulty datapoints
Turn fields into numeric form
Drop unneccessary variables (columns)
Normalising data
Renaming variables (remove capitalization, spaces etc)

What is the median?

Study These Flashcards

The middle value after sorting the data

What is the mode?

Study These Flashcards

The most common value

What is variance?

Study These Flashcards

Measure of dispersion of the data

What is standard deviation?

A measure of dispersion of the data, in the same unit as the data

What is the range?

The difference between the minimum and maximum values in a dataset

Why should you visualise your data?

Summary statistics can be deceiving Makes it easy to get an overview of outliers etc Looks pretty

What is cross-validation?

Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. It is a less biased or less optimistic estimate of the model than train/test split Example: k-folds cross validation, where k is the number of splits (e.g., 10)

What is meant by "splitting the data"?

Dataset is split into two or three subsets - Training set (For training the model - Validation set (Optional: For tuning hyperparameters) - Test set (For testing the model) .train.test.split()

What are performance metrics/measures used for?

Assessing how good your model is

What are some performance metrics used for classification?

- Confusion Matrix - Precision - Recall - F1 score - Accuracy - ROC - AOC

Describe a confusion matrix

Confusion matrix shows the performance of the classifier, by mapping the test data into True Positives, False Positives, True Negatives and False Negatives

How is Precision calculated? Can you conceptualise it?

Precision = True Positive/True Positive + False Positive | "Out of all predicted positives, what is the chance that it is actually a positive?"

How is Recall calculated? Can you conceptualise it?

Recall = True Positive/True Positive + False Negative | "Out of all actual positives, what is the chance that we predict correct?"

What are some performance metrics used for regression?

- Mean squared error - Root mean squared error - Mean absolute error

Which two types of parameters are there? When are they adjusted?

1. Parameters Adjusted automatically when training/fitting the ML model 2. Hyperparameters Hyperparameters are parameters that control how a ML model learns and needs to be adjusted by the user

What is grid search?

Grid Search is a way to systematically tune hyperparameters - if you had 2 hyperparameters Alpha and C - we could use a grid search, where we iterate through a matrix with each hyperparameter combination

Give some examples of functions to visualize data and some examples of functions to analyze the mean, median, and so on

.boxplot .pairplot .displot .describe

What is Mean Squared Error

MSE is the average squared difference between the estimates and the actual values

How exactly does Grid Search work?

It iterates through a matrix with each hyperparameter combination - and then computes different values for the model performance to find the best hyperparameters for the model

True or False? Mean Absolute Error (MEA) is less sensitive to outliers than RMSE or MSE

True

True or False? One-hot encoding is used to transform ordinal values to continuous (having an inherent order)

False. One-hot encoding is used to transform categorical values into values that have no inherent order.

Define categorical, ordinal, and continuous variables.

Categorical - set of values without inherent order Ordinal - set of ordered values whose magnitude between successive values is unknown Continuous - integer or real values whose magnitude between successive values is known

What is a perfect ROC score and what is a random ROC score?

1 is a perfect model | 0.5 is a random model

True or False? F1 score is useful when the classes are unevenly distributed

True

Define F1 Score

F1 score is the weighted average of precision and recall. It measures the balance between precision and recall

Lecture 2 - Introduction to Machine Learning Flashcards

(46 cards)