ML modul 1 Introduksjon til maskinlæring Flashcards
Data preperations: What are some typical problems and various solutions?
- Missing data
- Fill with some value (zero, mean, …) aka imputation
- Skip datapoints containing missing data
- Skip entire feature containing missing data
- Text attributes
- Convert to categorical values
Example:
Ocean_proximity categorical
“NEAR BAY” -> 0
“INLAND” -> 1
“NEAR OCEAN” -> 2
- Feature values have different scales
- Normalise values (shift to a range [0,1])
- Standardize values (shift to have mean equal to 0 and variance eaual to 1)
- feature values are not normally disctrubuted
- Transform values (compute e.g. logarithm)
What is imputation?
Imputation is the process of replacing missing data with substituted values
How do we represent a generic ML model as a function f)
ŷ = f(x,θ)
where
* ŷ is the prediction
* x is a data point
* θ are the parameters of the model
What is Training in ML?
Training is the process of finding the best parameters θ so that the prediction ŷ is as close as possible to the known target value y.
What are some useful metrics for Classification?
Classification: How many did we label correctly?
* Accuracy
* Percision
* Recall
* ROC curve
What are some useful metrics for Regression?
Regression: How close did we get?
* Mean squared error (MSE)
* Root mean squared error (RMSE)
* Mean absolute error (MAE)
What is accuracy and how do we calculate it?
Accuracy measures how many samples are classified correctly, relative to the total number of samples:
accuracy = correct classifications / all classifications
accuracy = TP + TN / TP + TN + FP + FN
What is precision and how do do we calculate it?
Precision measures how many positive classifications that are ctually positive
precision = correct positive classification / all positive classifications
precision = TP / TP + FP
What is Recall and how do we calculate it?
Recall measures how many of the actual positives that were classified as positive
recall = correct positive classifications / all actual positives
recall = TP / TP + FN
What is Reciever Operator Characteristic (ROC) and how do we calculate it?
ROC is a more common option to plot the true positive rate (TRP) as function of false positive rate (FPR)
TRP:
How many positives did I get right
TPR = TP / TP + FN
FPR
How many negatives did I get wrong
FPR = FP / FP + TN
What is the Train-validation-test split used for?
In case we want to comare different models, were need a third set:
The Validation set
The test set is still only for final evaluation.
<———————|———–|———>
Train Val Test
ML models are pron to overfitting - i.e. memorising the training data.
How do we know if (when) this happens?
We can compoare performance on the training set to the validation set in the Train-Validation-Test split
How does Cross-validation work?
What is Reinforcement learning?
Reinforcement learning is a branch of machine learning that trains agents (such as bots to pick the actions that wil maximize their rewards over time within a given environment.
Key Consepts:
* Agent: The learner or decision makes.
* Environment: Everything the agent interacts with.
* State: A specific situation in which the agent finds itself.
* Actions: All possible moves the agent can make.
* Reward: Feedback from the environment
What is Supervised learning?
Supervised learning is a type of machine learning algorithm that learns from labeled data. Labeled data is data that has been tagged with a correct answer or classification.
Key Points:
* Supervised learning involves training a machine from labeled data.
* Labeled data consists of examples with the correct answer or classification.
* The machine learns the relationship between inputs (fruit images) and outputs (fruit labels).
* The trained machine can then make predictions on new, unlabeled data.