Machine learning Flashcards
Supervised ml -
training with data that includes input variables (x) as well as response variables (y). Supervised learning uses labelled datasets to train algorithms to predict outcomes and recognize patterns. It is trained on a dataset containing input and output data (features and labels).
Two main types; classification and regression.
Regression vs classification -
Classification → is about determining which group a new data point belongs to. Every data point is placed in one of the predefined classes or categories. e.g. temp, heart rate. y is discrete. (decision tree, random forest, svm).
Regression → Regression is used for predicting continuous numerical values, such as earnings, production orders, or stock prices. y is continuous. (linear regression).
Decision tree -
Flowchart-like classifier that makes decisions step by step.
Nodes = Tests on features/attributes.
Branches = Outcomes of tests.
Leaves = Final classification result.
Built from training data to make predictions. Simple but effective for many tasks. Decision Tree Features Can handle different types of attributes:
* Discrete-valued (e.g., color: red, blue, green).
Continuous-valued (e.g., temperature: 10°C).
Binary tree (yes/no decisions).
Example: Predicting if a customer will buy a computer based on past data.
Attribute selection measure in DT -
What is Attribute Selection? A method to find the most important attribute for splitting data. Goal: Create pure partitions (groups with only one class).
How to Measure Impurity?
Gini Impurity Measures how mixed the classes are in a group. Lower Gini = Better split (purer groups).
Information Gain (IG) Measures how much uncertainty is reduced after splitting. Higher IG = Better split
Choosing the Best Attribute:
Pick the attribute with the lowest Gini Impurity or highest Information Gain. This ensures the best first split for the decision tree.
What is overfitting and tree pruning in decision trees? -
What is Overfitting? When a decision tree is too adjusted to training data and does not generalize well to new data. Happens when the tree learns noise or outliers instead of real patterns. More attributes + less training data = higher risk of overfitting.
How to Fix Overfitting? → Pruning Pruning removes unnecessary branches to make the tree simpler and more accurate.
Types of Pruning:
* Prepruning (Early Stopping) Stop tree growth early based on rules like Gini index or Information Gain.
* Postpruning (More Common) First, build the full tree → then remove unhelpful branches. Uses cost complexity (based on misclassification rate & number of leaves).
* Goal: Small, accurate tree that balances size and accuracy.
DT pros and cons -
Pros:
* Transparency (easy to understand for humans)
* Does not require parameter setting
* Requires little to no preprocessing.
Cons:
* Scalability (might have trouble with large datasets, due to memory)
* Can be greedy (focus on local optima instead of global)
* Risk of overfitting.
Random forest and bagging -
Model combination classifier of DTs.
* Instead of one tree, it includes multiple trees.
* Better at handling overfitting compared to DTs.
* RF is a type of bagging. Bagging is short for ”bootstrap aggregation”.
* A type of ensamble learning method based on majority voting.
* More robust to the effects of noisy data and overfitting.
* First step of bagging is bootstrap sampling.
Bootstrap sampling and attribute selection RF -
Bootstrap Sampling Create k new training samples from the original dataset. Some data points are removed, while others appear multiple times. Each sample trains one decision tree.
Random Attribute Selection: Each tree only considers a random subset of attributes. Reduces correlation between trees. Less sensitive to noise and overfitting.
Final Prediction: Each tree makes a prediction. Majority vote decides the final result.
Support vector machine (SVM) -
Creates a classifier by drawing a line (2D) or plane (3D) to separate classes. Good performance in many applications but slow training. Useful as a “first try” ML method when exploring new domains. In higher dimensions (N-dimensional space), the separator is called a hyperplane.
Maximal marginal hyperplane, kernel function, soft margin and hard margin -
MMH is the hyperplane that maximizes separation between classes. Defined by support vectors (defined by the data samples closest to the hyperplane).
SVM: Kernel Function maps data into a higher-dimensional space to make it separable. Choice of kernel depends on the data. Examples: Linear, Polynomial, Radial Basis Function (RBF).
SVM: Soft vs. Hard Margin
* Hard margin: Strict separation, but fails if data is noisy or non-separable.
* Soft margin: Allows some misclassifications, leading to better generalization.
* Trade-off: A larger margin with minor mistakes is often better than a perfect separation with a narrow margin.
SVM pros and cons -
Pros:
* good performance for a variety of problems.
* Less prone to overfitting compered to many other ML methods.
* Can often work well even if small training set.
Cons:
* sensitive to noise.
* a large dataset can lead to long training time,
* needs parameter tuning to work properly.
Linear regression and logic regression -
Linear Regression: Used for predictive analysis to find a trend line in data. Finds the best fit line by minimizing the error.
Logistic Regression: Classification algorithm used for binary outcomes (e.g., Yes/No). Returns a logistic (S-curve) instead of a straight line. Differences: Linear regression → Predicts continuous values (straight line) Logistic regression → Predicts probability (0 to 1) (sigmoid curve).
Validation -
Purpose: Fine-tune model parameters during training and assess performance.
How it works: Split data into training and validation sets. Train on training set, test on validation set.
Adjusting Model: Hyperparameters are adjusted based on validation set performance.
Preventing Overfitting: Detects overfitting by ensuring good performance on unseen data.
K-fold Cross-Validation
What it is: Split data into k subsets. How it works: Train on k-1 subsets and test on the remaining 1/k subset. Average results from all k rounds.
Evaluation -
Purpose: After training, evaluate the model’s performance on a separate test dataset to assess its ability to generalize to new data.
Data Usage: The test data was not seen during training or validation, providing an unbiased assessment.
Performance Metrics: Metrics like accuracy, precision, recall, and F1 score are used to quantify performance.
Decision Making: Evaluation results help decide whether to deploy the model in real-world scenarios.
Key Point: The goal is for the model to perform well on new, unseen data, not just the data it was trained on.
Classification performance measurement -
Accuracy: Percentage of correct predictions (both true positives and true negatives, not false).
Recall (Sensitivity): Out of all the actual positive cases, how many did we correctly identify?
Precision: Out of all the cases we predicted as positive, how many were actually correct? (It tells us how accurate our positive predictions are.)
False Alarm Rate: Out of all the negative cases, how many did we wrongly predict as positive? (This shows how often we mistakenly raise an alarm when there’s no real issue.)
F1 Score: A single number that balances precision and recall—useful when you want to be good at both catching real cases and avoiding false alarms.
Confusion matrix: A table to compare predictions vs actual outcomes: True Positive (TP): Correctly predicted positive. True Negative (TN): Correctly predicted negative. False Positive (FP): Negative predicted as positive. False Negative (FN): Positive predicted as negative.
Regression performance measurement -
Mean squares error (MSE), Root mean squared error (RMSE), Mean absolute error (MAE).
Generalization, overfitting and underfitting -
Generalization is when a model performs well on new, unseen data from the same distribution as the training data. Measures how well the model can apply learned knowledge to make correct predictions on new data after being trained.
Underfitting: What it is: Model is too simple. Signs: High error on both training and test data. Cause: Model can’t capture patterns.
Overfitting: What it is: Model is too complex. Signs: Low error on training data, high error on test data. Cause: Model is too specific to training data and doesn’t generalize well.
Interactive ML (IML) -
The combination of humans and machines is powerful. Solving real-world problems can often benefit from interaction with the end-users. Sometimes even impossible without end-user input, e.g. due to a lack of labelled instances. Often labelling instances is expensive, IML can reduce the need of labelling. Empower the end-user who gets more control of the learning process. May increase the users trust the output of the ML system. Can make ML more accessible for people that are not ML experts.
Classic vs interactive ML -
Classical ML: Batch process (one-time pass). Long training times are acceptable. Requires large labeled datasets. Labels/classes are known beforehand. No user feedback during training.
Interactive ML: Iterative process. Sensitive to latency (fast responses needed). Often works with unlabeled datasets. Labels may not be known in advance. User feedback is crucial during training.
Triggering interaction in IML -
The aim is to provide those labels that will be more useful for the ML algorithm and at the same time try to bother the user as little as possible. Often the ML system has a budget of request/time unit.
Interactive learning strategies: the user provides a label when triggered by a specified event.
Interactive ML strategies -
Active learning (AL) strategies: AL triggered by uncertainty, AL triggered by time, AL triggered at random.
Machine learning (MT) strategies: MT triggered by error, MT triggered by state change, MT triggered by time, MT triggered by user factors.
Active learning (AL) strategies -
Triggered by uncertainty: the system asks the user for labels when it is uncertain about its predictions. If the model’s certainty is below a set threshold and there is room for more queries, it will request the user’s input. The user is assumed to be always correct.
Triggered by time: Asking the user at certain points in time what the current status is, e.g. once every hour.
Triggered at random: Asking the user at random points in time what the current status is.
Machine teaching (MT) strategies -
Triggered by error: The user notices that the ML systems estimation is not correct. The user provides the correct value.
Triggered by state change: The user notices that the activity has changed. The user provides the new value.
Triggered by time: The user reports the current activity at certain points in time. This could e.g. a security guard continuously patrolling a building or a member of the cleaning staff.
Triggered by user factors: The user reports the current activity based on internal factors, e.g. how busy/stressed the user is at the moment. How knowledgeable the user is about how to classify the current state.
Issues in interactive learning -
User Interaction: Should it be reactive (Active Learning) or proactive (Mixed Initiative), or both?.
Model Communication: Should the system show its current state to the user?.
Model Evaluation: How can the user assess the model’s quality?.
User Feedback: What is the best way for the user to provide input?.