Machine learning Flashcards

1
Q

Supervised ml -

A

training with data that includes input variables (x) as well as response variables (y). Supervised learning uses labelled datasets to train algorithms to predict outcomes and recognize patterns. It is trained on a dataset containing input and output data (features and labels).

Two main types; classification and regression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Regression vs classification -

A

Classification → is about determining which group a new data point belongs to. Every data point is placed in one of the predefined classes or categories. e.g. temp, heart rate. y is discrete. (decision tree, random forest, svm).

Regression → Regression is used for predicting continuous numerical values, such as earnings, production orders, or stock prices. y is continuous. (linear regression).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Decision tree -

A

Flowchart-like classifier that makes decisions step by step.
Nodes = Tests on features/attributes.
Branches = Outcomes of tests.
Leaves = Final classification result.

Built from training data to make predictions. Simple but effective for many tasks. Decision Tree Features Can handle different types of attributes:
* Discrete-valued (e.g., color: red, blue, green).
Continuous-valued (e.g., temperature: 10°C).
Binary tree (yes/no decisions).
Example: Predicting if a customer will buy a computer based on past data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Attribute selection measure in DT -

A

What is Attribute Selection? A method to find the most important attribute for splitting data. Goal: Create pure partitions (groups with only one class).

How to Measure Impurity?
Gini Impurity Measures how mixed the classes are in a group. Lower Gini = Better split (purer groups).
Information Gain (IG) Measures how much uncertainty is reduced after splitting. Higher IG = Better split

Choosing the Best Attribute:
Pick the attribute with the lowest Gini Impurity or highest Information Gain. This ensures the best first split for the decision tree.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is overfitting and tree pruning in decision trees? -

A

What is Overfitting? When a decision tree is too adjusted to training data and does not generalize well to new data. Happens when the tree learns noise or outliers instead of real patterns. More attributes + less training data = higher risk of overfitting.

How to Fix Overfitting? → Pruning Pruning removes unnecessary branches to make the tree simpler and more accurate.

Types of Pruning:
* Prepruning (Early Stopping) Stop tree growth early based on rules like Gini index or Information Gain.
* Postpruning (More Common) First, build the full tree → then remove unhelpful branches. Uses cost complexity (based on misclassification rate & number of leaves).
* Goal: Small, accurate tree that balances size and accuracy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

DT pros and cons -

A

Pros:
* Transparency (easy to understand for humans)
* Does not require parameter setting
* Requires little to no preprocessing.

Cons:
* Scalability (might have trouble with large datasets, due to memory)
* Can be greedy (focus on local optima instead of global)
* Risk of overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Random forest and bagging -

A

Model combination classifier of DTs.
* Instead of one tree, it includes multiple trees.
* Better at handling overfitting compared to DTs.
* RF is a type of bagging. Bagging is short for ”bootstrap aggregation”.
* A type of ensamble learning method based on majority voting.
* More robust to the effects of noisy data and overfitting.
* First step of bagging is bootstrap sampling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Bootstrap sampling and attribute selection RF -

A

Bootstrap Sampling Create k new training samples from the original dataset. Some data points are removed, while others appear multiple times. Each sample trains one decision tree.

Random Attribute Selection: Each tree only considers a random subset of attributes. Reduces correlation between trees. Less sensitive to noise and overfitting.

Final Prediction: Each tree makes a prediction. Majority vote decides the final result.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Support vector machine (SVM) -

A

Creates a classifier by drawing a line (2D) or plane (3D) to separate classes. Good performance in many applications but slow training. Useful as a “first try” ML method when exploring new domains. In higher dimensions (N-dimensional space), the separator is called a hyperplane.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Maximal marginal hyperplane, kernel function, soft margin and hard margin -

A

MMH is the hyperplane that maximizes separation between classes. Defined by support vectors (defined by the data samples closest to the hyperplane).

SVM: Kernel Function maps data into a higher-dimensional space to make it separable. Choice of kernel depends on the data. Examples: Linear, Polynomial, Radial Basis Function (RBF).

SVM: Soft vs. Hard Margin
* Hard margin: Strict separation, but fails if data is noisy or non-separable.
* Soft margin: Allows some misclassifications, leading to better generalization.
* Trade-off: A larger margin with minor mistakes is often better than a perfect separation with a narrow margin.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

SVM pros and cons -

A

Pros:
* good performance for a variety of problems.
* Less prone to overfitting compered to many other ML methods.
* Can often work well even if small training set.

Cons:
* sensitive to noise.
* a large dataset can lead to long training time,
* needs parameter tuning to work properly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Linear regression and logic regression -

A

Linear Regression: Used for predictive analysis to find a trend line in data. Finds the best fit line by minimizing the error.

Logistic Regression: Classification algorithm used for binary outcomes (e.g., Yes/No). Returns a logistic (S-curve) instead of a straight line. Differences: Linear regression → Predicts continuous values (straight line) Logistic regression → Predicts probability (0 to 1) (sigmoid curve).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Validation -

A

Purpose: Fine-tune model parameters during training and assess performance.

How it works: Split data into training and validation sets. Train on training set, test on validation set.

Adjusting Model: Hyperparameters are adjusted based on validation set performance.

Preventing Overfitting: Detects overfitting by ensuring good performance on unseen data.

K-fold Cross-Validation
What it is: Split data into k subsets. How it works: Train on k-1 subsets and test on the remaining 1/k subset. Average results from all k rounds.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Evaluation -

A

Purpose: After training, evaluate the model’s performance on a separate test dataset to assess its ability to generalize to new data.

Data Usage: The test data was not seen during training or validation, providing an unbiased assessment.

Performance Metrics: Metrics like accuracy, precision, recall, and F1 score are used to quantify performance.

Decision Making: Evaluation results help decide whether to deploy the model in real-world scenarios.

Key Point: The goal is for the model to perform well on new, unseen data, not just the data it was trained on.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Classification performance measurement -

A

Accuracy: Percentage of correct predictions (both true positives and true negatives, not false).

Recall (Sensitivity): Out of all the actual positive cases, how many did we correctly identify?

Precision: Out of all the cases we predicted as positive, how many were actually correct? (It tells us how accurate our positive predictions are.)

False Alarm Rate: Out of all the negative cases, how many did we wrongly predict as positive? (This shows how often we mistakenly raise an alarm when there’s no real issue.)

F1 Score: A single number that balances precision and recall—useful when you want to be good at both catching real cases and avoiding false alarms.

Confusion matrix: A table to compare predictions vs actual outcomes: True Positive (TP): Correctly predicted positive. True Negative (TN): Correctly predicted negative. False Positive (FP): Negative predicted as positive. False Negative (FN): Positive predicted as negative.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Regression performance measurement -

A

Mean squares error (MSE), Root mean squared error (RMSE), Mean absolute error (MAE).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Generalization, overfitting and underfitting -

A

Generalization is when a model performs well on new, unseen data from the same distribution as the training data. Measures how well the model can apply learned knowledge to make correct predictions on new data after being trained.

Underfitting: What it is: Model is too simple. Signs: High error on both training and test data. Cause: Model can’t capture patterns.

Overfitting: What it is: Model is too complex. Signs: Low error on training data, high error on test data. Cause: Model is too specific to training data and doesn’t generalize well.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Interactive ML (IML) -

A

The combination of humans and machines is powerful. Solving real-world problems can often benefit from interaction with the end-users. Sometimes even impossible without end-user input, e.g. due to a lack of labelled instances. Often labelling instances is expensive, IML can reduce the need of labelling. Empower the end-user who gets more control of the learning process. May increase the users trust the output of the ML system. Can make ML more accessible for people that are not ML experts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Classic vs interactive ML -

A

Classical ML: Batch process (one-time pass). Long training times are acceptable. Requires large labeled datasets. Labels/classes are known beforehand. No user feedback during training.

Interactive ML: Iterative process. Sensitive to latency (fast responses needed). Often works with unlabeled datasets. Labels may not be known in advance. User feedback is crucial during training.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Triggering interaction in IML -

A

The aim is to provide those labels that will be more useful for the ML algorithm and at the same time try to bother the user as little as possible. Often the ML system has a budget of request/time unit.

Interactive learning strategies: the user provides a label when triggered by a specified event.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Interactive ML strategies -

A

Active learning (AL) strategies: AL triggered by uncertainty, AL triggered by time, AL triggered at random.

Machine learning (MT) strategies: MT triggered by error, MT triggered by state change, MT triggered by time, MT triggered by user factors.

22
Q

Active learning (AL) strategies -

A

Triggered by uncertainty: the system asks the user for labels when it is uncertain about its predictions. If the model’s certainty is below a set threshold and there is room for more queries, it will request the user’s input. The user is assumed to be always correct.

Triggered by time: Asking the user at certain points in time what the current status is, e.g. once every hour.

Triggered at random: Asking the user at random points in time what the current status is.

23
Q

Machine teaching (MT) strategies -

A

Triggered by error: The user notices that the ML systems estimation is not correct. The user provides the correct value.

Triggered by state change: The user notices that the activity has changed. The user provides the new value.

Triggered by time: The user reports the current activity at certain points in time. This could e.g. a security guard continuously patrolling a building or a member of the cleaning staff.

Triggered by user factors: The user reports the current activity based on internal factors, e.g. how busy/stressed the user is at the moment. How knowledgeable the user is about how to classify the current state.

24
Q

Issues in interactive learning -

A

User Interaction: Should it be reactive (Active Learning) or proactive (Mixed Initiative), or both?.

Model Communication: Should the system show its current state to the user?.

Model Evaluation: How can the user assess the model’s quality?.

User Feedback: What is the best way for the user to provide input?.

25
The Cold Start problem (bootstrapping) -
happens when there is little or no labeled data available at the beginning of training. Incremental Learning: Gradually improve the model as new data comes in. Transfer Learning: Use knowledge from a similar, pre-trained model. Data Augmentation: Create synthetic data to increase the dataset size.
26
Incremental learning -
Batch Learning (Traditional ML) All data is available before training. Model is optimized using the full dataset. Assumes data remains the same over time. Incremental Learning (Online ML) Model continuously updates with new data. Needs to make accurate predictions at any time. Limited memory (can't store all data). Uses a compact representation of past data (e.g., statistics, recent samples). Challenges in Incremental Learning: - - Updating the Model Fully Online: Update after every new data point. - Mini-Batch: Update after collecting small batches of data. - Batch Learning: Store all data and retrain (not always feasible). Concept Drift (Changes in Data Over Time): - Gradual Change: Adapt smoothly by giving more weight to recent data. - Sudden Change: Detect shifts (e.g., accuracy drops) and adjust the model. - New Data Categories: Use clustering to detect new patterns. Incremental learning helps models stay up-to-date without retraining from scratch.
27
Transfer learning -
To deal with the cold start problem. A model learned for one task is reused as the starting point for learning a model for another task (reuse model learned to recognize activity in one room for another room). Activity Recognition: A model trained to recognize activities in one room can be reused to recognize activities in another room, instead of starting from scratch.
28
Data argumentation -
This means that you should generate additional labeled data points from the data you already have. • For image data, well-developed methods already exist, such as rotation, flipping, brightness adjustments, or using neural networks to generate new images based on existing ones. • For other types of data (such as text, tabular data, or time series), there are not as many well-established methods, making it harder to generate realistic additional examples
29
Black box vs white box, interactive ML -
Black Box: User feedback is based only on the input and output of the model. White Box: User provides feedback on the internal structure of the model, offering more transparency. Increases user trust by showing how the model works. Requires visualising the model and data, and some ML expertise. Used mainly in offline IML so far.
30
Unsupervised learning -
the algorithms are provided with data that does not contain any labels or explicit instructions on what to do with it. The goal is for the learning algorithm to find structure in the input data on its own. Large datasets are costly, especially with time-consuming and expensive labeling. It’s useful when the number or type of classes in the data is unknown. Applications: market basket analysis, medical diagnosis, marketing, social media. Divided into two problems: Clustering and dimension reduction.
31
Clustering -
Clustering algorithms find hidden patterns in the data based on their similarities or differences. These patterns can relate to the shape, size, or color and are used to group data items or create clusters. Clustering methods: Partitioning clustering (K-means and K-medoids) and Hierarchical clustering (agglomerative clustering and divisive clustering).
32
Partitioning clustering -
Partitioning clustering algorithms group data based on similarities and differences. Key characteristics: * Non-overlapping: Each data point belongs to one cluster. Predefined number of clusters (hyperparameter). * Center-based: Cluster described by its center. * Objective function: Measures data similarity/dissimilarity. * Iterative optimization: Process based on similarity criteria, often using Euclidean distance.
33
K-means -
K-means: is a clustering algorithm that groups similar data points together. Each group (cluster) is defined by a centroid (a central point). Goal: Group data into K clusters based on similarity. How It Works: Choose K – Pick the number of clusters. Initialize Centroids – Randomly select K starting points. Assign Points – Each data point joins the nearest centroid. Update Centroids – Recalculate the center of each cluster. Repeat – Until centroids stop changing. Finding the Best K: Elbow Method – Find the "bend" in a WCSS (Within-Cluster Sum of Squares) plot. Silhouette Score – Measures how well points fit in clusters (closer to 1 = better).
34
K-medoids -
Selecting some samples of clusters as primary centers (each cluster one sample as a center) Result: Greater resistance to outliers and noise.
35
Hierarchical clustering -
Groups data into a tree structure (dendrogram), showing the hierarchical relationship between clusters. Steps: Start with each sample as a separate cluster. Merge the closest clusters. Repeat merging until all samples are in one large cluster. Strategies: Agglomerative Clustering (Bottom-up): Starts with individual clusters and merges the closest ones until all data points form a single cluster. Uses a distance metric (e.g., Euclidean distance). Divisive Clustering (Top-down): Starts with all data in one cluster and recursively splits it into smaller clusters until each sample is its own cluster.
36
Dimensionality reduction -
Dimensionality reduction is used to reduce the number of features in a dataset while retaining as much of the important information as possible. It is a process of transforming high-dimensional data into a lower-dimensional space that still preserves the essence of the original data. Reasons for using: simplifying complex data, removal of excesses, noise reduction. Feature selection and Feature extraction.
37
Feature extraction -
Creating new feature sets from original features (Finding a combination of new features). Different methods: 1. Linear, PCA (prinicpal component analysis, convering original features to uncorrelate features), LDA (linear discriminant analysis, is a technique based on between class distance and within class distance based on labels). 2. Non-linear, LLE (locally linear embedding, A dimensionality reduction method that preserves the geometric structure of high-dimensional data. It transforms complex structures (e.g., a 3D Swiss roll) into a lower-dimensional space while keeping relationships intact).
38
Feature selection -
selecting a subset of the main features, removing less important features.
39
ML strategies, Centralized and decentralized learning -
Centralized Learning: Data is collected and processed in one central server. Model training and inference happen on the central server. * Suitable when: Data is centralized. No strict privacy or transfer constraints. E.g. centralized patient data. Decentralized Learning: Data is generated across multiple devices (phones, IoT). Enables smarter models without centralizing data.
40
ML strategies, Distributed learning -
Distributed learning reduces the cost of training a model on a centralized server by using multiple computers (nodes) across a network, like in clusters or cloud systems. The training process is divided into smaller tasks, executed on different machines in parallel. Data is spread across these nodes. Nodes communicate to share information and update the model. Used for large datasets or complex models that need a lot of computing power. Tools like Apache Spark, TensorFlow, and PyTorch are used for distributed learning.
41
ML strategies, Federated learning -
Federated learning is a privacy-focused approach where model training happens on decentralized devices (edge devices or local servers) without sending raw data to a central location. Each device trains the model locally and shares only updates with a central server, which combines these updates to improve the global model. It’s used when data privacy is important, and data cannot be easily centralized. Commonly applied in mobile devices, healthcare, and IoT. When to use Federated Learning: When on-device data is more relevant than data stored on servers. When the data is sensitive or large (e.g., health data, IoT). When labels can be derived from user interactions. Federated Learning Strategies: Centralized Federated Learning: Needs a central server to coordinate clients and gather updates. Decentralized Federated Learning: No central server; devices share updates directly with each other, avoiding a single-point failure.
42
Advantages/disadvantages and how federated learning works -
How it works: The central server sends an untrained model to devices. Each device trains the model with local data. Devices send back trained models (not data) to the server. The server combines the models, often by averaging. This process repeats until the global model improves. The updated model is sent to devices for testing. Pros: ensure privacy: data remains on the user's device. reduce latency: the update model can make predictions on the user's device. smarter models: collaborative training process. Cons: implementation cost: higher than collecting the data and processing it centrally. communication is expensive. unreliable client availability.
43
Federated learning algorithms -
federated stochastic gradient descent (fedsgd). Federated averaging (fedavg): the common baseline algorithm. federated learning with dynamic regularization (feddyn).
44
Distributed computing -
A method where multiple computers work together to solve a problem, appearing as a powerful single computer. It handles complex tasks like encrypting large data, solving equations, and rendering 3D animations. Examples: Cloud Computing. Edge Computing. Fog Computing: A hybrid approach that balances processing between the cloud and edge devices. Pros: - Scalability: Can grow by adding more computing devices (nodes) when handling increased workload. - Availability: The system remains functional even if the computer fails, fault tolerance. - Consistency: Automatically manages data consistency across computers, ensuring reliable data without compromising fault tolerance. - Transparency: Users interact as if it's a single system. - Efficiency: Optimizes hardware use for faster performance, handling workloads without system failures.
45
Types of distributed systems -
- Client-Server Architecture: Client: Requests data from the server. Server: Manages and synchronizes resources. Easy to maintain, secure. Communication bottleneck. - Three-Tier Architecture: Client: Same as client-server. Application Server: Handles communication and application logic. Database Server: Manages data storage Reduces communication bottleneck. More complex than client-server. - N-Tier Architecture: Multiple client-server systems working together to solve a problem. Used in modern distributed systems with various enterprise applications. Increased complexity. - Peer-to-Peer Architecture: All networked computers share equal responsibilities. Popular for content sharing, file streaming, and blockchain networks. No central control, can be harder to manage.
46
How distributed systems work? -
Can use either loos or tight coupling. Loose Coupling: Components are weakly connected and can operate independently. Changes made to one component don’t affect the others. e.g client sends a message to a server and does other jobs until getting a response. Tight Coupling: Components are strongly connected, often relying on each other to work efficiently. Changes in one component can directly impact the others.
47
Parallel computing -
Types of Parallelism: - Task Parallelism: Different tasks run on separate cores/processors. - Data Parallelism: The same task runs on different data chunks at the same time. Hardware: Multi-core Processor: A CPU with multiple cores that can perform tasks simultaneously. GPUs/TPUs: Specialized processors for fast, parallel execution (mainly used in AI). Parallel computing speeds up tasks by running them at the same time, using multi-core CPUs or specialized hardware like GPUs/TPUs.
48
Distributed vs parallel computing -
Distributed System: Components are located in different places. Multiple computers in a network work together. Each computer has its own memory. Tasks are distributed across different computers. Parallel Computing: Multiple processes run simultaneously using multiple processors. A single computer is used. Processors may share or have separate memory. Tasks are performed within one system.
49
Parallel processing advantages -
* can process large datasets in a fraction of the time. * less memory and computer requirements needed as the set of instructions is distributed to smaller execution nodes. * more execution can be added or removed from the processing network depending on the complexity of the problem.
50
Horizontal scaling, the embarrassingly parallel, data locality and fault tolerance -
Horizontal Scaling: Increases capacity by adding more computers to a cluster. Expands storage and computing power by adding nodes. Embarrassingly Parallel: Tasks are easy to split and run independently. If one process fails, it can be re-run without affecting others. Data Locality: Data is processed where it's stored to improve efficiency. Computation and output happen on the same node to reduce data movement. Fault Tolerance: Allows the system to keep running even if some components fail. Ensures reliability and continuous operation.
51
K-means pros and cons
Pro: * Scalability: K-means is a scalable algorithm that can handle large datasets with high dimensionality. * Speed: is a relatively fast algorithm, making it suitable for real-time or near-real-time applications * Simplicity: is a simple algorithm to implement and understand. Con: * User-defined: requires the user to specify the number of clusters (K) beforehand. * Non-convex shaped clusters: assumes clusters are round (spherical), struggles with irregularly shaped clusters. * Can't handle noisy data: are sensitive to noisy data or outliers, which can significantly affect the clustering results