AI & ML Flashcards
Artificial Intelligence
Development of intelligent systems capable of performing tasks that typically require human intelligence
Examples include perception, reasoning, learning, problem-solving, decision-making
Used for technologies like computer vision, facial recognition, fraud detection, and intelligent document processing
Machine Learning
Type of AI for building methods that allow machines to learn, but not the same as AI
Data is leveraged to improve computer performance on a set of tasks
Make predictions based on data used to train the model; no explicit programming of rules
Neural Network
Method in AI where nodes are connected together and organized in layers, talking to each other by passing data to the next layer
Creates an adaptive system that computers use to learn from their mistakes and improve continuously
Consists of Input Layer, Hidden Layers, and Output Layer
Deep Learning
Method in AI that teaches computers to process data in a way that is inspired by the human brain
Uses neurons and synapses to train a model; process is more complex patterns in the data than traditional ML
Computer Vision, NLP; takes a large amount of input data and requires GPU
Generative AI
Field of computer science as a subset of Deep Learning for generating new data similar to the data it was trained on, such as images, text, audio, video, code, etc.
Unlabeled Data is used to pre-train a Foundation Model backed by a neural network; this model can then be adapted for more specific uses like text generation, info extraction, chatbots, and more
Training Data
Large dataset used to train MLs to process information and accurately predict outcomes, and is the most critical stage to building a good model
Can be Structured or Unstructured; Labeled or Unlabeled
Labeled Data
ML data that includes both input features and corresponding output labels
Used for Supervised Learning, where the model is trained to map inputs to known outputs
For example, dataset with images of animals where each image is labeled with the corresponding animal type
Unlabeled Data
ML data that includes only input features without any output labels
For example, a collection of images without any associated labels
Used for Unsupervised Learning, where the model tries to find patterns or structures in the data
Structured Data
Data is organized in a structured format, often in rows and columns
Tabular Data is data arranged in a table with rows representing records and columns representing features
Time Series Data is a series of data points collected or recorded at successive points in time
Unstructured Data
Data that doesn’t follow a specific structure and is often text-heavy or multimedia content
Text Data is unstructured text such as articles, social media posts, or customer reviews
Image Data is data in the form of images, which can vary widely in format and content
Supervised Learning
ML learning method that learns a mapping function that can predict the output for new unseen input data
Needs Labeled Data; very powerful, but difficult to perform on millions of datapoints
Regression
Supervised Learning technique used to predict a numeric value based on input data
The output variable is continuous, meaning it can take any value within a range
Used when the goal is to predict a quantity or a real value; predicting house prices, stock prices, weather forecasting, etc.
Classification
Supervised Learning technique used to predict the categorical label of input data
Output variable is discrete, which means it falls into a specific category or class
Used for scenarios where decisions or predictions need to be made between distinct categories; fraud, image types, diagnostics
Classify emails, animals; give labels to movies
Training Set
Data set used to train the model
Typically, 60-80% of the dataset
For example, 800 labeled images from a dataset of 1000 images
Validation Set
Data set used to tune model parameters and validate performance
Typically, 10-20% of the dataset
For example, 100 labeled images for hyperparameter tuning
Test Set
Data set used to evaluate the final model performance
Typically, 10-20% of the dataset
For example, 100 labeled images to test the model’s accuracy
Feature Engineering
Process of using domain knowledge to select and transform raw data into meaningful features
Helps enhance performance of ML models
Feature Extraction
Feature Engineering technique where you extract useful information from raw data, such as deriving age from date of birth
Feature Selection
Feature Engineering technique where you select a subset of relevant features, like choosing important predictors in a regression model
Feature Transformation
Feature Engineering technique where you transform data for better model performance, such as normalizing numerical data
Unsupervised Learning
ML learning method that for discovering inherent patterns, structures, or relationships within the input data
Machine must uncover and create the groups itself, but humans still put labels on the output groups
Feature Engineering can help improve the quality of the training
Clustering use cases include customer segmentation, targeted marketing, recommender systems
Clustering
Unsupervised Learning technique used to group similar data points together into clusters based on their features
For example, segment customers to understand different purchasing behaviors; then, target each segment with tailored marketing strategies
Association Rule Learning
Unsupervised Learning technique used to group data points based on their relation to one another
For example, understand which products are frequently bought together; then, supermarket can place associated products together to boost sales
Anomaly Detection
Unsupervised Learning technique used to identify outliers and strange patterns in data
For example, fraud detection in credit card purchases
Semi-Supervised Learning
ML learning method that uses a small amount of labeled data and a large amount of unlabeled data to train systems
After that, the partially trained algorithm itself labels the unlabeled data; this is called pseudo-labeling
The model is then re-trained on the resulting data mix without being explicitly programmed
Reinforcement Learning
Type of ML where an agent learns to make decisions by performing actions in an environment to maximize cumulative rewards
RLHF
Reinforcement Learning from Human Feedback
Use human feedback to help ML models to self-learn more efficiently; incorporates human feedback in the reward function, to be more aligned with human goals, wants and needs
Used throughout GenAI applications including LLM Models; significantly enhances the model performance
Steps: Data Collection, Supervised Fine-Tuning, Training Reward Model, Optimization
Model Fit
Measurement of how well a machine learning model adapts to data that is similar to the data on which it was trained
Overfitting: performs well on the training data, but doesn’t perform well on evaluation data
Underfitting: performs poorly on training data; could be a problem of having a model too
simple or poor data features
Balanced if performs well on training data and evaluation data
Bias
Difference, or error, between predicted and actual value
Occurs due to the wrong choice in the ML process
If high, model doesn’t closely match the training data; considered as underfitting
Reduce this by using a more complex model and increase the number of features
Variance
How much the performance of a model changes if trained on a different dataset which has a similar
distribution
If high, model is very sensitive to changes in the training data; considered as overfitting
Reduce this through feature selection for less, more important features; split into training and test data sets multiple times
Confusion Matrix
Matrix that summarizes the performance of a machine learning model on a set of test data
Best way to evaluate the performance of a model that does classifications; i.e. binary classification
Precision best when false positives are costly; Recall best when false negatives are costly
F1 Score best for balance of Precision and Recall, especially for imbalanced datasets; Accuracy best for balanced datasets
AUC-ROC
Area under the curve-receiver operator curve
Value from 0 to 1, max value represents absolute perfection
Shows what the curve for true positive compared to false positive looks like at various thresholds, with multiple confusion matrixes
Regression Metrics
MAE: Mean Absolute Error; MAPE: Mean Absolute Percentage Error; RMSE: Root Mean Squared Error
MAE, MAPE, and RMSE measure the error, or how accurate the model is
R² explains variance in your model; close to 1 means predictions are good
Inferencing
When a model is making prediction on new data
Real Time: models have to make decisions quickly as data arrives, and speed is preferred over perfect
accuracy; i.e. chatbots
Batch: large amount of data that is analyzed all at once, and perfect accuracy is preferred over speed; often used for data analysis
Edge Inferencing
When a model is making prediction on new data, and is close to where the data is being generated
Use edge devices which run your model; less computational power but close proximity to your data
Small Language Models on edge devices offer very low latency, low compute footprint, and offline capability
Large Language Models on remote servers are more powerful, but higher latency and must be online for access
Hyperparameter Tuning
Finding the best hyperparameters values to optimize the model performance
Hyperparameters are settings that define the model structure, learning algorithm, and process; set before training begins
Tuning improves model accuracy, reduces overfitting, and enhances generalization
Learning Rate
Hyperparameter for how large or small the steps are when updating the model’s weights during training
High rate can lead to faster convergence but risks overshooting the optimal
solution
Low rate may result in more precise but slower convergence
Batch Size
Hyperparameter for number of training examples used to update the model weights in one iteration
Smaller sizes can lead to more stable learning but require more time to compute
Larger sizes are faster but may lead to less stable updates
Epochs
Hyperparameter that refers to how many times the model will iterate over the entire training dataset
Too few can lead to underfitting, while too many may cause overfitting