ML fundamentals Flashcards
What are the different types of ML?
- Types of Machine Learning:Supervised Learning: Learning from labeled data (e.g., regression, classification).
Unsupervised Learning: Learning from unlabeled data (e.g., clustering, dimensionality reduction). - Supervised Learning Tasks:Regression: Predict continuous output (e.g., predicting house prices).
Classification: Predict discrete labels (e.g., image classification, spam detection). - Unsupervised Learning Tasks:Clustering: Group similar data points together (e.g., K-Means, DBSCAN).
Dimensionality Reduction: Reduce the number of input features (e.g., PCA, t-SNE).
What are key algorithms and evaluation metrics?
- Key Algorithms:Linear Models: Linear Regression, Logistic Regression, Ridge, and Lasso.
Tree-Based Models: Decision Trees, Random Forests, Gradient Boosting (XGBoost, LightGBM, CatBoost).Neural Networks: Used for deep learning tasks (feedforward, convolutional, recurrent networks). - Model Evaluation Metrics:For Regression:
Mean Absolute Error (MAE)
Mean Squared Error (MSE) / Root Mean Squared Error (RMSE)
R-squared
For Classification:
Accuracy, Precision, Recall, F1-Score
Receiver Operating Characteristic (ROC) curve, Area Under Curve (AUC)
Confusion Matrix
Core Languages in Machine Learning
Python
NumPy: For numerical operations and matrix manipulations.
pandas: For data manipulation and preprocessing.
Matplotlib, Seaborn: For data visualization.
Scikit-learn: For traditional ML models and utilities like train-test split, cross-validation.
TensorFlow, PyTorch: For deep learning and neural networks.
SciPy: For scientific computations (e.g., optimization).
MLOps
Docker:
Containerization tool for ensuring reproducibility and portability across different environments. Widely used in model deployment to encapsulate the entire ML pipeline in containers.
Kubernetes:
Orchestration tool for managing containerized applications at scale. Common in production environments for managing and scaling ML services.
Natural Language Processing (NLP) Tools
spaCy:
High-performance NLP library for tasks like tokenization, POS tagging, and named entity recognition (NER).
Explain the type of ML you implemented
What is Supervised Learning?
In supervised learning, we train a model with input data and known output (labeled data). The goal is for the model to learn the relationship between the inputs and the target output so that it can make accurate predictions on new, unseen data.
Regression: This is a type of supervised learning where the output (target) is continuous (e.g., predicting the price of a house or temperature for tomorrow). Classification: In contrast, if the output is categorical (e.g., classifying emails as spam or not spam), it’s called classification. Process of Supervised Learning (Regression)
Let’s break down the process into simple steps:
Collect and Prepare Data: We gather the input features (independent variables) and target labels (dependent variable). Split Data: The data is usually split into a training set and a test set. The model learns from the training data and is evaluated on the test data. Choose a Model: In regression, the model can be a simple one like Linear Regression or a more complex model like Random Forest Regressor. Train the Model: The model "learns" from the training data by minimizing an error function (e.g., Mean Squared Error). Evaluate the Model: After training, the model is evaluated on the test data to check its performance. Make Predictions: Once trained, the model can predict the target value for new, unseen data.
Explain ML data preparation
Purpose of Data Preparation:
In any machine learning project, data preparation is a critical step because the quality of the input data directly impacts the model’s performance. Poorly prepared data can lead to inaccurate predictions, overfitting, or underfitting.
For your project in 3D printing, you had various types of data (e.g., numeric sensor readings, categorical variables like adhesive and material type). The goal of this step is to convert raw data into a format that can be efficiently processed by machine learning algorithms.
Data Cleaning, Encoding (one-hot encoding, turning a categorical variable (adhesive type) into a numerical value, normalization)
Missing data: Forward filling and and interpolation
Feature selection is the process of choosing a subset of relevant, important, and non-redundant features (input variables) from your dataset that contribute the most to predicting the target variable (output). It is a crucial step in building machine learning models because selecting the right features can improve model performance, reduce complexity, and speed up training time.
Data Splitting (Train/Test Split):
Purpose: The goal is to separate your dataset into distinct subsets to ensure that the model is trained and evaluated effectively, avoiding common pitfalls like overfitting or biased performance estimates. Why Split Data? "Data splitting is essential to evaluate how well a machine learning model generalizes to unseen data. By splitting the data into training and testing sets, we prevent overfitting and ensure that the model isn't just memorizing the training data but is learning underlying patterns that can be applied to new data." What's the Role of a Validation Set? "A validation set is used during model development to tune hyperparameters and evaluate the model’s performance before touching the test data. This helps avoid overfitting to the test set and gives a more reliable way to optimize the model." What Happens if We Don’t Split Data? "If we train and evaluate the model on the same data, we risk overfitting, where the model performs well on the data it has seen but fails to generalize to new, unseen data. This leads to poor real-world performance." Step-by-Step Data Splitting Process Step 1: Start with the Entire Dataset: You have a dataset containing your input features (X) and your target labels (y). Step 2: Split Data into Training and Testing Sets: You randomly split the data into training and testing sets. For example, 80% of the data will be used for training the model, and 20% will be used for testing its performance on unseen data. Step 3: Optionally, Create a Validation Set: If you want to fine-tune hyperparameters or select models, you can split the training data further into a training set and validation set. The validation set helps you tune the model without overfitting to the test data. Step 4: Train the Model: You train your model using the training data. The model will learn patterns from this data and optimize its internal parameters. Step 5: Validate the Model (Optional): After training, you can use the validation set to evaluate the model's performance and fine-tune its hyperparameters (like learning rate, regularization strength, etc.). Step 6: Test the Model: Once the model is fully trained (and validated, if necessary), you test it on the testing data. The performance on the test set gives you an unbiased estimate of how well the model will perform on new, unseen data in the real world.
Explain ML Development
The purpose of model development is to train a machine learning model that can effectively learn patterns from the data and make predictions on unseen data. The goal is to create a model that generalizes well to new data and can be used in production to solve a real-world problem—in your case, predicting material and supply consumption in 3D printing.
Learning Patterns from Data: The model needs to learn relationships between the input features (e.g., temperature, humidity, material type) and the target variable (e.g., material consumption). This is done by minimizing the difference between the predicted and actual values.
Q1: Why did you choose specific algorithms (e.g., Linear Regression, Neural Networks) for model development?
Answer: I initially chose Linear Regression because it’s a simple and interpretable model that works well when the relationship between input features and the target variable is approximately linear.
Q4: What metrics did you use to evaluate model performance, and why?
Answer: Since the problem is a regression task, I primarily used Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) as evaluation metrics. These metrics are easy to interpret and directly measure the average squared difference between the actual and predicted values.
Steps in Model Development
Choose an Algorithm: You choose an appropriate machine learning algorithm based on the problem you are trying to solve. In your case, for predicting material consumption (a regression problem), algorithms like Linear Regression, Decision Trees, or more complex models like Neural Networks or Gradient Boosting could be suitable. Train the Model: The training process involves feeding input data (features) into the algorithm so it can learn the relationships between features and the target. During training, the algorithm adjusts its internal parameters (weights, biases, etc.) to minimize the error between the model’s predictions and the actual target values. Validate and Tune the Model: After training, the model is evaluated on a validation set to check how well it generalizes to unseen data. Cross-validation is a common technique used to reduce bias and variance in the model’s performance. Hyperparameter tuning: Algorithms have hyperparameters that control how the learning process works. For example, in a neural network, you may tune the learning rate or the number of layers/neurons. Finding the optimal set of hyperparameters is essential for maximizing model performance. Evaluate the Model: Use various evaluation metrics to assess the model’s performance. For regression problems, common metrics include Mean Squared Error (MSE), Mean Absolute Error (MAE), and R-squared. These metrics help you understand how close the model’s predictions are to the actual values. Model Selection: After testing multiple algorithms and tuning them, you choose the model that gives the best balance between accuracy, generalization, and simplicity. Test the Model on Unseen Data: Finally, the selected model is tested on the test set, which it has never seen before. This gives a final estimate of the model’s performance in a real-world setting. Prepare for Deployment: Once you are satisfied with the model’s performance, it is prepared for deployment. At this stage, the model is typically containerized using tools like Docker to ensure consistency across different environments.
Explain ML Deployment and Monitoring
After developing a model that performs well in a controlled environment (e.g., testing/validation), the next step is to deploy the model in a production environment where it will make predictions on real-world data.