Interviews Flashcards

Question

GBDTs

Answer 1

- Gradient boosting works by ensembling weak learners to improve the performance of the model as a whole and these weak learners are usually decision trees - Maybe its good to talk about ensembling before I proceed: - Essentially, ensembling is just the act of combining a number of different models into 1 and the 2 most popular ensemble learning methods are bagging and boosting. a. Bagging: - Training a bunch of models in a parallel way and each model learns from a random subset of data b. Boosting - Training a bunch of models sequentially and each model learns from mistakes of the previous model - So obviously GBDTs use Boosting. So why boosting? - Boosting works on the principle of improving mistakes of the previous learner through the next learner. - In boosting, weak learner are used which perform only slightly better than a random chance. - Boosting focuses on sequentially adding up these weak learners and filtering out the observations that a learner gets correct at every step. - Basically the stress is on developing new weak learners to handle the remaining difficult observations at each step. - We often use Decision Trees as the weak learner. A Decision Tree is an ML model that builds upon iteratively asking questions to partition data and reach a solution - So the boosting process looks something like: The boosting process looks like this: 1. Build an initial model with the data, 2. Run predictions on the whole data set, 3. Calculate the error using the predictions and the actual values, 4. Assign more weight to the incorrect predictions, 5. Create another model that attempts to fix errors from the last model, 6. Run predictions on the entire dataset with the new model, 7. Create several models with each model aiming at correcting the errors generated by the previous one, 8. Obtain the final model by weighting the mean of all the models. - So in GBDTs, we often combine multiple Decision Trees (weak learners) to come up with 1 strong learner. All the trees are connected in series and each tree tries to minimise the error of the previous tree and due to this sequential connection, boosting algorithms are usually slow to learn but also highly accurate. - The weak learners are fit in such a way that each new learner fits into the residuals of the previous step so as the model improves. The final model aggregates the result of each step and thus a strong learner is achieved. A loss function is used to detect the residuals. For instance, mean squared error (MSE) can be used for a regression task and logarithmic loss (log loss) can be used for classification tasks. It is worth noting that existing trees in the model do not change when a new tree is added. The added decision tree fits the residuals from the current model.

Answer 2

1) Definition: Supervised Learning: - Supervised learning is a type of machine learning paradigm where the model is trained on labeled data. The data is provided with the answer key, and the algorithm iteratively makes predictions and is corrected by the provided labels whenever it’s wrong. Unsupervised Learning: - Unsupervised learning involves training the model on data that is neither classified nor labeled. The model works without guidance and groups unsorted information according to similarities, patterns, and differences without any labeled responses to guide the learning process. 2) Data: Supervised Learning: - Requires labeled data for training. Each example in the training dataset is paired with an output label. Unsupervised Learning: - Works with unlabeled data. It tries to learn the underlying structure from the input data directly. 3) Goal: Supervised Learning: - The goal is often prediction or classification. It aims to make predictions or infer mappings based on the input-output pairs. Unsupervised Learning: - The goal is to find structure in the data, like clustering or association. It tries to achieve a transformation that is subject to certain criteria, such as dimensionality reduction. 4) Feedback: Supervised Learning: - The model receives explicit feedback in terms of labels or correct answers during training. Unsupervised Learning: - There’s no feedback, and the algorithm tries to identify patterns directly from the input data.

Answer 3

1) What is Multicollinearity - The situation where 2 or more predictor variables in a regression model are highly correlated, such that one can be linearly predicted from the others with substantial accuracy. When multicollinearity is present: a. It can inflate the variance of the coefficient estimates, leading to less reliable interpretations. b. It can make the model's estimates sensitive to minor changes in the model. 2) To handle multicollinearity: - Principal Component Analysis (PCA): PCA can be used to transform the original variables into a new set of uncorrelated variables. - Removing Variables: In some cases, based on domain knowledge and correlation analysis, I considered dropping one variable from a pair of highly correlated variables. - Regularization Techniques: Techniques like Ridge and Lasso regression can help in handling multicollinearity. Ridge regression adds a penalty to the coefficients, and Lasso can lead to feature selection.

Answer 4

To ensure models were not overfitting: 1) Cross-Validation: I utilized k-fold cross-validation, where the training set was split into 'k' smaller sets. For each of the k "folds", a model was trained on k-1 of those chunks and validated on the remaining chunk. 2) Regularization: Implemented L1 (Lasso) and L2 (Ridge) regularization techniques that add penalty terms to the loss function, constraining the magnitude of the coefficients. 3) Early Stopping: When training deep learning models, I monitored the validation loss, and if it stopped decreasing (or started increasing), training was halted. 4) Pruning: In tree-based algorithms, pruning helps reduce the size of the tree, which minimizes overfitting. 5) Dropout: In neural networks, dropout layers were introduced, where during training, random subsets of neurons are dropped out to prevent reliance on any one neuron.

Answer 5

Underfitting occurs when the model fails to capture the underlying patterns in the data. Here's how you can prevent it: 1) Complexity: Ensure the model has adequate complexity to capture the data patterns. This might mean adding more layers or neurons to a neural network. 2) Features: Use feature engineering to provide more informative features or to transform them in ways that make relationships more apparent. 3) Training Duration: Train for more epochs, as sometimes longer training is required to achieve convergence. 4) Regularization: If using regularization, ensure the regularization parameters aren’t set too high, which can suppress the model's capacity. 5) Model Selection: Consider switching to a more complex model if simpler models like linear regression aren't capturing the data patterns.

Answer 6

Training vs. Validation Error: - Underfitting: Both training and validation errors are high. - Overfitting: Training error is low, but validation error is significantly higher. - Learning Curves: Plotting training and validation error over epochs. If they converge and plateau with a high error, it's likely underfitting. If there's a large gap between them, it's overfitting.

Answer 7

When using an ensemble approach, assessing the importance or contribution of each model can help understand which model brings the most value. Methods I used: 1) Permutation Importance: By shuffling the predictions of a particular model and observing the drop in the ensemble's performance, one can gauge the importance of that model. 2) Correlation of Errors: Models that make very different errors compared to others can be considered more valuable, as they bring diversity to the ensemble. Evaluating the correlation of errors among models can be insightful.

Answer 8

Dealing with imbalanced data is crucial to prevent models from being biased towards the majority class: 1. Resampling: - Oversampling: Increase the number of instances in the minority class by duplicating samples or generating synthetic samples (e.g., SMOTE). - Undersampling: Reduce the number of instances in the majority class. However, it might lead to loss of information. 2. Weighted Loss Function: Assign higher weights to the minority class during model training. - Anomaly Detection: Treat the minority class as an anomaly detection problem. - Using Different Evaluation Metrics: Accuracy might be misleading. Instead, focus on metrics like precision, recall, F1-score, or the area under the precision-recall curve. 3. Ensemble Methods: Bagging and boosting algorithms can help. Or Stacking / Voting etc.

Answer 9

The primary components of a neural network include: 1) Layers: These can be input, hidden, or output layers. 2) Nodes or Neurons: These are computational units in each layer. 3) Weights and Biases: Parameters that get adjusted during training. 4) Activation Function: Determines the output of a neuron, e.g., ReLU, sigmoid, tanh.

Answer 10

1) R-CNN (Regions with CNN Features): Architecture: - Region Proposal: Uses selective search to propose candidate object bounding boxes. - Feature Extraction: For each proposed region, a CNN extracts features. - Classification: SVM classifiers identify the object within the proposed regions. - Bounding Box Regression: Refines the bounding boxes for better accuracy. 2) Fast R-CNN: An improved version that addresses some inefficiencies of R-CNN. Architecture: - Single CNN Pass: The entire image goes through CNN to generate a feature map. - Region of Interest (RoI) Pooling: Proposed regions from the feature map are pooled to have a fixed size. - Classification and Bounding Box Regression: Fully connected layers followed by two output layers—one for classifying objects and the other for bounding box regression. 3) Faster R-CNN: Builds upon Fast R-CNN by adding a Region Proposal Network (RPN). Architecture: - Region Proposal Network (RPN): Learns to propose candidate object bounding boxes directly, replacing selective search. - RoI Pooling: Similar to Fast R-CNN, it has RoI pooling, followed by layers for classification and bounding box regression.

Answer 11

YOLO: - YOLO is known for its speed, performing object detection in real-time. Architecture: 1) Grid Division: - Divides the input image into a grid (e.g., 13x13). 2) Single Network Pass: - Passes the image through a single neural network. 3) Predictions: - Each grid cell predicts bounding boxes, objectness scores, and class probabilities. 4) Non-maximum Suppression: - Reduces overlapping bounding boxes, keeping only the ones with the highest confidence scores.

Answer 12

SSD: - SSD combines aspects of YOLO and Faster R-CNN, excelling in both speed and accuracy. - Architecture: 1) Multiple Feature Maps: - Uses feature maps from multiple layers of the network for detection, allowing for objects of varying sizes to be detected. 2) Predictions: - Each feature map cell predicts categories and bounding boxes. 3) Non-maximum Suppression: - Similar to YOLO, it uses non-maximum suppression to reduce redundant bounding boxes and keep the most confident ones.

Interviews Flashcards

(36 cards)