Machine_Learning_Interview_Flashcards_Final_Update

Question

How do you evaluate a machine learning model’s performance?

Answer 1

Performance can be evaluated using metrics such as accuracy, precision, recall, F1 score, ROC-AUC, and confusion matrices.

Answer 2

Cross-validation splits the data into training and validation sets multiple times to ensure the model generalizes well on unseen data.

Answer 3

Techniques include resampling methods (e.g., SMOTE), adjusting class weights, or using algorithms that handle imbalanced data natively.

Answer 4

Confusion matrices provide a summary of predicted vs actual classes, useful for calculating precision, recall, and other metrics.

Answer 5

Generative models learn the joint probability distribution and can generate new data, while discriminative models focus on the decision boundary.

Answer 6

Reinforcement learning learns through rewards from interacting with an environment, while supervised learning learns from labeled data.

Answer 7

GANs consist of a generator and a discriminator, where the generator creates data, and the discriminator differentiates between real and generated data.

Answer 8

Attention mechanisms help models focus on important parts of the input sequence, improving performance in tasks like NLP and translation.

Answer 9

PCA reduces dimensionality by projecting data onto principal components that capture the most variance, useful in visualization and noise reduction.

Answer 10

LSTMs have an additional forget gate compared to GRUs, making them more flexible but slower. GRUs are simpler and computationally efficient.

Answer 11

Dimensionality reduction techniques like PCA or selecting top features based on importance can help handle datasets with millions of features.

Answer 12

You can tune hyperparameters using grid search, random search, or more advanced techniques like Bayesian optimization.

Answer 13

Try diagnosing issues like overfitting or underfitting, adjust model complexity, and tweak hyperparameters or gather more data.

Answer 14

Parametric models assume a fixed number of parameters, while non-parametric models can grow in complexity with more data.

Answer 15

Batch normalization normalizes the inputs to each layer, speeding up training and improving model performance by stabilizing gradients.

Answer 16

K-means clustering partitions data into k clusters by minimizing the distance between data points and the centroid of each cluster.

Answer 17

Classification predicts discrete labels, while regression predicts continuous values.

Answer 18

A kernel function in SVMs transforms data into a higher-dimensional space, making it easier to separate classes using a hyperplane.

Answer 19

Stochastic gradient descent updates parameters after each training example, while batch gradient descent updates after processing the entire dataset, making SGD faster but noisier.

Answer 20

An autoencoder is a type of neural network that learns to encode data and then reconstruct it, commonly used for tasks like dimensionality reduction and denoising.

Answer 21

Naive Bayes classifier assumes features are independent and uses Bayes' theorem to calculate the probability of different classes, making it simple and fast for classification.

Answer 22

Precision measures the accuracy of positive predictions, while sensitivity (recall) measures how well the model identifies actual positives.

Answer 23

Ensemble methods combine multiple models to improve prediction performance, such as Random Forests and Gradient Boosting.

Answer 24

Activation functions introduce non-linearity into neural networks, enabling them to learn complex patterns.

Answer 25

The exploding gradient problem occurs when gradients grow uncontrollably large during backpropagation, often addressed with gradient clipping or normalization techniques.

Answer 26

Early stopping halts training when the model's performance on validation data stops improving, preventing overfitting.

Answer 27

Over-sampling increases the number of minority class samples, while under-sampling reduces the number of majority class samples to handle imbalanced data.

Answer 28

Bayesian inference updates the probability of a hypothesis as more evidence becomes available, used for probabilistic modeling.

Answer 29

Logistic regression models the probability of a binary outcome using a logistic function, and is often used for classification tasks.

Answer 30

A/B testing is an experiment where two groups are compared, one with the current setup (A) and one with a variation (B), to evaluate performance.

Answer 31

Word embeddings are vector representations of words that capture their meanings based on context, commonly used in NLP models like Word2Vec.

Answer 32

Gradient boosting builds an ensemble of weak learners sequentially, where each new learner focuses on correcting errors from previous ones.

Answer 33

Hyperparameters are set before training and control the model's learning process, while parameters are learned from the data during training.

Answer 34

The learning rate controls how large a step the optimizer takes when updating model parameters. Too high leads to overshooting; too low leads to slow convergence.

Answer 35

Data augmentation generates new training data by applying transformations like rotations, flips, or noise to improve model generalization.

Answer 36

The training set is used to train the model, and the test set evaluates its performance. Typically, data is split 70-80% for training and 20-30% for testing.

Answer 37

For small datasets, simpler models like Naive Bayes may work better, while larger datasets may benefit from more complex models like neural networks.

Answer 38

Machine learning focuses on algorithms that learn patterns from data, while deep learning is a subset that uses neural networks with multiple layers.

Answer 39

Supervised learning is used in business for applications like email spam filtering, fraud detection, and customer segmentation.

Answer 40

Unsupervised learning is used for tasks like customer segmentation, anomaly detection, and market basket analysis in business.

Answer 41

Semi-supervised learning uses a small amount of labeled data and a large amount of unlabeled data, improving model performance without extensive labeling.

Answer 42

K-means is a clustering algorithm that groups data into clusters, while KNN is a classification algorithm that assigns labels based on the nearest neighbors.

Answer 43

Support vectors are the data points in SVMs that are closest to the decision boundary and help define the hyperplane.

Answer 44

Data leakage occurs when information from outside the training dataset is used to create the model, leading to overestimation of performance. It can be prevented by careful feature selection and validation.

Answer 45

Batch gradient descent processes the entire dataset before updating parameters, while stochastic gradient descent updates parameters after each sample, offering faster but noisier convergence.

Answer 46

Common optimization algorithms include gradient descent, Adam, and RMSprop. Adam adjusts learning rates based on first and second moments of the gradients, while RMSprop normalizes gradients to improve convergence.

Answer 47

An epoch is one complete pass through the entire dataset. A batch is a subset of the dataset used to update the model, and an iteration refers to one update step based on a batch.

Answer 48

Cross-entropy loss measures the difference between predicted probabilities and true labels, commonly used in classification tasks.

Answer 49

Dropout randomly deactivates neurons during training to prevent overfitting, helping neural networks generalize better.

Answer 50

A validation set is used for tuning hyperparameters and evaluating model performance during training, while a test set evaluates the final model's performance.

Answer 51

L1 regularization (Lasso) encourages sparsity by shrinking some weights to zero, while L2 regularization (Ridge) penalizes large weights without making them zero.

Answer 52

A Gaussian Mixture Model (GMM) is a probabilistic model that represents a mixture of multiple Gaussian distributions, useful in clustering.

Answer 53

The Transformer architecture uses self-attention mechanisms and parallel processing to model sequential data more efficiently than RNNs.

Answer 54

BERT is a Transformer-based model pre-trained on large corpora using bidirectional encoding. It is used for NLP tasks like sentiment analysis, question answering, and text classification.

Answer 55

Attention mechanisms allow models to focus on specific parts of input data, improving performance in tasks like machine translation and text summarization.

Answer 56

Pooling layers reduce the spatial dimensions of input data in CNNs, helping to extract relevant features and reduce computational load.

Answer 57

Data drift occurs when the statistical properties of input data change over time, causing the model to perform poorly. Monitoring and retraining the model can mitigate its effects.

Answer 58

Collaborative filtering uses the preferences of many users to recommend items, commonly used in recommendation systems like Netflix and Amazon.

Answer 59

Monte Carlo methods use repeated random sampling to solve numerical problems, often used in probabilistic modeling and simulations.

Answer 60

The softmax function converts raw scores into probabilities that sum to 1, commonly used in the output layer of classification models.

Answer 61

Kullback-Leibler divergence measures how one probability distribution diverges from a reference distribution, often used to compare model predictions with true distributions.

Answer 62

By allowing gradients to pass through without diminishing for positive inputs, unlike other activation functions, which squash values into a narrow range and cause gradients to shrink.

Answer 63

Grid search, random search, bayesian optimization, cross validation

Answer 64

Skip connections, output of one layer is added to the output of a deeper layer. Addresses vanishing gradient and allows for deeper networks

Machine_Learning_Interview_Flashcards_Final_Update

(88 cards)