Machine_Learning_Interview_Flashcards_Final_Update

1
Q

What is overfitting, and how can you prevent it?

A

Overfitting happens when a model performs well on training data but poorly on new data. It can be prevented using techniques like cross-validation, regularization, and simplifying the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Explain the bias-variance tradeoff.

A

Bias: Error due to overly simplistic models, which leads to underfitting

Variance: Error due to overly complex models, which leads to overfitting

The goal is to find a balance where the model is complex enough to capture the data patterns (low bias) but not so complex that it overfits (low variance).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is precision?

A

Precision is the ratio of true positives to the total predicted positives, measuring how accurate the positive predictions are.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is recall?

A

Recall is the ratio of true positives to the total actual positives, measuring how well the model identifies positive cases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is F1 score?

A

F1 score is the harmonic mean of precision and recall, providing a balanced measure when both are important.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is AUC-ROC?

A

Metric for evaluating binary classification. It’s a probability curve that plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at different thresholds. It measures the performance of a binary classifier by plotting true positive rate vs false positive rate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is regularization?

A

Regularization is a technique that discourages overfitting by adding a penalty to large model weights (L1 or L2 regularization).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How does a decision tree work?

A

Decision trees work by recursively splitting the dataset based on feature values that maximize the separation between different classes or outcomes. At each node, the algorithm chooses the best feature and threshold to split the data, creating branches until a stopping condition is met, such as reaching a maximum depth or a pure node.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Explain the working of random forests and how they reduce overfitting.

A

Random forests build multiple decision trees on random subsets of the data and average their predictions, reducing overfitting by combining the results of multiple trees.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is gradient descent, and how does it work?

A

Gradient descent is an optimization algorithm that minimizes a loss function by iteratively updating the model parameters in the direction of the negative gradient.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are support vector machines (SVMs)?

A

SVMs are supervised learning models that classify data by finding the hyperplane that best separates classes with maximum margin.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Explain the difference between bagging and boosting.

A

Bagging reduces variance by training multiple models on different subsets of data (outputs are averaged or use voting), while boosting reduces bias by sequentially training models to fix errors made by previous models (ex: higher weights to the misclassified data points)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a neural network, and how does backpropagation work?

A

A neural network is a set of connected layers that transform input data. Backpropagation updates weights based on the gradient of the loss function to minimize error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are CNNs and RNNs, and when are they used?

A

A Convolutional Neural Network (CNN) is a type of neural network designed for processing grid-like data, such as images, using convolutional layers to capture spatial features.

A Recurrent Neural Network (RNN) is a neural network designed for sequential data, where outputs from previous steps are fed back into the model to capture temporal dependencies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is transfer learning, and why is it useful?

A

Transfer learning leverages a pre-trained model and fine-tunes it on a new task, saving time and resources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are vanishing and exploding gradients? How do you address them?

A

Vanishing gradients occur when gradients are too small, and exploding gradients occur when they grow too large. They can be addressed with techniques like ReLU activation or gradient clipping.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is dropout in neural networks?

A

Dropout randomly deactivates neurons during training to prevent overfitting by making the network more robust.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How do you handle missing data?

A

Missing data can be handled by removing rows, imputing values, or using algorithms that handle missing values, like decision trees.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is feature scaling, and why is it important?

A

Feature scaling ensures all features contribute equally to the model by normalizing or standardizing data. This is crucial for algorithms like SVMs or neural networks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What techniques do you use for feature selection?

A

Techniques include removing correlated features, using feature importance from models like Random Forests, or using dimensionality reduction methods like PCA.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the difference between L1 and L2 regularization?

A

L1 regularization (Lasso) encourages sparsity by shrinking coefficients to zero, while L2 regularization (Ridge) penalizes large coefficients more smoothly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is L1 regularization?

A

L1 regularization (Lasso) encourages sparsity by shrinking some feature weights to zero, useful for feature selection.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is L2 regularization?

A

L2 regularization (Ridge) penalizes large weights, leading to smaller but non-zero weights, which helps reduce overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What are common methods for dimensionality reduction?

A

Common methods include PCA, t-SNE, and Autoencoders, which reduce the number of features while preserving important information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

How do you evaluate a machine learning model’s performance?

A

Performance can be evaluated using metrics such as accuracy, precision, recall, F1 score, ROC-AUC, and confusion matrices.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is cross-validation, and why is it important?

A

Cross-validation splits the data into training and validation sets multiple times to ensure the model generalizes well on unseen data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

How do you handle imbalanced datasets?

A

Techniques include resampling methods (e.g., SMOTE), adjusting class weights, or using algorithms that handle imbalanced data natively.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What are confusion matrices, and how do you use them?

A

Confusion matrices provide a summary of predicted vs actual classes, useful for calculating precision, recall, and other metrics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

When would you use a generative model vs. a discriminative model?

A

Generative models learn the joint probability distribution and can generate new data, while discriminative models focus on the decision boundary.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is reinforcement learning, and how does it differ from supervised learning?

A

Reinforcement learning learns through rewards from interacting with an environment, while supervised learning learns from labeled data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What are GANs, and how do they work?

A

GANs consist of a generator and a discriminator, where the generator creates data, and the discriminator differentiates between real and generated data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is attention in deep learning models?

A

Attention mechanisms help models focus on important parts of the input sequence, improving performance in tasks like NLP and translation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Explain PCA (Principal Component Analysis) and its applications.

A

PCA reduces dimensionality by projecting data onto principal components that capture the most variance, useful in visualization and noise reduction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What is the difference between an LSTM and a GRU?

A

LSTMs have an additional forget gate compared to GRUs, making them more flexible but slower. GRUs are simpler and computationally efficient.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

How would you deal with a dataset containing millions of features?

A

Dimensionality reduction techniques like PCA or selecting top features based on importance can help handle datasets with millions of features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

How would you tune hyperparameters for a machine learning model?

A

You can tune hyperparameters using grid search, random search, or more advanced techniques like Bayesian optimization.

37
Q

What steps would you take if your model isn’t performing well?

A

Try diagnosing issues like overfitting or underfitting, adjust model complexity, and tweak hyperparameters or gather more data.

38
Q

Explain the difference between parametric and non-parametric models.

A

Parametric models assume a fixed number of parameters, while non-parametric models can grow in complexity with more data.

39
Q

What is batch normalization, and why is it used?

A

Batch normalization normalizes the inputs to each layer, speeding up training and improving model performance by stabilizing gradients.

40
Q

How does k-means clustering work?

A

K-means clustering partitions data into k clusters by minimizing the distance between data points and the centroid of each cluster.

41
Q

What is the difference between classification and regression?

A

Classification predicts discrete labels, while regression predicts continuous values.

42
Q

Explain what a kernel function is in SVMs.

A

A kernel function in SVMs transforms data into a higher-dimensional space, making it easier to separate classes using a hyperplane.

43
Q

What is the difference between stochastic gradient descent and batch gradient descent?

A

Stochastic gradient descent updates parameters after each training example, while batch gradient descent updates after processing the entire dataset, making SGD faster but noisier.

44
Q

What is an autoencoder, and where is it used?

A

An autoencoder is a type of neural network that learns to encode data and then reconstruct it, commonly used for tasks like dimensionality reduction and denoising.

45
Q

Explain how the Naive Bayes classifier works.

A

Naive Bayes classifier assumes features are independent and uses Bayes’ theorem to calculate the probability of different classes, making it simple and fast for classification.

46
Q

What is the difference between precision and sensitivity?

A

Precision measures the accuracy of positive predictions, while sensitivity (recall) measures how well the model identifies actual positives.

47
Q

What are ensemble methods in machine learning?

A

Ensemble methods combine multiple models to improve prediction performance, such as Random Forests and Gradient Boosting.

48
Q

What is the purpose of activation functions in neural networks?

A

Activation functions introduce non-linearity into neural networks, enabling them to learn complex patterns.

49
Q

What is the exploding gradient problem?

A

The exploding gradient problem occurs when gradients grow uncontrollably large during backpropagation, often addressed with gradient clipping or normalization techniques.

50
Q

What is early stopping in machine learning?

A

Early stopping halts training when the model’s performance on validation data stops improving, preventing overfitting.

51
Q

Explain over-sampling and under-sampling techniques.

A

Over-sampling increases the number of minority class samples, while under-sampling reduces the number of majority class samples to handle imbalanced data.

52
Q

What is Bayesian inference in machine learning?

A

Bayesian inference updates the probability of a hypothesis as more evidence becomes available, used for probabilistic modeling.

53
Q

How does logistic regression work?

A

Logistic regression models the probability of a binary outcome using a logistic function, and is often used for classification tasks.

54
Q

Explain the concept of A/B testing.

A

A/B testing is an experiment where two groups are compared, one with the current setup (A) and one with a variation (B), to evaluate performance.

55
Q

What are word embeddings?

A

Word embeddings are vector representations of words that capture their meanings based on context, commonly used in NLP models like Word2Vec.

56
Q

How does gradient boosting work?

A

Gradient boosting builds an ensemble of weak learners sequentially, where each new learner focuses on correcting errors from previous ones.

57
Q

What is a hyperparameter, and how is it different from a parameter?

A

Hyperparameters are set before training and control the model’s learning process, while parameters are learned from the data during training.

58
Q

What is the purpose of the learning rate in optimization algorithms?

A

The learning rate controls how large a step the optimizer takes when updating model parameters. Too high leads to overshooting; too low leads to slow convergence.

59
Q

What is data augmentation in deep learning?

A

Data augmentation generates new training data by applying transformations like rotations, flips, or noise to improve model generalization.

60
Q

What is training set and test set in machine learning? How do you split?

A

The training set is used to train the model, and the test set evaluates its performance. Typically, data is split 70-80% for training and 20-30% for testing.

61
Q

How Can You Choose a Classifier Based on a Training Set Data Size?

A

For small datasets, simpler models like Naive Bayes may work better, while larger datasets may benefit from more complex models like neural networks.

62
Q

What Are the Differences Between Machine Learning and Deep Learning?

A

Machine learning focuses on algorithms that learn patterns from data, while deep learning is a subset that uses neural networks with multiple layers.

63
Q

Name a few applications of supervised learning in modern business.

A

Supervised learning is used in business for applications like email spam filtering, fraud detection, and customer segmentation.

64
Q

Name a few applications of unsupervised learning in modern business.

A

Unsupervised learning is used for tasks like customer segmentation, anomaly detection, and market basket analysis in business.

65
Q

What is Semi-supervised Machine Learning?

A

Semi-supervised learning uses a small amount of labeled data and a large amount of unlabeled data, improving model performance without extensive labeling.

66
Q

What is the difference between k-means and knn algorithms

A

K-means is a clustering algorithm that groups data into clusters, while KNN is a classification algorithm that assigns labels based on the nearest neighbors.

67
Q

What are support vectors in SVMs?

A

Support vectors are the data points in SVMs that are closest to the decision boundary and help define the hyperplane.

68
Q

What is data leakage, and how can it be prevented?

A

Data leakage occurs when information from outside the training dataset is used to create the model, leading to overestimation of performance. It can be prevented by careful feature selection and validation.

69
Q

What is the difference between batch gradient descent and stochastic gradient descent?

A

Batch gradient descent processes the entire dataset before updating parameters, while stochastic gradient descent updates parameters after each sample, offering faster but noisier convergence.

70
Q

List common optimization algorithms and their differences.

A

Common optimization algorithms include gradient descent, Adam, and RMSprop. Adam adjusts learning rates based on first and second moments of the gradients, while RMSprop normalizes gradients to improve convergence.

71
Q

What is the difference between epoch, batch, and iteration in machine learning?

A

An epoch is one complete pass through the entire dataset. A batch is a subset of the dataset used to update the model, and an iteration refers to one update step based on a batch.

72
Q

What is cross-entropy loss, and where is it used?

A

Cross-entropy loss measures the difference between predicted probabilities and true labels, commonly used in classification tasks.

73
Q

Explain the concept of dropout in neural networks.

A

Dropout randomly deactivates neurons during training to prevent overfitting, helping neural networks generalize better.

74
Q

What is the difference between validation set and test set?

A

A validation set is used for tuning hyperparameters and evaluating model performance during training, while a test set evaluates the final model’s performance.

75
Q

Explain L1 (Lasso) and L2 (Ridge) regularization and their differences.

A

L1 regularization (Lasso) encourages sparsity by shrinking some weights to zero, while L2 regularization (Ridge) penalizes large weights without making them zero.

76
Q

What is a Gaussian Mixture Model (GMM)?

A

A Gaussian Mixture Model (GMM) is a probabilistic model that represents a mixture of multiple Gaussian distributions, useful in clustering.

77
Q

What is the Transformer architecture in neural networks?

A

The Transformer architecture uses self-attention mechanisms and parallel processing to model sequential data more efficiently than RNNs.

78
Q

What is BERT, and what are its applications?

A

BERT is a Transformer-based model pre-trained on large corpora using bidirectional encoding. It is used for NLP tasks like sentiment analysis, question answering, and text classification.

79
Q

What are attention mechanisms, and why are they important in deep learning?

A

Attention mechanisms allow models to focus on specific parts of input data, improving performance in tasks like machine translation and text summarization.

80
Q

What is the purpose of pooling layers in CNNs?

A

Pooling layers reduce the spatial dimensions of input data in CNNs, helping to extract relevant features and reduce computational load.

81
Q

What is data drift, and how can it impact model performance?

A

Data drift occurs when the statistical properties of input data change over time, causing the model to perform poorly. Monitoring and retraining the model can mitigate its effects.

82
Q

Explain collaborative filtering in recommendation systems.

A

Collaborative filtering uses the preferences of many users to recommend items, commonly used in recommendation systems like Netflix and Amazon.

83
Q

What are Monte Carlo methods?

A

Monte Carlo methods use repeated random sampling to solve numerical problems, often used in probabilistic modeling and simulations.

84
Q

What is the softmax function, and when is it used?

A

The softmax function converts raw scores into probabilities that sum to 1, commonly used in the output layer of classification models.

85
Q

Explain Kullback-Leibler divergence.

A

Kullback-Leibler divergence measures how one probability distribution diverges from a reference distribution, often used to compare model predictions with true distributions.

86
Q

Why do ReLu activation functions help with vanishing gradients?

A

By allowing gradients to pass through without diminishing for positive inputs, unlike other activation functions, which squash values into a narrow range and cause gradients to shrink.

87
Q

What are the 3-4 main techniques for hyperparameter tuning?

A

Grid search, random search, bayesian optimization, cross validation

88
Q

What is a residual network (ResNet)?

A

Skip connections, output of one layer is added to the output of a deeper layer. Addresses vanishing gradient and allows for deeper networks