DS Entretien 3 Flashcards

1
Q

“You are given an unsorted list of integers. The list is too large to fit into memory all at once, but you need to sort it as efficiently as possible. What sorting algorithm would you choose, and why?”

A

Quick Sort

Its average-case time complexity is O(n log n), making it efficient for large datasets. It operates in-place, requiring only O(log n) additional space for the recursion stack, which is crucial when dealing with memory constraints.

You can mention that for even better performance in certain cases, you might use a randomized pivot or switch to a different algorithm (like merge sort) if the worst-case performance is a concern.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What algorithm is recommended for merging two sorted lists of integers and why?

A

Merge Sort is recommended because:
* Guarantees a time complexity of O(n log n) in the worst case
* More predictable than Quick Sort
* Useful for large datasets that don’t fit into memory
* Works efficiently with external storage
* Higher space complexity of O(n) due to additional memory requirements
* Less space-efficient than in-place algorithms like Quick Sort

Merge Sort is particularly effective in scenarios where data is too large to handle in memory, making it suitable for external sorting tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is regularization in one sentence

A

Regularization is a technique used to prevent overfitting in machine learning models by adding a penalty to the loss function, discouraging overly complex models.

Lasso (L1): adds |w| to the loss, encouraging some weights to become exactly zero

Ridge (L2): adds w^2 encouraging smaller weights without eliminating them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Machine Learning Model Deployment Steps

A
  1. Understand problem definition and expected output
  2. Data preparation (handle missing values, duplicates, formatting)
  3. Feature engineering (create new features, transform raw data into useful formats)
  4. Select model
  5. Train model
  6. Evaluate model (accuracy, precision, recall, etc)
  7. Save model (pickle or tensorflow)
  8. Develop API for inference (Flask, FastAPI)
  9. Containerize model (using Docker for portability)
  10. Deploy model (cloud / on-premises/ edge)
  11. Monitor performance (check for model drift)
  12. Scale and load balancing (using Kubernetes, cloud scaling)
  13. Model retraining (automate with CI/CD pipelines)
  14. Security & Governance (authentification, encryption, compliance)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is problem definition in machine learning, and why is it important?

A

Problem definition is understanding the business or research question and what the model is expected to predict or classify. It helps in aligning the model’s objective with real-world goals, guiding data collection and model selection.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is data preparation in machine learning?

A

Data preparation involves cleaning the data, handling missing values, removing duplicates, and formatting the data correctly. This step ensures the model can learn effectively and the results are reliable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is feature engineering, and why is it important?

A

Feature engineering is the process of creating new features from raw data or transforming existing features into more meaningful formats. It helps improve model performance by ensuring the model can capture the right patterns in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is model selection in machine learning?

A

Model selection is the process of choosing an appropriate algorithm for the task at hand, whether it’s regression, classification, or clustering. It depends on the problem, data, and desired output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is model training in machine learning?

A

Model training involves feeding the preprocessed data into the chosen algorithm so it can learn the underlying patterns. This step adjusts model parameters to minimize error and improve predictions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do you evaluate a machine learning model?

A

Model evaluation involves measuring its performance using metrics such as accuracy, precision, recall, F1-score, or RMSE. These metrics indicate how well the model generalizes to new data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does saving a model in machine learning mean?

A

Saving a model involves serializing it to a file so it can be used later for inference. Common formats are Pickle for Python or TensorFlow’s SavedModel for deep learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is an API, and what are some tools to develop it?

A

An API (Application Programming Interface) allows external systems to interact with a model. Tools like Flask and FastAPI are used to build APIs that serve machine learning models for real-time predictions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does containerizing a machine learning model mean?

A

Containerizing a model involves packaging it and its dependencies into a portable container using tools like Docker. This ensures the model can run consistently across different environments.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does it mean to deploy a model, and where can you deploy it?

A

Deploying a model means making it accessible for use in production, either on cloud platforms (AWS, GCP, Azure), on-premises servers, or on edge devices like mobile phones or IoT devices.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is model monitoring in deployment?

A

Model monitoring tracks the model’s performance over time, ensuring it continues to perform well. It involves checking for issues like model drift, which can happen if data patterns change.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is scaling and load balancing in machine learning deployment?

A

Scaling ensures that the deployed model can handle varying amounts of traffic by increasing or decreasing resources. Load balancing distributes the traffic evenly across multiple instances to ensure smooth performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is model retraining in machine learning?

A

Model retraining involves periodically updating the model with new data to maintain its accuracy. This can be automated using CI/CD pipelines, ensuring the model adapts to changing data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are security and governance in machine learning deployment?

A

Security ensures that only authorized users can access the model, using methods like authentication and encryption. Governance involves ensuring compliance with data protection regulations and managing the model lifecycle.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is precision in machine learning evaluation?

A

Precision is the ratio of true positive predictions to the total predicted positives. It measures how many of the positive predictions were actually correct. Formula: Precision = TP / (TP + FP)

TP = True Positives, FP = False Positives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is recall in machine learning evaluation?

A

Recall is the ratio of true positive predictions to the total actual positives. It measures how many of the actual positives were correctly identified. Formula: Recall = TP / (TP + FN)

FN = False Negatives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is a confusion matrix?

A

A confusion matrix is a table that shows the performance of a classification model. It compares the predicted labels with the true labels, showing the counts of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is accuracy in machine learning evaluation?

A

Accuracy is the ratio of correct predictions (both true positives and true negatives) to the total predictions. It is the most common metric but can be misleading in imbalanced datasets. Formula: Accuracy = (TP + TN) / (TP + TN + FP + FN)

TN = True Negatives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the F1 score in machine learning evaluation?

A

The F1 score is the harmonic mean of precision and recall. It balances the two metrics and is especially useful when the class distribution is imbalanced. Formula: F1 = 2 * (Precision * Recall) / (Precision + Recall).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is specificity in machine learning evaluation?

A

Specificity (or True Negative Rate) measures the proportion of actual negatives that were correctly identified. Formula: Specificity = TN / (TN + FP).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is area under the ROC curve (AUC-ROC)?

A

AUC-ROC is a metric used to evaluate the performance of a binary classification model. The ROC curve plots the true positive rate (recall) against the false positive rate. AUC represents the probability that the model ranks a randomly chosen positive instance higher than a randomly chosen negative instance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is the ROC curve in machine learning?

A

The ROC (Receiver Operating Characteristic) curve is a graphical plot that illustrates the diagnostic ability of a binary classifier as its discrimination threshold is varied. The curve plots the true positive rate (recall) against the false positive rate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is logarithmic loss (log loss) in machine learning evaluation?

A

Log loss measures the performance of a classification model where the prediction is a probability value between 0 and 1. It calculates the negative log-likelihood of the true labels given the predicted probabilities. Lower log loss indicates better model performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is mean squared error (MSE) in machine learning evaluation?

A

Mean Squared Error (MSE) measures the average squared difference between predicted and actual values. It’s commonly used for regression models. A lower MSE indicates better model accuracy. Formula: MSE = (1/n) * Σ(actual - predicted)²

n = number of observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is mean absolute error (MAE) in machine learning evaluation?

A

Mean Absolute Error (MAE) measures the average absolute difference between predicted and actual values. It’s another common metric for regression models. Unlike MSE, it doesn’t penalize larger errors as heavily. Formula: MAE = (1/n) * Σ|actual - predicted|.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is R-squared (R²) in machine learning evaluation?

A

R-squared (also known as the coefficient of determination) measures how well the model explains the variability in the target variable. It ranges from 0 to 1, where 1 indicates a perfect fit. Formula: R² = 1 - (Σ(actual - predicted)² / Σ(actual - mean)²)

mean = average of actual values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is an index in SQL, and how does it improve query performance?

A

An index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional storage and maintenance overhead.

ex:
CREATE INDEX idx_user_email ON Users(email);
- this creates an index on the email column of the Users table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Write an index to make this query faster

SELECT *
FROM Users
WHERE email = ‘doucheturd@email.com’;

A

CREATE INDEX idx_user_email ON Users(email);

this creates an index on the email column of the Users table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Inference meaning in ML

A

The process of using a trained model to make predictions on new, unseen data (without labels)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Supervised ML algorithms

A

Linear Regression
Logistic Regression
k-NN (K nearest neighbors)
Decision Trees
Random Forests
SVM Linear Kernel
SVM RBF Kernel

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Unsupervised learning algorithms in ML

A

K-means
Hierarchical Clustering
PCA
t-SNE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Deep Learning Algorithms

A

Fully Connected NN (Dense Layers)

CNN (Convolutional Neural Network)

RNN / LSTM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What does Linear Regression do?

A

Fits a line to minimize the squared error between predictions and actual values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What is the training complexity of Linear Regression?

A

O(n d²), where n = number of samples, d = number of features (due to matrix inversion).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What is the inference complexity of Linear Regression?

A

O(d), since predictions only require multiplying weights by input features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What does Logistic Regression do?

A

A classification model that applies a sigmoid function to predict probabilities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What is the training complexity of Logistic Regression?

A

O(n d), as gradient descent updates weights iteratively.

where n = number of samples, d = number of features

42
Q

What is the inference complexity of Logistic Regression?

A

O(d), since predictions only require a dot product and activation function.

d = number of features

43
Q

What does k-NN (k-Nearest Neighbors) do?

A

Classifies a new point by voting among its k nearest neighbors.

44
Q

What is the training complexity of k-NN?

A

O(n d), as there is no real training phase, only storing data.

where n = number of samples, d = number of features

45
Q

What is the inference complexity of k-NN?

A

O(n d), since each prediction requires computing distances to all points.

where n = number of samples, d = number of features

46
Q

What does a Decision Tree do?

A

Splits data recursively based on feature conditions to create a tree structure.

47
Q

What is the training complexity of a Decision Tree?

A

O(n d log n), since it recursively partitions data.

where n = number of samples, d = number of features

48
Q

What is the inference complexity of a Decision Tree?

A

O(log n), as prediction follows a single path from root to leaf.

where n = number of samples

49
Q

What does Random Forest do?

A

An ensemble of decision trees that votes on predictions for robustness.

50
Q

What is the training complexity of Random Forest?

A

O(t n d log n), where t = number of trees and n = number of samples

51
Q

What is the inference complexity of Random Forest?

A

O(t log n), since predictions require passing through t trees and n samples.

52
Q

What does SVM with a Linear Kernel do?

A

Finds the optimal linear boundary that separates classes.

53
Q

What is the training complexity of SVM with a Linear Kernel?

A

O(n d), since it optimizes a convex function.

where n = number of samples, d = number of features

54
Q

What is the inference complexity of SVM with a Linear Kernel?

A

O(d), as classification is based on a simple dot product.

where d = number of features

55
Q

What does SVM with an RBF Kernel do?

A

Maps data to a higher-dimensional space for non-linear separation.

56
Q

What is the training complexity of SVM with an RBF Kernel?

A

O(n² d), as it requires computing pairwise kernel similarities.

where n = number of samples, d = number of features

57
Q

What is the inference complexity of SVM with an RBF Kernel?

A

O(n d), since predictions require summing over all support vectors.

where n = number of samples, d = number of features

58
Q

What does k-Means do?

A

Groups data into k clusters by minimizing intra-cluster distance.

59
Q

What is the training complexity of k-Means?

A

O(n k d t), where t = number of iterations until convergence. k clusters and d is number of features

60
Q

What is the inference complexity of k-Means?

A

O(k d), as new points are assigned to the nearest cluster center. k clusters and d features.

61
Q

What does Hierarchical Clustering do?

A

Builds a hierarchy of clusters via merging or splitting.

62
Q

What is the training complexity of Hierarchical Clustering?

A

O(n² log n), due to computing all pairwise distances. where n = number of samples

63
Q

What is the inference complexity of Hierarchical Clustering?

A

O(n²), as merging clusters requires distance updates. n is number of samples.

64
Q

What does PCA (Principal Component Analysis) do?

A

Reduces dimensionality by projecting data onto principal components.

65
Q

What is the training complexity of PCA?

A

O(n d² + d³), as it involves eigen decomposition.

where n = number of samples, d = number of features

66
Q

What is the inference complexity of PCA?

A

O(d²), as transformation only requires matrix multiplication.

where d = number of features

67
Q

What does a Fully Connected Neural Network (Dense Layers) do?

A

A deep learning model where each neuron is connected to all neurons in the next layer.

68
Q

What is the training complexity of a Fully Connected Neural Network?

A

O(n d² e), where n = samples, d = features, e = epochs (due to matrix multiplications in backpropagation).

69
Q

What is the inference complexity of a Fully Connected Neural Network?

A

O(d²), since predictions require forward propagation through layers. d is number of features.

70
Q

What does a Convolutional Neural Network (CNN) do?

A

A deep learning model designed for image processing using convolutional filters.

71
Q

What is the training complexity of a CNN?

A

O(n d² c e), where c = number of filters, d² = image size, e = epochs. n is number of samples

72
Q

What is the inference complexity of a CNN?

A

O(d² c), since convolution operations dominate the forward pass. d features and c filters.

73
Q

What does an RNN (Recurrent Neural Network) do?

A

A neural network designed for sequential data, where previous states influence the current state.

74
Q

What is the training complexity of an RNN?

A

O(n d² e), due to sequential backpropagation through time (BPTT). e = epochs, n = samples, d = features.

75
Q

What is the inference complexity of an RNN?

A

O(d²), since each time step requires a matrix multiplication. d= features

76
Q

What does an LSTM (Long Short-Term Memory) network do?

A

An advanced RNN that prevents vanishing gradients using memory cells and gates.

77
Q

What is the training complexity of an LSTM?

A

O(n d² e), since it extends RNN training with additional gate calculations. n = samples, e = epochs, d = features

78
Q

What is the inference complexity of an LSTM?

A

O(d²), as each time step processes multiple gating functions. d= features

79
Q

What does t-SNE (t-Distributed Stochastic Neighbor Embedding) do?

A

A dimensionality reduction technique that visualizes high-dimensional data by preserving local structure.

80
Q

What is the training complexity of t-SNE?

A

O(n²), since it requires pairwise distance calculations for all points. n = # of samples

81
Q

What is the inference complexity of t-SNE?

A

O(n²), making it impractical for large datasets. n = # of samples

82
Q

What does the meltTable function do, and what is its time and space complexity?

A

The function meltTable transforms a wide-format DataFrame into a long-format DataFrame by unpivoting columns into row values.

ex: df.melt(
id_vars=’column_i_want_to_be_index’,
value_vars=df.columns,
var_name=’name_of_column_that_stores_og_column_names’,
value_name=’store_values_of var_name_column_here’)

Time: O(NM), N rows & M columns being melted

Space: O(NM), output frame has NM rows so space usage increases accordingly

83
Q

Try replacing all the NaN’s in the sf_permits data with the one that comes directly after it and then replacing any remaining NaN’s with 0.

Set the result to a new DataFrame sf_permits_with_na_imputed.

A

sf_permits_with_na_imputed = sf_permits.fillna(method=’bfill’,axis=0).fillna(0)

84
Q

If you removed all of the rows of sf_permits with missing values, how many rows are left?

A

sf_permits.dropna()

85
Q

Create a new DataFrame called sf_permits_with_na_dropped that has all of the columns with empty values removed.

How many columns were removed from the original sf_permits DataFrame? Use this number to set the value of the dropped_columns variable below.

A

sf_permits_with_na_dropped = sf_permits.dropna(axis=1)

dropped_columns = sf_permits.shape[1] - sf_permits_with_na_dropped.shape[1]

86
Q

get the total number of missing values in a dataframe

A

df.isnull().sum().sum()

87
Q

get the total number of missing values (data points) in a dataframe per column

A

df.isnull().sum()

88
Q

when is multiplication possible with matrices in Linear Algebra

A

columns of first matrix must match the rows of the second matrix

ex: first matrix is (1x2) and second matrix is (2x2)

89
Q

A = [12]
B = [2 4
24 6]

Multiply A*B

A

C11 = (1x2) + (2x24) = 2 + 48 = 50

C12 = (1x4) + (2x6) = 4 + 12 = 16

C = (50 16)

90
Q

dimensions of C where

C = A * B, and A is size (mxn) and B is (nxp)

A

C is (mxp) size, it inherits row size of A and column size of B.

91
Q

Softmax Function

A

Softmax converts a vector of raw scores (logits) into probabilities by exponentiating and normalizing them. It is used in multi-class classification problems to ensure the output sums to 1, making it interpretable as probabilities.

Use it when assigning probabilities to multiple classes in models like neural networks.

92
Q

ReLU(x) =

A

ReLU(x) = max(0,x)

93
Q

Sigmoid activation function:

σ(x) =

A

σ(x) = 1 / (1 +( e^-x))

Input x is mapped to a value between 0 and 1.

As x -> ∞, σ(x) -> 1

94
Q

Tanh (Hyperbolic Tangent) activation function:

tanh(x) =

A

tanh(x) = (e^x - e^-x) / (e^x + e^-x)

Function maps any input x to a value between -1 and 1. A x tends to ∞, tanh(x) tends to 1.

95
Q

Sigmoid vs Softmax

A

Sigmoid: One output per class (label/target type), independent probabilities (good for binary or multi-label problems).

Softmax: One output per class, probabilities that sum to 1 (good for multi-class problems with exclusive classes).

96
Q

What makes a function convex

A

A function is convex if, for any two points on the function, the line segment connecting them lies above or on the function (one global minimum).

Should be shaped liked a bowl with single minimum

97
Q

Why should cost function be convex

A

If cost function is not convex, it can lead to optimization problems in machine learning algorithms as local minima can trap the algorithm, preventing it from finding the global minimum.

This prevents model from learning optimal parameters and therefore poor performance on the test data.

98
Q

What is gradient descent

A

Gradient descent is an iterative optimization algorithm used to find the minimum of a function.

It works by moving in the opposite direction of the gradient (as gradient points in direction of increase) and in optimization we are looking for the minimum.

x_new = x_old - (LR * gradient). Keep iterating until x_new is the global minimum. x represents the weight or parameter being optimized.

99
Q

Difference between gradient descent and backpropagation

A

Gradient Descent is an optimization algorithm that updates model parameters (weights) to minimize a loss function, while Backpropagation is a technique that computes the gradients of the loss function with respect to each weight using the chain rule. Backpropagation provides the gradients, and Gradient Descent uses them to update the weights.

Essentially, backpropagation technique requires doing backpropagation (chain rule to get gradient of loss function for each weight/parameter) AND THEN doing gradient descent to find the optimal parameter/weight by minimizing cost (loss) function.

100
Q

convert a column of values that should be dates but are objects into dates (ie 1/17/07)

A

df[‘date_parsed’] = pd.to_datetime(df[‘date’], format=”%m/%d/%y”)