DS Entretien 3 Flashcards

Question

What is **area under the ROC curve (AUC-ROC)**?

Answer 1

AUC-ROC is a metric used to evaluate the performance of a binary classification model. The ROC curve plots the true positive rate (recall) against the false positive rate. AUC represents the probability that the model ranks a randomly chosen positive instance higher than a randomly chosen negative instance.

Answer 2

The ROC (Receiver Operating Characteristic) curve is a graphical plot that illustrates the diagnostic ability of a binary classifier as its discrimination threshold is varied. The curve plots the true positive rate (recall) against the false positive rate.

Answer 3

Log loss measures the performance of a classification model where the prediction is a probability value between 0 and 1. It calculates the negative log-likelihood of the true labels given the predicted probabilities. Lower log loss indicates better model performance.

Answer 4

Mean Squared Error (MSE) measures the average squared difference between predicted and actual values. It's commonly used for regression models. A lower MSE indicates better model accuracy. Formula: MSE = (1/n) * Σ(actual - predicted)² ## Footnote n = number of observations

Answer 5

Mean Absolute Error (MAE) measures the average absolute difference between predicted and actual values. It's another common metric for regression models. Unlike MSE, it doesn't penalize larger errors as heavily. Formula: MAE = (1/n) * Σ|actual - predicted|.

Answer 6

R-squared (also known as the coefficient of determination) measures how well the model explains the variability in the target variable. It ranges from 0 to 1, where 1 indicates a perfect fit. Formula: R² = 1 - (Σ(actual - predicted)² / Σ(actual - mean)²) ## Footnote mean = average of actual values

Answer 7

An index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional storage and maintenance overhead. ex: CREATE INDEX idx_user_email ON Users(email); - this creates an index on the email column of the Users table.

Answer 8

CREATE INDEX idx_user_email ON Users(email); this creates an index on the email column of the Users table.

Answer 9

The process of using a trained model to make predictions on new, unseen data (without labels)

Answer 10

Linear Regression Logistic Regression k-NN (K nearest neighbors) Decision Trees Random Forests SVM Linear Kernel SVM RBF Kernel

Answer 11

K-means Hierarchical Clustering PCA t-SNE

Answer 12

Fully Connected NN (Dense Layers) CNN (Convolutional Neural Network) RNN / LSTM

Answer 13

Fits a line to minimize the squared error between predictions and actual values.

Answer 14

O(n d²), where n = number of samples, d = number of features (due to matrix inversion).

Answer 15

O(d), since predictions only require multiplying weights by input features.

Answer 16

A classification model that applies a sigmoid function to predict probabilities.

Answer 17

O(n d), as gradient descent updates weights iteratively. where n = number of samples, d = number of features

Answer 18

O(d), since predictions only require a dot product and activation function. d = number of features

Answer 19

Classifies a new point by voting among its k nearest neighbors.

Answer 20

O(n d), as there is no real training phase, only storing data. where n = number of samples, d = number of features

Answer 21

O(n d), since each prediction requires computing distances to all points. where n = number of samples, d = number of features

Answer 22

Splits data recursively based on feature conditions to create a tree structure.

Answer 23

O(n d log n), since it recursively partitions data. where n = number of samples, d = number of features

Answer 24

O(log n), as prediction follows a single path from root to leaf. where n = number of samples

Answer 25

An ensemble of decision trees that votes on predictions for robustness.

Answer 26

O(t n d log n), where t = number of trees and n = number of samples

Answer 27

O(t log n), since predictions require passing through t trees and n samples.

Answer 28

Finds the optimal linear boundary that separates classes.

Answer 29

O(n d), since it optimizes a convex function. where n = number of samples, d = number of features

Answer 30

O(d), as classification is based on a simple dot product. where d = number of features

Answer 31

Maps data to a higher-dimensional space for non-linear separation.

Answer 32

O(n² d), as it requires computing pairwise kernel similarities. where n = number of samples, d = number of features

Answer 33

O(n d), since predictions require summing over all support vectors. where n = number of samples, d = number of features

Answer 34

Groups data into k clusters by minimizing intra-cluster distance.

Answer 35

O(n k d t), where t = number of iterations until convergence. k clusters and d is number of features

Answer 36

O(k d), as new points are assigned to the nearest cluster center. k clusters and d features.

Answer 37

Builds a hierarchy of clusters via merging or splitting.

Answer 38

O(n² log n), due to computing all pairwise distances. where n = number of samples

Answer 39

O(n²), as merging clusters requires distance updates. n is number of samples.

Answer 40

Reduces dimensionality by projecting data onto principal components.

Answer 41

O(n d² + d³), as it involves eigen decomposition. where n = number of samples, d = number of features

Answer 42

O(d²), as transformation only requires matrix multiplication. where d = number of features

Answer 43

A deep learning model where each neuron is connected to all neurons in the next layer.

Answer 44

O(n d² e), where n = samples, d = features, e = epochs (due to matrix multiplications in backpropagation).

Answer 45

O(d²), since predictions require forward propagation through layers. d is number of features.

Answer 46

A deep learning model designed for image processing using convolutional filters.

Answer 47

O(n d² c e), where c = number of filters, d² = image size, e = epochs. n is number of samples

Answer 48

O(d² c), since convolution operations dominate the forward pass. d features and c filters.

Answer 49

A neural network designed for sequential data, where previous states influence the current state.

Answer 50

O(n d² e), due to sequential backpropagation through time (BPTT). e = epochs, n = samples, d = features.

Answer 51

O(d²), since each time step requires a matrix multiplication. d= features

Answer 52

An advanced RNN that prevents vanishing gradients using memory cells and gates.

Answer 53

O(n d² e), since it extends RNN training with additional gate calculations. n = samples, e = epochs, d = features

Answer 54

O(d²), as each time step processes multiple gating functions. d= features

Answer 55

A dimensionality reduction technique that visualizes high-dimensional data by preserving local structure.

Answer 56

O(n²), since it requires pairwise distance calculations for all points. n = # of samples

Answer 57

O(n²), making it impractical for large datasets. n = # of samples

Answer 58

The function meltTable transforms a wide-format DataFrame into a long-format DataFrame by unpivoting columns into row values. ex: df.melt( id_vars='column_i_want_to_be_index', value_vars=df.columns, var_name='name_of_column_that_stores_og_column_names', value_name='store_values_of var_name_column_here') Time: O(NM), N rows & M columns being melted Space: O(NM), output frame has NM rows so space usage increases accordingly

Answer 59

sf_permits_with_na_imputed = sf_permits.fillna(method='bfill',axis=0).fillna(0)

Answer 60

sf_permits.dropna()

Answer 61

sf_permits_with_na_dropped = sf_permits.dropna(axis=1) dropped_columns = sf_permits.shape[1] - sf_permits_with_na_dropped.shape[1]

Answer 62

df.isnull().sum().sum()

Answer 63

df.isnull().sum()

Answer 64

columns of first matrix must match the rows of the second matrix ex: first matrix is (1x2) and second matrix is (2x2)

Answer 65

C11 = (1x2) + (2x24) = 2 + 48 = 50 C12 = (1x4) + (2x6) = 4 + 12 = 16 C = (50 16)

Answer 66

C is (mxp) size, it inherits row size of A and column size of B.

Answer 67

Softmax converts a vector of raw scores (logits) into probabilities by exponentiating and normalizing them. It is used in multi-class classification problems to ensure the output sums to 1, making it interpretable as probabilities. Use it when assigning probabilities to multiple classes in models like neural networks.

Answer 68

ReLU(x) = max(0,x)

Answer 69

σ(x) = 1 / (1 +( e^-x)) Input x is mapped to a value between 0 and 1. As x -> ∞, σ(x) -> 1

Answer 70

tanh(x) = (e^x - e^-x) / (e^x + e^-x) Function maps any input x to a value between -1 and 1. A x tends to ∞, tanh(x) tends to 1.

Answer 71

Sigmoid: One output per class (label/target type), independent probabilities (good for binary or multi-label problems). Softmax: One output per class, probabilities that sum to 1 (good for multi-class problems with exclusive classes).

Answer 72

A function is convex if, for any two points on the function, the line segment connecting them lies above or on the function (one global minimum). Should be shaped liked a bowl with single minimum

Answer 73

If cost function is not convex, it can lead to optimization problems in machine learning algorithms as local minima can trap the algorithm, preventing it from finding the global minimum. This prevents model from learning optimal parameters and therefore poor performance on the test data.

Answer 74

Gradient descent is an iterative optimization algorithm used to find the minimum of a function. It works by moving in the opposite direction of the gradient (as gradient points in direction of increase) and in optimization we are looking for the minimum. x_new = x_old - (LR * gradient). Keep iterating until x_new is the global minimum. x represents the weight or parameter being optimized.

Answer 75

Gradient Descent is an optimization algorithm that updates model parameters (weights) to minimize a loss function, while Backpropagation is a technique that computes the gradients of the loss function with respect to each weight using the chain rule. Backpropagation provides the gradients, and Gradient Descent uses them to update the weights. Essentially, backpropagation technique requires doing backpropagation (chain rule to get gradient of loss function for each weight/parameter) AND THEN doing gradient descent to find the optimal parameter/weight by minimizing cost (loss) function.

Answer 76

df['date_parsed'] = pd.to_datetime(df['date'], format="%m/%d/%y")

DS Entretien 3 Flashcards

(100 cards)