interview_ML Flashcards

Question

Say we have the weights w ∈ R^(d×m) and a mini-batch x of n elements, each element is of the shape 1 × d, so that x ∈ R^(n×d). We have the output y = f(x; w) = xw. What's the dimension of the Jacobian ∂y/∂x?

Answer 1

First, notice that y ∈ R^(n×m). With that said, Jac_x(f) ∈ R^((n×m)×(n×d)), or equivalently Jac_x(f) ∈ R^((n·m)×(n·d)), given that we have reshaped the 4-dim tensor into a 2-dim tensor, i.e. a matrix. in general dimension of jacobain dy/dxis R^(y-dim x x-dim)

Answer 2

To find the unit vector x that minimizes x^T Ax, we can frame this as an optimization problem and approach it using an iterative algorithm like gradient descent or the conjugate gradient method. These methods update x in a direction that minimizes the function x^T Ax, and in the case of gradient descent, this involves computing the gradient -2Ax at each step. We would iterate this process until convergence, ensuring at each step that x remains a unit vector by normalizing it.

Answer 3

Dimensionality reduction is used to reduce the number of random variables under consideration and can be divided into feature selection and feature extraction. It helps in reducing the time and storage space required, removes multicollinearity, enhances the interpretation of the parameters, helps in visualizing data, and most importantly, it can help in avoiding the curse of dimensionality.

Answer 4

The decomposition is not always unique. Suppose A ∈ R^(2×2) has two equal eigenvalues λ1 = λ2 = λ, with corresponding eigenvectors u1, u2. Then: Au1 = λ1u1 = λu1 Au2 = λ2u2 = λu2 Or written in matrix form: A [u1 u2] = [u1 u2] [λ 0; 0 λ] Notice that we can permute the matrix of eigenvectors (thus obtaining a different factorization): A [u2 u1] = [u2 u1] [λ 0; 0 λ] But we still end up with the same eigen-properties: Au2 = λu2 Au1 = λu1

Answer 5

In PCA we are interested in the components that maximize the variance. If one component (e.g. human height) varies less than another (e.g. weight) because of their respective scales (meters vs. kilos), PCA might determine that the direction of maximal variance more closely corresponds with the ‘weight’ axis, if those features are not scaled. Since a change in height of one meter should be considered much more important than the change in weight of one kilogram, the previous assumption would be incorrect. Therefore, it is important to standardize the features before applying PCA.

Answer 6

Eigendecomposition is possible only for (square) diagonalizable matrices. On the other hand, the Singular Value Decomposition (SVD) always exists (even for non-square matrices).

Answer 7

Suppose we have data X ∈ R^(n×d) with n samples and d features. Moreover, assume that the data has been centered so that the mean of each feature is 0. Then, we can perform PCA in two main ways: First, we compute the covariance matrix C = 1/(n-1)X^T X ∈ R^(d×d), and perform eigendecomposition: C = V L V^T, with eigenvalues as the diagonal of L ∈ R^(d×d), and eigenvectors as the columns of V ∈ R^(d×d). Then, we stack the k eigenvectors of V corresponding to the top k eigenvalues into a matrix V˜ ∈ R^(d×k). Finally, we obtain the component values as follows: X˜ = X V˜ ∈ R^(n×k). Alternatively, instead of first computing the covariance matrix and then performing eigendecomposition, notice that given the above formulation, we can directly compute SVD on the data matrix X, thus obtaining: X = U Σ V^T. By construction, the right singular vectors in V are the eigenvectors of X^T X. Similarly, we stack the k right singular vectors corresponding to the top k singular values into a matrix V˜ ∈ R^(d×k). Finally, we obtain the component values as follows: X˜ = X V˜ ∈ R^(n×k). Even though SVD is slower, it is often considered to be the preferred method because of its higher numerical accuracy.

Answer 8

Consider A ∈ R^(m×n) of rank r. Then, we can factorize A as follows: A = U Σ V^T where U ∈ R^(m×m) is an orthogonal matrix of left singular vectors, V ∈ R^(n×n) is an orthogonal matrix of right singular vectors, and Σ ∈ R^(m×n) is a "diagonal" matrix of singular values such that exactly r of the values σ_i := Σ_ii are non-zero. By construction: The left singular vectors of A are the eigenvectors of A^T A. From the Spectral Theorem, the eigenvectors (and thus the left singular vectors) are orthonormal. The right singular vectors of A are the eigenvectors of A A^T. From the Spectral Theorem, the eigenvectors (and thus the right singular vectors) are orthonormal. If λ is an eigenvalue of A^T A (or A A^T), then √λ is a singular value of A. From the positive semidefiniteness of A^T A (or A A^T), the eigenvalues (and thus the singular values) are non-negative.

Answer 9

A function f: U → R is said to be differentiable at a ∈ U if the derivative: f'(a) = lim(h→0) [f(a + h) - f(a)]/h exists. This implies that the function is continuous at a. Note that every continuous function is not necessarily differentiable. like |X| continuous but not differentiable. functions with sharp edges are not differentiable tho continuous

Answer 10

An example is the function f(x) = |x|, which doesn’t have a derivative at x = 0. The graph of this function has a sharp corner at x = 0, which means there is no single tangent line at that point. Why you can still apply the formula right?

Answer 11

An example is the ReLU (Rectified Linear Unit) function, which is non-differentiable at x = 0. In machine learning, backpropagation with such functions often uses a concept called subgradient, which allows the algorithm to bypass non-differentiability at certain points. For ReLU, the derivative is defined as 0 for x < 0 and 1 for x > 0, and at x = 0, any value between 0 and 1 can be used.

Answer 12

A function is called convex if the line segment between any two points on the graph of the function lies above the graph between the two points. More precisely, the function f : X → R is convex if and only if for all 0 ≤ t ≤ 1 and all x1, x2 ∈ X: f(tx1 + (1 − t)x2) ≤ tf(x1) + (1 − t)f(x2) The function f is said to be concave if −f is convex.

Answer 13

Convexity is desirable because any local minimum of a convex function is also a global minimum

Answer 14

Second-order derivatives can be used in optimization algorithms to better understand the curvature of the loss function. This information can be used to adjust the learning rate and the direction of the update steps, potentially leading to faster convergence.

Answer 15

Pros: Can lead to faster convergence and more informed update steps. Cons: Computationally expensive, as it requires calculating and inverting the Hessian matrix.

Answer 16

The Hessian matrix can be used to test the nature of critical points. If the Hessian at a point is positive definite, the point is a local minimum; if it is negative definite, the point is a local maximum; and if it is indefinite, the point is a saddle point

Answer 17

As stated before, for a given convex function f, we had the following property: g(tx1 + (1 - t)x2) ≤ tg(x1) + (1 - t)g(x2) Let us generalize this property. Again, suppose we have a convex function f, variables x1, ..., xn ∈ I, and non-negative real numbers α1, ..., αn such that ∑i αi = 1. Then, by induction we have: g(α1x1 + ... + αnxn) ≤ α1g(x1) + ... + αng(xn) Let's formalize it one step further. Consider a convex function f, a discrete random variable X with n possible values x1, ..., xn, and real non-negative values ai = p(X = xi). Then, we obtain the general form of the Jensen's inequality: g(E[X]) ≤ E[g(X)]

Answer 18

The chain rule is a formula that expresses the derivative of the composition of two differentiable functions f and g in terms of the derivatives of f and g. More precisely, if h = f ∘ g is the function such that h(x) = f(g(x)) for every x, then the chain rule is: dh/dx = df/dg · dg/dx

Answer 19

In order to solve the constrained optimization problem, we form the Lagrangian: L(x, y, λ) = 4x^2 - y + λ(x^2 + y^2 - 1) Given the function f(x, y) = 4x^2 - y with the constraint x^2 + y^2 = 1. Find the function's maximum and minimum values. Answer: In order to solve the constrained optimization problem, we form the Lagrangian: L(x, y, λ) = 4x^2 - y + λ(x^2 + y^2 - 1) Calculating the gradient and setting it to zero, we obtain: ∇_x,y,λ L = (∂L/∂x, ∂L/∂y, ∂L/∂λ) = (8x + 2λx, -1 + 2λy, x^2 + y^2 - 1) = 0

Answer 20

∂L/∂x = ∂/∂x [(-y^T log(softmax(x))] = -y^T ∂/∂x [softmax(x)]/∂x [log(softmax(x))] = -y^T [diag(softmax(x)) - softmax(x)softmax(x)^T] / [diag(softmax(x))] = softmax(x) - y

Answer 21

For a continuous uniform distribution, the probability of X being exactly any specific value, including 0.5, is 0. This is because the probability for a continuous distribution is defined over intervals, not specific points.

Answer 22

Yes, the values of a Probability Density Function (PDF) can be greater than 1. The key point is that the area under the PDF curve over the entire range must integrate to 1. A high PDF value does not represent probability but rather indicates a higher density of the variable at that point.

Answer 23

A multivariate distribution is a probability distribution with more than one random variable, each with its range of values. A multimodal distribution is a probability distribution with more than one peak or mode, regardless of how many variables it has.

Answer 24

In general, continuous random variables X1, ..., Xn admitting a joint density are all independent from each other if and only if: p_{X1,...,Xn}(x1, ..., xn) = p_{X1}(x1) · · · p_{Xn}(xn) This equation states that the joint probability density function (pdf) of the random variables X1, ..., Xn factorizes into the product of their individual pdfs, which is a necessary and sufficient condition for independence.

Answer 25

The Central Limit Theorem (CLT) states that the distribution of the sum of a large number of independent, identically distributed random variables is approximately normal, regardless of the underlying distribution. Because so many things in the universe can be modeled as the sum of a large number of independent random variables, the normal distribution pops up a lot. ****Central limit theorem and law if larger numbers

Answer 26

To convert a probabilistic model into a deterministic one, you typically use expected values, mode, or median of the probability distributions as fixed values instead of random variables. This approach ignores the variability and uncertainty represented by the probability distributions.

Answer 27

The frequentist approach The goal is to use the sample data to build point estimates of the parameters (potentially with standard error). bayesian uses priors The goal is to build a posterior distribution of the parameters, given the data at hand.

Answer 28

def merge_sort(arr): """ Sorts an array using the merge sort algorithm. """ if len(arr) > 1: mid = len(arr) // 2 # Finding the mid of the array left_half = arr[:mid] # Dividing the elements into 2 halves right_half = arr[mid:] merge_sort(left_half) # Sorting the first half merge_sort(right_half) # Sorting the second half i = j = k = 0 # Copy data to temp arrays L[] and R[] while i < len(left_half) and j < len(right_half): if left_half[i] < right_half[j]: arr[k] = left_half[i] i += 1 else: arr[k] = right_half[j] j += 1 k += 1 # Checking if any element was left while i < len(left_half): arr[k] = left_half[i] i += 1 k += 1 while j < len(right_half): arr[k] = right_half[j] j += 1 k += 1 return arr Example usage if __name__ == "__main__": sample_array = [12, 11, 13, 5, 6, 7] print("Original array:", sample_array) sorted_array = merge_sort(sample_array) print("Sorted array:", sorted_array)

Answer 29

import json def load_json_file(filename): """ Load the JSON data from a file. """ with open(filename, 'r') as file: return json.load(file) def print_json_recursively(data, indent=0): """ Recursively print JSON data with indentation for nested structures. """ for key, value in data.items(): print(' ' * indent + str(key) + ':', end=' ') if isinstance(value, dict): # If value is a dictionary, recurse print() print_json_recursively(value, indent + 4) elif isinstance(value, list): # If value is a list, iterate each item print() for i, item in enumerate(value): print(' ' * (indent + 4) + f'[{i}]:', end=' ') if isinstance(item, dict): print() print_json_recursively(item, indent + 8) else: print(item) else: print(value) def main(): filename = 'path_to_your_json_file.json' try: json_data = load_json_file(filename) print_json_recursively(json_data) except Exception as e: print(f"An error occurred: {e}") if __name__ == "__main__": main()

Answer 30

def longest_increasing_subsequence(s): # Cache for memoization memo = {} def rec(i, prev): if i == len(s): return 0 # Base case: end of string # Check memoized results if (i, prev) in memo: return memo[(i, prev)] # Option 1: Skip the current character taken = 0 if prev < s[i]: # Option 2: Include the current character if it continues the sequence taken = 1 + rec(i + 1, s[i]) not_taken = rec(i + 1, prev) # Store result in memoization dictionary memo[(i, prev)] = max(taken, not_taken) return memo[(i, prev)] # Initialize recursive function, use a character smaller than any possible as the initial 'previous' return rec(0, chr(0)) Example usage s = "azbycxdwe" print("Length of the longest increasing subsequence is:", longest_increasing_subsequence(s))

Answer 31

class TreeNode: def __init__(self, val=0, left=None, right=None): self.val = val self.left = left self.right = right def preorder_traversal(root): if root is None: return [] return [root.val] + preorder_traversal(root.left) + preorder_traversal(root.right) def inorder_traversal(root): if root is None: return [] return inorder_traversal(root.left) + [root.val] + inorder_traversal(root.right) def postorder_traversal(root): if root is None: return [] return postorder_traversal(root.left) + postorder_traversal(root.right) + [root.val]

Answer 32

def preorder_traversal_iterative(root): if root is None: return [] stack, output = [root], [] while stack: node = stack.pop() if node: output.append(node.val) stack.append(node.right) # Right child pushed first so that left is processed first stack.append(node.left) return output

Answer 33

def inorder_traversal_iterative(root): stack, output = [], [] current = root while current or stack: while current: stack.append(current) current = current.left current = stack.pop() output.append(current.val) current = current.right return output

Answer 34

def postorder_traversal_iterative(root): if root is None: return [] stack, output = [root], [] while stack: node = stack.pop() output.append(node.val) if node.left: stack.append(node.left) if node.right: stack.append(node.right) return output[::-1] # Reverse the result because we want left-right-root Do post order traversal question

Answer 35

def subarraySum(nums, k): # Dictionary to store the frequency of cumulative sums cumulative_sum_count = {0: 1} # Base case: sum of 0 exists once current_sum = 0 count = 0 for num in nums: current_sum += num # Check if there is a prefix subarray we can subtract # that results in the current subarray summing to k sum_needed = current_sum - k if sum_needed in cumulative_sum_count: count += cumulative_sum_count[sum_needed] # Update the count of the current_sum in the hashmap if current_sum in cumulative_sum_count: cumulative_sum_count[current_sum] += 1 else: cumulative_sum_count[current_sum] = 1 return count

Answer 36

Since matrix multiplication is associative, the answer is the same whether we multiply in the order of (AB)C or A(BC). However, let us observe the cost through the number of scalar multiplications we need to perform: (AB)C = 100 · 5 · 200 + 100 · 200 · 20 = 50000 A(BC) = 5 · 200 · 20 + 100 · 5 · 20 = 30000 Obviously, the second approach is computationally cheaper.

Answer 37

Overflow, underflow, division by zero, log 0, NaN as input, etc.

Answer 38

The purpose is to avoid operations that are undefined for 0, such as division by 0, log 0, etc

Answer 39

GPUs became popular for deep learning because matrix multiplications can be efficiently parallelized over hundreds of cores. TPUs (Tensor Processing Units) are specialized hardware for neural nets, with the key difference that they have lower precision for representing floating-point numbers, allowing for: Higher memory throughput Faster addition and multiplication operations This design enables TPUs to accelerate neural network computations while reducing power consumption and increasing efficiency.

Answer 40

O(B · T · w) - time O(w + B · T · a) - space For the forward-pass of a single example in one timestep we need to evaluate all the weights, resulting in O(w) time complexity, where w is the number of weights. Due to the recurrence, we repeat the computation for T timesteps, resulting in O(T · w). Moreover, performing this un-rolled forward pass for an entire batch will amount the time complexity to O(B · T · w). Lastly, we note that the time complexity of the forward and the backward pass is the same. As for the space complexity, note that we need to keep in memory both the network weights and the activations from the forward pass (required for the backprop computation). Given that storing the activations for a single timestep is O(a), the space complexity amounts to O(w + B · T · a)

Answer 41

def bigram(text_list:list): result = [] for ls in text_list: words = ls.lower().split() for bi in zip(words, words[1:]): result.append(bi) return result text = ["Data drives everything", "Get the skills you need for the future of work"] print(bigram(text))

Answer 42

Masked and causal language modelling

Answer 43

C= 6PD P is the number of parameters in the transformer model D is the dataset size, in tokens C is the compute required to train the transformer model, in total floating point operations C= # of GPUs x Flops/GPU

Answer 44

Total Memory = Model Memory + optimizer memory + activation memory + gradient memory model memory (fp16)-> 2 bytes optimizer memory -> often kept at 32B precision, 4byte for gradient, 4byte for momentum, 4byte for variance gradient (saved in fp16) -> 2bytes activations -> 2bytes * batchsize * # of tokens * hidden dimension * # of layers total memory is roughly 16 * # of parameters Benefit of keeping optim at 32bits?

Answer 45

For ZeRO-1, Total Memory_Training ≈ Model Memory + ((Optimizer memory) / (No. GPUs)) + Activation Memory + Gradient Mem For ZeRO-2, Total Memory_Training ≈ Model Memory + Activation Memory + (Optimizer Memory + Gradient Memory)/ (No. GPUs) For ZeRO-3, Total Memory_Training ≈ Activation Memory + (Model Memory + Optimizer Memory + Gradient Memory) / (No. GPUs) Why does the optimiser keep a copy of the gradients along with gradient memory?

Answer 46

20 x # of parameters

Answer 47

So, the sequence is: - Multi-headed attention - o_proj linear projection - Skip connection by adding the original input - Layer normalization - First FCN layer - Activation Function - Second FCN layer - skip connection - another layer norm

Answer 48

bleu score is a precision based metric, where numerator is count of n-gram match divided by count of n-grams present in generated text. so this mean if "the apple" bigram shows up 7 times in generated text and 2 times in reference text then numerator count of "the apple" will be 2 and denominator count will be 7 this gives you Bleu-n for different n. One problem here is you could have really small generations reducing the denominator and making the model seem good, so there is a brevity penalty multiplied total blue score is exponent of weighted log precision

Answer 49

1. Dropout 2. Weight Decay (L2 Regularization) 3. Layer Normalization 4. Gradient Clipping 5. Label Smoothing

Answer 50

l1 norm as regularization term is lasso l2 norm is ridge

Answer 51

gradient clipping is when you clip gradients during backprop. helps with exploding gradient problem. Helps maintain stability during training. What value is gradient clipped at

Answer 52

l1 norm pushes them to 0, making it good for variable selection l2 norm pushes them close to 0. l1 norm in the parameter space graphically is a diamond and l2 norm in the parameter space is a circle. point where the contours of loss function first touch the diamond or circle is where loss function solution lies

Answer 53

it determines the size of the regularisation circle in the parameter space. larger lambda smaller circle

Answer 54

In the context of machine learning, bias refers to the error that is introduced by approximating a real-world problem, which may be complex, by a much simpler model. simpler models can exhibit high bias and complex models high variance bias - underfitting variance - overfitting

Answer 55

Variance, refers to the error that is introduced by the model's sensitivity to fluctuations in the training set. if you use a complex model with lot of parameters and the parameters are allowed to take large values then it will be able to model noise along with data If a model has too many parameters or if those parameters are allowed to take on large values, it can become extremely flexible. This means it can capture not only the underlying relationships in the data but also the noise (random fluctuations) specific to the training set. As a result, it may perform very well on the training data (low bias), but poorly on new, unseen data (high variance) because it's overfitted to the noise and specifics of the training set rather than to the underlying data distribution.

Answer 56

Penalizing Large Weights: prevent any single feature from having too much influence on the predictions, which is desirable when you suspect some features may be correlated with noise rather than with the signal in the training data. Smoothness and Generalization: Smaller weights often result in smoother functions that change less drastically with input variations. This smoothness means the model is less likely to pick up on noise and will therefore generalize better to unseen data. Shrinkage Effect: The regularization term shrinks the parameter values towards zero but not exactly to zero. This effect is akin to a "soft" form of feature selection that lowers the complexity of the model without completely eliminating the contribution of any single feature.

Answer 57

skip connections solve for the problem of vanishing gradient

Answer 58

gradient clipping

Answer 59

(TP+TN)/(FP+FN+TP+TN)

Answer 60

accuracy of positive samples = TP/(TP+FP) Also ask gpt and write about what these tell you

Answer 61

coverage of actual positive samples = TP/(TP+FN) Is this right ??

Answer 62

coverage of actual negative samples = TN/(TN+FP)

Answer 63

harmonic mean of precision and recall, useful for imbalanced classes = 2TP/(2TP+FP+FN)

Answer 64

plot of TPR vs FPR recall vs (1-specificity) allows us to compare 2 methods like logistic vs boosting

Answer 65

it is the same as recall

Answer 66

BLEU is precision based

Answer 67

recall based

Answer 68

BERTScore uses BERT to create an embedding for n-grams with BERT for both output and reference and then takes dot product for similarity in the form of precision and recall and then takes harmonic mean of them both Do you use a double loop to summation over all n grams in the precision and recall components??

Answer 69

1. coherence 2. compile - syntax 3. gpt eval - semantics 4. expert eval

Answer 70

Position bias: LLMs tend to favor the response in the first position. Verbosity bias: LLMs tend to favor longer, wordier responses over more concise ones Self-enhancement bias: LLMs have a slight bias towards their own answers.

Answer 71

Vicuna 13B is a llama finetune on chat data from shareGPT. it is one of the first paper to use gpt4 to evaulate itself against different bench marks

Answer 72

Accuracy. recall, precision, specificity, f1 score if you have classes BLEU, ROGUE, BERTScore if you have text as reference and task is something like summarisation Automatic gpt4 based evaluation

Answer 73

prepends a trainable tensor to the model’s input embeddings, essentially creating a soft prompt. Unlike discrete text prompts, soft prompts can be learned via backpropagation, meaning they can be fine-tuned to incorporate signals from any number of labeled examples.

Answer 74

it prepends trainable parameters to the hidden states of all transformer blocks. During fine-tuning, the LM’s original parameters are kept frozen while the prefix parameters are updated.

Answer 75

when adapting to a specific task, pre-trained language models have a low intrinsic dimension and can still learn efficiently despite a random projection into a smaller subspace. Thus, LoRA hypothesized that weight updates during adaption also have low intrinsic rank. How do you decompose into lower rank??

Answer 76

QLoRA builds on the idea of LoRA. But instead of using the full 16-bit model during fine-tuning, it applies a 4-bit quantized model 1. 4-bit NormalFloat (to quantize models), 2. double quantization (for additional memory savings), and 3. paged optimizers (that prevent OOM errors by transferring data to CPU RAM when the GPU runs out of memory). Give qlora paper to chatgpt and talk to it about how quantisation works there like u and sigma and normal distribution assumption and binning

Answer 77

Benefits: less memory, faster computation Problems: during training numerical instability possible cuz gradients get clipped due to lack of precision, learning affected or stalled due to underflow mixed precision training to solve this issue Details of mixed precision

Answer 78

Given two tensors A and B, both with shape [2, 2]: torch.cat([A, B], dim=0) will result in a tensor of shape [4, 2]. torch.stack([A, B], dim=0) will result in a tensor of shape [2, 2, 2].

Answer 79

In-place operations Operations that have a _ suffix are in-place tensor.add_(5) above tensor is modified in place in place operations use is discouraged as it leads to an immediate loss of information

Answer 80

model = resnet18()

Answer 81

data = torch.randn(1, 3, 64, 64)

Answer 82

label = torch.randn(1, 1000)

Answer 83

prediction = model(data)

Answer 84

loss = (prediction - labels).sum() loss.backward

Answer 85

optim = torch. optim.SGD(model.parameters(), lr= 1e-3, momentum = 0.9) optim.step()

Answer 86

1. define model 2. get data 3. get labels 4. define optimizer and loss 5. calculate loss and do loss.backward() 6. do optimizer.zero_grad() and optimizer.step() or you define everything and do trainer.train()

Answer 87

transformers library in huggingface which has models and techniques for training and finetuning many machine learning models. Its an API in PyTorch

Answer 88

Trainer is a class in Transformers library in huggingface that lets you finetune pre-trained models. let's you abstract away a lot of settings and supports mixed precision training.

Answer 89

Callback is a set of methods that you can override or utilize to customize behvaiour of trainer. for example you can use it to save model preiodically, logging metrics, modify learning rate. It's useful for extending the functionality of the training loop without changing core trainnig logic. you can define callbacks as arguments tot he trainer object

Answer 90

callbacks are defined as a class in PyTorch where you create a class that inherits from a defined class in Transformers library and you can define a function in there to do what you want it to do when you want it done. LIke define a function that runs every epoch to do something if a condition is met.

Answer 91

Automatic Parameter Registration: This is crucial for training the model because optimizers rely on the .parameters() method of the nn.Module to get a list of all parameters that need updates. Model Serialization and deserialization: when you save and load model Device Management: When moving the model to a device (e.g., GPU), all parameters of all contained layers are automatically moved as well.

Answer 92

self attention takes you from xi's to yi's how every y_i is calculated is you take the x_i and dot product it with every other x_i. Then you take softmax. Each dot product becomes the scaling factor or weight that weighted addition of xj's is done with. Every y_i is weighted summation of all x_j 's y_i = ∑_j w_ij x_j w'_{ij} = x_i^T x_j w_{ij} = exp(w'_{ij}) / ∑_j exp(w'_{ij})

Answer 93

large softmax kill the gradient, so we divide by sqrt(d) Why sqrt(d)? Imagine a vector in ℝ^d with values all c. Its **Euclidean length** is sqrt(cd). Therefore, we are dividing out the amount by which the increase in dimension increases the length of the average vectors.

Answer 94

The layer normalization is applied over the embedding dimension only.

Answer 95

temperature sampling is dividing logits by the temperature before feeding them into softmax and obtaining our sampling probabilities. lower temp means less random higher temp means more random

Answer 96

in top-k sampling you sample from the top k probabilities. the problem is that if the distribution has a lot of reasonable options and prob is uniform we'll ignore some possibilities simply because we put hard stop of k

Answer 97

nucleus sampling aka top p sampling, we compute cumulative prob till p and cut off there

Answer 98

Stability in Training: LayerNorm stabilizes the neural network training process by ensuring that the distribution of the inputs to the activation functions in a layer does not vary too much. This reduces the problem where the learning has to constantly adjust to a shifting input distribution across layers.

Answer 99

risk in emperical risk minimization is the expectation of loss over the join prob distribution P(x, y) where y = h(x) and loss = L(h(x), x) loss is the 0-1 loss so ris is the integral of joint prob where loss is 1 R(h) = E[L(h(x), y)] = ∫ L(h(x), y) dP(x, y)

Answer 100

you take average of loss on the training data and call that emperical risk as you can't find real risk as P(x, y) is not available empirical risk minimization is when you find a h which minimizes the risk

Answer 101

Deeper networks are more expressive, since they encode an inductive bias that complex functions can be modeled as composition of simple functions. In turn, this allows the network to learn multiple levels of an abstraction hierarchy. Empirically, it has been shown that deeper networks lead to more compact models with better generalization performance.

Answer 102

cuz gradient of l1 is 1 even close to 0 but gradient of l2 becomes really small close to 0 as 2w

Answer 103

in bagging you sample with replacement to train n classifiers and vote for final prediction in boosting you first train a weak learner and look at what samples it is not able to classify, it weights those samples higher and trains another weak learner and does this iteratively

Answer 104

Derived from feedforward neural networks, RNNs can use their internal state (memory) to process variable length sequences of inputs

Answer 105

vanishing and exploding gradient in RNNs

Answer 106

Perplexity(P) = 2^(-1/N∑log P(x))

Answer 107

a bias term is added inside the softmax with qk^T to signify positions but doing so doesn't let us use KV cache and this is bad for training

Answer 108

applied only to q and k matrices and not v. We multiply with a rotating matrix that rotates a word by m x theta where m is position of word what is done is blocks of 2 in the embedding dimnesion is rotated by theta1, theta2, etc. allows for rotational invariance to relative positions

Answer 109

During inference you don't really need to calculate qkT V for all the previous tokens. usually q is calculated only for the most recent token and only this 1 q is used, K and V for all is needed but only 1 column and row is new So what you do is you save the k and v from the previous iteration. you generate 1 extra q k and v from latest token. take old k and v from cache append to it and then do calculation.

Answer 110

Increasing the filter size results in decrease in computational efficiency (since the number of model parameters increases), and an increase in accuracy up to a certain point beyond which the network can overfit and imitate a fully-connected network. Alternatively, decreasing the filter size results in increase of computational efficiency, and a decrease in accuracy when tending towards using extremely small kernels (e.g. 1x1) which do not capture the local structure of the inputs properly

Answer 111

Each filter operates only on a small neighborhood (e.g 3x3) around the ground pixel, and is applied across the entire spatial domain. This sort of weight sharing drastically reduces the number of the parameters, and injects the “spatial equivariance” bias.

Answer 112

The purpose of a 1x1 conv layer is to reduce the number of channels in the input activation volume.

Answer 113

The main purpose of the pooling operation is to increase the receptive field of the network by non-parametric techniques. If pooling is completely removed, then we’d need an increasingly large stack of convolutional layers to achieve large enough receptive fields for the neurons located in the deep layers. Of course, this comes at the cost of drastically increasing the number of parameters and computation requirements.

Answer 114

``` A = [5, 3, 4] B = [4, 2, 4] dot_product = sum(a*b for a, b in zip(A, B)) magnitude_A = sum(a*a for a in A)**0.5 magnitude_B = sum(b*b for b in B)**0.5 cosine_similarity = dot_product / (magnitude_A * magnitude_B) print(f"Cosine Similarity using standard Python: {cosine_similarity}") ```

Answer 115

in finite horizon when we're thinking of reward for an action we're only considering rewards in a finite set of future states. for infinite horizon we consider future rewards of all future states infinite horizons are more common

Answer 116

using a discount rate (smaller than 1) is a mathematical trick to make an infinite sum finite.

Answer 117

Q(S_t, A_t) ← Q(S_t, A_t) + α(R_{t+1} + γmax_a Q(S_{t+1}, a) - Q(S_t, A_t)) γ - > discount rate alpha -> learning rate

Answer 118

(s_t, a_t, r_t, s_{t+1})

Answer 119

put an image through encoder, encode to a mean and standard deviation (log standard deviation for numerical stability) and then sample by scaling a standard normal distribution with those parameters and try to generate the input image from decoder. input image used to train decoder and encoder and KL divergence of output of encoder and normal distribution acts as regularizer

Answer 120

wt+1 = wt - α * (1/N * ∑i=1 to N ∇wLi(wt, yi, yˆi))

Answer 121

thing to note here is how input and gradient are multiplied by mask, how mask is created and how scaling is done post dropout ``` import numpy as np def forward_pass_with_dropout(X, dropout_rate): """ Apply dropout to the input layer X during the forward pass. :param X: Input data for the layer, numpy array of shape (n_features, n :param dropout_rate: The probability of setting a neuron's output to 0 :return: A tuple (output after applying dropout, dropout mask used) """ # Create a mask using the dropout rate, setting to 0 for the dropped un dropout_mask = np.random.rand(*X.shape) > dropout_rate # Apply the mask to the input data dropped_out_X = np.multiply(X, dropout_mask) # During training, we'll scale data to not change the expected value dropped_out_X /= (1 - dropout_rate) return dropped_out_X, dropout_mask ``` ``` def backward_pass_with_dropout(dA, dropout_mask, dropout_rate): """ Apply the stored mask to the gradient during the backward pass. :param dA: Gradient of the loss with respect to the activations, numpy :param dropout_mask: The dropout mask that was used during the forward :param dropout_rate: The probability of setting a neuron's output to 0 :return: Gradient after applying dropout mask """ # Apply the dropout mask to the gradients dA_with_dropout = np.multiply(dA, dropout_mask) # Scale the gradients as we did during the forward pass dA_with_dropout /= (1 - dropout_rate) return dA_with_dropout # Example usage: np.random.seed(0) # for reproducibility X = np.random.randn(5, 3) # 5 features, 3 samples dropout_rate = 0.2 # 20% dropout rate # Forward pass X_dropped, mask = forward_pass_with_dropout(X, dropout_rate) # Suppose we have some gradient dA from the backward pass dA = np.random.randn(5, 3) # Backward pass dA_dropped = backward_pass_with_dropout(dA, mask, dropout_rate) ```

Answer 122

Backprop through time (BPTT) is the reason for vanishing and exploding gradient Gated Units: Using RNN variants with gating mechanisms, such as LSTM (Long Short-Term Memory) Gradient Clipping: Using Skip Connections: Proper Initialization and Activation Functions

Answer 123

It might be that the training loss uses regularizers (e.g. L2 norm on the weights). Since at validation time we only evaluate the main loss function, it might happen that the validation loss is lower than the (composite) train loss. The model might have dropout layers, which impose heavy regularization during training. These are layers that behave differently during training (dropping out neurons) and inference time The validation set might simply be easier compared to the train set.

Answer 124

We can perform early stopping if there is no change or a decrease in a metric of interest (validation loss, accuracy, etc.). It is a good idea to use a patience parameter in order to avoid noisy estimates.

Answer 125

In gradient descent we first perform a forward pass and a backward pass for each sample in the dataset, before we take a step in the direction of the cumulative gradient. This is extremely slow to converge, as today’s datasets are extremely large in size, which implies that we perform gradient updates too rarely. * In SGD, after performing a forward and a backward pass for a single sample, we take a step in the direction of the single gradient. Even though we perform gradient updates much more often, the entire process is extremely noisy as we are optimizing with respect to a single sample at a time. * Mini-batch SGD combines the best of both worlds: perform a forward/backward pass for a batch (e.g. 32) of samples, and take a step in the direction of the gradient for the current mini-batch. On one hand, we perform updates more often than pure gradient descent; on the other hand, we optimize over an entire mini-batch, minimizing the noise in the estimated gradient.

Answer 126

The fluctuation can be attributed to two primary factors: large learning rate or exploding gradients. In either case, this can seriously destabilize the training process. In order to resolve this issue, we could: 1) lower the learning rate; 2) perform gradient clipping

Answer 127

1. if too high loss will decrease fast but will stabilise at high value 2. if just right loss will decrease and stabilise at low value 3. if ltoo low loss will decrease very slowly

Answer 128

Warmup steps are just a few updates with low learning rate at the beginning of training. After this warmup, you use the regular learning rate (schedule) to train your model to convergence. For example, RMSProp computes a moving average of the squared gradients to get an estimate of the variance in the gradients for each parameter. For the first update, the estimated variance is just the square root of the sum of the squared gradients for the first batch. Since, in general, this will not be a good estimate, your first update could push your network in a wrong direction. To avoid this problem, you give the optimiser a few steps to estimate the variance while making as little changes as possible (low learning rate) and only when the estimate is reasonable, you use the actual (high) learning rate.

Answer 129

Suppose the output of the previous layer is X ∈ R B×D where B is the batch size and D is the dimensionality of the embedding. Both techniques normalize the input as follows: Y = X − E [X] / sqrt(Var [X] + ϵ) ∗ γ + β where γ and β are learnt affine parameters, and all operations are treated as broadcasts. The main difference stems from how the two techniques compute the statistics: * Batch norm computes the mean and standard deviation over the batch, meaning that E [X] ∈ R 1×D and Var [X] ∈ R1×D. * Layer norm computes the mean and standard deviation over the features, meaning that E [X] ∈ R B×1 and Var [X] ∈ R B×1 .

Answer 130

if you use L2 regularization then in the gradient update rule you're basically multiplying w with 1- \alpha * \lambda before updating with gradient. this is called weight decay. PyTroch implements weight decay

Answer 131

The learning rate is controlling the size of the update steps along the gradient. so large first small later on In Adam the moving square average acts as a learning rate parameter exception would be continual learning

Answer 132

When utilizing larger batches, we can afford to have large learning rates, as the approximated gradient of the batch is closer to the true gradient. On the other hand, using very small batches yields noisy estimates of the gradient, and is therefore advisable to also use small learning rates so that we don’t diverge in our optimization procedure.

Answer 133

solutions found by adaptive methods (e.g. Adam) generalize worse than SGD, even when these solutions have better training performance. In other words, adaptive methods converge faster, but have worse generalization performance than pure SGD.

Answer 134

SGD is typical gradient update with learning rate adam along with this has a momentum adn a variance term. Momentum is moveing average of gradient and variance is moving average of gradient^2

Answer 135

, the choice between asynchronous and synchronous SGD in a distributed training environment depends on the specific requirements of the training process and the trade-offs between speed and stability. Asynchronous SGD offers faster training but can suffer from issues due to stale gradients, while synchronous SGD provides more stable convergence at the cost of potentially slower training

Answer 136

during backpropagation, we will have the same local derivatives, which in turn will cause the weights for all neurons in a given layer to perform the same update.

Answer 137

Random weight initialization, mini-batch shuffling, dropout, etc

Answer 138

1) SGD gradient randomness helps escape local minima 2) randomness while output sampling in LLMs and action sampling when RL plays chess helps 3) dropout helps the network not be too dependent on any weights

Answer 139

A dead neuron in a neural network is a neuron that always outputs the same value, regardless of the input. This typically happens when the neuron's activation function is non-linear, like ReLU (Rectified Linear Unit), and the input to the neuron is such that the activation function always outputs the minimum value (which is 0 in the case of ReLU). For ReLU, this happens when the weights and biases of the neuron are adjusted during training in such a way that the weighted sum of the inputs is always negative, leading to an output of 0

Answer 140

we can monitor activations and if one activation is 0 all the time for different inputs then it is dead. can be sorted by using leaky relu or layer norm or proper initialization

Answer 141

* Magnitude. Remove weights that have magnitude close to 0; remove neurons whose L2 norm of the weight is also close to 0. * Activations. We could also use the training data to observe the activations of the neurons. We can remove neurons whose distribution of activations extremely peaked (invariant to the input); Moreover, we can also remove a neuron if its activation pattern is highly correlated to another neuron in the same layer.

interview_ML Flashcards

(166 cards)