interview_ML Flashcards

1
Q

compute a checksum for
precise file deduplication, and compress them for storage. What is meant by this?

A

checksum gives you a numerical representation of the contents of a file. used to signal integrity of the data in a file.
We can remove duplicate files with same content but different names using this method.

In the context of the described data collection process, computing a checksum means generating a unique, fixed-size numerical value (often represented as a string of letters and numbers) that is calculated from the contents of a file or a piece of data. This checksum serves multiple purposes:

Integrity Verification: The checksum helps verify the integrity of the data or files over time. If the file is altered in any way, even a small change, recomputing the checksum will result in a different value. This allows for easy detection of corruption or unintended changes.

Deduplication: In the process described, the checksum is used for precise file deduplication. By computing and comparing checksums, it is possible to identify and eliminate duplicate files in the dataset, even if the files are named differently. This is because the checksum is based on the file content, not the file name or other metadata.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what data preprocessing steps were done at keysight?

A

You can do checksum (python code) and deduplication, you can weed out some files based on the version of the code,
Excessively large code files were filtered as they were probably auto generated.
Added file path in the beginning of data of a file

tokenisation done to create chunks and then synthetic dataset creation by querying either llama2 or gpt API depending on data sensitivity

Something I didn’t do during keysight data work was dependency ordering, you read about it later in deepseek coder and it makes a lot of sense and they gave pseudocode for it and their code performance is state of the art and this prolly ahs something to do with it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Positional Encoding Calculation and code (formula) also the rationale behind it

A

import numpy as np

def positional_encoding(sentence_length, model_dim):
pos_enc_matrix = np.zeros((sentence_length, model_dim))
for pos in range(sentence_length):
for i in range(0, model_dim, 2):
pos_enc_matrix[pos, i] = np.sin(pos / (10000 ** (i / model_dim)))
pos_enc_matrix[pos, i + 1] = np.cos(pos / (10000 ** (i / model_dim)))
return pos_enc_matrix

Rationale behind the Formula
The usage of a sine function for even indices and a cosine function for odd indices allows the model to capture different phase relationships between sequences of different frequencies.

This approach offers a “hackish” way to prioritize positional information of varying scales. As the formula demonstrates, higher-frequency components influence words that are further along in the sentence, while lower-frequency components emphasize closer proximity.

The specific constant
1
10000
is introduced to prevent the function from saturating.

Positional encoding used in llama?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

mu and sigma for layer norm are calculated during training or runtime

A

mu and sigma calculated during runtime for inference for each data sample and gamma and beta are trained parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

in the transformer architecture layer norm is done where?

A

before the activation function after the linear layer after skip connection.

This means the workflow within each sub-block of a transformer is as follows:

  1. Input from the previous layer
  2. Addition of the skip connection output
  3. Layer normalization
  4. Activation function (if any, depending on the sub-block)

However, some implementations and variations might apply layer normalization before adding the skip connection output. Both methods have been explored in practice, with different effects on training dynamics and performance. The original approach (normalization after the residual connection) is more common and is often preferred because it allows the model to preserve the raw output of each layer before it is normalized, which can be beneficial for learning complex dependencies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

i. [E] What’s the geometric interpretation of the dot product of two vectors?

A

Multiplication of the length of one vector and the length of the projection of the other
vector onto the first one.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

[E] Given a vector u, find vector v of unit length such that the dot product of u and v is
maximum

A

Given a vector u, the vector v of unit length that maximizes the dot product u · v is the vector that points
in the same direction as u. The vector v can be found by dividing u by its own magnitude, making it a
unit vector.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Give an example of how the outer product can be useful in ML.

A

The Covariance matrix is a commonly used quantity in Machine Learning algorithms
(eg. PCA). Given a dataset X ∈ R
n×d with n samples and d features, we calculate the (empirical)
covariance as follows:
Cov [X] = 1
n
Xn
i=1
(xi − x¯)(xi − x¯)
T
where ¯x is the mean feature vector: ¯x =
1
n
Pn
i=1 xi

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

[E] What does it mean for two vectors to be linearly independent?

A

Two vectors are linearly independent if no scalar multiple of one vector equals the other. This means
they do not lie on the same line through the origin and neither can be formed by scaling the other.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

[M] Given two sets of vectors and . How do you
check that they share the same basis?

A

you first have to make sure A and B have the same dimension. This can be done by converting A and B into its row echelon form.

Singular Value Decomposition (SVD): Decompose A into the product of three matrices: A = U Σ V^T, where U and V are orthogonal matrices, and Σ is a diagonal matrix. The number of non-zero diagonal entries in Σ is the rank of A.

once you have the rank you can then take the r vectors than span row space of both A and B and show that every vector in rows of A is spanned by rows of B and vice versa

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

[M] Given n vectors, each of d dimensions. What is the dimension of their span?

A

Given n vectors, each of d dimensions, the dimension of their span depends on their linear
independence. If all vectors are linearly independent, the dimension of their span is min(n, d). If they are
not all linearly independent, the dimension will be less than min(n, d).

Can think of this as a matrix with n columns of d-dim vectors, min(n,d) is the rank

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

[E] What’s a norm? What is Lnorm?

A

A norm is a function that assigns a strictly positive length or size to each vector in a vector space,
except for the zero vector, which is given a length of zero.
The L0 norm refers to the number of nonzero elements in a vector,
L1 norm is the sum of the absolute values of the vector elements,
L2 norm
(also known as the Euclidean norm) is the square root of the sum of the squares of the vector elements,
and the
L norm is the maximum absolute value of the elements in the vector.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

[M] How do norm and metric differ? Given a norm, make a metric. Given a metric, can we make a norm?

A

Ans. A metric measures distances between pairs of things while a norm measures the size of a single item. Metrics can be defined on pretty much anything, while the notion of a norm applies only to vector spaces: the very definition of a norm requires that the things measured by the norm could be added and scaled. If you have a norm, you can define a metric by saying that the distance between a and b is the size of a - b.

On the other hand, if you have a metric you can’t usually define a norm.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

[E] Why do we say that matrices are linear transformations?

A

Matrices represent linear transformations because they map linear combinations of vectors to other linear
combinations.

Specifically, for a matrix M and vectors u and v, and a scalar c, the following properties hold true, which are
the properties of linear transformations:
M(u + v) = Mu + Mv
Applying M to the sum of two vectors is the same as summing the results of applying M to each
vector individually.
M(cu) = c(Mu)
Applying M to a scalar multiple of a vector is the same as multiplying the result of applying M by
that same scalar.
These properties demonstrate that matrix multiplication exhibits the key aspects of a linear transformation:
1. Additivity - Applying the transformation to vector sums gives the sum of individually transformed
vectors
2. Homogeneity - Scalars distribute across the transformation
Thus, matrices and matrix multiplication inherently represent and operate as linear transformations
between vector spaces.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What’s the inverse of a matrix? Do all matrices have an inverse? Is the inverse of a matrix
always unique?

A

The inverse of a matrix A is another matrix A^-1 such that AA^-1 = A^-1A = I, where I is the identity matrix.
Not all matrices have an inverse; only square matrices that are non-singular (with a non-zero determinant)
have an inverse. When a matrix has an inverse, it is always unique.`

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

[E] What does the determinant of a matrix represent?

A

The determinant of a matrix can be interpreted as a scaling factor for the transformation that the matrix
represents. Geometrically, it represents the volume scaling factor of the linear transformation described by
the matrix, including whether the transformation preserves or reverses the orientation of the space.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

[E] What happens to the determinant of a matrix if we multiply one of its rows by a scalar t×R?

A

Multiplying a row of a matrix by a scalar t scales the determinant of the matrix by that scalar. So if the
original determinant was d, the new determinant will be t×d.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

[M] A 4×4 matrix has four eigenvalues 3,3,2,−1. What can we say about the trace and the
determinant of this matrix?

A

The trace of the matrix, which is the sum of its eigenvalues, would be 3+3+2−1=7. The determinant of the
matrix, which is the product of its eigenvalues, would be 3×3×2×(−1)=−18

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

what is the determinant of a matrix with linearly dependent rows or columns

A

0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

[M] What’s the difference between the covariance matrix A^TA and the Gram matrix AA^T?

A

Suppose A ∈ Rn×d
, corresponding to n samples with each having d features. Then, the Covariance matrix AT A ∈ Rd×d
captures the ”similarity” between features, whereas the Gram matrix
AAT ∈ Rn×n captures the ”similarity” between samples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Given A∈Rn×m and b∈Rn

Find x such that: Ax=b

When does this have a unique solution?

Why is it when A has more columns than rows, Ax=b has multiple solutions?

Given a matrix A with no inverse. How would you solve the equation Ax=b?

A

Unique Solution

Square and Full Rank: If A is square (n = m) and has full rank (rank(A) = m = n), then A is invertible. The equation Ax = b has a unique solution given by x = A^(-1)b.

Non-Square but Full Column Rank: If A is not square (m ≠ n), but rank(A) = m (full column rank) and m < n (tall matrix), a left inverse A_L of A exists such that A_L A = I. This configuration implies Ax = b has a unique solution when b is in the range (column space) of A. The solution can be given by x = A_L b. In this scenario, A_L = (A^T A)^(-1)A^T and there is no null space of A (nullity(A) = 0) because all columns are linearly independent.

Multiple or No Solutions

Overdetermined System (m > n): If A is a wide matrix (more columns than rows), the rank of A could be at most n, and the null space of A (nullity(A)) has dimension m - n (assuming rank(A) = n). This situation typically means:

No Solution: If b is not in the column space of A, then Ax = b has no solution.

Infinitely Many Solutions: If b is in the column space of A, there are infinitely many solutions because there are free variables associated with the non-trivial null space of A.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the pseudoinverse and how to calculate it?

A

Method 1: Singular Value Decomposition (SVD)
The most reliable method for computing the pseudoinverse of any matrix, whether square or rectangular, full rank or rank-deficient, is using its Singular Value Decomposition (SVD). The SVD of matrix A is given by:
A = U Σ V^T
where:
U is an n×n orthogonal matrix whose columns are the left singular vectors of A.
V is an m×m orthogonal matrix whose columns are the right singular vectors of A.
Σ is an n×m diagonal matrix with non-negative real numbers on the diagonal, known as the singular values.
The pseudoinverse A+ is then calculated as:
A+ = V Σ+ U^T
Here, Σ+ is obtained by taking the reciprocal of each non-zero singular value in Σ, and transposing the matrix.

Method 2: Using Matrix Transpose and Inversion
For matrices that are either full row rank or full column rank, the pseudoinverse can also be computed directly using matrix transposes and inversion:
Full Column Rank (m≤n and rank(A)=m):
A+ = (A^T A)^(-1) A^T
Full Row Rank (m≥n and rank(A)=n):
A+ = A^T (A A^T)^(-1)
Note: These formulas assume that A has full column or row rank, respectively.

Do AAT or ATA depending on which gives you a full rank matrix, that is then invertible and use that to find pseudoinverse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What does the derivative represent?

A

Answer: The derivative of a function measures the sensitivity to change in the function output with
respect to a change in the input.
Moreover, when it exists, the derivative at a given point is the slope of the tangent line to the graph
of the function at that point. The tangent line is the best linear approximation of the function at
that input value. This is the reason why in gradient descent we (slowly) move in the (negative)
direction of the derivative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What’s the difference between derivative, gradient, and Jacobian?

A

When f: R → R, we calculate the derivative df/dx.

When f: R^n → R, we calculate the gradient:
∇f = [∂f/∂x1, ∂f/∂x2,…, ∂f/∂xn]

When f: R^n → R^m, we calculate the Jacobian (an mxn matrix):
Jac(f) = [
∂f1/∂x1 … ∂f1/∂xn

∂fm/∂x1 … ∂fm/∂xn
]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Say we have the weights w ∈ R^(d×m) and a mini-batch x of n elements, each element is of the shape 1 × d, so that x ∈ R^(n×d). We have the output y = f(x; w) = xw. What’s the dimension of the Jacobian ∂y/∂x?

A

First, notice that y ∈ R^(n×m). With that said, Jac_x(f) ∈ R^((n×m)×(n×d)), or equivalently Jac_x(f) ∈ R^((n·m)×(n·d)), given that we have reshaped the 4-dim tensor into a 2-dim tensor, i.e. a matrix.

in general dimension of jacobain dy/dxis R^(y-dim x x-dim)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Given a very large symmetric matrix A that doesn’t fit in memory, say and a
function f that can quickly compute . Find the unit vector x so that is
minimal.
Hint: Can you frame it as an optimization problem and use gradient descent to find an approximate
solution?

A

To find the unit vector x that minimizes x^T Ax, we can frame this as an optimization problem and approach
it using an iterative algorithm like gradient descent or the conjugate gradient method. These methods
update x in a direction that minimizes the function x^T Ax, and in the case of gradient descent, this involves computing the gradient -2Ax at each step. We would iterate this process until convergence, ensuring at
each step that x remains a unit vector by normalizing it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Why do we need dimensionality reduction?

A

Dimensionality reduction is used to reduce the number of random variables under consideration and can
be divided into feature selection and feature extraction. It helps in reducing the time and storage space
required, removes multicollinearity, enhances the interpretation of the parameters, helps in visualizing data,
and most importantly, it can help in avoiding the curse of dimensionality.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Eigendecomposition is a common factorization technique used for dimensionality reduction. Is the eigendecomposition of a matrix always unique?

A

The decomposition is not always unique. Suppose A ∈ R^(2×2) has two equal eigenvalues λ1 = λ2 = λ, with corresponding eigenvectors u1, u2. Then:
Au1 = λ1u1 = λu1
Au2 = λ2u2 = λu2
Or written in matrix form:
A [u1 u2] = [u1 u2] [λ 0; 0 λ]
Notice that we can permute the matrix of eigenvectors (thus obtaining a different factorization):
A [u2 u1] = [u2 u1] [λ 0; 0 λ]
But we still end up with the same eigen-properties:
Au2 = λu2
Au1 = λu1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Name some applications of eigenvalues and eigenvectors

A

PCA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

We want to do PCA on a dataset of multiple features in different ranges. For example, one is in the range 0-1 and one is in the range 10 - 1000. Will PCA work on this dataset?

A

In PCA we are interested in the components that maximize the variance. If one component (e.g. human height) varies less than another (e.g. weight) because of their respective scales (meters vs. kilos), PCA might determine that the direction of maximal variance more closely corresponds with the ‘weight’ axis, if those features are not scaled. Since a change in height of one meter should be considered much more important than the change in weight of one kilogram, the previous assumption would be incorrect. Therefore, it is important to standardize the features before applying PCA.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Under what conditions can one apply eigendecomposition? What about SVD?

A

Eigendecomposition is possible only for (square) diagonalizable matrices. On the other hand, the Singular Value Decomposition (SVD) always exists (even for non-square matrices).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What’s the relationship between PCA and SVD?

A

Suppose we have data X ∈ R^(n×d) with n samples and d features. Moreover, assume that the data has been centered so that the mean of each feature is 0. Then, we can perform PCA in two main ways:
First, we compute the covariance matrix C = 1/(n-1)X^T X ∈ R^(d×d), and perform eigendecomposition: C = V L V^T, with eigenvalues as the diagonal of L ∈ R^(d×d), and eigenvectors as the columns of V ∈ R^(d×d). Then, we stack the k eigenvectors of V corresponding to the top k eigenvalues into a matrix V˜ ∈ R^(d×k). Finally, we obtain the component values as follows: X˜ = X V˜ ∈ R^(n×k).
Alternatively, instead of first computing the covariance matrix and then performing eigendecomposition, notice that given the above formulation, we can directly compute SVD on the data matrix X, thus obtaining: X = U Σ V^T. By construction, the right singular vectors in V are the eigenvectors of X^T X. Similarly, we stack the k right singular vectors corresponding to the top k singular values into a matrix V˜ ∈ R^(d×k). Finally, we obtain the component values as follows: X˜ = X V˜ ∈ R^(n×k).
Even though SVD is slower, it is often considered to be the preferred method because of its higher numerical accuracy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is the relationship between SVD and eigendecomposition?

A

Consider A ∈ R^(m×n) of rank r. Then, we can factorize A as follows:
A = U Σ V^T
where U ∈ R^(m×m) is an orthogonal matrix of left singular vectors, V ∈ R^(n×n) is an orthogonal matrix of right singular vectors, and Σ ∈ R^(m×n) is a “diagonal” matrix of singular values such that exactly r of the values σ_i := Σ_ii are non-zero.
By construction:
The left singular vectors of A are the eigenvectors of A^T A. From the Spectral Theorem, the eigenvectors (and thus the left singular vectors) are orthonormal.
The right singular vectors of A are the eigenvectors of A A^T. From the Spectral Theorem, the eigenvectors (and thus the right singular vectors) are orthonormal.
If λ is an eigenvalue of A^T A (or A A^T), then √λ is a singular value of A. From the positive semidefiniteness of A^T A (or A A^T), the eigenvalues (and thus the singular values) are non-negative.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What does it mean when a function is differentiable?

A

A function f: U → R is said to be differentiable at a ∈ U if the derivative:
f’(a) = lim(h→0) [f(a + h) - f(a)]/h
exists. This implies that the function is continuous at a. Note that every continuous function is not necessarily differentiable. like |X| continuous but not differentiable. functions with sharp edges are not differentiable tho continuous

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

[E] Give an example of when a function doesn’t have a derivative at a point.

A

An example is the function f(x) = |x|, which doesn’t have a derivative at x = 0. The graph of this function
has a sharp corner at x = 0, which means there is no single tangent line at that point.

Why you can still apply the formula right?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Give an example of non-differentiable functions that are frequently used in machine learning.
How do we do backpropagation if those functions aren’t differentiable?

A

An example is the ReLU (Rectified Linear Unit) function, which is non-differentiable at x = 0. In machine
learning, backpropagation with such functions often uses a concept called subgradient, which allows the
algorithm to bypass non-differentiability at certain points. For ReLU, the derivative is defined as 0 for x < 0
and 1 for x > 0, and at x = 0, any value between 0 and 1 can be used.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What does it mean for a function to be convex or concave?

A

A function is called convex if the line segment between any two points on the graph of
the function lies above the graph between the two points. More precisely, the function f : X → R is
convex if and only if for all 0 ≤ t ≤ 1 and all x1, x2 ∈ X:
f(tx1 + (1 − t)x2) ≤ tf(x1) + (1 − t)f(x2)
The function f is said to be concave if −f is convex.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Why is convexity desirable in an optimization problem?

A

Convexity is desirable because any local minimum of a convex function is also a global
minimum

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Most ML algorithms we use nowadays use first-order derivatives (gradients) to construct the next
training iteration.
[E] How can we use second-order derivatives for training models?

A

Second-order derivatives can be used in optimization algorithms to better understand the curvature of
the loss function. This information can be used to adjust the learning rate and the direction of the
update steps, potentially leading to faster convergence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Pros and cons of second-order optimization.

A

Pros: Can lead to faster convergence and more informed update steps.
Cons: Computationally expensive, as it requires calculating and inverting the Hessian matrix.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

How can we use the Hessian (second derivative matrix) to test for critical points?

A

The Hessian matrix can be used to test the nature of critical points. If the Hessian at a point is positive
definite, the point is a local minimum; if it is negative definite, the point is a local maximum; and if it is
indefinite, the point is a saddle point

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Jensen’s inequality forms the basis for many algorithms for probabilistic inference, including Expectation-Maximization and variational inference. Explain what Jensen’s inequality is.

A

As stated before, for a given convex function f, we had the following property:
g(tx1 + (1 - t)x2) ≤ tg(x1) + (1 - t)g(x2)
Let us generalize this property. Again, suppose we have a convex function f, variables x1, …, xn ∈ I, and non-negative real numbers α1, …, αn such that ∑i αi = 1. Then, by induction we have:
g(α1x1 + … + αnxn) ≤ α1g(x1) + … + αng(xn)
Let’s formalize it one step further. Consider a convex function f, a discrete random variable X with n possible values x1, …, xn, and real non-negative values ai = p(X = xi). Then, we obtain the general form of the Jensen’s inequality:
g(E[X]) ≤ E[g(X)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Explain the chain rule.

A

The chain rule is a formula that expresses the derivative of the composition of two differentiable functions f and g in terms of the derivatives of f and g. More precisely, if h = f ∘ g is the function such that h(x) = f(g(x)) for every x, then the chain rule is:
dh/dx = df/dg · dg/dx

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Given the function f(x, y) = 4x^2 - y with the constraint x^2 + y^2 = 1. Find the function’s maximum and minimum values.

A

In order to solve the constrained optimization problem, we form the Lagrangian:
L(x, y, λ) = 4x^2 - y + λ(x^2 + y^2 - 1)

Given the function f(x, y) = 4x^2 - y with the constraint x^2 + y^2 = 1. Find the function’s maximum and minimum values.
Answer: In order to solve the constrained optimization problem, we form the Lagrangian:
L(x, y, λ) = 4x^2 - y + λ(x^2 + y^2 - 1)
Calculating the gradient and setting it to zero, we obtain:
∇_x,y,λ L = (∂L/∂x, ∂L/∂y, ∂L/∂λ) = (8x + 2λx, -1 + 2λy, x^2 + y^2 - 1) = 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

Let x ∈ R^n, L = crossentropy(softmax(x), y) in which y is a one-hot vector. Take the derivative of L with respect to x.

A

∂L/∂x = ∂/∂x [(-y^T log(softmax(x))] = -y^T ∂/∂x [softmax(x)]/∂x [log(softmax(x))]
= -y^T [diag(softmax(x)) - softmax(x)softmax(x)^T] / [diag(softmax(x))]
= softmax(x) - y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Given a uniform random variable X in the range of [0,1] inclusively. What’s the probability that
X=0.5?

A

For a continuous uniform distribution, the probability of X being exactly any specific value, including 0.5, is
0. This is because the probability for a continuous distribution is defined over intervals, not specific points.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

Can the values of PDF be greater than 1? If so, how do we interpret PDF?

A

Yes, the values of a Probability Density Function (PDF) can be greater than 1. The key point is that the
area under the PDF curve over the entire range must integrate to 1. A high PDF value does not represent
probability but rather indicates a higher density of the variable at that point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What’s the difference between multivariate distribution and multimodal distribution?

A

A multivariate distribution is a probability distribution with more than one random variable, each with its
range of values. A multimodal distribution is a probability distribution with more than one peak or mode, regardless of how many variables it has.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

What does it mean for two variables to be independent?

A

In general, continuous random variables X1, …, Xn admitting a joint density are all independent from each other if and only if:
p_{X1,…,Xn}(x1, …, xn) = p_{X1}(x1) · · · p_{Xn}(xn)
This equation states that the joint probability density function (pdf) of the random variables X1, …, Xn factorizes into the product of their individual pdfs, which is a necessary and sufficient condition for independence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

It’s a common practice to assume an unknown variable to be of the normal distribution. Why is that?

A

The Central Limit Theorem (CLT) states that the distribution of the sum of a large number of
independent, identically distributed random variables is approximately normal, regardless of the underlying distribution. Because so many things in the universe can be modeled as the sum of a large number of
independent random variables, the normal distribution pops up a lot.

**Central limit theorem and law if larger numbers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

How would you turn a probabilistic model into a deterministic model?

A

To convert a probabilistic model into a deterministic one, you typically use expected values, mode, or
median of the probability distributions as fixed values instead of random variables. This approach ignores
the variability and uncertainty represented by the probability distributions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

Explain frequentist vs. Bayesian statistics.

A

The frequentist approach The goal is to use the sample data to build point estimates of the parameters
(potentially with standard error).

bayesian uses priors The goal is to build a posterior distribution of the parameters, given
the data at hand.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

Code for merge sort

A

def merge_sort(arr):
“””
Sorts an array using the merge sort algorithm.
“””
if len(arr) > 1:
mid = len(arr) // 2 # Finding the mid of the array
left_half = arr[:mid] # Dividing the elements into 2 halves
right_half = arr[mid:]

    merge_sort(left_half)  # Sorting the first half
    merge_sort(right_half)  # Sorting the second half

    i = j = k = 0

    # Copy data to temp arrays L[] and R[]
    while i < len(left_half) and j < len(right_half):
        if left_half[i] < right_half[j]:
            arr[k] = left_half[i]
            i += 1
        else:
            arr[k] = right_half[j]
            j += 1
        k += 1

    # Checking if any element was left
    while i < len(left_half):
        arr[k] = left_half[i]
        i += 1
        k += 1

    while j < len(right_half):
        arr[k] = right_half[j]
        j += 1
        k += 1

return arr

Example usage
if __name__ == “__main__”:
sample_array = [12, 11, 13, 5, 6, 7]
print(“Original array:”, sample_array)
sorted_array = merge_sort(sample_array)
print(“Sorted array:”, sorted_array)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

code to recursively read a json file

A

import json

def load_json_file(filename):
“”” Load the JSON data from a file. “””
with open(filename, ‘r’) as file:
return json.load(file)

def print_json_recursively(data, indent=0):
“”” Recursively print JSON data with indentation for nested structures. “””
for key, value in data.items():
print(‘ ‘ * indent + str(key) + ‘:’, end=’ ‘)
if isinstance(value, dict): # If value is a dictionary, recurse
print()
print_json_recursively(value, indent + 4)
elif isinstance(value, list): # If value is a list, iterate each item
print()
for i, item in enumerate(value):
print(‘ ‘ * (indent + 4) + f’[{i}]:’, end=’ ‘)
if isinstance(item, dict):
print()
print_json_recursively(item, indent + 8)
else:
print(item)
else:
print(value)

def main():
filename = ‘path_to_your_json_file.json’
try:
json_data = load_json_file(filename)
print_json_recursively(json_data)
except Exception as e:
print(f”An error occurred: {e}”)

if __name__ == “__main__”:
main()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

Find the longest increasing subsequence in a string.

A

def longest_increasing_subsequence(s):
# Cache for memoization
memo = {}

def rec(i, prev):
    if i == len(s):
        return 0  # Base case: end of string
    
    # Check memoized results
    if (i, prev) in memo:
        return memo[(i, prev)]
    
    # Option 1: Skip the current character
    taken = 0
    if prev < s[i]:
        # Option 2: Include the current character if it continues the sequence
        taken = 1 + rec(i + 1, s[i])
    
    not_taken = rec(i + 1, prev)
    
    # Store result in memoization dictionary
    memo[(i, prev)] = max(taken, not_taken)
    return memo[(i, prev)]

# Initialize recursive function, use a character smaller than any possible as the initial 'previous'
return rec(0, chr(0))

Example usage
s = “azbycxdwe”
print(“Length of the longest increasing subsequence is:”, longest_increasing_subsequence(s))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

Traverse a tree in pre-order, in-order, and post-order.

A

class TreeNode:
def __init__(self, val=0, left=None, right=None):
self.val = val
self.left = left
self.right = right

def preorder_traversal(root):
if root is None:
return []
return [root.val] + preorder_traversal(root.left) + preorder_traversal(root.right)

def inorder_traversal(root):
if root is None:
return []
return inorder_traversal(root.left) + [root.val] + inorder_traversal(root.right)

def postorder_traversal(root):
if root is None:
return []
return postorder_traversal(root.left) + postorder_traversal(root.right) + [root.val]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q
  1. Preorder Traversal (Iterative)
A

def preorder_traversal_iterative(root):
if root is None:
return []
stack, output = [root], []
while stack:
node = stack.pop()
if node:
output.append(node.val)
stack.append(node.right) # Right child pushed first so that left is processed first
stack.append(node.left)
return output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

In order traversal of BST

A

def inorder_traversal_iterative(root):
stack, output = [], []
current = root
while current or stack:
while current:
stack.append(current)
current = current.left
current = stack.pop()
output.append(current.val)
current = current.right
return output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

Post order traversal of BST

A

def postorder_traversal_iterative(root):
if root is None:
return []
stack, output = [root], []
while stack:
node = stack.pop()
output.append(node.val)
if node.left:
stack.append(node.left)
if node.right:
stack.append(node.right)
return output[::-1] # Reverse the result because we want left-right-root

Do post order traversal question

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

Given an array of integers and an integer k, find the total number of continuous subarrays whose
sum equals k. The solution should have O(N) runtime

A

def subarraySum(nums, k):
# Dictionary to store the frequency of cumulative sums
cumulative_sum_count = {0: 1} # Base case: sum of 0 exists once
current_sum = 0
count = 0

for num in nums:
    current_sum += num
    # Check if there is a prefix subarray we can subtract
    # that results in the current subarray summing to k
    sum_needed = current_sum - k
    if sum_needed in cumulative_sum_count:
        count += cumulative_sum_count[sum_needed]
    
    # Update the count of the current_sum in the hashmap
    if current_sum in cumulative_sum_count:
        cumulative_sum_count[current_sum] += 1
    else:
        cumulative_sum_count[current_sum] = 1

return count
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

You have three matrices: A ∈ R100×5, B ∈ R5×200, C ∈ R200×20, and you need to calculate the product ABC. In what order would you perform your multiplication and why?

A

Since matrix multiplication is associative, the answer is the same whether we multiply in the order of (AB)C or A(BC). However, let us observe the cost through the number of scalar multiplications we need to perform:
(AB)C = 100 · 5 · 200 + 100 · 200 · 20 = 50000
A(BC) = 5 · 200 · 20 + 100 · 5 · 20 = 30000
Obviously, the second approach is computationally cheaper.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

What are some of the causes for numerical instability in deep learning?

A

Overflow, underflow, division by zero, log 0, NaN as input, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

In many machine learning techniques (e.g. batch norm), we often see a small term ϵ
added to the calculation. What’s the purpose of that term?

A

The purpose is to avoid operations that are undefined for 0, such as division by 0, log 0, etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

What made GPUs popular for deep learning? How are they compared to TPUs?

A

GPUs became popular for deep learning because matrix multiplications can be efficiently parallelized over hundreds of cores. TPUs (Tensor Processing Units) are specialized hardware for neural nets, with the key difference that they have lower precision for representing floating-point numbers, allowing for:
Higher memory throughput
Faster addition and multiplication operations
This design enables TPUs to accelerate neural network computations while reducing power consumption and increasing efficiency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

What are the time and space complexity for doing backpropagation on a recurrent neural network?

A

O(B · T · w) - time

O(w + B · T · a) - space

For the forward-pass of a single example in one timestep we need to evaluate all the weights, resulting in O(w) time complexity, where w is the number of weights. Due to the recurrence, we repeat the computation for T timesteps, resulting in O(T · w). Moreover, performing this un-rolled forward pass for an entire batch will amount the time complexity to O(B · T · w). Lastly, we note that the time complexity of the forward and the backward pass is the same.

As for the space complexity, note that we need to keep in memory both the network weights and the activations from the forward pass (required for the backprop computation). Given that storing the activations for a single timestep is O(a), the space complexity amounts to O(w + B · T · a)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

Write a function find_bigrams to take a string and return a list of all bigrams.

A

def bigram(text_list:list):
result = []
for ls in text_list:
words = ls.lower().split()
for bi in zip(words, words[1:]):
result.append(bi)
return result
text = [“Data drives everything”, “Get the skills you need for the future of work”]
print(bigram(text))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

what are the 2 kinds of language modelling?

A

Masked and causal language modelling

68
Q

cost to train a model in terms of P and D

A

C= 6PD
P is the number of parameters in the transformer model
D is the dataset size, in tokens
C is the compute required to train the transformer model, in total floating point operations

C= # of GPUs x Flops/GPU

69
Q

total memory required for training a model

A

Total Memory = Model Memory + optimizer memory + activation memory + gradient memory

model memory (fp16)-> 2 bytes

optimizer memory -> often kept at 32B precision, 4byte for gradient, 4byte for momentum, 4byte for variance

gradient (saved in fp16) -> 2bytes

activations -> 2bytes * batchsize * # of tokens * hidden dimension * # of layers

total memory is roughly 16 * # of parameters

Benefit of keeping optim at 32bits?

70
Q

how do memory requirements change for Zero-1, 2, 3

A

For ZeRO-1,
Total Memory_Training ≈ Model Memory + ((Optimizer memory) / (No. GPUs)) + Activation Memory + Gradient Mem

For ZeRO-2,
Total Memory_Training ≈ Model Memory + Activation Memory + (Optimizer Memory + Gradient Memory)/ (No. GPUs)

For ZeRO-3,
Total Memory_Training ≈ Activation Memory + (Model Memory + Optimizer Memory + Gradient Memory) / (No. GPUs)

Why does the optimiser keep a copy of the gradients along with gradient memory?

71
Q

optimum scaling stated by chinchilla

A

20 x # of parameters

72
Q

what happens after attention layer output?

A

So, the sequence is:

  • Multi-headed attention
  • o_proj linear projection
  • Skip connection by adding the original input
  • Layer normalization
  • First FCN layer
  • Activation Function
  • Second FCN layer
  • skip connection
  • another layer norm
73
Q

what is BLEU score?

A

bleu score is a precision based metric, where numerator is count of n-gram match divided by count of n-grams present in generated text.

so this mean if “the apple” bigram shows up 7 times in generated text and 2 times in reference text then numerator count of “the apple” will be 2 and denominator count will be 7

this gives you Bleu-n for different n.

One problem here is you could have really small generations reducing the denominator and making the model seem good, so there is a brevity penalty multiplied

total blue score is exponent of weighted log precision

74
Q

techniques for regularization in LLM training

A
  1. Dropout
  2. Weight Decay (L2 Regularization)
  3. Layer Normalization
  4. Gradient Clipping
  5. Label Smoothing
75
Q

what is lasso and ridge regression?

A

l1 norm as regularization term is lasso l2 norm is ridge

76
Q

what is gradient clipping? what does it help with?

A

gradient clipping is when you clip gradients during backprop. helps with exploding gradient problem. Helps maintain stability during training.

What value is gradient clipped at

77
Q

what does l1 and l2 norm do to parameter values

A

l1 norm pushes them to 0, making it good for variable selection l2 norm pushes them close to 0.

l1 norm in the parameter space graphically is a diamond and l2 norm in the parameter space is a circle.
point where the contours of loss function first touch the diamond or circle is where loss function solution lies

78
Q

what does lambda in regularization parameter do?

A

it determines the size of the regularisation circle in the parameter space. larger lambda smaller circle

79
Q

what is bias in a model

A

In the context of machine learning,
bias refers to the error that is introduced by approximating a real-world problem, which may be complex, by a much simpler model. simpler models can exhibit high bias and complex models high variance

bias - underfitting
variance - overfitting

80
Q

what is variance in a model

A

Variance, refers to the error that is introduced by the model’s sensitivity to fluctuations in the training set.

if you use a complex model with lot of parameters and the parameters are allowed to take large values then it will be able to model noise along with data

If a model has too many parameters or if those parameters are allowed to take on large values, it can become extremely flexible. This means it can capture not only the underlying relationships in the data but also the noise (random fluctuations) specific to the training set. As a result, it may perform very well on the training data (low bias), but poorly on new, unseen data (high variance) because it’s overfitted to the noise and specifics of the training set rather than to the underlying data distribution.

81
Q

Effect of L2 Regularization? on variance as well

A

Penalizing Large Weights: prevent any single feature from having too much influence on the predictions, which is desirable when you suspect some features may be correlated with noise rather than with the signal in the training data.

Smoothness and Generalization: Smaller weights often result in smoother functions that change less drastically with input variations. This smoothness means the model is less likely to pick up on noise and will therefore generalize better to unseen data.

Shrinkage Effect: The regularization term shrinks the parameter values towards zero but not exactly to zero. This effect is akin to a “soft” form of feature selection that lowers the complexity of the model without completely eliminating the contribution of any single feature.

82
Q

skip connections solve what problem?

A

skip connections solve for the problem of vanishing gradient

83
Q

how is the problem of exploding gradient solved in LLMs

A

gradient clipping

84
Q

Accuracy formula

A

(TP+TN)/(FP+FN+TP+TN)

85
Q

precision formula

A

accuracy of positive samples
= TP/(TP+FP)

Also ask gpt and write about what these tell you

86
Q

recall formula

A

coverage of actual positive samples
= TP/(TP+FN)

Is this right ??

87
Q

specificity formula

A

coverage of actual negative samples
= TN/(TN+FP)

88
Q

f1 score

A

harmonic mean of precision and recall, useful for imbalanced classes
= 2TP/(2TP+FP+FN)

89
Q

what is the ROC curve

A

plot of TPR vs FPR
recall vs (1-specificity)
allows us to compare 2 methods like logistic vs boosting

90
Q

what is sensitivity?

A

it is the same as recall

91
Q

is BLEU score precision recall or f1 score based?

A

BLEU is precision based

92
Q

is ROGUE score precision recall or f1 score based?

A

recall based

93
Q

BERTScore what is?

A

BERTScore uses BERT to create an embedding for n-grams with BERT for both output and reference and then takes dot product for similarity in the form of precision and recall and then takes harmonic mean of them both

Do you use a double loop to summation over all n grams in the precision and recall components??

94
Q

steps for evaluation of LLM output

A
  1. coherence
  2. compile - syntax
  3. gpt eval - semantics
  4. expert eval
95
Q

sources of bias in gpt automated eval

A

Position bias: LLMs tend to favor the response in the first position.

Verbosity bias: LLMs tend to favor longer, wordier responses over more concise ones

Self-enhancement bias: LLMs have a slight bias towards their own answers.

96
Q

what is vicuna?

A

Vicuna 13B is a llama finetune on chat data from shareGPT.
it is one of the first paper to use gpt4 to evaulate itself against different bench marks

97
Q

What are the different eval benchmarks popular

A

Accuracy. recall, precision, specificity, f1 score if you have classes

BLEU, ROGUE, BERTScore if you have text as reference and task is something like summarisation

Automatic gpt4 based evaluation

98
Q

what is soft prompt tuning

A

prepends a trainable tensor to the model’s input embeddings, essentially creating a soft prompt. Unlike discrete text prompts, soft prompts can be learned via backpropagation, meaning they can be fine-tuned to incorporate signals from any number of labeled examples.

99
Q

what is soft prefix tuning

A

it prepends trainable parameters to the hidden states of all transformer blocks. During fine-tuning, the LM’s original parameters are kept frozen while the prefix parameters are updated.

100
Q

what is the main idea behind LORA?

A

when adapting to a specific task, pre-trained language models have a low intrinsic dimension and can still learn efficiently despite a random projection into a smaller subspace. Thus, LoRA hypothesized that weight updates during adaption also have low intrinsic rank.

How do you decompose into lower rank??

101
Q

what are main ideas introduced by QLORA

A

QLoRA builds on the idea of LoRA. But instead of using the full 16-bit model during fine-tuning, it applies a 4-bit quantized model

  1. 4-bit NormalFloat (to quantize models),
  2. double quantization (for additional memory savings), and
  3. paged optimizers (that prevent OOM errors by transferring data to CPU RAM when the GPU runs out of memory).

Give qlora paper to chatgpt and talk to it about how quantisation works there like u and sigma and normal distribution assumption and binning

102
Q

What benefits do we get from reducing the precision of our model? What problems might we run into?
How to solve these problems?

A

Benefits: less memory, faster computation
Problems: during training numerical instability possible cuz gradients get clipped due to lack of precision, learning affected or stalled due to underflow

mixed precision training to solve this issue

Details of mixed precision

103
Q

what is the difference between torch.cat() and torch.stack()

A

Given two tensors A and B, both with shape [2, 2]:
torch.cat([A, B], dim=0) will result in a tensor of shape [4, 2].
torch.stack([A, B], dim=0) will result in a tensor of shape [2, 2, 2].

104
Q

how are in place operations done in pytorch

A

In-place operations Operations that have a _ suffix are in-place

tensor.add_(5)

above tensor is modified in place

in place operations use is discouraged as it leads to an immediate loss of information

105
Q

how to load resnet18 model

A

model = resnet18()

106
Q

what data can be passed into resnet18 and how would you create a sample data point for it?

A

data = torch.randn(1, 3, 64, 64)

107
Q

what do the output labels of resnet18 look like? how would you create a sample output label?

A

label = torch.randn(1, 1000)

108
Q

how would you generate the prediction for a model on data?

A

prediction = model(data)

109
Q

how would you calculate loss given you had a data point and prediction and do a backward pass with it

A

loss = (prediction - labels).sum()
loss.backward

110
Q

how would you define an optimizer (say SGD) and run a optimizer step?

A

optim = torch. optim.SGD(model.parameters(), lr= 1e-3, momentum = 0.9)

optim.step()

111
Q

steps in a training pipeline in pytorch

A
  1. define model
  2. get data
  3. get labels
  4. define optimizer and loss
  5. calculate loss and do loss.backward()
  6. do optimizer.zero_grad() and optimizer.step()

or you define everything and do trainer.train()

112
Q

what is the Transformers library?

A

transformers library in huggingface which has models and techniques for training and finetuning many machine learning models. Its an API in PyTorch

113
Q

what is huggingface trainer

A

Trainer is a class in Transformers library in huggingface that lets you finetune pre-trained models. let’s you abstract away a lot of settings and supports mixed precision training.

114
Q

what are callbacks in PyTorch

A

Callback is a set of methods that you can override or utilize to customize behvaiour of trainer. for example you can use it to save model preiodically, logging metrics, modify learning rate. It’s useful for extending the functionality of the training loop without changing core trainnig logic.

you can define callbacks as arguments tot he trainer object

115
Q

how do you define and use a callback in pytorch?

A

callbacks are defined as a class in PyTorch where you create a class that inherits from a defined class in Transformers library and you can define a function in there to do what you want it to do when you want it done. LIke define a function that runs every epoch to do something if a condition is met.

116
Q

what are teh benefits of using nn.Module subclass to define your model?

A

Automatic Parameter Registration: This is crucial for training the model because optimizers rely on the .parameters() method of the nn.Module to get a list of all parameters that need updates.

Model Serialization and deserialization: when you save and load model

Device Management: When moving the model to a device (e.g., GPU), all parameters of all contained layers are automatically moved as well.

117
Q

what is the self attention operation?

A

self attention takes you from xi’s to yi’s
how every y_i is calculated is you take the x_i and dot product it with every other x_i. Then you take softmax. Each dot product becomes the scaling factor or weight that weighted addition of xj’s is done with.

Every y_i is weighted summation of all x_j ‘s

y_i = ∑_j w_ij x_j

w’_{ij} = x_i^T x_j

w_{ij} = exp(w’_{ij}) / ∑j exp(w’{ij})

118
Q

in attention layers why do you scale the dot product

A

large softmax kill the gradient, so we divide by sqrt(d)

Why sqrt(d)? Imagine a vector in ℝ^d with values all c. Its Euclidean length is sqrt(cd). Therefore, we are dividing out the amount by which the increase in dimension increases the length of the average vectors.

119
Q

layer norm is applied on what values in a layer?

A

The layer normalization is applied over the embedding dimension only.

120
Q

what is temperature? how is it implemented?

A

temperature sampling is dividing logits by the temperature before feeding them into softmax and obtaining our sampling probabilities.

lower temp means less random
higher temp means more random

121
Q

what is top-k sampling and what is the problem with it?

A

in top-k sampling you sample from the top k probabilities.

the problem is that if the distribution has a lot of reasonable options and prob is uniform we’ll ignore some possibilities simply because we put hard stop of k

122
Q

what is nucleus sampling?

A

nucleus sampling aka top p sampling, we compute cumulative prob till p and cut off there

123
Q

what are the benefits of layernorm?

A

Stability in Training: LayerNorm stabilizes the neural network training process by ensuring that the distribution of the inputs to the activation functions in a layer does not vary too much. This reduces the problem where the learning has to constantly adjust to a shifting input distribution across layers.

124
Q

What’s the risk in empirical risk minimization?

A

risk in emperical risk minimization is the expectation of loss over the join prob distribution P(x, y) where y = h(x) and loss = L(h(x), x)

loss is the 0-1 loss so ris is the integral of joint prob where loss is 1

R(h) = E[L(h(x), y)] = ∫ L(h(x), y) dP(x, y)

125
Q

what is empirical risk minimization?

A

you take average of loss on the training data and call that emperical risk as you can’t find real risk as P(x, y) is not available

empirical risk minimization is when you find a h which minimizes the risk

126
Q

If we have a wide NN and a deep NN with the same number of parameters, which one is more expressive
and why?

A

Deeper networks are more expressive, since they encode an inductive bias that complex functions
can be modeled as composition of simple functions. In turn, this allows the network to learn multiple levels
of an abstraction hierarchy. Empirically, it has been shown that deeper networks lead to more compact
models with better generalization performance.

127
Q

Why does L1 regularization tend to lead to sparsity while L2 regularization pushes weights closer to 0?

A

cuz gradient of l1 is 1 even close to 0 but gradient of l2 becomes really small close to 0 as 2w

128
Q

what is bagging and boosting

A

in bagging you sample with replacement to train n classifiers and vote for final prediction

in boosting you first train a weak learner and look at what samples it is not able to classify, it weights those samples higher and trains another weak learner and does this iteratively

129
Q

What’s the motivation for RNN?

A

Derived from feedforward
neural networks, RNNs can use their internal state (memory) to process variable length sequences of
inputs

130
Q

What’s the motivation for LSTM?

A

vanishing and exploding gradient in RNNs

131
Q

formula for perplexity

A

Perplexity(P) = 2^(-1/N∑log P(x))

132
Q

what is the key idea in relative position embeddings

A

a bias term is added inside the softmax with qk^T to signify positions but doing so doesn’t let us use KV cache and this is bad for training

133
Q

main idea in rotary position embedding

A

applied only to q and k matrices and not v. We multiply with a rotating matrix that rotates a word by m x theta where m is position of word

what is done is blocks of 2 in the embedding dimnesion is rotated by theta1, theta2, etc.

allows for rotational invariance to relative positions

134
Q

what is KV cache

A

During inference you don’t really need to calculate qkT V for all the previous tokens.

usually q is calculated only for the most recent token and only this 1 q is used, K and V for all is needed but only 1 column and row is new

So what you do is you save the k and v from the previous iteration. you generate 1 extra q k and v from latest token. take old k and v from cache append to it and then do calculation.

135
Q

How are your model’s accuracy and computational efficiency affected when you decrease or increase
CNN filter size?

A

Increasing the filter size results in decrease in computational efficiency (since the number of model parameters increases), and an increase in accuracy up to a certain point beyond which the network can overfit and imitate a fully-connected network.

Alternatively, decreasing the filter size results in increase of computational efficiency, and a decrease in accuracy when tending towards using extremely small kernels (e.g. 1x1) which do not capture the local structure of the inputs properly

136
Q

Convolutional layers are also known as “locally connected”. Explain what it means

A

Each filter operates only on a small neighborhood (e.g 3x3) around the ground pixel, and is
applied across the entire spatial domain. This sort of weight sharing drastically reduces the number of
the parameters, and injects the “spatial equivariance” bias.

137
Q

What does a 1x1 convolutional layer do?

A

The purpose of a 1x1 conv layer is to reduce the number of channels in the input activation volume.

138
Q

What happens when pooling is removed completely?

A

The main purpose of the pooling operation is to increase the receptive field of the network
by non-parametric techniques. If pooling is completely removed, then we’d need an increasingly large
stack of convolutional layers to achieve large enough receptive fields for the neurons located in the
deep layers. Of course, this comes at the cost of drastically increasing the number of parameters and
computation requirements.

139
Q

how to calculate cosine similarity in python

A
A = [5, 3, 4]
B = [4, 2, 4]

dot_product = sum(a*b for a, b in zip(A, B))

magnitude_A = sum(a*a for a in A)**0.5
magnitude_B = sum(b*b for b in B)**0.5

cosine_similarity = dot_product / (magnitude_A * magnitude_B)
print(f"Cosine Similarity using standard Python: {cosine_similarity}")
140
Q

How would a finite or infinite horizon affect our algorithms?

A

in finite horizon when we’re thinking of reward for an action we’re only considering rewards in a finite set of future states.

for infinite horizon we consider future rewards of all future states

infinite horizons are more common

141
Q

Why do we need the discount term for objective functions

A

using a discount rate (smaller than 1) is a mathematical trick
to make an infinite sum finite.

142
Q

what is the q learning update rule

A

Q(S_t, A_t) ← Q(S_t, A_t) + α(R_{t+1} + γmax_a Q(S_{t+1}, a) - Q(S_t, A_t))

γ - > discount rate
alpha -> learning rate

143
Q

experience tuple in RL

A

(s_t, a_t, r_t, s_{t+1})

144
Q

how do variational autoencoders work

A

put an image through encoder, encode to a mean and standard deviation (log standard deviation for numerical stability) and then sample by scaling a standard normal distribution with those parameters and try to generate the input image from decoder.

input image used to train decoder and encoder and KL divergence of output of encoder and normal distribution acts as regularizer

145
Q

what is the formula for vanilla gradient update?

A

wt+1 = wt - α * (1/N * ∑i=1 to N ∇wLi(wt, yi, yˆi))

146
Q

Implement vanilla dropout for the forward and backward pass in NumPy

A

thing to note here is how input and gradient are multiplied by mask, how mask is created and how scaling is done post dropout

import numpy as np
def forward_pass_with_dropout(X, dropout_rate):
 """
 Apply dropout to the input layer X during the forward pass.
 :param X: Input data for the layer, numpy array of shape (n_features, n
 :param dropout_rate: The probability of setting a neuron's output to 0
 :return: A tuple (output after applying dropout, dropout mask used)
 """
 # Create a mask using the dropout rate, setting to 0 for the dropped un
 dropout_mask = np.random.rand(*X.shape) > dropout_rate
 # Apply the mask to the input data
 dropped_out_X = np.multiply(X, dropout_mask)
 # During training, we'll scale data to not change the expected value
 dropped_out_X /= (1 - dropout_rate)
 return dropped_out_X, dropout_mask
def backward_pass_with_dropout(dA, dropout_mask, dropout_rate):
 """
 Apply the stored mask to the gradient during the backward pass.
 :param dA: Gradient of the loss with respect to the activations, numpy 
 :param dropout_mask: The dropout mask that was used during the forward 
 :param dropout_rate: The probability of setting a neuron's output to 0
 :return: Gradient after applying dropout mask
 """
 # Apply the dropout mask to the gradients
 dA_with_dropout = np.multiply(dA, dropout_mask)
 # Scale the gradients as we did during the forward pass
 dA_with_dropout /= (1 - dropout_rate)
 return dA_with_dropout
# Example usage:
np.random.seed(0) # for reproducibility
X = np.random.randn(5, 3) # 5 features, 3 samples
dropout_rate = 0.2 # 20% dropout rate
# Forward pass
X_dropped, mask = forward_pass_with_dropout(X, dropout_rate)
# Suppose we have some gradient dA from the backward pass
dA = np.random.randn(5, 3)
# Backward pass
dA_dropped = backward_pass_with_dropout(dA, mask, dropout_rate)
147
Q

why are RNNs susceptible to vanishing and exploding gradients and how to solve?

A

Backprop through time (BPTT) is the reason for vanishing and exploding gradient

Gated Units: Using RNN variants with gating mechanisms, such as LSTM (Long Short-Term
Memory)

Gradient Clipping:
Using Skip Connections:
Proper Initialization and Activation Functions

148
Q

When training a large neural network, say a language model with a billion parameters, you
evaluate your model on a validation set at the end of every epoch. You realize that your validation
loss is often lower than your train loss. What might be happening?

A

It might be that the training loss uses regularizers (e.g. L2 norm on the weights). Since at validation time we only evaluate the main loss function, it might happen that the validation loss is lower than
the (composite) train loss.

The model might have dropout layers, which impose heavy regularization during training. These are layers that behave differently during training (dropping out neurons) and inference time

The validation set might simply be easier compared to the train set.

149
Q

What criteria would you use for early stopping?

A

We can perform early stopping if there is no change or a decrease in a metric of interest (validation loss, accuracy, etc.). It is a good idea to use a patience parameter in order to avoid noisy
estimates.

150
Q

Gradient descent vs SGD vs mini-batch SGD.

A

In gradient descent we first perform a forward pass and a backward pass for each sample in the
dataset, before we take a step in the direction of the cumulative gradient. This is extremely slow
to converge, as today’s datasets are extremely large in size, which implies that we perform gradient
updates too rarely.
* In SGD, after performing a forward and a backward pass for a single sample, we take a step in the
direction of the single gradient. Even though we perform gradient updates much more often, the
entire process is extremely noisy as we are optimizing with respect to a single sample at a time.
* Mini-batch SGD combines the best of both worlds: perform a forward/backward pass for a batch
(e.g. 32) of samples, and take a step in the direction of the gradient for the current mini-batch. On
one hand, we perform updates more often than pure gradient descent; on the other hand, we optimize
over an entire mini-batch, minimizing the noise in the estimated gradient.

151
Q

Your model’s weights fluctuate a lot during training. How does that affect your model’s performance?
What to do about it?

A

The fluctuation can be attributed to two primary factors: large learning rate or exploding gradients. In either case, this can seriously destabilize the training process. In order to resolve this issue, we could:
1) lower the learning rate;
2) perform gradient clipping

152
Q

Draw a graph number of training epochs vs training error for when the learning rate is: i) too high;
ii) too low; iii) acceptable.

A
  1. if too high loss will decrease fast but will stabilise at high value
  2. if just right loss will decrease and stabilise at low value
  3. if ltoo low loss will decrease very slowly
153
Q

What’s learning rate warmup? Why do we need it?

A

Warmup steps are just a few updates with low learning rate at the beginning of training.
After this warmup, you use the regular learning rate (schedule) to train your model to convergence.
For example, RMSProp computes a moving average of the squared gradients to get an estimate
of the variance in the gradients for each parameter. For the first update, the estimated variance is
just the square root of the sum of the squared gradients for the first batch. Since, in general, this
will not be a good estimate, your first update could push your network in a wrong direction. To
avoid this problem, you give the optimiser a few steps to estimate the variance while making as little
changes as possible (low learning rate) and only when the estimate is reasonable, you use the actual
(high) learning rate.

154
Q

Compare batch norm and layer norm.

A

Suppose the output of the previous layer is X ∈ R
B×D where B is the batch size and D is the dimensionality of the embedding. Both techniques normalize the input as follows:
Y = X − E [X] / sqrt(Var [X] + ϵ) ∗ γ + β

where γ and β are learnt affine parameters, and all operations are treated as broadcasts. The main
difference stems from how the two techniques compute the statistics:
* Batch norm computes the mean and standard deviation over the batch, meaning that E [X] ∈ R 1×D
and Var [X] ∈ R1×D.
* Layer norm computes the mean and standard deviation over the features, meaning that E [X] ∈ R B×1
and Var [X] ∈ R B×1
.

155
Q

. Some models use weight decay: after each gradient update, the weights are multiplied by a factor slightly
less than 1. What is this useful for?

A

if you use L2 regularization then in the gradient update rule you’re basically multiplying w with 1- \alpha * \lambda before updating with gradient. this is called weight decay. PyTroch implements weight decay

156
Q

It’s a common practice for the learning rate to be reduced throughout the training
i. What’s the motivation?
ii. What might be the exceptions

A

The learning rate is controlling the size of the update steps along the gradient. so large first small later on

In Adam the moving square average acts as a learning rate parameter

exception would be continual learning

157
Q

How should we adjust the learning rate as we increase or decrease the batch size?

A

When utilizing larger batches, we can afford to have large learning rates, as the approximated gradient of the batch is closer to the true gradient. On the other hand, using very small
batches yields noisy estimates of the gradient, and is therefore advisable to also use small learning
rates so that we don’t diverge in our optimization procedure.

158
Q

What can you say about the ability to converge and generalize of Adam vs. SGD?

A

solutions found by adaptive methods (e.g. Adam) generalize
worse than SGD, even when these solutions have better training performance. In other words,
adaptive methods converge faster, but have worse generalization performance than pure SGD.

159
Q

SGD vs adam optimzer formula

A

SGD is typical gradient update with learning rate

adam along with this has a momentum adn a variance term. Momentum is moveing average of gradient and variance is moving average of gradient^2

160
Q

asynchronous SGD vs.
synchronous SGD in distributed training

A

, the choice between asynchronous and synchronous SGD in a distributed training environment
depends on the specific requirements of the training process and the trade-offs between speed and
stability. Asynchronous SGD offers faster training but can suffer from issues due to stale gradients, while
synchronous SGD provides more stable convergence at the cost of potentially slower training

161
Q

Why don’t we just initialize all weights in a neural network to zero?

A

during backpropagation, we will have the same local derivatives, which in turn will cause
the weights for all neurons in a given layer to perform the same update.

162
Q

What are some sources of randomness in a neural network?

A

Random weight initialization, mini-batch shuffling, dropout, etc

163
Q

what are some ways in which neural networks benefit from randomness during training?

A

1) SGD gradient randomness helps escape local minima
2) randomness while output sampling in LLMs and action sampling when RL plays chess helps
3) dropout helps the network not be too dependent on any weights

164
Q

What’s a dead neuron?

A

A dead neuron in a neural network is a neuron that always outputs the same value, regardless of the
input. This typically happens when the neuron’s activation function is non-linear, like ReLU (Rectified
Linear Unit), and the input to the neuron is such that the activation function always outputs the
minimum value (which is 0 in the case of ReLU). For ReLU, this happens when the weights and biases
of the neuron are adjusted during training in such a way that the weighted sum of the inputs is always
negative, leading to an output of 0

165
Q

how do we detect dead neurons and prevent them

A

we can monitor activations and if one activation is 0 all the time for different inputs then it is dead.

can be sorted by using leaky relu or layer norm or proper initialization

166
Q

what is pruning and how do you decide what to prune?

A
  • Magnitude. Remove weights that have magnitude close to 0; remove neurons whose L2 norm of
    the weight is also close to 0.
  • Activations. We could also use the training data to observe the activations of the neurons. We
    can remove neurons whose distribution of activations extremely peaked (invariant to the input);
    Moreover, we can also remove a neuron if its activation pattern is highly correlated to another
    neuron in the same layer.