Interview Flashcards

Question 1

Q

compute a checksum for
precise file deduplication, and compress them for storage. What is meant by this?

Answer

A

checksum gives you a numerical representation of the contents of a file. used to signal integrity of the data in a file.
We can remove duplicate files with same content but different names using this method.

In the context of the described data collection process, computing a checksum means generating a unique, fixed-size numerical value (often represented as a string of letters and numbers) that is calculated from the contents of a file or a piece of data. This checksum serves multiple purposes:

Integrity Verification: The checksum helps verify the integrity of the data or files over time. If the file is altered in any way, even a small change, recomputing the checksum will result in a different value. This allows for easy detection of corruption or unintended changes.

Deduplication: In the process described, the checksum is used for precise file deduplication. By computing and comparing checksums, it is possible to identify and eliminate duplicate files in the dataset, even if the files are named differently. This is because the checksum is based on the file content, not the file name or other metadata.

Question 2

Q

what data preprocessing steps were done at keysight?

Answer

A

You can do checksum (python code) and deduplication, you can weed out some files based on the version of the code,
Excessively large code files were filtered as they were probably auto generated.
Added file path in the beginning of data of a file

tokenisation done to create chunks and then synthetic dataset creation by querying either llama2 or gpt API depending on data sensitivity

Something I didn’t do during keysight data work was dependency ordering, you read about it later in deepseek coder and it makes a lot of sense and they gave pseudocode for it and their code performance is state of the art and this prolly ahs something to do with it.

Question 3

Q

Positional Encoding Calculation and code (formula) also the rationale behind it

Answer

A

import numpy as np

def positional_encoding(sentence_length, model_dim):
pos_enc_matrix = np.zeros((sentence_length, model_dim))
for pos in range(sentence_length):
for i in range(0, model_dim, 2):
pos_enc_matrix[pos, i] = np.sin(pos / (10000 ** (i / model_dim)))
pos_enc_matrix[pos, i + 1] = np.cos(pos / (10000 ** (i / model_dim)))
return pos_enc_matrix

Rationale behind the Formula
The usage of a sine function for even indices and a cosine function for odd indices allows the model to capture different phase relationships between sequences of different frequencies.

This approach offers a “hackish” way to prioritize positional information of varying scales. As the formula demonstrates, higher-frequency components influence words that are further along in the sentence, while lower-frequency components emphasize closer proximity.

The specific constant
1
10000
is introduced to prevent the function from saturating.

Positional encoding used in llama?

Question 4

Q

mu and sigma for layer norm are calculated during training or runtime

Answer

A

mu and sigma calculated during runtime for inference for each data sample and gamma and beta are trained parameters

Question 5

Q

in the transformer architecture layer norm is done where?

Answer

A

before the activation function after the linear layer after skip connection.

This means the workflow within each sub-block of a transformer is as follows:

Input from the previous layer
Addition of the skip connection output
Layer normalization
Activation function (if any, depending on the sub-block)

However, some implementations and variations might apply layer normalization before adding the skip connection output. Both methods have been explored in practice, with different effects on training dynamics and performance. The original approach (normalization after the residual connection) is more common and is often preferred because it allows the model to preserve the raw output of each layer before it is normalized, which can be beneficial for learning complex dependencies.

Question 6

Q

i. [E] What’s the geometric interpretation of the dot product of two vectors?

Answer

A

Multiplication of the length of one vector and the length of the projection of the other
vector onto the first one.

Question 7

Q

[E] Given a vector u, find vector v of unit length such that the dot product of u and v is
maximum

Answer

A

Given a vector u, the vector v of unit length that maximizes the dot product u · v is the vector that points
in the same direction as u. The vector v can be found by dividing u by its own magnitude, making it a
unit vector.

Question 8

Q

Give an example of how the outer product can be useful in ML.

Answer

A

The Covariance matrix is a commonly used quantity in Machine Learning algorithms
(eg. PCA). Given a dataset X ∈ R
n×d with n samples and d features, we calculate the (empirical)
covariance as follows:
Cov [X] = 1
n
Xn
i=1
(xi − x¯)(xi − x¯)
T
where ¯x is the mean feature vector: ¯x =
1
n
Pn
i=1 xi

Question 9

Q

[E] What does it mean for two vectors to be linearly independent?

Answer

A

Two vectors are linearly independent if no scalar multiple of one vector equals the other. This means
they do not lie on the same line through the origin and neither can be formed by scaling the other.

Question 10

Q

[M] Given two sets of vectors and . How do you
check that they share the same basis?

Answer

A

you first have to make sure A and B have the same dimension. This can be done by converting A and B into its row echelon form.

Singular Value Decomposition (SVD): Decompose A into the product of three matrices: A = U Σ V^T, where U and V are orthogonal matrices, and Σ is a diagonal matrix. The number of non-zero diagonal entries in Σ is the rank of A.

once you have the rank you can then take the r vectors than span row space of both A and B and show that every vector in rows of A is spanned by rows of B and vice versa

Question 11

Q

[M] Given n vectors, each of d dimensions. What is the dimension of their span?

Answer

A

Given n vectors, each of d dimensions, the dimension of their span depends on their linear
independence. If all vectors are linearly independent, the dimension of their span is min(n, d). If they are
not all linearly independent, the dimension will be less than min(n, d).

Can think of this as a matrix with n columns of d-dim vectors, min(n,d) is the rank

Question 12

Q

[E] What’s a norm? What is Lnorm?

Answer

A

A norm is a function that assigns a strictly positive length or size to each vector in a vector space,
except for the zero vector, which is given a length of zero.
The L0 norm refers to the number of nonzero elements in a vector,
L1 norm is the sum of the absolute values of the vector elements,
L2 norm
(also known as the Euclidean norm) is the square root of the sum of the squares of the vector elements,
and the
L norm is the maximum absolute value of the elements in the vector.

Question 13

Q

[M] How do norm and metric differ? Given a norm, make a metric. Given a metric, can we make a norm?

Answer

A

Ans. A metric measures distances between pairs of things while a norm measures the size of a single item. Metrics can be defined on pretty much anything, while the notion of a norm applies only to vector spaces: the very definition of a norm requires that the things measured by the norm could be added and scaled. If you have a norm, you can define a metric by saying that the distance between a and b is the size of a - b.

On the other hand, if you have a metric you can’t usually define a norm.

Question 14

Q

[E] Why do we say that matrices are linear transformations?

Answer

A

Matrices represent linear transformations because they map linear combinations of vectors to other linear
combinations.

Specifically, for a matrix M and vectors u and v, and a scalar c, the following properties hold true, which are
the properties of linear transformations:
M(u + v) = Mu + Mv
Applying M to the sum of two vectors is the same as summing the results of applying M to each
vector individually.
M(cu) = c(Mu)
Applying M to a scalar multiple of a vector is the same as multiplying the result of applying M by
that same scalar.
These properties demonstrate that matrix multiplication exhibits the key aspects of a linear transformation:
1. Additivity - Applying the transformation to vector sums gives the sum of individually transformed
vectors
2. Homogeneity - Scalars distribute across the transformation
Thus, matrices and matrix multiplication inherently represent and operate as linear transformations
between vector spaces.

Question 15

Q

What’s the inverse of a matrix? Do all matrices have an inverse? Is the inverse of a matrix
always unique?

Answer

A

The inverse of a matrix A is another matrix A^-1 such that AA^-1 = A^-1A = I, where I is the identity matrix.
Not all matrices have an inverse; only square matrices that are non-singular (with a non-zero determinant)
have an inverse. When a matrix has an inverse, it is always unique.`

Question 16

Q

[E] What does the determinant of a matrix represent?

Answer

A

The determinant of a matrix can be interpreted as a scaling factor for the transformation that the matrix
represents. Geometrically, it represents the volume scaling factor of the linear transformation described by
the matrix, including whether the transformation preserves or reverses the orientation of the space.

Question 17

Q

[E] What happens to the determinant of a matrix if we multiply one of its rows by a scalar t×R?

Answer

A

Multiplying a row of a matrix by a scalar t scales the determinant of the matrix by that scalar. So if the
original determinant was d, the new determinant will be t×d.

Question 18

Q

[M] A 4×4 matrix has four eigenvalues 3,3,2,−1. What can we say about the trace and the
determinant of this matrix?

Answer

A

The trace of the matrix, which is the sum of its eigenvalues, would be 3+3+2−1=7. The determinant of the
matrix, which is the product of its eigenvalues, would be 3×3×2×(−1)=−18

Question 19

Q

what is the determinant of a matrix with linearly dependent rows or columns

Question 20

Q

[M] What’s the difference between the covariance matrix A^TA and the Gram matrix AA^T?

Answer

A

Suppose A ∈ Rn×d
, corresponding to n samples with each having d features. Then, the Covariance matrix AT A ∈ Rd×d
captures the ”similarity” between features, whereas the Gram matrix
AAT ∈ Rn×n captures the ”similarity” between samples.

Question 21

Q

Given A∈Rn×m and b∈Rn

Find x such that: Ax=b

When does this have a unique solution?

Why is it when A has more columns than rows, Ax=b has multiple solutions?

Given a matrix A with no inverse. How would you solve the equation Ax=b?

Answer

A

Unique Solution

Square and Full Rank: If A is square (n = m) and has full rank (rank(A) = m = n), then A is invertible. The equation Ax = b has a unique solution given by x = A^(-1)b.

Non-Square but Full Column Rank: If A is not square (m ≠ n), but rank(A) = m (full column rank) and m < n (tall matrix), a left inverse A_L of A exists such that A_L A = I. This configuration implies Ax = b has a unique solution when b is in the range (column space) of A. The solution can be given by x = A_L b. In this scenario, A_L = (A^T A)^(-1)A^T and there is no null space of A (nullity(A) = 0) because all columns are linearly independent.

Multiple or No Solutions

Overdetermined System (m > n): If A is a wide matrix (more columns than rows), the rank of A could be at most n, and the null space of A (nullity(A)) has dimension m - n (assuming rank(A) = n). This situation typically means:

No Solution: If b is not in the column space of A, then Ax = b has no solution.

Infinitely Many Solutions: If b is in the column space of A, there are infinitely many solutions because there are free variables associated with the non-trivial null space of A.

Question 22

Q

What is the pseudoinverse and how to calculate it?

Answer

A

Method 1: Singular Value Decomposition (SVD)
The most reliable method for computing the pseudoinverse of any matrix, whether square or rectangular, full rank or rank-deficient, is using its Singular Value Decomposition (SVD). The SVD of matrix A is given by:
A = U Σ V^T
where:
U is an n×n orthogonal matrix whose columns are the left singular vectors of A.
V is an m×m orthogonal matrix whose columns are the right singular vectors of A.
Σ is an n×m diagonal matrix with non-negative real numbers on the diagonal, known as the singular values.
The pseudoinverse A+ is then calculated as:
A+ = V Σ+ U^T
Here, Σ+ is obtained by taking the reciprocal of each non-zero singular value in Σ, and transposing the matrix.

Method 2: Using Matrix Transpose and Inversion
For matrices that are either full row rank or full column rank, the pseudoinverse can also be computed directly using matrix transposes and inversion:
Full Column Rank (m≤n and rank(A)=m):
A+ = (A^T A)^(-1) A^T
Full Row Rank (m≥n and rank(A)=n):
A+ = A^T (A A^T)^(-1)
Note: These formulas assume that A has full column or row rank, respectively.

Do AAT or ATA depending on which gives you a full rank matrix, that is then invertible and use that to find pseudoinverse

Question 23

Q

What does the derivative represent?

Answer

A

Answer: The derivative of a function measures the sensitivity to change in the function output with
respect to a change in the input.
Moreover, when it exists, the derivative at a given point is the slope of the tangent line to the graph
of the function at that point. The tangent line is the best linear approximation of the function at
that input value. This is the reason why in gradient descent we (slowly) move in the (negative)
direction of the derivative

Question 24

Q

What’s the difference between derivative, gradient, and Jacobian?

Answer

A

When f: R → R, we calculate the derivative df/dx.

When f: R^n → R, we calculate the gradient:
∇f = [∂f/∂x1, ∂f/∂x2,…, ∂f/∂xn]

When f: R^n → R^m, we calculate the Jacobian (an mxn matrix):
Jac(f) = [
∂f1/∂x1 … ∂f1/∂xn
…
∂fm/∂x1 … ∂fm/∂xn
]

Question 25

Q

Say we have the weights w ∈ R^(d×m) and a mini-batch x of n elements, each element is of the shape 1 × d, so that x ∈ R^(n×d). We have the output y = f(x; w) = xw. What’s the dimension of the Jacobian ∂y/∂x?

Answer

A

First, notice that y ∈ R^(n×m). With that said, Jac_x(f) ∈ R^((n×m)×(n×d)), or equivalently Jac_x(f) ∈ R^((n·m)×(n·d)), given that we have reshaped the 4-dim tensor into a 2-dim tensor, i.e. a matrix.

in general dimension of jacobain dy/dxis R^(y-dim x x-dim)

Question 26

Q

Given a very large symmetric matrix A that doesn’t fit in memory, say and a
function f that can quickly compute . Find the unit vector x so that is
minimal.
Hint: Can you frame it as an optimization problem and use gradient descent to find an approximate
solution?

Answer

A

To find the unit vector x that minimizes x^T Ax, we can frame this as an optimization problem and approach
it using an iterative algorithm like gradient descent or the conjugate gradient method. These methods
update x in a direction that minimizes the function x^T Ax, and in the case of gradient descent, this involves computing the gradient -2Ax at each step. We would iterate this process until convergence, ensuring at
each step that x remains a unit vector by normalizing it.

Question 27

Q

Why do we need dimensionality reduction?

Answer

A

Dimensionality reduction is used to reduce the number of random variables under consideration and can
be divided into feature selection and feature extraction. It helps in reducing the time and storage space
required, removes multicollinearity, enhances the interpretation of the parameters, helps in visualizing data,
and most importantly, it can help in avoiding the curse of dimensionality.

Question 28

Q

Eigendecomposition is a common factorization technique used for dimensionality reduction. Is the eigendecomposition of a matrix always unique?

Answer

A

The decomposition is not always unique. Suppose A ∈ R^(2×2) has two equal eigenvalues λ1 = λ2 = λ, with corresponding eigenvectors u1, u2. Then:
Au1 = λ1u1 = λu1
Au2 = λ2u2 = λu2
Or written in matrix form:
A [u1 u2] = [u1 u2] [λ 0; 0 λ]
Notice that we can permute the matrix of eigenvectors (thus obtaining a different factorization):
A [u2 u1] = [u2 u1] [λ 0; 0 λ]
But we still end up with the same eigen-properties:
Au2 = λu2
Au1 = λu1

Question 29

Q

Name some applications of eigenvalues and eigenvectors

Question 30

Q

We want to do PCA on a dataset of multiple features in different ranges. For example, one is in the range 0-1 and one is in the range 10 - 1000. Will PCA work on this dataset?

Answer

A

In PCA we are interested in the components that maximize the variance. If one component (e.g. human height) varies less than another (e.g. weight) because of their respective scales (meters vs. kilos), PCA might determine that the direction of maximal variance more closely corresponds with the ‘weight’ axis, if those features are not scaled. Since a change in height of one meter should be considered much more important than the change in weight of one kilogram, the previous assumption would be incorrect. Therefore, it is important to standardize the features before applying PCA.

Question 31

Q

Under what conditions can one apply eigendecomposition? What about SVD?

Answer

A

Eigendecomposition is possible only for (square) diagonalizable matrices. On the other hand, the Singular Value Decomposition (SVD) always exists (even for non-square matrices).

Question 32

Q

What’s the relationship between PCA and SVD?

Answer

A

Suppose we have data X ∈ R^(n×d) with n samples and d features. Moreover, assume that the data has been centered so that the mean of each feature is 0. Then, we can perform PCA in two main ways:
First, we compute the covariance matrix C = 1/(n-1)X^T X ∈ R^(d×d), and perform eigendecomposition: C = V L V^T, with eigenvalues as the diagonal of L ∈ R^(d×d), and eigenvectors as the columns of V ∈ R^(d×d). Then, we stack the k eigenvectors of V corresponding to the top k eigenvalues into a matrix V˜ ∈ R^(d×k). Finally, we obtain the component values as follows: X˜ = X V˜ ∈ R^(n×k).
Alternatively, instead of first computing the covariance matrix and then performing eigendecomposition, notice that given the above formulation, we can directly compute SVD on the data matrix X, thus obtaining: X = U Σ V^T. By construction, the right singular vectors in V are the eigenvectors of X^T X. Similarly, we stack the k right singular vectors corresponding to the top k singular values into a matrix V˜ ∈ R^(d×k). Finally, we obtain the component values as follows: X˜ = X V˜ ∈ R^(n×k).
Even though SVD is slower, it is often considered to be the preferred method because of its higher numerical accuracy.

Question 33

Q

What is the relationship between SVD and eigendecomposition?

Answer

A

Consider A ∈ R^(m×n) of rank r. Then, we can factorize A as follows:
A = U Σ V^T
where U ∈ R^(m×m) is an orthogonal matrix of left singular vectors, V ∈ R^(n×n) is an orthogonal matrix of right singular vectors, and Σ ∈ R^(m×n) is a “diagonal” matrix of singular values such that exactly r of the values σ_i := Σ_ii are non-zero.
By construction:
The left singular vectors of A are the eigenvectors of A^T A. From the Spectral Theorem, the eigenvectors (and thus the left singular vectors) are orthonormal.
The right singular vectors of A are the eigenvectors of A A^T. From the Spectral Theorem, the eigenvectors (and thus the right singular vectors) are orthonormal.
If λ is an eigenvalue of A^T A (or A A^T), then √λ is a singular value of A. From the positive semidefiniteness of A^T A (or A A^T), the eigenvalues (and thus the singular values) are non-negative.

Question 34

Q

What does it mean when a function is differentiable?

Answer

A

A function f: U → R is said to be differentiable at a ∈ U if the derivative:
f’(a) = lim(h→0) [f(a + h) - f(a)]/h
exists. This implies that the function is continuous at a. Note that every continuous function is not necessarily differentiable. like |X| continuous but not differentiable. functions with sharp edges are not differentiable tho continuous

Question 35

Q

[E] Give an example of when a function doesn’t have a derivative at a point.

Answer

A

An example is the function f(x) = |x|, which doesn’t have a derivative at x = 0. The graph of this function
has a sharp corner at x = 0, which means there is no single tangent line at that point.

Why you can still apply the formula right?

Question 36

Q

Give an example of non-differentiable functions that are frequently used in machine learning.
How do we do backpropagation if those functions aren’t differentiable?

Answer

A

An example is the ReLU (Rectified Linear Unit) function, which is non-differentiable at x = 0. In machine
learning, backpropagation with such functions often uses a concept called subgradient, which allows the
algorithm to bypass non-differentiability at certain points. For ReLU, the derivative is defined as 0 for x < 0
and 1 for x > 0, and at x = 0, any value between 0 and 1 can be used.

Question 37

Q

What does it mean for a function to be convex or concave?

Answer

A

A function is called convex if the line segment between any two points on the graph of
the function lies above the graph between the two points. More precisely, the function f : X → R is
convex if and only if for all 0 ≤ t ≤ 1 and all x1, x2 ∈ X:
f(tx1 + (1 − t)x2) ≤ tf(x1) + (1 − t)f(x2)
The function f is said to be concave if −f is convex.

Question 38

Q

Why is convexity desirable in an optimization problem?

Answer

A

Convexity is desirable because any local minimum of a convex function is also a global
minimum

Question 39

Q

Most ML algorithms we use nowadays use first-order derivatives (gradients) to construct the next
training iteration.
[E] How can we use second-order derivatives for training models?

Answer

A

Second-order derivatives can be used in optimization algorithms to better understand the curvature of
the loss function. This information can be used to adjust the learning rate and the direction of the
update steps, potentially leading to faster convergence.

Question 40

Q

Pros and cons of second-order optimization.

Answer

A

Pros: Can lead to faster convergence and more informed update steps.
Cons: Computationally expensive, as it requires calculating and inverting the Hessian matrix.

Question 41

Q

How can we use the Hessian (second derivative matrix) to test for critical points?

Answer

A

The Hessian matrix can be used to test the nature of critical points. If the Hessian at a point is positive
definite, the point is a local minimum; if it is negative definite, the point is a local maximum; and if it is
indefinite, the point is a saddle point

Question 42

Q

Jensen’s inequality forms the basis for many algorithms for probabilistic inference, including Expectation-Maximization and variational inference. Explain what Jensen’s inequality is.

Answer

A

As stated before, for a given convex function f, we had the following property:
g(tx1 + (1 - t)x2) ≤ tg(x1) + (1 - t)g(x2)
Let us generalize this property. Again, suppose we have a convex function f, variables x1, …, xn ∈ I, and non-negative real numbers α1, …, αn such that ∑i αi = 1. Then, by induction we have:
g(α1x1 + … + αnxn) ≤ α1g(x1) + … + αng(xn)
Let’s formalize it one step further. Consider a convex function f, a discrete random variable X with n possible values x1, …, xn, and real non-negative values ai = p(X = xi). Then, we obtain the general form of the Jensen’s inequality:
g(E[X]) ≤ E[g(X)]

Question 43

Q

Explain the chain rule.

Answer

A

The chain rule is a formula that expresses the derivative of the composition of two differentiable functions f and g in terms of the derivatives of f and g. More precisely, if h = f ∘ g is the function such that h(x) = f(g(x)) for every x, then the chain rule is:
dh/dx = df/dg · dg/dx

Question 44

Q

Given the function f(x, y) = 4x^2 - y with the constraint x^2 + y^2 = 1. Find the function’s maximum and minimum values.

Answer

A

In order to solve the constrained optimization problem, we form the Lagrangian:
L(x, y, λ) = 4x^2 - y + λ(x^2 + y^2 - 1)

Given the function f(x, y) = 4x^2 - y with the constraint x^2 + y^2 = 1. Find the function’s maximum and minimum values.
Answer: In order to solve the constrained optimization problem, we form the Lagrangian:
L(x, y, λ) = 4x^2 - y + λ(x^2 + y^2 - 1)
Calculating the gradient and setting it to zero, we obtain:
∇_x,y,λ L = (∂L/∂x, ∂L/∂y, ∂L/∂λ) = (8x + 2λx, -1 + 2λy, x^2 + y^2 - 1) = 0

Question 45

Q

Let x ∈ R^n, L = crossentropy(softmax(x), y) in which y is a one-hot vector. Take the derivative of L with respect to x.

Answer

A

∂L/∂x = ∂/∂x [(-y^T log(softmax(x))] = -y^T ∂/∂x [softmax(x)]/∂x [log(softmax(x))]
= -y^T [diag(softmax(x)) - softmax(x)softmax(x)^T] / [diag(softmax(x))]
= softmax(x) - y

Question 46

Q

Given a uniform random variable X in the range of [0,1] inclusively. What’s the probability that
X=0.5?

Answer

A

For a continuous uniform distribution, the probability of X being exactly any specific value, including 0.5, is
0. This is because the probability for a continuous distribution is defined over intervals, not specific points.

Question 47

Q

Can the values of PDF be greater than 1? If so, how do we interpret PDF?

Answer

A

Yes, the values of a Probability Density Function (PDF) can be greater than 1. The key point is that the
area under the PDF curve over the entire range must integrate to 1. A high PDF value does not represent
probability but rather indicates a higher density of the variable at that point.

Question 48

Q

What’s the difference between multivariate distribution and multimodal distribution?

Answer

A

A multivariate distribution is a probability distribution with more than one random variable, each with its
range of values. A multimodal distribution is a probability distribution with more than one peak or mode, regardless of how many variables it has.

Question 49

Q

What does it mean for two variables to be independent?

Answer

A

In general, continuous random variables X1, …, Xn admitting a joint density are all independent from each other if and only if:
p_{X1,…,Xn}(x1, …, xn) = p_{X1}(x1) · · · p_{Xn}(xn)
This equation states that the joint probability density function (pdf) of the random variables X1, …, Xn factorizes into the product of their individual pdfs, which is a necessary and sufficient condition for independence.

Question 50

Q

It’s a common practice to assume an unknown variable to be of the normal distribution. Why is that?

Answer

A

The Central Limit Theorem (CLT) states that the distribution of the sum of a large number of
independent, identically distributed random variables is approximately normal, regardless of the underlying distribution. Because so many things in the universe can be modeled as the sum of a large number of
independent random variables, the normal distribution pops up a lot.

**Central limit theorem and law if larger numbers

Question 51

Q

How would you turn a probabilistic model into a deterministic model?

Answer

A

To convert a probabilistic model into a deterministic one, you typically use expected values, mode, or
median of the probability distributions as fixed values instead of random variables. This approach ignores
the variability and uncertainty represented by the probability distributions.

Question 52

Q

Explain frequentist vs. Bayesian statistics.

Answer

A

The frequentist approach The goal is to use the sample data to build point estimates of the parameters
(potentially with standard error).

bayesian uses priors The goal is to build a posterior distribution of the parameters, given
the data at hand.

Question 53

Q

Code for merge sort

Answer

A

def merge_sort(arr):
“””
Sorts an array using the merge sort algorithm.
“””
if len(arr) > 1:
mid = len(arr) // 2 # Finding the mid of the array
left_half = arr[:mid] # Dividing the elements into 2 halves
right_half = arr[mid:]

    merge_sort(left_half)  # Sorting the first half
    merge_sort(right_half)  # Sorting the second half

    i = j = k = 0

    # Copy data to temp arrays L[] and R[]
    while i < len(left_half) and j < len(right_half):
        if left_half[i] < right_half[j]:
            arr[k] = left_half[i]
            i += 1
        else:
            arr[k] = right_half[j]
            j += 1
        k += 1

    # Checking if any element was left
    while i < len(left_half):
        arr[k] = left_half[i]
        i += 1
        k += 1

    while j < len(right_half):
        arr[k] = right_half[j]
        j += 1
        k += 1

return arr

Example usage
if __name__ == “__main__”:
sample_array = [12, 11, 13, 5, 6, 7]
print(“Original array:”, sample_array)
sorted_array = merge_sort(sample_array)
print(“Sorted array:”, sorted_array)

Question 54

Q

code to recursively read a json file

Answer

A

import json

def load_json_file(filename):
“”” Load the JSON data from a file. “””
with open(filename, ‘r’) as file:
return json.load(file)

def print_json_recursively(data, indent=0):
“”” Recursively print JSON data with indentation for nested structures. “””
for key, value in data.items():
print(‘ ‘ * indent + str(key) + ‘:’, end=’ ‘)
if isinstance(value, dict): # If value is a dictionary, recurse
print()
print_json_recursively(value, indent + 4)
elif isinstance(value, list): # If value is a list, iterate each item
print()
for i, item in enumerate(value):
print(‘ ‘ * (indent + 4) + f’[{i}]:’, end=’ ‘)
if isinstance(item, dict):
print()
print_json_recursively(item, indent + 8)
else:
print(item)
else:
print(value)

def main():
filename = ‘path_to_your_json_file.json’
try:
json_data = load_json_file(filename)
print_json_recursively(json_data)
except Exception as e:
print(f”An error occurred: {e}”)

if __name__ == “__main__”:
main()

Question 55

Q

Find the longest increasing subsequence in a string.

Answer

A

def longest_increasing_subsequence(s):
# Cache for memoization
memo = {}

def rec(i, prev):
    if i == len(s):
        return 0  # Base case: end of string
    
    # Check memoized results
    if (i, prev) in memo:
        return memo[(i, prev)]
    
    # Option 1: Skip the current character
    taken = 0
    if prev < s[i]:
        # Option 2: Include the current character if it continues the sequence
        taken = 1 + rec(i + 1, s[i])
    
    not_taken = rec(i + 1, prev)
    
    # Store result in memoization dictionary
    memo[(i, prev)] = max(taken, not_taken)
    return memo[(i, prev)]

# Initialize recursive function, use a character smaller than any possible as the initial 'previous'
return rec(0, chr(0))

Example usage
s = “azbycxdwe”
print(“Length of the longest increasing subsequence is:”, longest_increasing_subsequence(s))

Question 56

Q

Traverse a tree in pre-order, in-order, and post-order.

Answer

A

class TreeNode:
def __init__(self, val=0, left=None, right=None):
self.val = val
self.left = left
self.right = right

def preorder_traversal(root):
if root is None:
return []
return [root.val] + preorder_traversal(root.left) + preorder_traversal(root.right)

def inorder_traversal(root):
if root is None:
return []
return inorder_traversal(root.left) + [root.val] + inorder_traversal(root.right)

def postorder_traversal(root):
if root is None:
return []
return postorder_traversal(root.left) + postorder_traversal(root.right) + [root.val]

Question 57

Q

Preorder Traversal (Iterative)

Answer

A

def preorder_traversal_iterative(root):
if root is None:
return []
stack, output = [root], []
while stack:
node = stack.pop()
if node:
output.append(node.val)
stack.append(node.right) # Right child pushed first so that left is processed first
stack.append(node.left)
return output

Question 58

Q

In order traversal of BST

Answer

A

def inorder_traversal_iterative(root):
stack, output = [], []
current = root
while current or stack:
while current:
stack.append(current)
current = current.left
current = stack.pop()
output.append(current.val)
current = current.right
return output

Question 59

Q

Post order traversal of BST

Answer

A

def postorder_traversal_iterative(root):
if root is None:
return []
stack, output = [root], []
while stack:
node = stack.pop()
output.append(node.val)
if node.left:
stack.append(node.left)
if node.right:
stack.append(node.right)
return output[::-1] # Reverse the result because we want left-right-root

Do post order traversal question

Question 60

Q

Given an array of integers and an integer k, find the total number of continuous subarrays whose
sum equals k. The solution should have O(N) runtime

Answer

A

def subarraySum(nums, k):
# Dictionary to store the frequency of cumulative sums
cumulative_sum_count = {0: 1} # Base case: sum of 0 exists once
current_sum = 0
count = 0

for num in nums:
    current_sum += num
    # Check if there is a prefix subarray we can subtract
    # that results in the current subarray summing to k
    sum_needed = current_sum - k
    if sum_needed in cumulative_sum_count:
        count += cumulative_sum_count[sum_needed]
    
    # Update the count of the current_sum in the hashmap
    if current_sum in cumulative_sum_count:
        cumulative_sum_count[current_sum] += 1
    else:
        cumulative_sum_count[current_sum] = 1

return count

Question 61

Q

You have three matrices: A ∈ R100×5, B ∈ R5×200, C ∈ R200×20, and you need to calculate the product ABC. In what order would you perform your multiplication and why?

Answer

A

Since matrix multiplication is associative, the answer is the same whether we multiply in the order of (AB)C or A(BC). However, let us observe the cost through the number of scalar multiplications we need to perform:
(AB)C = 100 · 5 · 200 + 100 · 200 · 20 = 50000
A(BC) = 5 · 200 · 20 + 100 · 5 · 20 = 30000
Obviously, the second approach is computationally cheaper.

Question 62

Q

What are some of the causes for numerical instability in deep learning?

Answer

A

Overflow, underflow, division by zero, log 0, NaN as input, etc.

Question 63

Q

In many machine learning techniques (e.g. batch norm), we often see a small term ϵ
added to the calculation. What’s the purpose of that term?

Answer

A

The purpose is to avoid operations that are undefined for 0, such as division by 0, log 0, etc

Question 64

Q

What made GPUs popular for deep learning? How are they compared to TPUs?

Answer

A

GPUs became popular for deep learning because matrix multiplications can be efficiently parallelized over hundreds of cores. TPUs (Tensor Processing Units) are specialized hardware for neural nets, with the key difference that they have lower precision for representing floating-point numbers, allowing for:
Higher memory throughput
Faster addition and multiplication operations
This design enables TPUs to accelerate neural network computations while reducing power consumption and increasing efficiency.

Answer 63

A

O(B · T · w) - time

O(w + B · T · a) - space

For the forward-pass of a single example in one timestep we need to evaluate all the weights, resulting in O(w) time complexity, where w is the number of weights. Due to the recurrence, we repeat the computation for T timesteps, resulting in O(T · w). Moreover, performing this un-rolled forward pass for an entire batch will amount the time complexity to O(B · T · w). Lastly, we note that the time complexity of the forward and the backward pass is the same.

As for the space complexity, note that we need to keep in memory both the network weights and the activations from the forward pass (required for the backprop computation). Given that storing the activations for a single timestep is O(a), the space complexity amounts to O(w + B · T · a)

Answer 64

A

def bigram(text_list:list):
result = []
for ls in text_list:
words = ls.lower().split()
for bi in zip(words, words[1:]):
result.append(bi)
return result
text = [“Data drives everything”, “Get the skills you need for the future of work”]
print(bigram(text))

Answer 65

A

Masked and causal language modelling

Answer 66

A

C= 6PD
P is the number of parameters in the transformer model
D is the dataset size, in tokens
C is the compute required to train the transformer model, in total floating point operations

C= # of GPUs x Flops/GPU

Answer 67

A

Total Memory = Model Memory + optimizer memory + activation memory + gradient memory

model memory (fp16)-> 2 bytes

optimizer memory -> often kept at 32B precision, 4byte for gradient, 4byte for momentum, 4byte for variance

gradient (saved in fp16) -> 2bytes

activations -> 2bytes * batchsize * # of tokens * hidden dimension * # of layers

total memory is roughly 16 * # of parameters

Benefit of keeping optim at 32bits?

Answer 68

A

For ZeRO-1,
Total Memory_Training ≈ Model Memory + ((Optimizer memory) / (No. GPUs)) + Activation Memory + Gradient Mem

For ZeRO-2,
Total Memory_Training ≈ Model Memory + Activation Memory + (Optimizer Memory + Gradient Memory)/ (No. GPUs)

For ZeRO-3,
Total Memory_Training ≈ Activation Memory + (Model Memory + Optimizer Memory + Gradient Memory) / (No. GPUs)

Why does the optimiser keep a copy of the gradients along with gradient memory?

Answer 69

A

20 x # of parameters

Answer 70

A

So, the sequence is:

Multi-headed attention
o_proj linear projection
Skip connection by adding the original input
Layer normalization
First FCN layer
Activation Function
Second FCN layer
skip connection
another layer norm

Answer 71

A

bleu score is a precision based metric, where numerator is count of n-gram match divided by count of n-grams present in generated text.

so this mean if “the apple” bigram shows up 7 times in generated text and 2 times in reference text then numerator count of “the apple” will be 2 and denominator count will be 7

this gives you Bleu-n for different n.

One problem here is you could have really small generations reducing the denominator and making the model seem good, so there is a brevity penalty multiplied

total blue score is exponent of weighted log precision

Answer 72

A

Dropout
Weight Decay (L2 Regularization)
Layer Normalization
Gradient Clipping
Label Smoothing

Answer 73

A

l1 norm as regularization term is lasso l2 norm is ridge

Answer 74

A

gradient clipping is when you clip gradients during backprop. helps with exploding gradient problem. Helps maintain stability during training.

What value is gradient clipped at

Answer 75

A

l1 norm pushes them to 0, making it good for variable selection l2 norm pushes them close to 0.

l1 norm in the parameter space graphically is a diamond and l2 norm in the parameter space is a circle.
point where the contours of loss function first touch the diamond or circle is where loss function solution lies

Answer 76

A

it determines the size of the regularisation circle in the parameter space. larger lambda smaller circle

Answer 77

A

In the context of machine learning,
bias refers to the error that is introduced by approximating a real-world problem, which may be complex, by a much simpler model. simpler models can exhibit high bias and complex models high variance

bias - underfitting
variance - overfitting

Answer 78

A

Variance, refers to the error that is introduced by the model’s sensitivity to fluctuations in the training set.

if you use a complex model with lot of parameters and the parameters are allowed to take large values then it will be able to model noise along with data

If a model has too many parameters or if those parameters are allowed to take on large values, it can become extremely flexible. This means it can capture not only the underlying relationships in the data but also the noise (random fluctuations) specific to the training set. As a result, it may perform very well on the training data (low bias), but poorly on new, unseen data (high variance) because it’s overfitted to the noise and specifics of the training set rather than to the underlying data distribution.

Answer 79

A

Penalizing Large Weights: prevent any single feature from having too much influence on the predictions, which is desirable when you suspect some features may be correlated with noise rather than with the signal in the training data.

Smoothness and Generalization: Smaller weights often result in smoother functions that change less drastically with input variations. This smoothness means the model is less likely to pick up on noise and will therefore generalize better to unseen data.

Shrinkage Effect: The regularization term shrinks the parameter values towards zero but not exactly to zero. This effect is akin to a “soft” form of feature selection that lowers the complexity of the model without completely eliminating the contribution of any single feature.

Answer 80

A

skip connections solve for the problem of vanishing gradient

Answer 81

A

gradient clipping

Answer 82

A

(TP+TN)/(FP+FN+TP+TN)

Answer 83

A

accuracy of positive samples
= TP/(TP+FP)

Also ask gpt and write about what these tell you

Answer 84

A

coverage of actual positive samples
= TP/(TP+FN)

Is this right ??

Answer 85

A

coverage of actual negative samples
= TN/(TN+FP)

Answer 86

A

harmonic mean of precision and recall, useful for imbalanced classes
= 2TP/(2TP+FP+FN)

Answer 87

A

plot of TPR vs FPR
recall vs (1-specificity)
allows us to compare 2 methods like logistic vs boosting

Answer 88

A

it is the same as recall

Answer 89

A

BLEU is precision based

Answer 90

A

recall based

Answer 91

A

BERTScore uses BERT to create an embedding for n-grams with BERT for both output and reference and then takes dot product for similarity in the form of precision and recall and then takes harmonic mean of them both

Do you use a double loop to summation over all n grams in the precision and recall components??

Answer 92

A

coherence
compile - syntax
gpt eval - semantics
expert eval

Answer 93

A

Position bias: LLMs tend to favor the response in the first position.

Verbosity bias: LLMs tend to favor longer, wordier responses over more concise ones

Self-enhancement bias: LLMs have a slight bias towards their own answers.

Answer 94

A

Vicuna 13B is a llama finetune on chat data from shareGPT.
it is one of the first paper to use gpt4 to evaulate itself against different bench marks

Answer 95

A

Accuracy. recall, precision, specificity, f1 score if you have classes

BLEU, ROGUE, BERTScore if you have text as reference and task is something like summarisation

Automatic gpt4 based evaluation

Answer 96

A

prepends a trainable tensor to the model’s input embeddings, essentially creating a soft prompt. Unlike discrete text prompts, soft prompts can be learned via backpropagation, meaning they can be fine-tuned to incorporate signals from any number of labeled examples.

Answer 97

A

it prepends trainable parameters to the hidden states of all transformer blocks. During fine-tuning, the LM’s original parameters are kept frozen while the prefix parameters are updated.

Answer 98

A

when adapting to a specific task, pre-trained language models have a low intrinsic dimension and can still learn efficiently despite a random projection into a smaller subspace. Thus, LoRA hypothesized that weight updates during adaption also have low intrinsic rank.

How do you decompose into lower rank??

Answer 99

A

QLoRA builds on the idea of LoRA. But instead of using the full 16-bit model during fine-tuning, it applies a 4-bit quantized model

4-bit NormalFloat (to quantize models),
double quantization (for additional memory savings), and
paged optimizers (that prevent OOM errors by transferring data to CPU RAM when the GPU runs out of memory).

Give qlora paper to chatgpt and talk to it about how quantisation works there like u and sigma and normal distribution assumption and binning

Answer 100

A

Benefits: less memory, faster computation
Problems: during training numerical instability possible cuz gradients get clipped due to lack of precision, learning affected or stalled due to underflow

mixed precision training to solve this issue

Details of mixed precision

Answer 101

A

Given two tensors A and B, both with shape [2, 2]:
torch.cat([A, B], dim=0) will result in a tensor of shape [4, 2].
torch.stack([A, B], dim=0) will result in a tensor of shape [2, 2, 2].

Answer 102

A

In-place operations Operations that have a _ suffix are in-place

tensor.add_(5)

above tensor is modified in place

in place operations use is discouraged as it leads to an immediate loss of information

Answer 103

A

model = resnet18()

Answer 104

A

data = torch.randn(1, 3, 64, 64)

Answer 105

A

label = torch.randn(1, 1000)

Answer 106

A

prediction = model(data)

Answer 107

A

loss = (prediction - labels).sum()
loss.backward

Answer 108

A

optim = torch. optim.SGD(model.parameters(), lr= 1e-3, momentum = 0.9)

optim.step()

Answer 109

A

define model
get data
get labels
define optimizer and loss
calculate loss and do loss.backward()
do optimizer.zero_grad() and optimizer.step()

or you define everything and do trainer.train()

Answer 110

A

transformers library in huggingface which has models and techniques for training and finetuning many machine learning models. Its an API in PyTorch

Answer 111

A

Trainer is a class in Transformers library in huggingface that lets you finetune pre-trained models. let’s you abstract away a lot of settings and supports mixed precision training.

Answer 112

A

Callback is a set of methods that you can override or utilize to customize behvaiour of trainer. for example you can use it to save model preiodically, logging metrics, modify learning rate. It’s useful for extending the functionality of the training loop without changing core trainnig logic.

you can define callbacks as arguments tot he trainer object

Answer 113

A

callbacks are defined as a class in PyTorch where you create a class that inherits from a defined class in Transformers library and you can define a function in there to do what you want it to do when you want it done. LIke define a function that runs every epoch to do something if a condition is met.

Answer 114

A

Automatic Parameter Registration: This is crucial for training the model because optimizers rely on the .parameters() method of the nn.Module to get a list of all parameters that need updates.

Model Serialization and deserialization: when you save and load model

Device Management: When moving the model to a device (e.g., GPU), all parameters of all contained layers are automatically moved as well.

Answer 115

A

self attention takes you from xi’s to yi’s
how every y_i is calculated is you take the x_i and dot product it with every other x_i. Then you take softmax. Each dot product becomes the scaling factor or weight that weighted addition of xj’s is done with.

Every y_i is weighted summation of all x_j ‘s

y_i = ∑_j w_ij x_j

w’_{ij} = x_i^T x_j

w_{ij} = exp(w’_{ij}) / ∑j exp(w’{ij})

Answer 116

A

large softmax kill the gradient, so we divide by sqrt(d)

Why sqrt(d)? Imagine a vector in ℝ^d with values all c. Its Euclidean length is sqrt(cd). Therefore, we are dividing out the amount by which the increase in dimension increases the length of the average vectors.

Answer 117

A

The layer normalization is applied over the embedding dimension only.

Answer 118

A

temperature sampling is dividing logits by the temperature before feeding them into softmax and obtaining our sampling probabilities.

lower temp means less random
higher temp means more random

Answer 119

A

in top-k sampling you sample from the top k probabilities.

the problem is that if the distribution has a lot of reasonable options and prob is uniform we’ll ignore some possibilities simply because we put hard stop of k

Answer 120

A

nucleus sampling aka top p sampling, we compute cumulative prob till p and cut off there

Answer 121

A

Stability in Training: LayerNorm stabilizes the neural network training process by ensuring that the distribution of the inputs to the activation functions in a layer does not vary too much. This reduces the problem where the learning has to constantly adjust to a shifting input distribution across layers.

Answer 122

A

risk in emperical risk minimization is the expectation of loss over the join prob distribution P(x, y) where y = h(x) and loss = L(h(x), x)

loss is the 0-1 loss so ris is the integral of joint prob where loss is 1

R(h) = E[L(h(x), y)] = ∫ L(h(x), y) dP(x, y)

Answer 123

A

you take average of loss on the training data and call that emperical risk as you can’t find real risk as P(x, y) is not available

empirical risk minimization is when you find a h which minimizes the risk

Answer 124

A

Deeper networks are more expressive, since they encode an inductive bias that complex functions
can be modeled as composition of simple functions. In turn, this allows the network to learn multiple levels
of an abstraction hierarchy. Empirically, it has been shown that deeper networks lead to more compact
models with better generalization performance.

Answer 125

A

cuz gradient of l1 is 1 even close to 0 but gradient of l2 becomes really small close to 0 as 2w

Answer 126

A

in bagging you sample with replacement to train n classifiers and vote for final prediction

in boosting you first train a weak learner and look at what samples it is not able to classify, it weights those samples higher and trains another weak learner and does this iteratively

Answer 127

A

Derived from feedforward
neural networks, RNNs can use their internal state (memory) to process variable length sequences of
inputs

Answer 128

A

vanishing and exploding gradient in RNNs

Answer 129

A

Perplexity(P) = 2^(-1/N∑log P(x))

Answer 130

A

a bias term is added inside the softmax with qk^T to signify positions but doing so doesn’t let us use KV cache and this is bad for training

Answer 131

A

applied only to q and k matrices and not v. We multiply with a rotating matrix that rotates a word by m x theta where m is position of word

what is done is blocks of 2 in the embedding dimnesion is rotated by theta1, theta2, etc.

allows for rotational invariance to relative positions

Answer 132

A

During inference you don’t really need to calculate qkT V for all the previous tokens.

usually q is calculated only for the most recent token and only this 1 q is used, K and V for all is needed but only 1 column and row is new

So what you do is you save the k and v from the previous iteration. you generate 1 extra q k and v from latest token. take old k and v from cache append to it and then do calculation.

Answer 133

A

Increasing the filter size results in decrease in computational efficiency (since the number of model parameters increases), and an increase in accuracy up to a certain point beyond which the network can overfit and imitate a fully-connected network.

Alternatively, decreasing the filter size results in increase of computational efficiency, and a decrease in accuracy when tending towards using extremely small kernels (e.g. 1x1) which do not capture the local structure of the inputs properly

Answer 134

A

Each filter operates only on a small neighborhood (e.g 3x3) around the ground pixel, and is
applied across the entire spatial domain. This sort of weight sharing drastically reduces the number of
the parameters, and injects the “spatial equivariance” bias.

Answer 135

A

The purpose of a 1x1 conv layer is to reduce the number of channels in the input activation volume.

Answer 136

A

The main purpose of the pooling operation is to increase the receptive field of the network
by non-parametric techniques. If pooling is completely removed, then we’d need an increasingly large
stack of convolutional layers to achieve large enough receptive fields for the neurons located in the
deep layers. Of course, this comes at the cost of drastically increasing the number of parameters and
computation requirements.

Answer 137

A

A = [5, 3, 4]
B = [4, 2, 4]

dot_product = sum(a*b for a, b in zip(A, B))

magnitude_A = sum(a*a for a in A)**0.5
magnitude_B = sum(b*b for b in B)**0.5

cosine_similarity = dot_product / (magnitude_A * magnitude_B)
print(f"Cosine Similarity using standard Python: {cosine_similarity}")

Answer 138

A

in finite horizon when we’re thinking of reward for an action we’re only considering rewards in a finite set of future states.

for infinite horizon we consider future rewards of all future states

infinite horizons are more common

Answer 139

A

using a discount rate (smaller than 1) is a mathematical trick
to make an infinite sum finite.

Answer 140

A

Q(S_t, A_t) ← Q(S_t, A_t) + α(R_{t+1} + γmax_a Q(S_{t+1}, a) - Q(S_t, A_t))

γ - > discount rate
alpha -> learning rate

Answer 141

A

(s_t, a_t, r_t, s_{t+1})

Answer 142

A

put an image through encoder, encode to a mean and standard deviation (log standard deviation for numerical stability) and then sample by scaling a standard normal distribution with those parameters and try to generate the input image from decoder.

input image used to train decoder and encoder and KL divergence of output of encoder and normal distribution acts as regularizer

Answer 143

A

wt+1 = wt - α * (1/N * ∑i=1 to N ∇wLi(wt, yi, yˆi))

Answer 144

A

thing to note here is how input and gradient are multiplied by mask, how mask is created and how scaling is done post dropout

import numpy as np
def forward_pass_with_dropout(X, dropout_rate):
 """
 Apply dropout to the input layer X during the forward pass.
 :param X: Input data for the layer, numpy array of shape (n_features, n
 :param dropout_rate: The probability of setting a neuron's output to 0
 :return: A tuple (output after applying dropout, dropout mask used)
 """
 # Create a mask using the dropout rate, setting to 0 for the dropped un
 dropout_mask = np.random.rand(*X.shape) > dropout_rate
 # Apply the mask to the input data
 dropped_out_X = np.multiply(X, dropout_mask)
 # During training, we'll scale data to not change the expected value
 dropped_out_X /= (1 - dropout_rate)
 return dropped_out_X, dropout_mask

def backward_pass_with_dropout(dA, dropout_mask, dropout_rate):
 """
 Apply the stored mask to the gradient during the backward pass.
 :param dA: Gradient of the loss with respect to the activations, numpy 
 :param dropout_mask: The dropout mask that was used during the forward 
 :param dropout_rate: The probability of setting a neuron's output to 0
 :return: Gradient after applying dropout mask
 """
 # Apply the dropout mask to the gradients
 dA_with_dropout = np.multiply(dA, dropout_mask)
 # Scale the gradients as we did during the forward pass
 dA_with_dropout /= (1 - dropout_rate)
 return dA_with_dropout
# Example usage:
np.random.seed(0) # for reproducibility
X = np.random.randn(5, 3) # 5 features, 3 samples
dropout_rate = 0.2 # 20% dropout rate
# Forward pass
X_dropped, mask = forward_pass_with_dropout(X, dropout_rate)
# Suppose we have some gradient dA from the backward pass
dA = np.random.randn(5, 3)
# Backward pass
dA_dropped = backward_pass_with_dropout(dA, mask, dropout_rate)

Answer 145

A

Backprop through time (BPTT) is the reason for vanishing and exploding gradient

Gated Units: Using RNN variants with gating mechanisms, such as LSTM (Long Short-Term
Memory)

Gradient Clipping:
Using Skip Connections:
Proper Initialization and Activation Functions

Answer 146

A

It might be that the training loss uses regularizers (e.g. L2 norm on the weights). Since at validation time we only evaluate the main loss function, it might happen that the validation loss is lower than
the (composite) train loss.

The model might have dropout layers, which impose heavy regularization during training. These are layers that behave differently during training (dropping out neurons) and inference time

The validation set might simply be easier compared to the train set.

Answer 147

A

We can perform early stopping if there is no change or a decrease in a metric of interest (validation loss, accuracy, etc.). It is a good idea to use a patience parameter in order to avoid noisy
estimates.

Answer 148

A

In gradient descent we first perform a forward pass and a backward pass for each sample in the
dataset, before we take a step in the direction of the cumulative gradient. This is extremely slow
to converge, as today’s datasets are extremely large in size, which implies that we perform gradient
updates too rarely.
* In SGD, after performing a forward and a backward pass for a single sample, we take a step in the
direction of the single gradient. Even though we perform gradient updates much more often, the
entire process is extremely noisy as we are optimizing with respect to a single sample at a time.
* Mini-batch SGD combines the best of both worlds: perform a forward/backward pass for a batch
(e.g. 32) of samples, and take a step in the direction of the gradient for the current mini-batch. On
one hand, we perform updates more often than pure gradient descent; on the other hand, we optimize
over an entire mini-batch, minimizing the noise in the estimated gradient.

Answer 149

A

The fluctuation can be attributed to two primary factors: large learning rate or exploding gradients. In either case, this can seriously destabilize the training process. In order to resolve this issue, we could:
1) lower the learning rate;
2) perform gradient clipping

Answer 150

A

if too high loss will decrease fast but will stabilise at high value
if just right loss will decrease and stabilise at low value
if ltoo low loss will decrease very slowly

Answer 151

A

Warmup steps are just a few updates with low learning rate at the beginning of training.
After this warmup, you use the regular learning rate (schedule) to train your model to convergence.
For example, RMSProp computes a moving average of the squared gradients to get an estimate
of the variance in the gradients for each parameter. For the first update, the estimated variance is
just the square root of the sum of the squared gradients for the first batch. Since, in general, this
will not be a good estimate, your first update could push your network in a wrong direction. To
avoid this problem, you give the optimiser a few steps to estimate the variance while making as little
changes as possible (low learning rate) and only when the estimate is reasonable, you use the actual
(high) learning rate.

Answer 152

A

Suppose the output of the previous layer is X ∈ R
B×D where B is the batch size and D is the dimensionality of the embedding. Both techniques normalize the input as follows:
Y = X − E [X] / sqrt(Var [X] + ϵ) ∗ γ + β

where γ and β are learnt affine parameters, and all operations are treated as broadcasts. The main
difference stems from how the two techniques compute the statistics:
* Batch norm computes the mean and standard deviation over the batch, meaning that E [X] ∈ R 1×D
and Var [X] ∈ R1×D.
* Layer norm computes the mean and standard deviation over the features, meaning that E [X] ∈ R B×1
and Var [X] ∈ R B×1
.

Answer 153

A

if you use L2 regularization then in the gradient update rule you’re basically multiplying w with 1- \alpha * \lambda before updating with gradient. this is called weight decay. PyTroch implements weight decay

Answer 154

A

The learning rate is controlling the size of the update steps along the gradient. so large first small later on

In Adam the moving square average acts as a learning rate parameter

exception would be continual learning

Answer 155

A

When utilizing larger batches, we can afford to have large learning rates, as the approximated gradient of the batch is closer to the true gradient. On the other hand, using very small
batches yields noisy estimates of the gradient, and is therefore advisable to also use small learning
rates so that we don’t diverge in our optimization procedure.

Answer 156

A

solutions found by adaptive methods (e.g. Adam) generalize
worse than SGD, even when these solutions have better training performance. In other words,
adaptive methods converge faster, but have worse generalization performance than pure SGD.

Answer 157

A

SGD is typical gradient update with learning rate

adam along with this has a momentum adn a variance term. Momentum is moveing average of gradient and variance is moving average of gradient^2

Answer 158

A

, the choice between asynchronous and synchronous SGD in a distributed training environment
depends on the specific requirements of the training process and the trade-offs between speed and
stability. Asynchronous SGD offers faster training but can suffer from issues due to stale gradients, while
synchronous SGD provides more stable convergence at the cost of potentially slower training

Answer 159

A

during backpropagation, we will have the same local derivatives, which in turn will cause
the weights for all neurons in a given layer to perform the same update.

Answer 160

A

Random weight initialization, mini-batch shuffling, dropout, etc

Answer 161

A

1) SGD gradient randomness helps escape local minima
2) randomness while output sampling in LLMs and action sampling when RL plays chess helps
3) dropout helps the network not be too dependent on any weights

Answer 162

A

A dead neuron in a neural network is a neuron that always outputs the same value, regardless of the
input. This typically happens when the neuron’s activation function is non-linear, like ReLU (Rectified
Linear Unit), and the input to the neuron is such that the activation function always outputs the
minimum value (which is 0 in the case of ReLU). For ReLU, this happens when the weights and biases
of the neuron are adjusted during training in such a way that the weighted sum of the inputs is always
negative, leading to an output of 0

Answer 163

A

we can monitor activations and if one activation is 0 all the time for different inputs then it is dead.

can be sorted by using leaky relu or layer norm or proper initialization

Answer 164

A

Magnitude. Remove weights that have magnitude close to 0; remove neurons whose L2 norm of
the weight is also close to 0.
Activations. We could also use the training data to observe the activations of the neurons. We
can remove neurons whose distribution of activations extremely peaked (invariant to the input);
Moreover, we can also remove a neuron if its activation pattern is highly correlated to another
neuron in the same layer.

Brainscape's Knowledge GenomeTM

Interview Flashcards

Brainscape's Knowledge Genome^TM