interview_ML Flashcards
compute a checksum for
precise file deduplication, and compress them for storage. What is meant by this?
checksum gives you a numerical representation of the contents of a file. used to signal integrity of the data in a file.
We can remove duplicate files with same content but different names using this method.
In the context of the described data collection process, computing a checksum means generating a unique, fixed-size numerical value (often represented as a string of letters and numbers) that is calculated from the contents of a file or a piece of data. This checksum serves multiple purposes:
Integrity Verification: The checksum helps verify the integrity of the data or files over time. If the file is altered in any way, even a small change, recomputing the checksum will result in a different value. This allows for easy detection of corruption or unintended changes.
Deduplication: In the process described, the checksum is used for precise file deduplication. By computing and comparing checksums, it is possible to identify and eliminate duplicate files in the dataset, even if the files are named differently. This is because the checksum is based on the file content, not the file name or other metadata.
what data preprocessing steps were done at keysight?
You can do checksum (python code) and deduplication, you can weed out some files based on the version of the code,
Excessively large code files were filtered as they were probably auto generated.
Added file path in the beginning of data of a file
tokenisation done to create chunks and then synthetic dataset creation by querying either llama2 or gpt API depending on data sensitivity
Something I didn’t do during keysight data work was dependency ordering, you read about it later in deepseek coder and it makes a lot of sense and they gave pseudocode for it and their code performance is state of the art and this prolly ahs something to do with it.
Positional Encoding Calculation and code (formula) also the rationale behind it
import numpy as np
def positional_encoding(sentence_length, model_dim):
pos_enc_matrix = np.zeros((sentence_length, model_dim))
for pos in range(sentence_length):
for i in range(0, model_dim, 2):
pos_enc_matrix[pos, i] = np.sin(pos / (10000 ** (i / model_dim)))
pos_enc_matrix[pos, i + 1] = np.cos(pos / (10000 ** (i / model_dim)))
return pos_enc_matrix
Rationale behind the Formula
The usage of a sine function for even indices and a cosine function for odd indices allows the model to capture different phase relationships between sequences of different frequencies.
This approach offers a “hackish” way to prioritize positional information of varying scales. As the formula demonstrates, higher-frequency components influence words that are further along in the sentence, while lower-frequency components emphasize closer proximity.
The specific constant
1
10000
is introduced to prevent the function from saturating.
Positional encoding used in llama?
mu and sigma for layer norm are calculated during training or runtime
mu and sigma calculated during runtime for inference for each data sample and gamma and beta are trained parameters
in the transformer architecture layer norm is done where?
before the activation function after the linear layer after skip connection.
This means the workflow within each sub-block of a transformer is as follows:
- Input from the previous layer
- Addition of the skip connection output
- Layer normalization
- Activation function (if any, depending on the sub-block)
However, some implementations and variations might apply layer normalization before adding the skip connection output. Both methods have been explored in practice, with different effects on training dynamics and performance. The original approach (normalization after the residual connection) is more common and is often preferred because it allows the model to preserve the raw output of each layer before it is normalized, which can be beneficial for learning complex dependencies.
i. [E] What’s the geometric interpretation of the dot product of two vectors?
Multiplication of the length of one vector and the length of the projection of the other
vector onto the first one.
[E] Given a vector u, find vector v of unit length such that the dot product of u and v is
maximum
Given a vector u, the vector v of unit length that maximizes the dot product u · v is the vector that points
in the same direction as u. The vector v can be found by dividing u by its own magnitude, making it a
unit vector.
Give an example of how the outer product can be useful in ML.
The Covariance matrix is a commonly used quantity in Machine Learning algorithms
(eg. PCA). Given a dataset X ∈ R
n×d with n samples and d features, we calculate the (empirical)
covariance as follows:
Cov [X] = 1
n
Xn
i=1
(xi − x¯)(xi − x¯)
T
where ¯x is the mean feature vector: ¯x =
1
n
Pn
i=1 xi
[E] What does it mean for two vectors to be linearly independent?
Two vectors are linearly independent if no scalar multiple of one vector equals the other. This means
they do not lie on the same line through the origin and neither can be formed by scaling the other.
[M] Given two sets of vectors and . How do you
check that they share the same basis?
you first have to make sure A and B have the same dimension. This can be done by converting A and B into its row echelon form.
Singular Value Decomposition (SVD): Decompose A into the product of three matrices: A = U Σ V^T, where U and V are orthogonal matrices, and Σ is a diagonal matrix. The number of non-zero diagonal entries in Σ is the rank of A.
once you have the rank you can then take the r vectors than span row space of both A and B and show that every vector in rows of A is spanned by rows of B and vice versa
[M] Given n vectors, each of d dimensions. What is the dimension of their span?
Given n vectors, each of d dimensions, the dimension of their span depends on their linear
independence. If all vectors are linearly independent, the dimension of their span is min(n, d). If they are
not all linearly independent, the dimension will be less than min(n, d).
Can think of this as a matrix with n columns of d-dim vectors, min(n,d) is the rank
[E] What’s a norm? What is Lnorm?
A norm is a function that assigns a strictly positive length or size to each vector in a vector space,
except for the zero vector, which is given a length of zero.
The L0 norm refers to the number of nonzero elements in a vector,
L1 norm is the sum of the absolute values of the vector elements,
L2 norm
(also known as the Euclidean norm) is the square root of the sum of the squares of the vector elements,
and the
L norm is the maximum absolute value of the elements in the vector.
[M] How do norm and metric differ? Given a norm, make a metric. Given a metric, can we make a norm?
Ans. A metric measures distances between pairs of things while a norm measures the size of a single item. Metrics can be defined on pretty much anything, while the notion of a norm applies only to vector spaces: the very definition of a norm requires that the things measured by the norm could be added and scaled. If you have a norm, you can define a metric by saying that the distance between a and b is the size of a - b.
On the other hand, if you have a metric you can’t usually define a norm.
[E] Why do we say that matrices are linear transformations?
Matrices represent linear transformations because they map linear combinations of vectors to other linear
combinations.
Specifically, for a matrix M and vectors u and v, and a scalar c, the following properties hold true, which are
the properties of linear transformations:
M(u + v) = Mu + Mv
Applying M to the sum of two vectors is the same as summing the results of applying M to each
vector individually.
M(cu) = c(Mu)
Applying M to a scalar multiple of a vector is the same as multiplying the result of applying M by
that same scalar.
These properties demonstrate that matrix multiplication exhibits the key aspects of a linear transformation:
1. Additivity - Applying the transformation to vector sums gives the sum of individually transformed
vectors
2. Homogeneity - Scalars distribute across the transformation
Thus, matrices and matrix multiplication inherently represent and operate as linear transformations
between vector spaces.
What’s the inverse of a matrix? Do all matrices have an inverse? Is the inverse of a matrix
always unique?
The inverse of a matrix A is another matrix A^-1 such that AA^-1 = A^-1A = I, where I is the identity matrix.
Not all matrices have an inverse; only square matrices that are non-singular (with a non-zero determinant)
have an inverse. When a matrix has an inverse, it is always unique.`
[E] What does the determinant of a matrix represent?
The determinant of a matrix can be interpreted as a scaling factor for the transformation that the matrix
represents. Geometrically, it represents the volume scaling factor of the linear transformation described by
the matrix, including whether the transformation preserves or reverses the orientation of the space.
[E] What happens to the determinant of a matrix if we multiply one of its rows by a scalar t×R?
Multiplying a row of a matrix by a scalar t scales the determinant of the matrix by that scalar. So if the
original determinant was d, the new determinant will be t×d.
[M] A 4×4 matrix has four eigenvalues 3,3,2,−1. What can we say about the trace and the
determinant of this matrix?
The trace of the matrix, which is the sum of its eigenvalues, would be 3+3+2−1=7. The determinant of the
matrix, which is the product of its eigenvalues, would be 3×3×2×(−1)=−18
what is the determinant of a matrix with linearly dependent rows or columns
0
[M] What’s the difference between the covariance matrix A^TA and the Gram matrix AA^T?
Suppose A ∈ Rn×d
, corresponding to n samples with each having d features. Then, the Covariance matrix AT A ∈ Rd×d
captures the ”similarity” between features, whereas the Gram matrix
AAT ∈ Rn×n captures the ”similarity” between samples.
Given A∈Rn×m and b∈Rn
Find x such that: Ax=b
When does this have a unique solution?
Why is it when A has more columns than rows, Ax=b has multiple solutions?
Given a matrix A with no inverse. How would you solve the equation Ax=b?
Unique Solution
Square and Full Rank: If A is square (n = m) and has full rank (rank(A) = m = n), then A is invertible. The equation Ax = b has a unique solution given by x = A^(-1)b.
Non-Square but Full Column Rank: If A is not square (m ≠ n), but rank(A) = m (full column rank) and m < n (tall matrix), a left inverse A_L of A exists such that A_L A = I. This configuration implies Ax = b has a unique solution when b is in the range (column space) of A. The solution can be given by x = A_L b. In this scenario, A_L = (A^T A)^(-1)A^T and there is no null space of A (nullity(A) = 0) because all columns are linearly independent.
Multiple or No Solutions
Overdetermined System (m > n): If A is a wide matrix (more columns than rows), the rank of A could be at most n, and the null space of A (nullity(A)) has dimension m - n (assuming rank(A) = n). This situation typically means:
No Solution: If b is not in the column space of A, then Ax = b has no solution.
Infinitely Many Solutions: If b is in the column space of A, there are infinitely many solutions because there are free variables associated with the non-trivial null space of A.
What is the pseudoinverse and how to calculate it?
Method 1: Singular Value Decomposition (SVD)
The most reliable method for computing the pseudoinverse of any matrix, whether square or rectangular, full rank or rank-deficient, is using its Singular Value Decomposition (SVD). The SVD of matrix A is given by:
A = U Σ V^T
where:
U is an n×n orthogonal matrix whose columns are the left singular vectors of A.
V is an m×m orthogonal matrix whose columns are the right singular vectors of A.
Σ is an n×m diagonal matrix with non-negative real numbers on the diagonal, known as the singular values.
The pseudoinverse A+ is then calculated as:
A+ = V Σ+ U^T
Here, Σ+ is obtained by taking the reciprocal of each non-zero singular value in Σ, and transposing the matrix.
Method 2: Using Matrix Transpose and Inversion
For matrices that are either full row rank or full column rank, the pseudoinverse can also be computed directly using matrix transposes and inversion:
Full Column Rank (m≤n and rank(A)=m):
A+ = (A^T A)^(-1) A^T
Full Row Rank (m≥n and rank(A)=n):
A+ = A^T (A A^T)^(-1)
Note: These formulas assume that A has full column or row rank, respectively.
Do AAT or ATA depending on which gives you a full rank matrix, that is then invertible and use that to find pseudoinverse
What does the derivative represent?
Answer: The derivative of a function measures the sensitivity to change in the function output with
respect to a change in the input.
Moreover, when it exists, the derivative at a given point is the slope of the tangent line to the graph
of the function at that point. The tangent line is the best linear approximation of the function at
that input value. This is the reason why in gradient descent we (slowly) move in the (negative)
direction of the derivative
What’s the difference between derivative, gradient, and Jacobian?
When f: R → R, we calculate the derivative df/dx.
When f: R^n → R, we calculate the gradient:
∇f = [∂f/∂x1, ∂f/∂x2,…, ∂f/∂xn]
When f: R^n → R^m, we calculate the Jacobian (an mxn matrix):
Jac(f) = [
∂f1/∂x1 … ∂f1/∂xn
…
∂fm/∂x1 … ∂fm/∂xn
]
Say we have the weights w ∈ R^(d×m) and a mini-batch x of n elements, each element is of the shape 1 × d, so that x ∈ R^(n×d). We have the output y = f(x; w) = xw. What’s the dimension of the Jacobian ∂y/∂x?
First, notice that y ∈ R^(n×m). With that said, Jac_x(f) ∈ R^((n×m)×(n×d)), or equivalently Jac_x(f) ∈ R^((n·m)×(n·d)), given that we have reshaped the 4-dim tensor into a 2-dim tensor, i.e. a matrix.
in general dimension of jacobain dy/dxis R^(y-dim x x-dim)
Given a very large symmetric matrix A that doesn’t fit in memory, say and a
function f that can quickly compute . Find the unit vector x so that is
minimal.
Hint: Can you frame it as an optimization problem and use gradient descent to find an approximate
solution?
To find the unit vector x that minimizes x^T Ax, we can frame this as an optimization problem and approach
it using an iterative algorithm like gradient descent or the conjugate gradient method. These methods
update x in a direction that minimizes the function x^T Ax, and in the case of gradient descent, this involves computing the gradient -2Ax at each step. We would iterate this process until convergence, ensuring at
each step that x remains a unit vector by normalizing it.
Why do we need dimensionality reduction?
Dimensionality reduction is used to reduce the number of random variables under consideration and can
be divided into feature selection and feature extraction. It helps in reducing the time and storage space
required, removes multicollinearity, enhances the interpretation of the parameters, helps in visualizing data,
and most importantly, it can help in avoiding the curse of dimensionality.
Eigendecomposition is a common factorization technique used for dimensionality reduction. Is the eigendecomposition of a matrix always unique?
The decomposition is not always unique. Suppose A ∈ R^(2×2) has two equal eigenvalues λ1 = λ2 = λ, with corresponding eigenvectors u1, u2. Then:
Au1 = λ1u1 = λu1
Au2 = λ2u2 = λu2
Or written in matrix form:
A [u1 u2] = [u1 u2] [λ 0; 0 λ]
Notice that we can permute the matrix of eigenvectors (thus obtaining a different factorization):
A [u2 u1] = [u2 u1] [λ 0; 0 λ]
But we still end up with the same eigen-properties:
Au2 = λu2
Au1 = λu1
Name some applications of eigenvalues and eigenvectors
PCA
We want to do PCA on a dataset of multiple features in different ranges. For example, one is in the range 0-1 and one is in the range 10 - 1000. Will PCA work on this dataset?
In PCA we are interested in the components that maximize the variance. If one component (e.g. human height) varies less than another (e.g. weight) because of their respective scales (meters vs. kilos), PCA might determine that the direction of maximal variance more closely corresponds with the ‘weight’ axis, if those features are not scaled. Since a change in height of one meter should be considered much more important than the change in weight of one kilogram, the previous assumption would be incorrect. Therefore, it is important to standardize the features before applying PCA.
Under what conditions can one apply eigendecomposition? What about SVD?
Eigendecomposition is possible only for (square) diagonalizable matrices. On the other hand, the Singular Value Decomposition (SVD) always exists (even for non-square matrices).
What’s the relationship between PCA and SVD?
Suppose we have data X ∈ R^(n×d) with n samples and d features. Moreover, assume that the data has been centered so that the mean of each feature is 0. Then, we can perform PCA in two main ways:
First, we compute the covariance matrix C = 1/(n-1)X^T X ∈ R^(d×d), and perform eigendecomposition: C = V L V^T, with eigenvalues as the diagonal of L ∈ R^(d×d), and eigenvectors as the columns of V ∈ R^(d×d). Then, we stack the k eigenvectors of V corresponding to the top k eigenvalues into a matrix V˜ ∈ R^(d×k). Finally, we obtain the component values as follows: X˜ = X V˜ ∈ R^(n×k).
Alternatively, instead of first computing the covariance matrix and then performing eigendecomposition, notice that given the above formulation, we can directly compute SVD on the data matrix X, thus obtaining: X = U Σ V^T. By construction, the right singular vectors in V are the eigenvectors of X^T X. Similarly, we stack the k right singular vectors corresponding to the top k singular values into a matrix V˜ ∈ R^(d×k). Finally, we obtain the component values as follows: X˜ = X V˜ ∈ R^(n×k).
Even though SVD is slower, it is often considered to be the preferred method because of its higher numerical accuracy.
What is the relationship between SVD and eigendecomposition?
Consider A ∈ R^(m×n) of rank r. Then, we can factorize A as follows:
A = U Σ V^T
where U ∈ R^(m×m) is an orthogonal matrix of left singular vectors, V ∈ R^(n×n) is an orthogonal matrix of right singular vectors, and Σ ∈ R^(m×n) is a “diagonal” matrix of singular values such that exactly r of the values σ_i := Σ_ii are non-zero.
By construction:
The left singular vectors of A are the eigenvectors of A^T A. From the Spectral Theorem, the eigenvectors (and thus the left singular vectors) are orthonormal.
The right singular vectors of A are the eigenvectors of A A^T. From the Spectral Theorem, the eigenvectors (and thus the right singular vectors) are orthonormal.
If λ is an eigenvalue of A^T A (or A A^T), then √λ is a singular value of A. From the positive semidefiniteness of A^T A (or A A^T), the eigenvalues (and thus the singular values) are non-negative.
What does it mean when a function is differentiable?
A function f: U → R is said to be differentiable at a ∈ U if the derivative:
f’(a) = lim(h→0) [f(a + h) - f(a)]/h
exists. This implies that the function is continuous at a. Note that every continuous function is not necessarily differentiable. like |X| continuous but not differentiable. functions with sharp edges are not differentiable tho continuous
[E] Give an example of when a function doesn’t have a derivative at a point.
An example is the function f(x) = |x|, which doesn’t have a derivative at x = 0. The graph of this function
has a sharp corner at x = 0, which means there is no single tangent line at that point.
Why you can still apply the formula right?
Give an example of non-differentiable functions that are frequently used in machine learning.
How do we do backpropagation if those functions aren’t differentiable?
An example is the ReLU (Rectified Linear Unit) function, which is non-differentiable at x = 0. In machine
learning, backpropagation with such functions often uses a concept called subgradient, which allows the
algorithm to bypass non-differentiability at certain points. For ReLU, the derivative is defined as 0 for x < 0
and 1 for x > 0, and at x = 0, any value between 0 and 1 can be used.
What does it mean for a function to be convex or concave?
A function is called convex if the line segment between any two points on the graph of
the function lies above the graph between the two points. More precisely, the function f : X → R is
convex if and only if for all 0 ≤ t ≤ 1 and all x1, x2 ∈ X:
f(tx1 + (1 − t)x2) ≤ tf(x1) + (1 − t)f(x2)
The function f is said to be concave if −f is convex.
Why is convexity desirable in an optimization problem?
Convexity is desirable because any local minimum of a convex function is also a global
minimum
Most ML algorithms we use nowadays use first-order derivatives (gradients) to construct the next
training iteration.
[E] How can we use second-order derivatives for training models?
Second-order derivatives can be used in optimization algorithms to better understand the curvature of
the loss function. This information can be used to adjust the learning rate and the direction of the
update steps, potentially leading to faster convergence.
Pros and cons of second-order optimization.
Pros: Can lead to faster convergence and more informed update steps.
Cons: Computationally expensive, as it requires calculating and inverting the Hessian matrix.
How can we use the Hessian (second derivative matrix) to test for critical points?
The Hessian matrix can be used to test the nature of critical points. If the Hessian at a point is positive
definite, the point is a local minimum; if it is negative definite, the point is a local maximum; and if it is
indefinite, the point is a saddle point
Jensen’s inequality forms the basis for many algorithms for probabilistic inference, including Expectation-Maximization and variational inference. Explain what Jensen’s inequality is.
As stated before, for a given convex function f, we had the following property:
g(tx1 + (1 - t)x2) ≤ tg(x1) + (1 - t)g(x2)
Let us generalize this property. Again, suppose we have a convex function f, variables x1, …, xn ∈ I, and non-negative real numbers α1, …, αn such that ∑i αi = 1. Then, by induction we have:
g(α1x1 + … + αnxn) ≤ α1g(x1) + … + αng(xn)
Let’s formalize it one step further. Consider a convex function f, a discrete random variable X with n possible values x1, …, xn, and real non-negative values ai = p(X = xi). Then, we obtain the general form of the Jensen’s inequality:
g(E[X]) ≤ E[g(X)]
Explain the chain rule.
The chain rule is a formula that expresses the derivative of the composition of two differentiable functions f and g in terms of the derivatives of f and g. More precisely, if h = f ∘ g is the function such that h(x) = f(g(x)) for every x, then the chain rule is:
dh/dx = df/dg · dg/dx
Given the function f(x, y) = 4x^2 - y with the constraint x^2 + y^2 = 1. Find the function’s maximum and minimum values.
In order to solve the constrained optimization problem, we form the Lagrangian:
L(x, y, λ) = 4x^2 - y + λ(x^2 + y^2 - 1)
Given the function f(x, y) = 4x^2 - y with the constraint x^2 + y^2 = 1. Find the function’s maximum and minimum values.
Answer: In order to solve the constrained optimization problem, we form the Lagrangian:
L(x, y, λ) = 4x^2 - y + λ(x^2 + y^2 - 1)
Calculating the gradient and setting it to zero, we obtain:
∇_x,y,λ L = (∂L/∂x, ∂L/∂y, ∂L/∂λ) = (8x + 2λx, -1 + 2λy, x^2 + y^2 - 1) = 0
Let x ∈ R^n, L = crossentropy(softmax(x), y) in which y is a one-hot vector. Take the derivative of L with respect to x.
∂L/∂x = ∂/∂x [(-y^T log(softmax(x))] = -y^T ∂/∂x [softmax(x)]/∂x [log(softmax(x))]
= -y^T [diag(softmax(x)) - softmax(x)softmax(x)^T] / [diag(softmax(x))]
= softmax(x) - y
Given a uniform random variable X in the range of [0,1] inclusively. What’s the probability that
X=0.5?
For a continuous uniform distribution, the probability of X being exactly any specific value, including 0.5, is
0. This is because the probability for a continuous distribution is defined over intervals, not specific points.
Can the values of PDF be greater than 1? If so, how do we interpret PDF?
Yes, the values of a Probability Density Function (PDF) can be greater than 1. The key point is that the
area under the PDF curve over the entire range must integrate to 1. A high PDF value does not represent
probability but rather indicates a higher density of the variable at that point.
What’s the difference between multivariate distribution and multimodal distribution?
A multivariate distribution is a probability distribution with more than one random variable, each with its
range of values. A multimodal distribution is a probability distribution with more than one peak or mode, regardless of how many variables it has.
What does it mean for two variables to be independent?
In general, continuous random variables X1, …, Xn admitting a joint density are all independent from each other if and only if:
p_{X1,…,Xn}(x1, …, xn) = p_{X1}(x1) · · · p_{Xn}(xn)
This equation states that the joint probability density function (pdf) of the random variables X1, …, Xn factorizes into the product of their individual pdfs, which is a necessary and sufficient condition for independence.
It’s a common practice to assume an unknown variable to be of the normal distribution. Why is that?
The Central Limit Theorem (CLT) states that the distribution of the sum of a large number of
independent, identically distributed random variables is approximately normal, regardless of the underlying distribution. Because so many things in the universe can be modeled as the sum of a large number of
independent random variables, the normal distribution pops up a lot.
**Central limit theorem and law if larger numbers
How would you turn a probabilistic model into a deterministic model?
To convert a probabilistic model into a deterministic one, you typically use expected values, mode, or
median of the probability distributions as fixed values instead of random variables. This approach ignores
the variability and uncertainty represented by the probability distributions.
Explain frequentist vs. Bayesian statistics.
The frequentist approach The goal is to use the sample data to build point estimates of the parameters
(potentially with standard error).
bayesian uses priors The goal is to build a posterior distribution of the parameters, given
the data at hand.
Code for merge sort
def merge_sort(arr):
“””
Sorts an array using the merge sort algorithm.
“””
if len(arr) > 1:
mid = len(arr) // 2 # Finding the mid of the array
left_half = arr[:mid] # Dividing the elements into 2 halves
right_half = arr[mid:]
merge_sort(left_half) # Sorting the first half merge_sort(right_half) # Sorting the second half i = j = k = 0 # Copy data to temp arrays L[] and R[] while i < len(left_half) and j < len(right_half): if left_half[i] < right_half[j]: arr[k] = left_half[i] i += 1 else: arr[k] = right_half[j] j += 1 k += 1 # Checking if any element was left while i < len(left_half): arr[k] = left_half[i] i += 1 k += 1 while j < len(right_half): arr[k] = right_half[j] j += 1 k += 1 return arr
Example usage
if __name__ == “__main__”:
sample_array = [12, 11, 13, 5, 6, 7]
print(“Original array:”, sample_array)
sorted_array = merge_sort(sample_array)
print(“Sorted array:”, sorted_array)
code to recursively read a json file
import json
def load_json_file(filename):
“”” Load the JSON data from a file. “””
with open(filename, ‘r’) as file:
return json.load(file)
def print_json_recursively(data, indent=0):
“”” Recursively print JSON data with indentation for nested structures. “””
for key, value in data.items():
print(‘ ‘ * indent + str(key) + ‘:’, end=’ ‘)
if isinstance(value, dict): # If value is a dictionary, recurse
print()
print_json_recursively(value, indent + 4)
elif isinstance(value, list): # If value is a list, iterate each item
print()
for i, item in enumerate(value):
print(‘ ‘ * (indent + 4) + f’[{i}]:’, end=’ ‘)
if isinstance(item, dict):
print()
print_json_recursively(item, indent + 8)
else:
print(item)
else:
print(value)
def main():
filename = ‘path_to_your_json_file.json’
try:
json_data = load_json_file(filename)
print_json_recursively(json_data)
except Exception as e:
print(f”An error occurred: {e}”)
if __name__ == “__main__”:
main()
Find the longest increasing subsequence in a string.
def longest_increasing_subsequence(s):
# Cache for memoization
memo = {}
def rec(i, prev): if i == len(s): return 0 # Base case: end of string # Check memoized results if (i, prev) in memo: return memo[(i, prev)] # Option 1: Skip the current character taken = 0 if prev < s[i]: # Option 2: Include the current character if it continues the sequence taken = 1 + rec(i + 1, s[i]) not_taken = rec(i + 1, prev) # Store result in memoization dictionary memo[(i, prev)] = max(taken, not_taken) return memo[(i, prev)] # Initialize recursive function, use a character smaller than any possible as the initial 'previous' return rec(0, chr(0))
Example usage
s = “azbycxdwe”
print(“Length of the longest increasing subsequence is:”, longest_increasing_subsequence(s))
Traverse a tree in pre-order, in-order, and post-order.
class TreeNode:
def __init__(self, val=0, left=None, right=None):
self.val = val
self.left = left
self.right = right
def preorder_traversal(root):
if root is None:
return []
return [root.val] + preorder_traversal(root.left) + preorder_traversal(root.right)
def inorder_traversal(root):
if root is None:
return []
return inorder_traversal(root.left) + [root.val] + inorder_traversal(root.right)
def postorder_traversal(root):
if root is None:
return []
return postorder_traversal(root.left) + postorder_traversal(root.right) + [root.val]
- Preorder Traversal (Iterative)
def preorder_traversal_iterative(root):
if root is None:
return []
stack, output = [root], []
while stack:
node = stack.pop()
if node:
output.append(node.val)
stack.append(node.right) # Right child pushed first so that left is processed first
stack.append(node.left)
return output
In order traversal of BST
def inorder_traversal_iterative(root):
stack, output = [], []
current = root
while current or stack:
while current:
stack.append(current)
current = current.left
current = stack.pop()
output.append(current.val)
current = current.right
return output
Post order traversal of BST
def postorder_traversal_iterative(root):
if root is None:
return []
stack, output = [root], []
while stack:
node = stack.pop()
output.append(node.val)
if node.left:
stack.append(node.left)
if node.right:
stack.append(node.right)
return output[::-1] # Reverse the result because we want left-right-root
Do post order traversal question
Given an array of integers and an integer k, find the total number of continuous subarrays whose
sum equals k. The solution should have O(N) runtime
def subarraySum(nums, k):
# Dictionary to store the frequency of cumulative sums
cumulative_sum_count = {0: 1} # Base case: sum of 0 exists once
current_sum = 0
count = 0
for num in nums: current_sum += num # Check if there is a prefix subarray we can subtract # that results in the current subarray summing to k sum_needed = current_sum - k if sum_needed in cumulative_sum_count: count += cumulative_sum_count[sum_needed] # Update the count of the current_sum in the hashmap if current_sum in cumulative_sum_count: cumulative_sum_count[current_sum] += 1 else: cumulative_sum_count[current_sum] = 1 return count
You have three matrices: A ∈ R100×5, B ∈ R5×200, C ∈ R200×20, and you need to calculate the product ABC. In what order would you perform your multiplication and why?
Since matrix multiplication is associative, the answer is the same whether we multiply in the order of (AB)C or A(BC). However, let us observe the cost through the number of scalar multiplications we need to perform:
(AB)C = 100 · 5 · 200 + 100 · 200 · 20 = 50000
A(BC) = 5 · 200 · 20 + 100 · 5 · 20 = 30000
Obviously, the second approach is computationally cheaper.
What are some of the causes for numerical instability in deep learning?
Overflow, underflow, division by zero, log 0, NaN as input, etc.
In many machine learning techniques (e.g. batch norm), we often see a small term ϵ
added to the calculation. What’s the purpose of that term?
The purpose is to avoid operations that are undefined for 0, such as division by 0, log 0, etc
What made GPUs popular for deep learning? How are they compared to TPUs?
GPUs became popular for deep learning because matrix multiplications can be efficiently parallelized over hundreds of cores. TPUs (Tensor Processing Units) are specialized hardware for neural nets, with the key difference that they have lower precision for representing floating-point numbers, allowing for:
Higher memory throughput
Faster addition and multiplication operations
This design enables TPUs to accelerate neural network computations while reducing power consumption and increasing efficiency.
What are the time and space complexity for doing backpropagation on a recurrent neural network?
O(B · T · w) - time
O(w + B · T · a) - space
For the forward-pass of a single example in one timestep we need to evaluate all the weights, resulting in O(w) time complexity, where w is the number of weights. Due to the recurrence, we repeat the computation for T timesteps, resulting in O(T · w). Moreover, performing this un-rolled forward pass for an entire batch will amount the time complexity to O(B · T · w). Lastly, we note that the time complexity of the forward and the backward pass is the same.
As for the space complexity, note that we need to keep in memory both the network weights and the activations from the forward pass (required for the backprop computation). Given that storing the activations for a single timestep is O(a), the space complexity amounts to O(w + B · T · a)
Write a function find_bigrams to take a string and return a list of all bigrams.
def bigram(text_list:list):
result = []
for ls in text_list:
words = ls.lower().split()
for bi in zip(words, words[1:]):
result.append(bi)
return result
text = [“Data drives everything”, “Get the skills you need for the future of work”]
print(bigram(text))
what are the 2 kinds of language modelling?
Masked and causal language modelling
cost to train a model in terms of P and D
C= 6PD
P is the number of parameters in the transformer model
D is the dataset size, in tokens
C is the compute required to train the transformer model, in total floating point operations
C= # of GPUs x Flops/GPU
total memory required for training a model
Total Memory = Model Memory + optimizer memory + activation memory + gradient memory
model memory (fp16)-> 2 bytes
optimizer memory -> often kept at 32B precision, 4byte for gradient, 4byte for momentum, 4byte for variance
gradient (saved in fp16) -> 2bytes
activations -> 2bytes * batchsize * # of tokens * hidden dimension * # of layers
total memory is roughly 16 * # of parameters
Benefit of keeping optim at 32bits?
how do memory requirements change for Zero-1, 2, 3
For ZeRO-1,
Total Memory_Training ≈ Model Memory + ((Optimizer memory) / (No. GPUs)) + Activation Memory + Gradient Mem
For ZeRO-2,
Total Memory_Training ≈ Model Memory + Activation Memory + (Optimizer Memory + Gradient Memory)/ (No. GPUs)
For ZeRO-3,
Total Memory_Training ≈ Activation Memory + (Model Memory + Optimizer Memory + Gradient Memory) / (No. GPUs)
Why does the optimiser keep a copy of the gradients along with gradient memory?
optimum scaling stated by chinchilla
20 x # of parameters
what happens after attention layer output?
So, the sequence is:
- Multi-headed attention
- o_proj linear projection
- Skip connection by adding the original input
- Layer normalization
- First FCN layer
- Activation Function
- Second FCN layer
- skip connection
- another layer norm
what is BLEU score?
bleu score is a precision based metric, where numerator is count of n-gram match divided by count of n-grams present in generated text.
so this mean if “the apple” bigram shows up 7 times in generated text and 2 times in reference text then numerator count of “the apple” will be 2 and denominator count will be 7
this gives you Bleu-n for different n.
One problem here is you could have really small generations reducing the denominator and making the model seem good, so there is a brevity penalty multiplied
total blue score is exponent of weighted log precision
techniques for regularization in LLM training
- Dropout
- Weight Decay (L2 Regularization)
- Layer Normalization
- Gradient Clipping
- Label Smoothing
what is lasso and ridge regression?
l1 norm as regularization term is lasso l2 norm is ridge
what is gradient clipping? what does it help with?
gradient clipping is when you clip gradients during backprop. helps with exploding gradient problem. Helps maintain stability during training.
What value is gradient clipped at
what does l1 and l2 norm do to parameter values
l1 norm pushes them to 0, making it good for variable selection l2 norm pushes them close to 0.
l1 norm in the parameter space graphically is a diamond and l2 norm in the parameter space is a circle.
point where the contours of loss function first touch the diamond or circle is where loss function solution lies
what does lambda in regularization parameter do?
it determines the size of the regularisation circle in the parameter space. larger lambda smaller circle
what is bias in a model
In the context of machine learning,
bias refers to the error that is introduced by approximating a real-world problem, which may be complex, by a much simpler model. simpler models can exhibit high bias and complex models high variance
bias - underfitting
variance - overfitting
what is variance in a model
Variance, refers to the error that is introduced by the model’s sensitivity to fluctuations in the training set.
if you use a complex model with lot of parameters and the parameters are allowed to take large values then it will be able to model noise along with data
If a model has too many parameters or if those parameters are allowed to take on large values, it can become extremely flexible. This means it can capture not only the underlying relationships in the data but also the noise (random fluctuations) specific to the training set. As a result, it may perform very well on the training data (low bias), but poorly on new, unseen data (high variance) because it’s overfitted to the noise and specifics of the training set rather than to the underlying data distribution.
Effect of L2 Regularization? on variance as well
Penalizing Large Weights: prevent any single feature from having too much influence on the predictions, which is desirable when you suspect some features may be correlated with noise rather than with the signal in the training data.
Smoothness and Generalization: Smaller weights often result in smoother functions that change less drastically with input variations. This smoothness means the model is less likely to pick up on noise and will therefore generalize better to unseen data.
Shrinkage Effect: The regularization term shrinks the parameter values towards zero but not exactly to zero. This effect is akin to a “soft” form of feature selection that lowers the complexity of the model without completely eliminating the contribution of any single feature.
skip connections solve what problem?
skip connections solve for the problem of vanishing gradient
how is the problem of exploding gradient solved in LLMs
gradient clipping
Accuracy formula
(TP+TN)/(FP+FN+TP+TN)
precision formula
accuracy of positive samples
= TP/(TP+FP)
Also ask gpt and write about what these tell you
recall formula
coverage of actual positive samples
= TP/(TP+FN)
Is this right ??
specificity formula
coverage of actual negative samples
= TN/(TN+FP)
f1 score
harmonic mean of precision and recall, useful for imbalanced classes
= 2TP/(2TP+FP+FN)
what is the ROC curve
plot of TPR vs FPR
recall vs (1-specificity)
allows us to compare 2 methods like logistic vs boosting
what is sensitivity?
it is the same as recall
is BLEU score precision recall or f1 score based?
BLEU is precision based
is ROGUE score precision recall or f1 score based?
recall based
BERTScore what is?
BERTScore uses BERT to create an embedding for n-grams with BERT for both output and reference and then takes dot product for similarity in the form of precision and recall and then takes harmonic mean of them both
Do you use a double loop to summation over all n grams in the precision and recall components??
steps for evaluation of LLM output
- coherence
- compile - syntax
- gpt eval - semantics
- expert eval
sources of bias in gpt automated eval
Position bias: LLMs tend to favor the response in the first position.
Verbosity bias: LLMs tend to favor longer, wordier responses over more concise ones
Self-enhancement bias: LLMs have a slight bias towards their own answers.
what is vicuna?
Vicuna 13B is a llama finetune on chat data from shareGPT.
it is one of the first paper to use gpt4 to evaulate itself against different bench marks
What are the different eval benchmarks popular
Accuracy. recall, precision, specificity, f1 score if you have classes
BLEU, ROGUE, BERTScore if you have text as reference and task is something like summarisation
Automatic gpt4 based evaluation
what is soft prompt tuning
prepends a trainable tensor to the model’s input embeddings, essentially creating a soft prompt. Unlike discrete text prompts, soft prompts can be learned via backpropagation, meaning they can be fine-tuned to incorporate signals from any number of labeled examples.
what is soft prefix tuning
it prepends trainable parameters to the hidden states of all transformer blocks. During fine-tuning, the LM’s original parameters are kept frozen while the prefix parameters are updated.
what is the main idea behind LORA?
when adapting to a specific task, pre-trained language models have a low intrinsic dimension and can still learn efficiently despite a random projection into a smaller subspace. Thus, LoRA hypothesized that weight updates during adaption also have low intrinsic rank.
How do you decompose into lower rank??
what are main ideas introduced by QLORA
QLoRA builds on the idea of LoRA. But instead of using the full 16-bit model during fine-tuning, it applies a 4-bit quantized model
- 4-bit NormalFloat (to quantize models),
- double quantization (for additional memory savings), and
- paged optimizers (that prevent OOM errors by transferring data to CPU RAM when the GPU runs out of memory).
Give qlora paper to chatgpt and talk to it about how quantisation works there like u and sigma and normal distribution assumption and binning
What benefits do we get from reducing the precision of our model? What problems might we run into?
How to solve these problems?
Benefits: less memory, faster computation
Problems: during training numerical instability possible cuz gradients get clipped due to lack of precision, learning affected or stalled due to underflow
mixed precision training to solve this issue
Details of mixed precision
what is the difference between torch.cat() and torch.stack()
Given two tensors A and B, both with shape [2, 2]:
torch.cat([A, B], dim=0) will result in a tensor of shape [4, 2].
torch.stack([A, B], dim=0) will result in a tensor of shape [2, 2, 2].
how are in place operations done in pytorch
In-place operations Operations that have a _ suffix are in-place
tensor.add_(5)
above tensor is modified in place
in place operations use is discouraged as it leads to an immediate loss of information
how to load resnet18 model
model = resnet18()
what data can be passed into resnet18 and how would you create a sample data point for it?
data = torch.randn(1, 3, 64, 64)
what do the output labels of resnet18 look like? how would you create a sample output label?
label = torch.randn(1, 1000)
how would you generate the prediction for a model on data?
prediction = model(data)
how would you calculate loss given you had a data point and prediction and do a backward pass with it
loss = (prediction - labels).sum()
loss.backward
how would you define an optimizer (say SGD) and run a optimizer step?
optim = torch. optim.SGD(model.parameters(), lr= 1e-3, momentum = 0.9)
optim.step()
steps in a training pipeline in pytorch
- define model
- get data
- get labels
- define optimizer and loss
- calculate loss and do loss.backward()
- do optimizer.zero_grad() and optimizer.step()
or you define everything and do trainer.train()
what is the Transformers library?
transformers library in huggingface which has models and techniques for training and finetuning many machine learning models. Its an API in PyTorch
what is huggingface trainer
Trainer is a class in Transformers library in huggingface that lets you finetune pre-trained models. let’s you abstract away a lot of settings and supports mixed precision training.
what are callbacks in PyTorch
Callback is a set of methods that you can override or utilize to customize behvaiour of trainer. for example you can use it to save model preiodically, logging metrics, modify learning rate. It’s useful for extending the functionality of the training loop without changing core trainnig logic.
you can define callbacks as arguments tot he trainer object
how do you define and use a callback in pytorch?
callbacks are defined as a class in PyTorch where you create a class that inherits from a defined class in Transformers library and you can define a function in there to do what you want it to do when you want it done. LIke define a function that runs every epoch to do something if a condition is met.
what are teh benefits of using nn.Module subclass to define your model?
Automatic Parameter Registration: This is crucial for training the model because optimizers rely on the .parameters() method of the nn.Module to get a list of all parameters that need updates.
Model Serialization and deserialization: when you save and load model
Device Management: When moving the model to a device (e.g., GPU), all parameters of all contained layers are automatically moved as well.
what is the self attention operation?
self attention takes you from xi’s to yi’s
how every y_i is calculated is you take the x_i and dot product it with every other x_i. Then you take softmax. Each dot product becomes the scaling factor or weight that weighted addition of xj’s is done with.
Every y_i is weighted summation of all x_j ‘s
y_i = ∑_j w_ij x_j
w’_{ij} = x_i^T x_j
w_{ij} = exp(w’_{ij}) / ∑j exp(w’{ij})
in attention layers why do you scale the dot product
large softmax kill the gradient, so we divide by sqrt(d)
Why sqrt(d)? Imagine a vector in ℝ^d with values all c. Its Euclidean length is sqrt(cd). Therefore, we are dividing out the amount by which the increase in dimension increases the length of the average vectors.
layer norm is applied on what values in a layer?
The layer normalization is applied over the embedding dimension only.
what is temperature? how is it implemented?
temperature sampling is dividing logits by the temperature before feeding them into softmax and obtaining our sampling probabilities.
lower temp means less random
higher temp means more random
what is top-k sampling and what is the problem with it?
in top-k sampling you sample from the top k probabilities.
the problem is that if the distribution has a lot of reasonable options and prob is uniform we’ll ignore some possibilities simply because we put hard stop of k
what is nucleus sampling?
nucleus sampling aka top p sampling, we compute cumulative prob till p and cut off there
what are the benefits of layernorm?
Stability in Training: LayerNorm stabilizes the neural network training process by ensuring that the distribution of the inputs to the activation functions in a layer does not vary too much. This reduces the problem where the learning has to constantly adjust to a shifting input distribution across layers.
What’s the risk in empirical risk minimization?
risk in emperical risk minimization is the expectation of loss over the join prob distribution P(x, y) where y = h(x) and loss = L(h(x), x)
loss is the 0-1 loss so ris is the integral of joint prob where loss is 1
R(h) = E[L(h(x), y)] = ∫ L(h(x), y) dP(x, y)
what is empirical risk minimization?
you take average of loss on the training data and call that emperical risk as you can’t find real risk as P(x, y) is not available
empirical risk minimization is when you find a h which minimizes the risk
If we have a wide NN and a deep NN with the same number of parameters, which one is more expressive
and why?
Deeper networks are more expressive, since they encode an inductive bias that complex functions
can be modeled as composition of simple functions. In turn, this allows the network to learn multiple levels
of an abstraction hierarchy. Empirically, it has been shown that deeper networks lead to more compact
models with better generalization performance.
Why does L1 regularization tend to lead to sparsity while L2 regularization pushes weights closer to 0?
cuz gradient of l1 is 1 even close to 0 but gradient of l2 becomes really small close to 0 as 2w
what is bagging and boosting
in bagging you sample with replacement to train n classifiers and vote for final prediction
in boosting you first train a weak learner and look at what samples it is not able to classify, it weights those samples higher and trains another weak learner and does this iteratively
What’s the motivation for RNN?
Derived from feedforward
neural networks, RNNs can use their internal state (memory) to process variable length sequences of
inputs
What’s the motivation for LSTM?
vanishing and exploding gradient in RNNs
formula for perplexity
Perplexity(P) = 2^(-1/N∑log P(x))
what is the key idea in relative position embeddings
a bias term is added inside the softmax with qk^T to signify positions but doing so doesn’t let us use KV cache and this is bad for training
main idea in rotary position embedding
applied only to q and k matrices and not v. We multiply with a rotating matrix that rotates a word by m x theta where m is position of word
what is done is blocks of 2 in the embedding dimnesion is rotated by theta1, theta2, etc.
allows for rotational invariance to relative positions
what is KV cache
During inference you don’t really need to calculate qkT V for all the previous tokens.
usually q is calculated only for the most recent token and only this 1 q is used, K and V for all is needed but only 1 column and row is new
So what you do is you save the k and v from the previous iteration. you generate 1 extra q k and v from latest token. take old k and v from cache append to it and then do calculation.
How are your model’s accuracy and computational efficiency affected when you decrease or increase
CNN filter size?
Increasing the filter size results in decrease in computational efficiency (since the number of model parameters increases), and an increase in accuracy up to a certain point beyond which the network can overfit and imitate a fully-connected network.
Alternatively, decreasing the filter size results in increase of computational efficiency, and a decrease in accuracy when tending towards using extremely small kernels (e.g. 1x1) which do not capture the local structure of the inputs properly
Convolutional layers are also known as “locally connected”. Explain what it means
Each filter operates only on a small neighborhood (e.g 3x3) around the ground pixel, and is
applied across the entire spatial domain. This sort of weight sharing drastically reduces the number of
the parameters, and injects the “spatial equivariance” bias.
What does a 1x1 convolutional layer do?
The purpose of a 1x1 conv layer is to reduce the number of channels in the input activation volume.
What happens when pooling is removed completely?
The main purpose of the pooling operation is to increase the receptive field of the network
by non-parametric techniques. If pooling is completely removed, then we’d need an increasingly large
stack of convolutional layers to achieve large enough receptive fields for the neurons located in the
deep layers. Of course, this comes at the cost of drastically increasing the number of parameters and
computation requirements.
how to calculate cosine similarity in python
A = [5, 3, 4] B = [4, 2, 4] dot_product = sum(a*b for a, b in zip(A, B)) magnitude_A = sum(a*a for a in A)**0.5 magnitude_B = sum(b*b for b in B)**0.5 cosine_similarity = dot_product / (magnitude_A * magnitude_B) print(f"Cosine Similarity using standard Python: {cosine_similarity}")
How would a finite or infinite horizon affect our algorithms?
in finite horizon when we’re thinking of reward for an action we’re only considering rewards in a finite set of future states.
for infinite horizon we consider future rewards of all future states
infinite horizons are more common
Why do we need the discount term for objective functions
using a discount rate (smaller than 1) is a mathematical trick
to make an infinite sum finite.
what is the q learning update rule
Q(S_t, A_t) ← Q(S_t, A_t) + α(R_{t+1} + γmax_a Q(S_{t+1}, a) - Q(S_t, A_t))
γ - > discount rate
alpha -> learning rate
experience tuple in RL
(s_t, a_t, r_t, s_{t+1})
how do variational autoencoders work
put an image through encoder, encode to a mean and standard deviation (log standard deviation for numerical stability) and then sample by scaling a standard normal distribution with those parameters and try to generate the input image from decoder.
input image used to train decoder and encoder and KL divergence of output of encoder and normal distribution acts as regularizer
what is the formula for vanilla gradient update?
wt+1 = wt - α * (1/N * ∑i=1 to N ∇wLi(wt, yi, yˆi))
Implement vanilla dropout for the forward and backward pass in NumPy
thing to note here is how input and gradient are multiplied by mask, how mask is created and how scaling is done post dropout
import numpy as np def forward_pass_with_dropout(X, dropout_rate): """ Apply dropout to the input layer X during the forward pass. :param X: Input data for the layer, numpy array of shape (n_features, n :param dropout_rate: The probability of setting a neuron's output to 0 :return: A tuple (output after applying dropout, dropout mask used) """ # Create a mask using the dropout rate, setting to 0 for the dropped un dropout_mask = np.random.rand(*X.shape) > dropout_rate # Apply the mask to the input data dropped_out_X = np.multiply(X, dropout_mask) # During training, we'll scale data to not change the expected value dropped_out_X /= (1 - dropout_rate) return dropped_out_X, dropout_mask
def backward_pass_with_dropout(dA, dropout_mask, dropout_rate): """ Apply the stored mask to the gradient during the backward pass. :param dA: Gradient of the loss with respect to the activations, numpy :param dropout_mask: The dropout mask that was used during the forward :param dropout_rate: The probability of setting a neuron's output to 0 :return: Gradient after applying dropout mask """ # Apply the dropout mask to the gradients dA_with_dropout = np.multiply(dA, dropout_mask) # Scale the gradients as we did during the forward pass dA_with_dropout /= (1 - dropout_rate) return dA_with_dropout # Example usage: np.random.seed(0) # for reproducibility X = np.random.randn(5, 3) # 5 features, 3 samples dropout_rate = 0.2 # 20% dropout rate # Forward pass X_dropped, mask = forward_pass_with_dropout(X, dropout_rate) # Suppose we have some gradient dA from the backward pass dA = np.random.randn(5, 3) # Backward pass dA_dropped = backward_pass_with_dropout(dA, mask, dropout_rate)
why are RNNs susceptible to vanishing and exploding gradients and how to solve?
Backprop through time (BPTT) is the reason for vanishing and exploding gradient
Gated Units: Using RNN variants with gating mechanisms, such as LSTM (Long Short-Term
Memory)
Gradient Clipping:
Using Skip Connections:
Proper Initialization and Activation Functions
When training a large neural network, say a language model with a billion parameters, you
evaluate your model on a validation set at the end of every epoch. You realize that your validation
loss is often lower than your train loss. What might be happening?
It might be that the training loss uses regularizers (e.g. L2 norm on the weights). Since at validation time we only evaluate the main loss function, it might happen that the validation loss is lower than
the (composite) train loss.
The model might have dropout layers, which impose heavy regularization during training. These are layers that behave differently during training (dropping out neurons) and inference time
The validation set might simply be easier compared to the train set.
What criteria would you use for early stopping?
We can perform early stopping if there is no change or a decrease in a metric of interest (validation loss, accuracy, etc.). It is a good idea to use a patience parameter in order to avoid noisy
estimates.
Gradient descent vs SGD vs mini-batch SGD.
In gradient descent we first perform a forward pass and a backward pass for each sample in the
dataset, before we take a step in the direction of the cumulative gradient. This is extremely slow
to converge, as today’s datasets are extremely large in size, which implies that we perform gradient
updates too rarely.
* In SGD, after performing a forward and a backward pass for a single sample, we take a step in the
direction of the single gradient. Even though we perform gradient updates much more often, the
entire process is extremely noisy as we are optimizing with respect to a single sample at a time.
* Mini-batch SGD combines the best of both worlds: perform a forward/backward pass for a batch
(e.g. 32) of samples, and take a step in the direction of the gradient for the current mini-batch. On
one hand, we perform updates more often than pure gradient descent; on the other hand, we optimize
over an entire mini-batch, minimizing the noise in the estimated gradient.
Your model’s weights fluctuate a lot during training. How does that affect your model’s performance?
What to do about it?
The fluctuation can be attributed to two primary factors: large learning rate or exploding gradients. In either case, this can seriously destabilize the training process. In order to resolve this issue, we could:
1) lower the learning rate;
2) perform gradient clipping
Draw a graph number of training epochs vs training error for when the learning rate is: i) too high;
ii) too low; iii) acceptable.
- if too high loss will decrease fast but will stabilise at high value
- if just right loss will decrease and stabilise at low value
- if ltoo low loss will decrease very slowly
What’s learning rate warmup? Why do we need it?
Warmup steps are just a few updates with low learning rate at the beginning of training.
After this warmup, you use the regular learning rate (schedule) to train your model to convergence.
For example, RMSProp computes a moving average of the squared gradients to get an estimate
of the variance in the gradients for each parameter. For the first update, the estimated variance is
just the square root of the sum of the squared gradients for the first batch. Since, in general, this
will not be a good estimate, your first update could push your network in a wrong direction. To
avoid this problem, you give the optimiser a few steps to estimate the variance while making as little
changes as possible (low learning rate) and only when the estimate is reasonable, you use the actual
(high) learning rate.
Compare batch norm and layer norm.
Suppose the output of the previous layer is X ∈ R
B×D where B is the batch size and D is the dimensionality of the embedding. Both techniques normalize the input as follows:
Y = X − E [X] / sqrt(Var [X] + ϵ) ∗ γ + β
where γ and β are learnt affine parameters, and all operations are treated as broadcasts. The main
difference stems from how the two techniques compute the statistics:
* Batch norm computes the mean and standard deviation over the batch, meaning that E [X] ∈ R 1×D
and Var [X] ∈ R1×D.
* Layer norm computes the mean and standard deviation over the features, meaning that E [X] ∈ R B×1
and Var [X] ∈ R B×1
.
. Some models use weight decay: after each gradient update, the weights are multiplied by a factor slightly
less than 1. What is this useful for?
if you use L2 regularization then in the gradient update rule you’re basically multiplying w with 1- \alpha * \lambda before updating with gradient. this is called weight decay. PyTroch implements weight decay
It’s a common practice for the learning rate to be reduced throughout the training
i. What’s the motivation?
ii. What might be the exceptions
The learning rate is controlling the size of the update steps along the gradient. so large first small later on
In Adam the moving square average acts as a learning rate parameter
exception would be continual learning
How should we adjust the learning rate as we increase or decrease the batch size?
When utilizing larger batches, we can afford to have large learning rates, as the approximated gradient of the batch is closer to the true gradient. On the other hand, using very small
batches yields noisy estimates of the gradient, and is therefore advisable to also use small learning
rates so that we don’t diverge in our optimization procedure.
What can you say about the ability to converge and generalize of Adam vs. SGD?
solutions found by adaptive methods (e.g. Adam) generalize
worse than SGD, even when these solutions have better training performance. In other words,
adaptive methods converge faster, but have worse generalization performance than pure SGD.
SGD vs adam optimzer formula
SGD is typical gradient update with learning rate
adam along with this has a momentum adn a variance term. Momentum is moveing average of gradient and variance is moving average of gradient^2
asynchronous SGD vs.
synchronous SGD in distributed training
, the choice between asynchronous and synchronous SGD in a distributed training environment
depends on the specific requirements of the training process and the trade-offs between speed and
stability. Asynchronous SGD offers faster training but can suffer from issues due to stale gradients, while
synchronous SGD provides more stable convergence at the cost of potentially slower training
Why don’t we just initialize all weights in a neural network to zero?
during backpropagation, we will have the same local derivatives, which in turn will cause
the weights for all neurons in a given layer to perform the same update.
What are some sources of randomness in a neural network?
Random weight initialization, mini-batch shuffling, dropout, etc
what are some ways in which neural networks benefit from randomness during training?
1) SGD gradient randomness helps escape local minima
2) randomness while output sampling in LLMs and action sampling when RL plays chess helps
3) dropout helps the network not be too dependent on any weights
What’s a dead neuron?
A dead neuron in a neural network is a neuron that always outputs the same value, regardless of the
input. This typically happens when the neuron’s activation function is non-linear, like ReLU (Rectified
Linear Unit), and the input to the neuron is such that the activation function always outputs the
minimum value (which is 0 in the case of ReLU). For ReLU, this happens when the weights and biases
of the neuron are adjusted during training in such a way that the weighted sum of the inputs is always
negative, leading to an output of 0
how do we detect dead neurons and prevent them
we can monitor activations and if one activation is 0 all the time for different inputs then it is dead.
can be sorted by using leaky relu or layer norm or proper initialization
what is pruning and how do you decide what to prune?
- Magnitude. Remove weights that have magnitude close to 0; remove neurons whose L2 norm of
the weight is also close to 0. - Activations. We could also use the training data to observe the activations of the neurons. We
can remove neurons whose distribution of activations extremely peaked (invariant to the input);
Moreover, we can also remove a neuron if its activation pattern is highly correlated to another
neuron in the same layer.