interview_ML Flashcards
compute a checksum for
precise file deduplication, and compress them for storage. What is meant by this?
checksum gives you a numerical representation of the contents of a file. used to signal integrity of the data in a file.
We can remove duplicate files with same content but different names using this method.
In the context of the described data collection process, computing a checksum means generating a unique, fixed-size numerical value (often represented as a string of letters and numbers) that is calculated from the contents of a file or a piece of data. This checksum serves multiple purposes:
Integrity Verification: The checksum helps verify the integrity of the data or files over time. If the file is altered in any way, even a small change, recomputing the checksum will result in a different value. This allows for easy detection of corruption or unintended changes.
Deduplication: In the process described, the checksum is used for precise file deduplication. By computing and comparing checksums, it is possible to identify and eliminate duplicate files in the dataset, even if the files are named differently. This is because the checksum is based on the file content, not the file name or other metadata.
what data preprocessing steps were done at keysight?
You can do checksum (python code) and deduplication, you can weed out some files based on the version of the code,
Excessively large code files were filtered as they were probably auto generated.
Added file path in the beginning of data of a file
tokenisation done to create chunks and then synthetic dataset creation by querying either llama2 or gpt API depending on data sensitivity
Something I didn’t do during keysight data work was dependency ordering, you read about it later in deepseek coder and it makes a lot of sense and they gave pseudocode for it and their code performance is state of the art and this prolly ahs something to do with it.
Positional Encoding Calculation and code (formula) also the rationale behind it
import numpy as np
def positional_encoding(sentence_length, model_dim):
pos_enc_matrix = np.zeros((sentence_length, model_dim))
for pos in range(sentence_length):
for i in range(0, model_dim, 2):
pos_enc_matrix[pos, i] = np.sin(pos / (10000 ** (i / model_dim)))
pos_enc_matrix[pos, i + 1] = np.cos(pos / (10000 ** (i / model_dim)))
return pos_enc_matrix
Rationale behind the Formula
The usage of a sine function for even indices and a cosine function for odd indices allows the model to capture different phase relationships between sequences of different frequencies.
This approach offers a “hackish” way to prioritize positional information of varying scales. As the formula demonstrates, higher-frequency components influence words that are further along in the sentence, while lower-frequency components emphasize closer proximity.
The specific constant
1
10000
is introduced to prevent the function from saturating.
Positional encoding used in llama?
mu and sigma for layer norm are calculated during training or runtime
mu and sigma calculated during runtime for inference for each data sample and gamma and beta are trained parameters
in the transformer architecture layer norm is done where?
before the activation function after the linear layer after skip connection.
This means the workflow within each sub-block of a transformer is as follows:
- Input from the previous layer
- Addition of the skip connection output
- Layer normalization
- Activation function (if any, depending on the sub-block)
However, some implementations and variations might apply layer normalization before adding the skip connection output. Both methods have been explored in practice, with different effects on training dynamics and performance. The original approach (normalization after the residual connection) is more common and is often preferred because it allows the model to preserve the raw output of each layer before it is normalized, which can be beneficial for learning complex dependencies.
i. [E] What’s the geometric interpretation of the dot product of two vectors?
Multiplication of the length of one vector and the length of the projection of the other
vector onto the first one.
[E] Given a vector u, find vector v of unit length such that the dot product of u and v is
maximum
Given a vector u, the vector v of unit length that maximizes the dot product u · v is the vector that points
in the same direction as u. The vector v can be found by dividing u by its own magnitude, making it a
unit vector.
Give an example of how the outer product can be useful in ML.
The Covariance matrix is a commonly used quantity in Machine Learning algorithms
(eg. PCA). Given a dataset X ∈ R
n×d with n samples and d features, we calculate the (empirical)
covariance as follows:
Cov [X] = 1
n
Xn
i=1
(xi − x¯)(xi − x¯)
T
where ¯x is the mean feature vector: ¯x =
1
n
Pn
i=1 xi
[E] What does it mean for two vectors to be linearly independent?
Two vectors are linearly independent if no scalar multiple of one vector equals the other. This means
they do not lie on the same line through the origin and neither can be formed by scaling the other.
[M] Given two sets of vectors and . How do you
check that they share the same basis?
you first have to make sure A and B have the same dimension. This can be done by converting A and B into its row echelon form.
Singular Value Decomposition (SVD): Decompose A into the product of three matrices: A = U Σ V^T, where U and V are orthogonal matrices, and Σ is a diagonal matrix. The number of non-zero diagonal entries in Σ is the rank of A.
once you have the rank you can then take the r vectors than span row space of both A and B and show that every vector in rows of A is spanned by rows of B and vice versa
[M] Given n vectors, each of d dimensions. What is the dimension of their span?
Given n vectors, each of d dimensions, the dimension of their span depends on their linear
independence. If all vectors are linearly independent, the dimension of their span is min(n, d). If they are
not all linearly independent, the dimension will be less than min(n, d).
Can think of this as a matrix with n columns of d-dim vectors, min(n,d) is the rank
[E] What’s a norm? What is Lnorm?
A norm is a function that assigns a strictly positive length or size to each vector in a vector space,
except for the zero vector, which is given a length of zero.
The L0 norm refers to the number of nonzero elements in a vector,
L1 norm is the sum of the absolute values of the vector elements,
L2 norm
(also known as the Euclidean norm) is the square root of the sum of the squares of the vector elements,
and the
L norm is the maximum absolute value of the elements in the vector.
[M] How do norm and metric differ? Given a norm, make a metric. Given a metric, can we make a norm?
Ans. A metric measures distances between pairs of things while a norm measures the size of a single item. Metrics can be defined on pretty much anything, while the notion of a norm applies only to vector spaces: the very definition of a norm requires that the things measured by the norm could be added and scaled. If you have a norm, you can define a metric by saying that the distance between a and b is the size of a - b.
On the other hand, if you have a metric you can’t usually define a norm.
[E] Why do we say that matrices are linear transformations?
Matrices represent linear transformations because they map linear combinations of vectors to other linear
combinations.
Specifically, for a matrix M and vectors u and v, and a scalar c, the following properties hold true, which are
the properties of linear transformations:
M(u + v) = Mu + Mv
Applying M to the sum of two vectors is the same as summing the results of applying M to each
vector individually.
M(cu) = c(Mu)
Applying M to a scalar multiple of a vector is the same as multiplying the result of applying M by
that same scalar.
These properties demonstrate that matrix multiplication exhibits the key aspects of a linear transformation:
1. Additivity - Applying the transformation to vector sums gives the sum of individually transformed
vectors
2. Homogeneity - Scalars distribute across the transformation
Thus, matrices and matrix multiplication inherently represent and operate as linear transformations
between vector spaces.
What’s the inverse of a matrix? Do all matrices have an inverse? Is the inverse of a matrix
always unique?
The inverse of a matrix A is another matrix A^-1 such that AA^-1 = A^-1A = I, where I is the identity matrix.
Not all matrices have an inverse; only square matrices that are non-singular (with a non-zero determinant)
have an inverse. When a matrix has an inverse, it is always unique.`
[E] What does the determinant of a matrix represent?
The determinant of a matrix can be interpreted as a scaling factor for the transformation that the matrix
represents. Geometrically, it represents the volume scaling factor of the linear transformation described by
the matrix, including whether the transformation preserves or reverses the orientation of the space.
[E] What happens to the determinant of a matrix if we multiply one of its rows by a scalar t×R?
Multiplying a row of a matrix by a scalar t scales the determinant of the matrix by that scalar. So if the
original determinant was d, the new determinant will be t×d.
[M] A 4×4 matrix has four eigenvalues 3,3,2,−1. What can we say about the trace and the
determinant of this matrix?
The trace of the matrix, which is the sum of its eigenvalues, would be 3+3+2−1=7. The determinant of the
matrix, which is the product of its eigenvalues, would be 3×3×2×(−1)=−18
what is the determinant of a matrix with linearly dependent rows or columns
0
[M] What’s the difference between the covariance matrix A^TA and the Gram matrix AA^T?
Suppose A ∈ Rn×d
, corresponding to n samples with each having d features. Then, the Covariance matrix AT A ∈ Rd×d
captures the ”similarity” between features, whereas the Gram matrix
AAT ∈ Rn×n captures the ”similarity” between samples.
Given A∈Rn×m and b∈Rn
Find x such that: Ax=b
When does this have a unique solution?
Why is it when A has more columns than rows, Ax=b has multiple solutions?
Given a matrix A with no inverse. How would you solve the equation Ax=b?
Unique Solution
Square and Full Rank: If A is square (n = m) and has full rank (rank(A) = m = n), then A is invertible. The equation Ax = b has a unique solution given by x = A^(-1)b.
Non-Square but Full Column Rank: If A is not square (m ≠ n), but rank(A) = m (full column rank) and m < n (tall matrix), a left inverse A_L of A exists such that A_L A = I. This configuration implies Ax = b has a unique solution when b is in the range (column space) of A. The solution can be given by x = A_L b. In this scenario, A_L = (A^T A)^(-1)A^T and there is no null space of A (nullity(A) = 0) because all columns are linearly independent.
Multiple or No Solutions
Overdetermined System (m > n): If A is a wide matrix (more columns than rows), the rank of A could be at most n, and the null space of A (nullity(A)) has dimension m - n (assuming rank(A) = n). This situation typically means:
No Solution: If b is not in the column space of A, then Ax = b has no solution.
Infinitely Many Solutions: If b is in the column space of A, there are infinitely many solutions because there are free variables associated with the non-trivial null space of A.
What is the pseudoinverse and how to calculate it?
Method 1: Singular Value Decomposition (SVD)
The most reliable method for computing the pseudoinverse of any matrix, whether square or rectangular, full rank or rank-deficient, is using its Singular Value Decomposition (SVD). The SVD of matrix A is given by:
A = U Σ V^T
where:
U is an n×n orthogonal matrix whose columns are the left singular vectors of A.
V is an m×m orthogonal matrix whose columns are the right singular vectors of A.
Σ is an n×m diagonal matrix with non-negative real numbers on the diagonal, known as the singular values.
The pseudoinverse A+ is then calculated as:
A+ = V Σ+ U^T
Here, Σ+ is obtained by taking the reciprocal of each non-zero singular value in Σ, and transposing the matrix.
Method 2: Using Matrix Transpose and Inversion
For matrices that are either full row rank or full column rank, the pseudoinverse can also be computed directly using matrix transposes and inversion:
Full Column Rank (m≤n and rank(A)=m):
A+ = (A^T A)^(-1) A^T
Full Row Rank (m≥n and rank(A)=n):
A+ = A^T (A A^T)^(-1)
Note: These formulas assume that A has full column or row rank, respectively.
Do AAT or ATA depending on which gives you a full rank matrix, that is then invertible and use that to find pseudoinverse
What does the derivative represent?
Answer: The derivative of a function measures the sensitivity to change in the function output with
respect to a change in the input.
Moreover, when it exists, the derivative at a given point is the slope of the tangent line to the graph
of the function at that point. The tangent line is the best linear approximation of the function at
that input value. This is the reason why in gradient descent we (slowly) move in the (negative)
direction of the derivative
What’s the difference between derivative, gradient, and Jacobian?
When f: R → R, we calculate the derivative df/dx.
When f: R^n → R, we calculate the gradient:
∇f = [∂f/∂x1, ∂f/∂x2,…, ∂f/∂xn]
When f: R^n → R^m, we calculate the Jacobian (an mxn matrix):
Jac(f) = [
∂f1/∂x1 … ∂f1/∂xn
…
∂fm/∂x1 … ∂fm/∂xn
]
Say we have the weights w ∈ R^(d×m) and a mini-batch x of n elements, each element is of the shape 1 × d, so that x ∈ R^(n×d). We have the output y = f(x; w) = xw. What’s the dimension of the Jacobian ∂y/∂x?
First, notice that y ∈ R^(n×m). With that said, Jac_x(f) ∈ R^((n×m)×(n×d)), or equivalently Jac_x(f) ∈ R^((n·m)×(n·d)), given that we have reshaped the 4-dim tensor into a 2-dim tensor, i.e. a matrix.
in general dimension of jacobain dy/dxis R^(y-dim x x-dim)
Given a very large symmetric matrix A that doesn’t fit in memory, say and a
function f that can quickly compute . Find the unit vector x so that is
minimal.
Hint: Can you frame it as an optimization problem and use gradient descent to find an approximate
solution?
To find the unit vector x that minimizes x^T Ax, we can frame this as an optimization problem and approach
it using an iterative algorithm like gradient descent or the conjugate gradient method. These methods
update x in a direction that minimizes the function x^T Ax, and in the case of gradient descent, this involves computing the gradient -2Ax at each step. We would iterate this process until convergence, ensuring at
each step that x remains a unit vector by normalizing it.
Why do we need dimensionality reduction?
Dimensionality reduction is used to reduce the number of random variables under consideration and can
be divided into feature selection and feature extraction. It helps in reducing the time and storage space
required, removes multicollinearity, enhances the interpretation of the parameters, helps in visualizing data,
and most importantly, it can help in avoiding the curse of dimensionality.
Eigendecomposition is a common factorization technique used for dimensionality reduction. Is the eigendecomposition of a matrix always unique?
The decomposition is not always unique. Suppose A ∈ R^(2×2) has two equal eigenvalues λ1 = λ2 = λ, with corresponding eigenvectors u1, u2. Then:
Au1 = λ1u1 = λu1
Au2 = λ2u2 = λu2
Or written in matrix form:
A [u1 u2] = [u1 u2] [λ 0; 0 λ]
Notice that we can permute the matrix of eigenvectors (thus obtaining a different factorization):
A [u2 u1] = [u2 u1] [λ 0; 0 λ]
But we still end up with the same eigen-properties:
Au2 = λu2
Au1 = λu1
Name some applications of eigenvalues and eigenvectors
PCA
We want to do PCA on a dataset of multiple features in different ranges. For example, one is in the range 0-1 and one is in the range 10 - 1000. Will PCA work on this dataset?
In PCA we are interested in the components that maximize the variance. If one component (e.g. human height) varies less than another (e.g. weight) because of their respective scales (meters vs. kilos), PCA might determine that the direction of maximal variance more closely corresponds with the ‘weight’ axis, if those features are not scaled. Since a change in height of one meter should be considered much more important than the change in weight of one kilogram, the previous assumption would be incorrect. Therefore, it is important to standardize the features before applying PCA.
Under what conditions can one apply eigendecomposition? What about SVD?
Eigendecomposition is possible only for (square) diagonalizable matrices. On the other hand, the Singular Value Decomposition (SVD) always exists (even for non-square matrices).
What’s the relationship between PCA and SVD?
Suppose we have data X ∈ R^(n×d) with n samples and d features. Moreover, assume that the data has been centered so that the mean of each feature is 0. Then, we can perform PCA in two main ways:
First, we compute the covariance matrix C = 1/(n-1)X^T X ∈ R^(d×d), and perform eigendecomposition: C = V L V^T, with eigenvalues as the diagonal of L ∈ R^(d×d), and eigenvectors as the columns of V ∈ R^(d×d). Then, we stack the k eigenvectors of V corresponding to the top k eigenvalues into a matrix V˜ ∈ R^(d×k). Finally, we obtain the component values as follows: X˜ = X V˜ ∈ R^(n×k).
Alternatively, instead of first computing the covariance matrix and then performing eigendecomposition, notice that given the above formulation, we can directly compute SVD on the data matrix X, thus obtaining: X = U Σ V^T. By construction, the right singular vectors in V are the eigenvectors of X^T X. Similarly, we stack the k right singular vectors corresponding to the top k singular values into a matrix V˜ ∈ R^(d×k). Finally, we obtain the component values as follows: X˜ = X V˜ ∈ R^(n×k).
Even though SVD is slower, it is often considered to be the preferred method because of its higher numerical accuracy.
What is the relationship between SVD and eigendecomposition?
Consider A ∈ R^(m×n) of rank r. Then, we can factorize A as follows:
A = U Σ V^T
where U ∈ R^(m×m) is an orthogonal matrix of left singular vectors, V ∈ R^(n×n) is an orthogonal matrix of right singular vectors, and Σ ∈ R^(m×n) is a “diagonal” matrix of singular values such that exactly r of the values σ_i := Σ_ii are non-zero.
By construction:
The left singular vectors of A are the eigenvectors of A^T A. From the Spectral Theorem, the eigenvectors (and thus the left singular vectors) are orthonormal.
The right singular vectors of A are the eigenvectors of A A^T. From the Spectral Theorem, the eigenvectors (and thus the right singular vectors) are orthonormal.
If λ is an eigenvalue of A^T A (or A A^T), then √λ is a singular value of A. From the positive semidefiniteness of A^T A (or A A^T), the eigenvalues (and thus the singular values) are non-negative.
What does it mean when a function is differentiable?
A function f: U → R is said to be differentiable at a ∈ U if the derivative:
f’(a) = lim(h→0) [f(a + h) - f(a)]/h
exists. This implies that the function is continuous at a. Note that every continuous function is not necessarily differentiable. like |X| continuous but not differentiable. functions with sharp edges are not differentiable tho continuous
[E] Give an example of when a function doesn’t have a derivative at a point.
An example is the function f(x) = |x|, which doesn’t have a derivative at x = 0. The graph of this function
has a sharp corner at x = 0, which means there is no single tangent line at that point.
Why you can still apply the formula right?
Give an example of non-differentiable functions that are frequently used in machine learning.
How do we do backpropagation if those functions aren’t differentiable?
An example is the ReLU (Rectified Linear Unit) function, which is non-differentiable at x = 0. In machine
learning, backpropagation with such functions often uses a concept called subgradient, which allows the
algorithm to bypass non-differentiability at certain points. For ReLU, the derivative is defined as 0 for x < 0
and 1 for x > 0, and at x = 0, any value between 0 and 1 can be used.
What does it mean for a function to be convex or concave?
A function is called convex if the line segment between any two points on the graph of
the function lies above the graph between the two points. More precisely, the function f : X → R is
convex if and only if for all 0 ≤ t ≤ 1 and all x1, x2 ∈ X:
f(tx1 + (1 − t)x2) ≤ tf(x1) + (1 − t)f(x2)
The function f is said to be concave if −f is convex.
Why is convexity desirable in an optimization problem?
Convexity is desirable because any local minimum of a convex function is also a global
minimum
Most ML algorithms we use nowadays use first-order derivatives (gradients) to construct the next
training iteration.
[E] How can we use second-order derivatives for training models?
Second-order derivatives can be used in optimization algorithms to better understand the curvature of
the loss function. This information can be used to adjust the learning rate and the direction of the
update steps, potentially leading to faster convergence.
Pros and cons of second-order optimization.
Pros: Can lead to faster convergence and more informed update steps.
Cons: Computationally expensive, as it requires calculating and inverting the Hessian matrix.
How can we use the Hessian (second derivative matrix) to test for critical points?
The Hessian matrix can be used to test the nature of critical points. If the Hessian at a point is positive
definite, the point is a local minimum; if it is negative definite, the point is a local maximum; and if it is
indefinite, the point is a saddle point
Jensen’s inequality forms the basis for many algorithms for probabilistic inference, including Expectation-Maximization and variational inference. Explain what Jensen’s inequality is.
As stated before, for a given convex function f, we had the following property:
g(tx1 + (1 - t)x2) ≤ tg(x1) + (1 - t)g(x2)
Let us generalize this property. Again, suppose we have a convex function f, variables x1, …, xn ∈ I, and non-negative real numbers α1, …, αn such that ∑i αi = 1. Then, by induction we have:
g(α1x1 + … + αnxn) ≤ α1g(x1) + … + αng(xn)
Let’s formalize it one step further. Consider a convex function f, a discrete random variable X with n possible values x1, …, xn, and real non-negative values ai = p(X = xi). Then, we obtain the general form of the Jensen’s inequality:
g(E[X]) ≤ E[g(X)]
Explain the chain rule.
The chain rule is a formula that expresses the derivative of the composition of two differentiable functions f and g in terms of the derivatives of f and g. More precisely, if h = f ∘ g is the function such that h(x) = f(g(x)) for every x, then the chain rule is:
dh/dx = df/dg · dg/dx
Given the function f(x, y) = 4x^2 - y with the constraint x^2 + y^2 = 1. Find the function’s maximum and minimum values.
In order to solve the constrained optimization problem, we form the Lagrangian:
L(x, y, λ) = 4x^2 - y + λ(x^2 + y^2 - 1)
Given the function f(x, y) = 4x^2 - y with the constraint x^2 + y^2 = 1. Find the function’s maximum and minimum values.
Answer: In order to solve the constrained optimization problem, we form the Lagrangian:
L(x, y, λ) = 4x^2 - y + λ(x^2 + y^2 - 1)
Calculating the gradient and setting it to zero, we obtain:
∇_x,y,λ L = (∂L/∂x, ∂L/∂y, ∂L/∂λ) = (8x + 2λx, -1 + 2λy, x^2 + y^2 - 1) = 0
Let x ∈ R^n, L = crossentropy(softmax(x), y) in which y is a one-hot vector. Take the derivative of L with respect to x.
∂L/∂x = ∂/∂x [(-y^T log(softmax(x))] = -y^T ∂/∂x [softmax(x)]/∂x [log(softmax(x))]
= -y^T [diag(softmax(x)) - softmax(x)softmax(x)^T] / [diag(softmax(x))]
= softmax(x) - y
Given a uniform random variable X in the range of [0,1] inclusively. What’s the probability that
X=0.5?
For a continuous uniform distribution, the probability of X being exactly any specific value, including 0.5, is
0. This is because the probability for a continuous distribution is defined over intervals, not specific points.
Can the values of PDF be greater than 1? If so, how do we interpret PDF?
Yes, the values of a Probability Density Function (PDF) can be greater than 1. The key point is that the
area under the PDF curve over the entire range must integrate to 1. A high PDF value does not represent
probability but rather indicates a higher density of the variable at that point.
What’s the difference between multivariate distribution and multimodal distribution?
A multivariate distribution is a probability distribution with more than one random variable, each with its
range of values. A multimodal distribution is a probability distribution with more than one peak or mode, regardless of how many variables it has.
What does it mean for two variables to be independent?
In general, continuous random variables X1, …, Xn admitting a joint density are all independent from each other if and only if:
p_{X1,…,Xn}(x1, …, xn) = p_{X1}(x1) · · · p_{Xn}(xn)
This equation states that the joint probability density function (pdf) of the random variables X1, …, Xn factorizes into the product of their individual pdfs, which is a necessary and sufficient condition for independence.
It’s a common practice to assume an unknown variable to be of the normal distribution. Why is that?
The Central Limit Theorem (CLT) states that the distribution of the sum of a large number of
independent, identically distributed random variables is approximately normal, regardless of the underlying distribution. Because so many things in the universe can be modeled as the sum of a large number of
independent random variables, the normal distribution pops up a lot.
**Central limit theorem and law if larger numbers
How would you turn a probabilistic model into a deterministic model?
To convert a probabilistic model into a deterministic one, you typically use expected values, mode, or
median of the probability distributions as fixed values instead of random variables. This approach ignores
the variability and uncertainty represented by the probability distributions.
Explain frequentist vs. Bayesian statistics.
The frequentist approach The goal is to use the sample data to build point estimates of the parameters
(potentially with standard error).
bayesian uses priors The goal is to build a posterior distribution of the parameters, given
the data at hand.
Code for merge sort
def merge_sort(arr):
“””
Sorts an array using the merge sort algorithm.
“””
if len(arr) > 1:
mid = len(arr) // 2 # Finding the mid of the array
left_half = arr[:mid] # Dividing the elements into 2 halves
right_half = arr[mid:]
merge_sort(left_half) # Sorting the first half merge_sort(right_half) # Sorting the second half i = j = k = 0 # Copy data to temp arrays L[] and R[] while i < len(left_half) and j < len(right_half): if left_half[i] < right_half[j]: arr[k] = left_half[i] i += 1 else: arr[k] = right_half[j] j += 1 k += 1 # Checking if any element was left while i < len(left_half): arr[k] = left_half[i] i += 1 k += 1 while j < len(right_half): arr[k] = right_half[j] j += 1 k += 1 return arr
Example usage
if __name__ == “__main__”:
sample_array = [12, 11, 13, 5, 6, 7]
print(“Original array:”, sample_array)
sorted_array = merge_sort(sample_array)
print(“Sorted array:”, sorted_array)
code to recursively read a json file
import json
def load_json_file(filename):
“”” Load the JSON data from a file. “””
with open(filename, ‘r’) as file:
return json.load(file)
def print_json_recursively(data, indent=0):
“”” Recursively print JSON data with indentation for nested structures. “””
for key, value in data.items():
print(‘ ‘ * indent + str(key) + ‘:’, end=’ ‘)
if isinstance(value, dict): # If value is a dictionary, recurse
print()
print_json_recursively(value, indent + 4)
elif isinstance(value, list): # If value is a list, iterate each item
print()
for i, item in enumerate(value):
print(‘ ‘ * (indent + 4) + f’[{i}]:’, end=’ ‘)
if isinstance(item, dict):
print()
print_json_recursively(item, indent + 8)
else:
print(item)
else:
print(value)
def main():
filename = ‘path_to_your_json_file.json’
try:
json_data = load_json_file(filename)
print_json_recursively(json_data)
except Exception as e:
print(f”An error occurred: {e}”)
if __name__ == “__main__”:
main()
Find the longest increasing subsequence in a string.
def longest_increasing_subsequence(s):
# Cache for memoization
memo = {}
def rec(i, prev): if i == len(s): return 0 # Base case: end of string # Check memoized results if (i, prev) in memo: return memo[(i, prev)] # Option 1: Skip the current character taken = 0 if prev < s[i]: # Option 2: Include the current character if it continues the sequence taken = 1 + rec(i + 1, s[i]) not_taken = rec(i + 1, prev) # Store result in memoization dictionary memo[(i, prev)] = max(taken, not_taken) return memo[(i, prev)] # Initialize recursive function, use a character smaller than any possible as the initial 'previous' return rec(0, chr(0))
Example usage
s = “azbycxdwe”
print(“Length of the longest increasing subsequence is:”, longest_increasing_subsequence(s))
Traverse a tree in pre-order, in-order, and post-order.
class TreeNode:
def __init__(self, val=0, left=None, right=None):
self.val = val
self.left = left
self.right = right
def preorder_traversal(root):
if root is None:
return []
return [root.val] + preorder_traversal(root.left) + preorder_traversal(root.right)
def inorder_traversal(root):
if root is None:
return []
return inorder_traversal(root.left) + [root.val] + inorder_traversal(root.right)
def postorder_traversal(root):
if root is None:
return []
return postorder_traversal(root.left) + postorder_traversal(root.right) + [root.val]
- Preorder Traversal (Iterative)
def preorder_traversal_iterative(root):
if root is None:
return []
stack, output = [root], []
while stack:
node = stack.pop()
if node:
output.append(node.val)
stack.append(node.right) # Right child pushed first so that left is processed first
stack.append(node.left)
return output
In order traversal of BST
def inorder_traversal_iterative(root):
stack, output = [], []
current = root
while current or stack:
while current:
stack.append(current)
current = current.left
current = stack.pop()
output.append(current.val)
current = current.right
return output
Post order traversal of BST
def postorder_traversal_iterative(root):
if root is None:
return []
stack, output = [root], []
while stack:
node = stack.pop()
output.append(node.val)
if node.left:
stack.append(node.left)
if node.right:
stack.append(node.right)
return output[::-1] # Reverse the result because we want left-right-root
Do post order traversal question
Given an array of integers and an integer k, find the total number of continuous subarrays whose
sum equals k. The solution should have O(N) runtime
def subarraySum(nums, k):
# Dictionary to store the frequency of cumulative sums
cumulative_sum_count = {0: 1} # Base case: sum of 0 exists once
current_sum = 0
count = 0
for num in nums: current_sum += num # Check if there is a prefix subarray we can subtract # that results in the current subarray summing to k sum_needed = current_sum - k if sum_needed in cumulative_sum_count: count += cumulative_sum_count[sum_needed] # Update the count of the current_sum in the hashmap if current_sum in cumulative_sum_count: cumulative_sum_count[current_sum] += 1 else: cumulative_sum_count[current_sum] = 1 return count
You have three matrices: A ∈ R100×5, B ∈ R5×200, C ∈ R200×20, and you need to calculate the product ABC. In what order would you perform your multiplication and why?
Since matrix multiplication is associative, the answer is the same whether we multiply in the order of (AB)C or A(BC). However, let us observe the cost through the number of scalar multiplications we need to perform:
(AB)C = 100 · 5 · 200 + 100 · 200 · 20 = 50000
A(BC) = 5 · 200 · 20 + 100 · 5 · 20 = 30000
Obviously, the second approach is computationally cheaper.
What are some of the causes for numerical instability in deep learning?
Overflow, underflow, division by zero, log 0, NaN as input, etc.
In many machine learning techniques (e.g. batch norm), we often see a small term ϵ
added to the calculation. What’s the purpose of that term?
The purpose is to avoid operations that are undefined for 0, such as division by 0, log 0, etc
What made GPUs popular for deep learning? How are they compared to TPUs?
GPUs became popular for deep learning because matrix multiplications can be efficiently parallelized over hundreds of cores. TPUs (Tensor Processing Units) are specialized hardware for neural nets, with the key difference that they have lower precision for representing floating-point numbers, allowing for:
Higher memory throughput
Faster addition and multiplication operations
This design enables TPUs to accelerate neural network computations while reducing power consumption and increasing efficiency.
What are the time and space complexity for doing backpropagation on a recurrent neural network?
O(B · T · w) - time
O(w + B · T · a) - space
For the forward-pass of a single example in one timestep we need to evaluate all the weights, resulting in O(w) time complexity, where w is the number of weights. Due to the recurrence, we repeat the computation for T timesteps, resulting in O(T · w). Moreover, performing this un-rolled forward pass for an entire batch will amount the time complexity to O(B · T · w). Lastly, we note that the time complexity of the forward and the backward pass is the same.
As for the space complexity, note that we need to keep in memory both the network weights and the activations from the forward pass (required for the backprop computation). Given that storing the activations for a single timestep is O(a), the space complexity amounts to O(w + B · T · a)
Write a function find_bigrams to take a string and return a list of all bigrams.
def bigram(text_list:list):
result = []
for ls in text_list:
words = ls.lower().split()
for bi in zip(words, words[1:]):
result.append(bi)
return result
text = [“Data drives everything”, “Get the skills you need for the future of work”]
print(bigram(text))