Foundations Flashcards
What’s broadcasting?
It enables element-wise operations on arrays or tensors of different shapes efficiently and intuitively. It plays a critical role in simplifying the implementation of mathematical operations in high-dimensional data processing and deep learning frameworks.
What’s the mathematical formula of the first part of a neural network?
out = xi· wi + b
x = input
w = weight
b = bias
What´s a scalar?
During broadcasting with a scalar the scalar can be thought of as being expanded to a tensor of the
same shape as the input tensor
What are the two criteria that tells us if a tensor is broadcastable or not?
- Two tensors of higher dimensions are broadcastable if the length of the axes of the lower ranked tensor
match the length of the trailing axes in the higher ranked tensor
– Two lengths of axes match in the sense of broadcasting if either one of them is 1 or they are equal
trailing axes: trailing axes = n last dimensions, depends on the size of the tensor you’re broadcasting with
What does the “expand_as” method do?
it virtually expands the tensor c to have the same shape as m
What does the method “unsqueeze()” do?
it adds an axes of length one at the defined position
Why is it important to initialize the weights?
to prevent that In the case of an increasing standard deviation the values may overflow and in the case of too small weights the values may vanish eventually
What are the different types of initialization techniques?
- Xavier: scaling factor (1/√n), n = # of inputs
- Kaming: scaling factor (2/√n)
What disadvantages does the Xavier initialization has over the Kaming?
Xavier method doesn’t preserve the std dev
Which of the following statements is true about weight initialization? (Multiple Choice)
1. The ReLU activation function preserves the distribution of values by leaving the majority of them unchanged (the linear part) and mapping values below zero to zero, which is acceptable given the desired mean of zero.
2. It is satisfactory to have the mean and variance of the distribution of output values average out to zero and one, respectively, across multiple initializations. In individual cases, these values may deviate.
3. If a pre-trained model is used and no new weights are added, we do not need Xavier and Kaming initialization at all.
4. In larger networks, the initialization process is relatively less critical due to the involvement of numerous random numbers. As a result, the likelihood of individual numbers impacting the overall outcome is mitigated.
5. Even with Xavier and Kaming initialization, it can occur by chance that the weights of a neural network are initialized in such a way that the network is unable to learn anything useful.
2,3,5
Which of the following statements is true about ANNs? (Multiple Choice)
1. All standard weight operations can be expressed as matrix multiplications. This makes neural network operations so efficient when executed on GPUs.
2. A single neuron cannot be implemented in plain Python, PyTorch or a similar deep learning library is required.
3. It is not possible to express the weights of a layer in a single matrix because the biases have to be separated from the input weights.
4. If one could obtain a fast enough GPU, while using only plain Python code, one could beat PyTorch’s CPU execution time for matrix multiplication.
1
How do we know if two tensors are broadcastable?
All their dimensions (in the same positions) are compatible. If one of the two is not equal, but one of them is equal to 1, it can be broadcasted
Why is PyTorch generally faster than plain Python for deep learning tasks?
A) It has a more intuitive API
B) It uses functions implemented in C/C++
C) It has better visualization tools
D) It requires less memory
Answer: B) It uses functions implemented in C/C++
Which of the following is a common issue that can occur with improper weight initialization in a neural network?
A) Faster convergence
B) Overfitting
C) Vanishing or exploding gradients
D) Reduced model complexity
Answer: C) Vanishing or exploding gradients
If a neural network has 3 layers with weights initialized using a normal distribution with a mean of 0 and a variance of 1, what is the expected variance of the output for each layer?
Answer:
The variance of the output remains 1 if the weights are initialized properly considering the input dimensions, assuming no activation or normalization layers that affect the output variance.
Explain the concept of broadcasting in the context of neural network operations. Provide an example of how it can be used in matrix operations.
Answer:
Broadcasting is a technique that allows numpy or PyTorch to perform element-wise operations on arrays of different shapes by automatically expanding the smaller array to match the shape of the larger one. This is useful for efficient computation without the need for manual array resizing.
Example:
If A=np.array([[1,2],[3,4]]) and
B=np.array([1,2]), broadcasting allows you to perform C=A+B, resulting in C=np.array([[2,4],[4,6]]).
Discuss the importance of proper weight initialization in deep learning models. What issues might arise from poor initialization?
Answer:
Proper weight initialization is crucial because it affects the convergence speed and stability of a deep learning model during training. Poor initialization can lead to problems such as vanishing or exploding gradients, where gradients either become too small to propagate back effectively or grow too large, causing numerical instability. Good initialization techniques, like Xavier or Kaming initialization, aim to maintain a stable variance of outputs and gradients throughout the network.
You implemented a neural network but noticed that the gradients of the weights are either too large or too small during training. What could be the possible reasons, and how would you address this issue?
Answer:
The issue could be due to improper weight initialization, leading to vanishing or exploding gradients. To address this, one could use initialization methods like Xavier (Glorot) or Kaming initialization, which take into account the number of input and output units to maintain gradient flow. Additionally, gradient clipping can be used to prevent gradients from becoming too large.
Explain the difference between the forward pass and the backward pass in a neural network. Why is the backward pass critical for model training?
Answer:
The forward pass in a neural network involves computing the output predictions from the input data by passing it through the network layers sequentially. The backward pass, also known as backpropagation, involves computing the gradients of the loss function with respect to the model parameters, enabling the model to learn by updating its weights. The backward pass is critical for training because it allows the model to minimize the loss function by adjusting the weights in the direction that reduces the error.