Foundations Flashcards
What’s broadcasting?
It enables element-wise operations on arrays or tensors of different shapes efficiently and intuitively. It plays a critical role in simplifying the implementation of mathematical operations in high-dimensional data processing and deep learning frameworks.
What’s the mathematical formula of the first part of a neural network?
out = xi· wi + b
x = input
w = weight
b = bias
What´s a scalar?
During broadcasting with a scalar the scalar can be thought of as being expanded to a tensor of the
same shape as the input tensor
What are the two criteria that tells us if a tensor is broadcastable or not?
- Two tensors of higher dimensions are broadcastable if the length of the axes of the lower ranked tensor
match the length of the trailing axes in the higher ranked tensor
– Two lengths of axes match in the sense of broadcasting if either one of them is 1 or they are equal
trailing axes: trailing axes = n last dimensions, depends on the size of the tensor you’re broadcasting with
What does the “expand_as” method do?
it virtually expands the tensor c to have the same shape as m
What does the method “unsqueeze()” do?
it adds an axes of length one at the defined position
Why is it important to initialize the weights?
to prevent that In the case of an increasing standard deviation the values may overflow and in the case of too small weights the values may vanish eventually
What are the different types of initialization techniques?
- Xavier: scaling factor (1/√n), n = # of inputs
- Kaming: scaling factor (2/√n)
What disadvantages does the Xavier initialization has over the Kaming?
Xavier method doesn’t preserve the std dev
Which of the following statements is true about weight initialization? (Multiple Choice)
1. The ReLU activation function preserves the distribution of values by leaving the majority of them unchanged (the linear part) and mapping values below zero to zero, which is acceptable given the desired mean of zero.
2. It is satisfactory to have the mean and variance of the distribution of output values average out to zero and one, respectively, across multiple initializations. In individual cases, these values may deviate.
3. If a pre-trained model is used and no new weights are added, we do not need Xavier and Kaming initialization at all.
4. In larger networks, the initialization process is relatively less critical due to the involvement of numerous random numbers. As a result, the likelihood of individual numbers impacting the overall outcome is mitigated.
5. Even with Xavier and Kaming initialization, it can occur by chance that the weights of a neural network are initialized in such a way that the network is unable to learn anything useful.
2,3,5
Which of the following statements is true about ANNs? (Multiple Choice)
1. All standard weight operations can be expressed as matrix multiplications. This makes neural network operations so efficient when executed on GPUs.
2. A single neuron cannot be implemented in plain Python, PyTorch or a similar deep learning library is required.
3. It is not possible to express the weights of a layer in a single matrix because the biases have to be separated from the input weights.
4. If one could obtain a fast enough GPU, while using only plain Python code, one could beat PyTorch’s CPU execution time for matrix multiplication.
1
How do we know if two tensors are broadcastable?
All their dimensions (in the same positions) are compatible. If one of the two is not equal, but one of them is equal to 1, it can be broadcasted
Why is PyTorch generally faster than plain Python for deep learning tasks?
A) It has a more intuitive API
B) It uses functions implemented in C/C++
C) It has better visualization tools
D) It requires less memory
Answer: B) It uses functions implemented in C/C++
Which of the following is a common issue that can occur with improper weight initialization in a neural network?
A) Faster convergence
B) Overfitting
C) Vanishing or exploding gradients
D) Reduced model complexity
Answer: C) Vanishing or exploding gradients
If a neural network has 3 layers with weights initialized using a normal distribution with a mean of 0 and a variance of 1, what is the expected variance of the output for each layer?
Answer:
The variance of the output remains 1 if the weights are initialized properly considering the input dimensions, assuming no activation or normalization layers that affect the output variance.