ai Flashcards
Given a convolution layer with input channels 3, output channel 64, kernel size 4x4 and stride 2, dilation 3, padding 1, what are the parameter size of this convolution layer?
3x64x4x4
In Pytorch (import torch.nn as nn), which of the following layer downsamples the input size into half?
nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, stride=2, padding=1, dilation=1))
Which following statement is True about convolution layer?
Convolution layer is linear and it is often used along with activation function.
In the design of auto-encoder, the encoder and decoder should follow the exact same structure.
FALSE
All regularizations (e.g., L1 norm, L2 norm) penalize larger parameters.
TRUE
When updating parameters using gradient descent, which way of calculating loss works better (ie. a better trade-off between efficiency and robustness)?
calculate loss for a mini-batch of data examples in every iteration
MaxPooling preserves detected features, and downsamples feature map (image)
TRUE
What is the size of receptive field for two stacking dilated convolution layers with kernel size 3x3, stride 1, and dilation 2?
9x9
In CNN, two conv layers cannot be connected directly, we must use a pooling layer in the middle.
FALSE
In the design of CNN, fully connected layer usually contains much more parameters than conv layers.
TRUE
What is the purpose of the ReLU activation function in a CNN?
To introduce non-linearity
What is the main advantage of using dropout in a CNN?
Preventing overfitting
In the mini-batch SGD training, an important practice is to shuffle the training data before every epoch. Why?
It helps the training converge fast and prevents bias.
Logistic Regression is widely used to solve a classification problem with predicting probabilities of discrete (or categorical) values.
TRUE
Which of the following statements is true about activation functions in the context of neural networks and backpropagation?
Activation functions like ReLU (Rectified Linear Unit) introduce non-linear properties to the model, allowing it to learn complex patterns.
Which case is overfitting?
Training error is low, but testing error is high
What approach could be used to handle overfitting?
Use regularization
Besides penalizing larger parameters, which regularization makes parameters more sparse?
L1 norm
In Backpropagation, which claim is true?
Backward-pass uses the information preserved in forward-pass to calculate gradients
As an activation function, tanh avoids the vanishing gradient problem.
FALSE
As an activation function, ReLU solves the vanishing gradient problem.
TRUE
About SGD optimization, which is not correct?
Randomly initialize the parameters will affect the performance.
In reinforcement learning, what is the benefit of using network instead of lookup table?
Generalization
Which way do we usually use to train an autoencoder model?
We usually train the encoder model and the decoder model together
Which claim is true about attention and self-attention?
In the sequence-to-sequence model, at different steps, “attention” lets the model “focus” on different parts of the input.
Self-attention is usually used to model dependencies between different parts of one sequence (e.g., words in one sentence).
Both the above claims.
None of the above claims.
Both the above claims.
What’s the major purpose of multi-head attention?
Catching multiple relationships between words of the input sequence
In the transformer neural network architecture, the encoder blocks usually use the identical neural network structure.
TRUE
In the transformer neural network architecture, the output of the final encoder block will go to _____.
Every decoder block
In the autoregressive model, the output variable at the current step depends on only the hidden states at all previous steps.
FALSE
In Transformer, how does the decoder use the information (features) from the encoder?
Cross-attention (mixed attentions from multiple inputs)
In the policy gradient approach for reinforcement learning, the reward R(?^n ) is considered based on
The cumulative reward in every entire trajectory
In the two major approaches of reinforcement learning, i.e., policy gradient and Q-learning, which is usually more sample-efficient?
Q-learning
In Q-Learning, we can use either Q-table or a neural network to predict a Q-value for a pair of (state, action). Which one is more scalable?
Neural network
In the two major approaches of reinforcement learning, i.e., policy gradient and Q-learning, which one uses randomly sampled transitions instead of the entire trajectories?
Q-learning
In the two major approaches of reinforcement learning, i.e., policy gradient and Q-learning, which one is on-policy training?
Policy gradient
For Discrete-event modeling, what approach do we often use?
Q-learning