ai Flashcards by Yahya Abovat

Given a convolution layer with input channels 3, output channel 64, kernel size 4x4 and stride 2, dilation 3, padding 1, what are the parameter size of this convolution layer?

3x64x4x4

How well did you know this?

Not at all

Perfectly

In Pytorch (import torch.nn as nn), which of the following layer downsamples the input size into half?

nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, stride=2, padding=1, dilation=1))

How well did you know this?

Not at all

Perfectly

Which following statement is True about convolution layer?

Convolution layer is linear and it is often used along with activation function.

How well did you know this?

Not at all

Perfectly

In the design of auto-encoder, the encoder and decoder should follow the exact same structure.

FALSE

How well did you know this?

Not at all

Perfectly

All regularizations (e.g., L1 norm, L2 norm) penalize larger parameters.

TRUE

How well did you know this?

Not at all

Perfectly

When updating parameters using gradient descent, which way of calculating loss works better (ie. a better trade-off between efficiency and robustness)?

calculate loss for a mini-batch of data examples in every iteration

How well did you know this?

Not at all

Perfectly

MaxPooling preserves detected features, and downsamples feature map (image)

TRUE

How well did you know this?

Not at all

Perfectly

What is the size of receptive field for two stacking dilated convolution layers with kernel size 3x3, stride 1, and dilation 2?

9x9

How well did you know this?

Not at all

Perfectly

In CNN, two conv layers cannot be connected directly, we must use a pooling layer in the middle.

FALSE

How well did you know this?

Not at all

Perfectly

In the design of CNN, fully connected layer usually contains much more parameters than conv layers.

TRUE

How well did you know this?

Not at all

Perfectly

What is the purpose of the ReLU activation function in a CNN?

To introduce non-linearity

How well did you know this?

Not at all

Perfectly

What is the main advantage of using dropout in a CNN?

Preventing overfitting

How well did you know this?

Not at all

Perfectly

In the mini-batch SGD training, an important practice is to shuffle the training data before every epoch. Why?

It helps the training converge fast and prevents bias.

How well did you know this?

Not at all

Perfectly

Logistic Regression is widely used to solve a classification problem with predicting probabilities of discrete (or categorical) values.

TRUE

How well did you know this?

Not at all

Perfectly

Which of the following statements is true about activation functions in the context of neural networks and backpropagation?

Activation functions like ReLU (Rectified Linear Unit) introduce non-linear properties to the model, allowing it to learn complex patterns.

How well did you know this?

Not at all

Perfectly

Which case is overfitting?

Training error is low, but testing error is high

How well did you know this?

Not at all

Perfectly

What approach could be used to handle overfitting?

Use regularization

How well did you know this?

Not at all

Perfectly

Besides penalizing larger parameters, which regularization makes parameters more sparse?

L1 norm

How well did you know this?

Not at all

Perfectly

In Backpropagation, which claim is true?

Backward-pass uses the information preserved in forward-pass to calculate gradients

How well did you know this?

Not at all

Perfectly

As an activation function, tanh avoids the vanishing gradient problem.

FALSE

How well did you know this?

Not at all

Perfectly

As an activation function, ReLU solves the vanishing gradient problem.

TRUE

How well did you know this?

Not at all

Perfectly

About SGD optimization, which is not correct?

Randomly initialize the parameters will affect the performance.

How well did you know this?

Not at all

Perfectly

In reinforcement learning, what is the benefit of using network instead of lookup table?

Generalization

How well did you know this?

Not at all

Perfectly

Which way do we usually use to train an autoencoder model?

We usually train the encoder model and the decoder model together

How well did you know this?

Not at all

Perfectly

Which claim is true about attention and self-attention? In the sequence-to-sequence model, at different steps, "attention" lets the model "focus" on different parts of the input. Self-attention is usually used to model dependencies between different parts of one sequence (e.g., words in one sentence). Both the above claims. None of the above claims.

Both the above claims.

What's the major purpose of multi-head attention?

Catching multiple relationships between words of the input sequence

In the transformer neural network architecture, the encoder blocks usually use the identical neural network structure.

TRUE

In the transformer neural network architecture, the output of the final encoder block will go to _____.

Every decoder block

In the autoregressive model, the output variable at the current step depends on only the hidden states at all previous steps.

FALSE

In Transformer, how does the decoder use the information (features) from the encoder?

Cross-attention (mixed attentions from multiple inputs)

In the policy gradient approach for reinforcement learning, the reward R(?^n ) is considered based on

The cumulative reward in every entire trajectory

In the two major approaches of reinforcement learning, i.e., policy gradient and Q-learning, which is usually more sample-efficient?

Q-learning

In Q-Learning, we can use either Q-table or a neural network to predict a Q-value for a pair of (state, action). Which one is more scalable?

Neural network

In the two major approaches of reinforcement learning, i.e., policy gradient and Q-learning, which one uses randomly sampled transitions instead of the entire trajectories?

Q-learning

In the two major approaches of reinforcement learning, i.e., policy gradient and Q-learning, which one is on-policy training?

Policy gradient

For Discrete-event modeling, what approach do we often use?

Q-learning

What approach do we often use for Discrete-event modeling?

Q-learning

In the information theory, which event includes more information?

Low probability event occurs

What is the correct statement about activation functions in neural networks and backpropagation?

Activation functions introduce non-linear properties to the model, allowing it to learn complex patterns.

What is a common sign of overfitting?

Low training error, high testing error

What technique can help to mitigate overfitting?

Regularization

Do all regularizations penalize larger parameters?

TRUE

What is the primary purpose of using dropout in a CNN?

Preventing overfitting

What is the purpose of the ReLU activation function in a CNN?

To introduce non-linearity

Which statement about convolution layers is true?

Convolution layers are linear and often used with activation functions.

What is the parameter size of a convolution layer with input channels 3, output channels 64, kernel size 4x4, stride 2, dilation 3, and padding 1?

3x64x4x4

Which layer in PyTorch downsamples the input size into half?

nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, stride=2, padding=1, dilation=1))

Does MaxPooling preserve detected features and downsample the feature map?

TRUE

Do two convolution layers in a CNN always need a pooling layer in between them?

FALSE

Does a fully connected layer in CNN usually contain more parameters than convolution layers?

TRUE

What is the primary purpose of using dropout in a CNN?

To prevent overfitting

What is the main advantage of using dropout in a CNN?

Preventing overfitting

What is the purpose of the ReLU activation function in a CNN?

To introduce non-linearity

Which statement about convolution layers is true?

Convolution layers are linear and often used with activation functions.

What is the parameter size of a convolution layer with input channels 3, output channels 64, kernel size 4x4, stride 2, dilation 3, and padding 1?

3x64x4x4

Which layer in PyTorch downsamples the input size into half?

nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, stride=2, padding=1, dilation=1))

Does MaxPooling preserve detected features and downsample the feature map?

TRUE

Do two convolution layers in a CNN always need a pooling layer in between them?

FALSE

Does a fully connected layer in CNN usually contain more parameters than convolution layers?

TRUE

What is the primary purpose of using dropout in a CNN?

To prevent overfitting

What is the main advantage of using dropout in a CNN?

Preventing overfitting

What is the purpose of the ReLU activation function in a CNN?

To introduce non-linearity

Which statement about convolution layers is true?

Convolution layers are linear and often used with activation functions.

What is the parameter size of a convolution layer with input channels 3, output channels 64, kernel size 4x4, stride 2, dilation 3, and padding 1?

3x64x4x4

Which layer in PyTorch downsamples the input size into half?

nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, stride=2, padding=1, dilation=1))

Does MaxPooling preserve detected features and downsample the feature map?

TRUE

Do two convolution layers in a CNN always need a pooling layer in between them?

FALSE

Does a fully connected layer in CNN usually contain more parameters than convolution layers?

TRUE

What is the primary purpose of using dropout in a CNN?

To prevent overfitting

What is the main advantage of using dropout in a CNN?

Preventing overfitting

What is the purpose of the ReLU activation function in a CNN?

To introduce non-linearity

Which statement about convolution layers is true?

Convolution layers are linear and often used with activation functions.

What is the parameter size of a convolution layer with input channels 3, output channels 64, kernel size 4x4, stride 2, dilation 3, and padding 1?

3x64x4x4

Which layer in PyTorch downsamples the input size into half?

nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, stride=2, padding=1, dilation=1))

Does MaxPooling preserve detected features and downsample the feature map?

TRUE

Do two convolution layers in a CNN always need a pooling layer in between them?

FALSE

Does a fully connected layer in CNN usually contain more parameters than convolution layers?

TRUE

What is the primary purpose of using dropout in a CNN?

To prevent overfitting

What is the main advantage of using dropout in a CNN?

Preventing overfitting

What is the purpose of the ReLU activation function in a CNN?

To introduce non-linearity

Which statement about convolution layers is true?

Convolution layers are linear and often used with activation functions.

What is the parameter size of a convolution layer with input channels 3, output channels 64, kernel size 4x4, stride 2, dilation 3, and padding 1?

3x64x4x4

Which layer in PyTorch downsamples the input size into half?

nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, stride=2, padding=1, dilation=1))

Does MaxPooling preserve detected features and downsample the feature map?

TRUE

Do two convolution layers in a CNN always need a pooling layer in between them?

FALSE

Does a fully connected layer in CNN usually contain more parameters than convolution layers?

TRUE

What is the primary purpose of using dropout in a CNN?

To prevent overfitting

What is the main advantage of using dropout in a CNN?

Preventing overfitting

What is the purpose of the ReLU activation function in a CNN?

To introduce non-linearity

Which statement about convolution layers is true?

Convolution layers are linear and often used with activation functions.

What is the parameter size of a convolution layer with input channels 3, output channels 64, kernel size 4x4, stride 2, dilation 3, and padding 1?

3x64x4x4

Which layer in PyTorch downsamples the input size into half?

nn.Conv2d(in_channels

ai Flashcards

(92 cards)