ai Flashcards

1
Q

Given a convolution layer with input channels 3, output channel 64, kernel size 4x4 and stride 2, dilation 3, padding 1, what are the parameter size of this convolution layer?

A

3x64x4x4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

In Pytorch (import torch.nn as nn), which of the following layer downsamples the input size into half?

A

nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, stride=2, padding=1, dilation=1))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Which following statement is True about convolution layer?

A

Convolution layer is linear and it is often used along with activation function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

In the design of auto-encoder, the encoder and decoder should follow the exact same structure.

A

FALSE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

All regularizations (e.g., L1 norm, L2 norm) penalize larger parameters.

A

TRUE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

When updating parameters using gradient descent, which way of calculating loss works better (ie. a better trade-off between efficiency and robustness)?

A

calculate loss for a mini-batch of data examples in every iteration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

MaxPooling preserves detected features, and downsamples feature map (image)

A

TRUE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the size of receptive field for two stacking dilated convolution layers with kernel size 3x3, stride 1, and dilation 2?

A

9x9

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

In CNN, two conv layers cannot be connected directly, we must use a pooling layer in the middle.

A

FALSE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

In the design of CNN, fully connected layer usually contains much more parameters than conv layers.

A

TRUE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the purpose of the ReLU activation function in a CNN?

A

To introduce non-linearity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the main advantage of using dropout in a CNN?

A

Preventing overfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

In the mini-batch SGD training, an important practice is to shuffle the training data before every epoch. Why?

A

It helps the training converge fast and prevents bias.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Logistic Regression is widely used to solve a classification problem with predicting probabilities of discrete (or categorical) values.

A

TRUE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Which of the following statements is true about activation functions in the context of neural networks and backpropagation?

A

Activation functions like ReLU (Rectified Linear Unit) introduce non-linear properties to the model, allowing it to learn complex patterns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Which case is overfitting?

A

Training error is low, but testing error is high

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What approach could be used to handle overfitting?

A

Use regularization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Besides penalizing larger parameters, which regularization makes parameters more sparse?

A

L1 norm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

In Backpropagation, which claim is true?

A

Backward-pass uses the information preserved in forward-pass to calculate gradients

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

As an activation function, tanh avoids the vanishing gradient problem.

A

FALSE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

As an activation function, ReLU solves the vanishing gradient problem.

A

TRUE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

About SGD optimization, which is not correct?

A

Randomly initialize the parameters will affect the performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

In reinforcement learning, what is the benefit of using network instead of lookup table?

A

Generalization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Which way do we usually use to train an autoencoder model?

A

We usually train the encoder model and the decoder model together

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Which claim is true about attention and self-attention?

In the sequence-to-sequence model, at different steps, “attention” lets the model “focus” on different parts of the input.

Self-attention is usually used to model dependencies between different parts of one sequence (e.g., words in one sentence).

Both the above claims.

None of the above claims.

A

Both the above claims.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What’s the major purpose of multi-head attention?

A

Catching multiple relationships between words of the input sequence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

In the transformer neural network architecture, the encoder blocks usually use the identical neural network structure.

A

TRUE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

In the transformer neural network architecture, the output of the final encoder block will go to _____.

A

Every decoder block

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

In the autoregressive model, the output variable at the current step depends on only the hidden states at all previous steps.

A

FALSE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

In Transformer, how does the decoder use the information (features) from the encoder?

A

Cross-attention (mixed attentions from multiple inputs)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

In the policy gradient approach for reinforcement learning, the reward R(?^n ) is considered based on

A

The cumulative reward in every entire trajectory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

In the two major approaches of reinforcement learning, i.e., policy gradient and Q-learning, which is usually more sample-efficient?

A

Q-learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

In Q-Learning, we can use either Q-table or a neural network to predict a Q-value for a pair of (state, action). Which one is more scalable?

A

Neural network

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

In the two major approaches of reinforcement learning, i.e., policy gradient and Q-learning, which one uses randomly sampled transitions instead of the entire trajectories?

A

Q-learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

In the two major approaches of reinforcement learning, i.e., policy gradient and Q-learning, which one is on-policy training?

A

Policy gradient

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

For Discrete-event modeling, what approach do we often use?

A

Q-learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What approach do we often use for Discrete-event modeling?

A

Q-learning

38
Q

In the information theory, which event includes more information?

A

Low probability event occurs

39
Q

What is the correct statement about activation functions in neural networks and backpropagation?

A

Activation functions introduce non-linear properties to the model, allowing it to learn complex patterns.

40
Q

What is a common sign of overfitting?

A

Low training error, high testing error

41
Q

What technique can help to mitigate overfitting?

A

Regularization

42
Q

Do all regularizations penalize larger parameters?

A

TRUE

43
Q

What is the primary purpose of using dropout in a CNN?

A

Preventing overfitting

44
Q

What is the purpose of the ReLU activation function in a CNN?

A

To introduce non-linearity

45
Q

Which statement about convolution layers is true?

A

Convolution layers are linear and often used with activation functions.

46
Q

What is the parameter size of a convolution layer with input channels 3, output channels 64, kernel size 4x4, stride 2, dilation 3, and padding 1?

A

3x64x4x4

47
Q

Which layer in PyTorch downsamples the input size into half?

A

nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, stride=2, padding=1, dilation=1))

48
Q

Does MaxPooling preserve detected features and downsample the feature map?

A

TRUE

49
Q

Do two convolution layers in a CNN always need a pooling layer in between them?

A

FALSE

50
Q

Does a fully connected layer in CNN usually contain more parameters than convolution layers?

A

TRUE

51
Q

What is the primary purpose of using dropout in a CNN?

A

To prevent overfitting

52
Q

What is the main advantage of using dropout in a CNN?

A

Preventing overfitting

53
Q

What is the purpose of the ReLU activation function in a CNN?

A

To introduce non-linearity

54
Q

Which statement about convolution layers is true?

A

Convolution layers are linear and often used with activation functions.

55
Q

What is the parameter size of a convolution layer with input channels 3, output channels 64, kernel size 4x4, stride 2, dilation 3, and padding 1?

A

3x64x4x4

56
Q

Which layer in PyTorch downsamples the input size into half?

A

nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, stride=2, padding=1, dilation=1))

57
Q

Does MaxPooling preserve detected features and downsample the feature map?

A

TRUE

58
Q

Do two convolution layers in a CNN always need a pooling layer in between them?

A

FALSE

59
Q

Does a fully connected layer in CNN usually contain more parameters than convolution layers?

A

TRUE

60
Q

What is the primary purpose of using dropout in a CNN?

A

To prevent overfitting

61
Q

What is the main advantage of using dropout in a CNN?

A

Preventing overfitting

62
Q

What is the purpose of the ReLU activation function in a CNN?

A

To introduce non-linearity

63
Q

Which statement about convolution layers is true?

A

Convolution layers are linear and often used with activation functions.

64
Q

What is the parameter size of a convolution layer with input channels 3, output channels 64, kernel size 4x4, stride 2, dilation 3, and padding 1?

A

3x64x4x4

65
Q

Which layer in PyTorch downsamples the input size into half?

A

nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, stride=2, padding=1, dilation=1))

66
Q

Does MaxPooling preserve detected features and downsample the feature map?

A

TRUE

67
Q

Do two convolution layers in a CNN always need a pooling layer in between them?

A

FALSE

68
Q

Does a fully connected layer in CNN usually contain more parameters than convolution layers?

A

TRUE

69
Q

What is the primary purpose of using dropout in a CNN?

A

To prevent overfitting

70
Q

What is the main advantage of using dropout in a CNN?

A

Preventing overfitting

71
Q

What is the purpose of the ReLU activation function in a CNN?

A

To introduce non-linearity

72
Q

Which statement about convolution layers is true?

A

Convolution layers are linear and often used with activation functions.

73
Q

What is the parameter size of a convolution layer with input channels 3, output channels 64, kernel size 4x4, stride 2, dilation 3, and padding 1?

A

3x64x4x4

74
Q

Which layer in PyTorch downsamples the input size into half?

A

nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, stride=2, padding=1, dilation=1))

75
Q

Does MaxPooling preserve detected features and downsample the feature map?

A

TRUE

76
Q

Do two convolution layers in a CNN always need a pooling layer in between them?

A

FALSE

77
Q

Does a fully connected layer in CNN usually contain more parameters than convolution layers?

A

TRUE

78
Q

What is the primary purpose of using dropout in a CNN?

A

To prevent overfitting

79
Q

What is the main advantage of using dropout in a CNN?

A

Preventing overfitting

80
Q

What is the purpose of the ReLU activation function in a CNN?

A

To introduce non-linearity

81
Q

Which statement about convolution layers is true?

A

Convolution layers are linear and often used with activation functions.

82
Q

What is the parameter size of a convolution layer with input channels 3, output channels 64, kernel size 4x4, stride 2, dilation 3, and padding 1?

A

3x64x4x4

83
Q

Which layer in PyTorch downsamples the input size into half?

A

nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, stride=2, padding=1, dilation=1))

84
Q

Does MaxPooling preserve detected features and downsample the feature map?

A

TRUE

85
Q

Do two convolution layers in a CNN always need a pooling layer in between them?

A

FALSE

86
Q

Does a fully connected layer in CNN usually contain more parameters than convolution layers?

A

TRUE

87
Q

What is the primary purpose of using dropout in a CNN?

A

To prevent overfitting

88
Q

What is the main advantage of using dropout in a CNN?

A

Preventing overfitting

89
Q

What is the purpose of the ReLU activation function in a CNN?

A

To introduce non-linearity

90
Q

Which statement about convolution layers is true?

A

Convolution layers are linear and often used with activation functions.

91
Q

What is the parameter size of a convolution layer with input channels 3, output channels 64, kernel size 4x4, stride 2, dilation 3, and padding 1?

A

3x64x4x4

92
Q

Which layer in PyTorch downsamples the input size into half?

A

nn.Conv2d(in_channels