11 Intro to NN Flashcards

1
Q

What is the difference between a batch and an epoch in neural‑network training?

A

A batch is a subset of training samples used for one gradient update; one epoch is completed after every sample in the full training set has been used once.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why does ReLU help mitigate the vanishing‑gradient problem for deep nets?

A

Because its derivative is 1 for positive inputs, so gradients do not shrink as they pass through ReLU activations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Give the formula for a residual block in ResNet.

A

Output = F(x) + x, where F(x) is the learned residual mapping (e.g., two Conv–BN–ReLU layers).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

In one sentence, why do skip connections improve gradient flow?

A

They provide a direct path with derivative 1, so gradients cannot vanish even if ∂F/∂x is small.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the intuition behind learning a residual instead of the full mapping?

A

If the desired mapping is close to identity, the network only needs to learn small differences (the residual), which is easier to optimize.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

List two common loss functions for classification tasks in neural nets.

A

Binary cross‑entropy (binary) and categorical cross‑entropy (multiclass).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

True/False: Dropout is typically inserted immediately after convolutional layers.

A

False – it is most often applied after dense (fully connected) layers; conv layers rely more on BatchNorm.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does a 3×3 filter with stride 2 and ‘valid’ padding do to an input of size 32×32?

A

Produces a feature map of size 15×15: ((32 − 3)/2) + 1 = 15.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Purpose of max pooling in CNNs?

A

Downsample feature maps while retaining the strongest activations, adding translation invariance and reducing computation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why does data augmentation reduce overfitting?

A

It shows the model label‑preserving variations of the same data, forcing it to learn invariant features rather than memorizing exact examples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Softmax vs. sigmoid: when do you use each?

A

Use sigmoid for independent binary outputs; use softmax when classes are mutually exclusive and probabilities must sum to 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Define an embedding layer in one sentence.

A

A trainable lookup table that maps discrete tokens (e.g., words or categories) to dense, low‑dimensional vectors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What tensor shape represents a batch of 64 RGB images at 128×128 resolution?

A

(64, 128, 128, 3).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Give two operations that constitute data augmentation for images.

A

Examples: random horizontal flip; random rotation; random zoom; random translation; brightness shift (any two).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Key advantage of Inception modules over plain VGG‑style stacking.

A

Parallel convolutions of multiple sizes let the network capture multi‑scale features without greatly increasing depth or parameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the main reason residual networks can be trained to 100+ layers while vanilla CNNs struggle?

A

Skip connections keep gradients alive, preventing vanishing and allowing very deep optimization.

17
Q

State the forward and backward steps of backpropagation in two bullet points.

A

Forward: compute layer outputs and loss; Backward: use chain rule to propagate gradients and update weights.

18
Q

Which activation is symmetric around zero and often used in shallow regression nets?

A

Tanh (hyperbolic tangent).