11 Intro to NN Flashcards
What is the difference between a batch and an epoch in neural‑network training?
A batch is a subset of training samples used for one gradient update; one epoch is completed after every sample in the full training set has been used once.
Why does ReLU help mitigate the vanishing‑gradient problem for deep nets?
Because its derivative is 1 for positive inputs, so gradients do not shrink as they pass through ReLU activations.
Give the formula for a residual block in ResNet.
Output = F(x) + x, where F(x) is the learned residual mapping (e.g., two Conv–BN–ReLU layers).
In one sentence, why do skip connections improve gradient flow?
They provide a direct path with derivative 1, so gradients cannot vanish even if ∂F/∂x is small.
What is the intuition behind learning a residual instead of the full mapping?
If the desired mapping is close to identity, the network only needs to learn small differences (the residual), which is easier to optimize.
List two common loss functions for classification tasks in neural nets.
Binary cross‑entropy (binary) and categorical cross‑entropy (multiclass).
True/False: Dropout is typically inserted immediately after convolutional layers.
False – it is most often applied after dense (fully connected) layers; conv layers rely more on BatchNorm.
What does a 3×3 filter with stride 2 and ‘valid’ padding do to an input of size 32×32?
Produces a feature map of size 15×15: ((32 − 3)/2) + 1 = 15.
Purpose of max pooling in CNNs?
Downsample feature maps while retaining the strongest activations, adding translation invariance and reducing computation.
Why does data augmentation reduce overfitting?
It shows the model label‑preserving variations of the same data, forcing it to learn invariant features rather than memorizing exact examples.
Softmax vs. sigmoid: when do you use each?
Use sigmoid for independent binary outputs; use softmax when classes are mutually exclusive and probabilities must sum to 1.
Define an embedding layer in one sentence.
A trainable lookup table that maps discrete tokens (e.g., words or categories) to dense, low‑dimensional vectors.
What tensor shape represents a batch of 64 RGB images at 128×128 resolution?
(64, 128, 128, 3).
Give two operations that constitute data augmentation for images.
Examples: random horizontal flip; random rotation; random zoom; random translation; brightness shift (any two).
Key advantage of Inception modules over plain VGG‑style stacking.
Parallel convolutions of multiple sizes let the network capture multi‑scale features without greatly increasing depth or parameters.
What is the main reason residual networks can be trained to 100+ layers while vanilla CNNs struggle?
Skip connections keep gradients alive, preventing vanishing and allowing very deep optimization.
State the forward and backward steps of backpropagation in two bullet points.
Forward: compute layer outputs and loss; Backward: use chain rule to propagate gradients and update weights.
Which activation is symmetric around zero and often used in shallow regression nets?
Tanh (hyperbolic tangent).