Slides met chatgpt Flashcards by Jedidja Marsman

What is regularization in machine learning?

Regularization consists of strategies explicitly designed to reduce test error, potentially at the expense of increased training error. Its goal is to move a model from overfitting to matching the data complexity.

How well did you know this?

Not at all

Perfectly

List some common regularization techniques used in deep learning.

Common techniques include parameter norm penalties, dataset augmentation, ensemble methods, increasing noise robustness, semi-supervised learning, multitask learning, early stopping, parameter tying and sharing, dropout, and adversarial learning.

How well did you know this?

Not at all

Perfectly

What is a parameter norm penalty, and how is it used in regularization?

Parameter norm penalties limit model capacity by adding a penalty term,
Ω(𝜃), to the objective function
𝐽J. This regularizes weights
w⊂θ (excluding biases) to reduce overfitting.

How well did you know this?

Not at all

Perfectly

Define L1 norm penalty and its application in regularization.

The L1 norm penalty is defined as
Ω(𝜃)=∥𝑤∥1=∑𝑖∣|𝑤𝑖∣∣. This approach, also known as LASSO, encourages sparsity in the model parameters.

How well did you know this?

Not at all

Perfectly

Define L2 norm penalty and its application in regularization.

The L2 norm penalty is defined as
Ω(𝜃)=∥ 𝑤 ∥=𝑤𝑇w

How well did you know this?

Not at all

Perfectly

What is dataset augmentation, and why is it useful?

Dataset augmentation involves generating additional synthetic data to enhance the training set. Examples include image transformations (rotation, flips), noise addition in audio, and synonym replacement in text, helping improve model generalization.

How well did you know this?

Not at all

Perfectly

Explain ensemble methods and their role in reducing model error.

Ensemble methods combine multiple learners to reduce error by averaging predictions. Techniques like bagging, boosting, and stacking are used to generate diverse models that reduce the variance in predictions.

How well did you know this?

Not at all

Perfectly

Describe dropout as a regularization technique.

Dropout is a technique where units in a neural network are randomly “dropped” (set to zero) during training, forcing the network to learn redundant representations. This can be thought of as training an ensemble of subnetworks.

How well did you know this?

Not at all

Perfectly

What is the early stopping technique in model training?

Early stopping involves monitoring validation error during training and halting training when validation error no longer improves, preventing overfitting to the training data.

How well did you know this?

Not at all

Perfectly

Define parameter tying and parameter sharing.

Parameter tying keeps parameters close by adding a penalty to the loss, while parameter sharing enforces identical parameters, common in CNNs, to reduce memory usage and improve efficiency.

How well did you know this?

Not at all

Perfectly

Explain the difference between deterministic and stochastic gradient descent.

Deterministic gradient descent uses the whole dataset for each update, while stochastic gradient descent (SGD) updates weights based on individual data points or small minibatches, adding noise to the learning process.

How well did you know this?

Not at all

Perfectly

What is momentum in optimization, and why is it used?

Momentum is a technique that accelerates gradient descent by maintaining a moving average of past gradients, helping the model navigate along consistent gradient directions and dampen oscillations.

How well did you know this?

Not at all

Perfectly

Describe the Adam optimizer.

Adam (Adaptive Moments) is an optimization algorithm that adjusts learning rates based on the moments (mean and uncentered variance) of the gradients, providing more efficient convergence.

How well did you know this?

Not at all

Perfectly

What are Convolutional Neural Networks (CNNs) designed for?

CNNs are designed for data with grid-like structure, such as images (2D grids of pixels) and time-series data (1D grids of samples). They use convolution in place of general matrix multiplication in certain layers.

How well did you know this?

Not at all

Perfectly

What is a convolution operation in CNNs?

Convolution in CNNs involves sliding a filter (kernel) over the input to compute feature maps, capturing spatial hierarchies in data by leveraging sparse interactions, parameter sharing, and equivariance.

How well did you know this?

Not at all

Perfectly

Define pooling in the context of CNNs.

Pooling reduces the spatial dimensions of feature maps, typically by taking the max or average in each neighborhood. This introduces translation invariance and reduces the number of parameters in the network.

How well did you know this?

Not at all

Perfectly

What are Recurrent Neural Networks (RNNs) used for?

RNNs are designed for sequential data and allow parameter sharing over time, making them effective for tasks like language modeling and time-series forecasting.

How well did you know this?

Not at all

Perfectly

Explain the concept of unfolding in RNNs.

Unfolding is the process of representing the computational graph of an RNN across time steps, allowing the network to learn temporal dependencies through recurrent connections.

How well did you know this?

Not at all

Perfectly

Summarize the bias-variance trade-off in machine learning.

The bias-variance trade-off describes how increasing model complexity reduces bias but increases variance. A balanced model minimizes both to avoid underfitting and overfitting.

How well did you know this?

Not at all

Perfectly

How do you determine the number of trainable weights in a neural network, excluding biases?

Study These Flashcards

Count the total connections (arrows) between nodes. Each connection represents a unique weight to be trained.

How do you determine the total number of weights in a neural network, including biases?

Study These Flashcards

Count each connection plus one bias per non-input node.

In backpropagation, which paths in a computational graph are relevant?

Study These Flashcards

Only paths that connect trainable parameters to the final loss (J). Paths that don’t lead to J are irrelevant.

Compute the partial derivative of y-hat with respect to weight w^(2)_i if y-hat = h^T * w^(2).

Study These Flashcards

The answer is h_i, which represents the activation from the previous layer at node i.

Compute the derivative of h_i with respect to u_j if h_i = ReLU(u_i).

Study These Flashcards

If i ≠ j, the answer is 0.

If i = j:
The derivative is 1 if u_i > 0.
The derivative is 0 if u_i <= 0.

What is the XOR problem, and why is it significant for neural networks?

XOR is a linearly non-separable problem, meaning it cannot be solved by a single layer. It shows the need for hidden layers to solve non-linear problems.

In an XOR network with ReLU activation and hidden biases of -0.5, what should the hidden-output weights v1 and v2 be?

Set v1 = v2 = 2. This allows the network to output 1 when inputs differ and 0 when they are the same.

List two advantages of using ReLU activation over sigmoid activation.

1) ReLU is computationally efficient to calculate and backpropagate. 2) ReLU does not saturate when x > 0, preventing gradient vanishing for positive values.

Name three regularization techniques for neural networks.

Examples include: Parameter Norm Penalties Dataset Augmentation Noise Robustness Semi-Supervised Learning Early Stopping Dropout

Which regularization technique (L1 or L2) encourages sparsity in weights? Why?

L1 norm, as it drives small weights to zero, whereas L2 norm results in weights that are small but usually not zero.

Dropout in neural networks is an approximation of what ensemble method?

Dropout approximates "bagging," training a kind of ensemble by randomly omitting units, effectively training many subnetworks.

Describe the difference between SMOTE and mixup data augmentation techniques.

SMOTE generates synthetic data within a single class, often for upsampling minorities. Mixup interpolates between classes, creating soft memberships for classes.

Which regularization technique is used by convolutional filters in CNNs?

Parameter sharing, where filters use the same weights across different input regions, saving memory and improving efficiency.

What is the result of a 2D convolution on an image with a 3x3 filter and stride 1?

The output matrix is smaller than the input. Each entry is the sum of element-wise products between the filter and image patch it overlays.

Describe the result of a 2x2 max pooling operation on a 3x3 matrix.

Max pooling with stride 1 slides a 2x2 window over the matrix, reducing spatial dimensions by taking the maximum value in each window.

What are two key advantages of ReLU activation in neural networks?

1) Efficiency in computation, making ReLU fast to evaluate. 2) It does not saturate for positive inputs, avoiding gradient vanishing in the positive range.

List three regularization techniques specific to deep neural networks.

Examples include Dropout, Early Stopping, and Adversarial Learning.

Which norm (L1 or L2) is more likely to yield sparse weights, and why?

L1 norm, because it tends to drive small weights to zero, leading to fewer active features compared to L2.

What ensemble method does dropout approximate in a practical way for deep networks?

Dropout approximates "bagging" by training many subnetworks, each with a random subset of active units, making it a practical ensemble method for large networks.

What is the main difference between SMOTE and mixup for data augmentation?

SMOTE focuses on generating new samples within a single class for upsampling, while mixup interpolates between classes, creating data that can belong partially to multiple classes.

Which regularization technique do convolutional filters in CNNs use to improve efficiency?

Parameter sharing, where the same weights are reused across different spatial regions, reducing the number of parameters.

How do you perform a convolution operation on a binary input image with a binary filter?

Slide the filter over the image, multiplying and summing the overlapping values in each position to produce a new matrix (output image).

After applying a 3x3 convolution filter with stride 1 on a 4x4 image, what is the size of the resulting image?

The resulting image will be 2x2, since the filter moves one step at a time, covering 3x3 sections of the original image.

What is max pooling, and how does it work in a CNN?

Max pooling reduces dimensionality by taking the maximum value in a defined window (e.g., 2x2), summarizing the most significant feature in that region.

What is the purpose of max pooling after a convolution operation?

Max pooling introduces translation invariance and reduces the spatial size, retaining only the strongest feature responses in each region.

How do you perform a 2x2 max pooling operation on a 3x3 matrix with stride 1?

Slide a 2x2 window across the matrix with one-step strides, and replace each region with its maximum value to create a reduced matrix.

: Why is max pooling useful in convolutional neural networks?

It reduces computation by downsampling, and it introduces invariance to small translations, allowing the network to focus on more significant features.

In a neural network, why might dropout be preferred over training a full ensemble?

Dropout is faster and less memory-intensive, as it approximates ensemble training by randomly deactivating units rather than training many separate networks.

Slides met chatgpt Flashcards

(48 cards)