torch_modules Flashcards

1
Q

What is the input shape of Conv1D?

A
  • Input shape: (batch_size, channels, sequence_length)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is output shape of conv 1d with specifc strid and kernel size

A
  • Formula for output length: ⌊(sequence_length - kernel_size + 2 * padding) / stride⌋ + 1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

give an exmaple of conv1d

A
# Example input: batch_size=32, channels=3, sequence_length=100
x = torch.randn(32, 3, 100)
conv1d = nn.Conv1d(in_channels=3, out_channels=16, kernel_size=3, stride=1, padding=1)
output = conv1d(x)  # Shape: (32, 16, 100)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How does Conv1D transform dimensions?

A
  • Input: (batch_size, in_channels, sequence_length)
  • Output: (batch_size, out_channels, new_sequence_length)
  • Example: If input is (32, 3, 100) with out_channels=16:
    • Output becomes (32, 16, 100) with padding=1
    • Each output channel represents different learned features
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the input shape and purpose of Conv2D?

A
  • Input shape: (batch_size, channels, height, width)
    • Height_out = ⌊(height - kernel_size + 2 * padding) / stride⌋ + 1
    • Width_out = ⌊(width - kernel_size + 2 * padding) / stride⌋ + 1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How does Conv2D transform dimensions?

A
  • Input: (batch_size, in_channels, height, width)
  • Output: (batch_size, out_channels, new_height, new_width)
  • Example: If input is (32, 3, 224, 224) with out_channels=64:
    • Output becomes (32, 64, 224, 224) with padding=1
    • Each output channel is a different learned feature map
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the purpose and operation of AvgPool1D?

A
  • Purpose: Reduces sequence length by averaging values in sliding windows
  • Operation: Takes the mean of each kernel-sized window
  • No learnable parameters
  • Reduces dimensionality while maintaining important features
  • Output length = ⌊(sequence_length - kernel_size) / stride⌋ + 1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How does AvgPool1D transform dimensions?

A
  • Input: (batch_size, channels, sequence_length)
  • Output: (batch_size, channels, new_sequence_length)
  • Example: With kernel_size=2, stride=2
    • Input (32, 16, 100) becomes (32, 16, 50)
    • Sequence length is halved while channels remain unchanged
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the purpose of BatchNormalization?

A
  • Purpose: Normalizes layer outputs to have zero mean and unit variance
  • Helps with training stability and faster convergence
  • Different versions for different dimensional data:
    • BatchNorm1d: (batch_size, features)
    • BatchNorm2d: (batch_size, channels, height, width)
    • BatchNorm3d: (batch_size, channels, depth, height, width)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How does BatchNorm work during training vs. inference?

A
  • Training:
    • Calculates mean and variance from current batch
    • Updates running statistics for inference
    • Applies learned scale (gamma) and shift (beta)
  • Inference:
    • Uses stored running statistics instead of batch statistics
    • Still applies learned scale and shift
  • Maintains input shape in all cases
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the purpose and operation of Dropout?

A
  • Purpose: Prevents overfitting by randomly deactivating neurons
  • Training:
    • Randomly sets values to zero with probability p
    • Scales remaining values by 1/(1-p)
  • Inference:
    • No random dropping
    • No scaling needed
  • Maintains input shape
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How to use Dropout effectively?

A
  • Common dropout rates: 0.1 to 0.5
  • Higher rates = stronger regularization
  • Place after activation functions
  • Don’t use just before output layer
  • Different types available:
    • Dropout: Standard random dropping
    • Dropout2d: Drops entire channels (good for CNNs)
    • Dropout3d: Drops entire 3D features
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the dimension transformations for each module?

Module | Input Shape | Output Shape | Parameters
For module in conv1d conv2d avgpool1d
Dropuut

A

Module | Input Shape | Output Shape | Parameters
——-|————-|————–|————
Conv1D | (B, C, L) | (B, C_out, L_new) | in_channels, out_channels, kernel_size, stride, padding

Conv2D | (B, C, H, W) | (B, C_out, H_new, W_new) | in_channels, out_channels, kernel_size, stride, padding

AvgPool1D | (B, C, L) | (B, C, L_new) | kernel_size, stride
BatchNorm | (B, *) | Same as input | num_features

Dropout | Any | Same as input | p (dropout probability)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are common mistakes to watch for with these modules?

A
  1. Conv1D/AvgPool1D:
    • Forgetting to permute channels for time series data
    • Not accounting for padding in output size calculations
  2. BatchNorm:
    • Using too small batch sizes (affects statistics)
    • Forgetting to set training mode correctly
  3. Dropout:
    • Not turning off during inference
    • Using too high dropout rate
    • Placing in wrong locations in network
  4. General:
    • Not checking output shapes
    • Incorrect dimension ordering
    • Not handling edge cases in sequence lengths
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What’s the key difference between depthwise and pointwise convolutions in terms of channel interaction?

A

Depthwise processes each channel independently while pointwise (1x1 conv) mixes information across all channels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

For input shape [m,128,3000], what affects the number of parameters in depthwise convolution?

A

Only kernel_size × in_channels determines parameters (e.g., 3 × 128 = 384 parameters for kernel size 3).

17
Q

How does pointwise convolution (kernel_size=1) process the temporal dimension of input?

A

Each position in time dimension is processed independently, applying learned channel mixing weights at each timestep.

18
Q

How do you implement a basic depthwise convolution layer in PyTorch for 1D input with 128 channels?

A
nn.Conv1d(in_channels=128, in_channels=128, kernel_size=3, groups=128, padding='same')
19
Q

How do you process pointwise conv? Explain parameters for and input and output

A

nn.Conv1d(in_channels=128, out_channels=256, kernel_size=1)
takes input [batch,128,time] and outputs [batch,256,time], where kernel_size=1 ensures each timestep is processed independently through a learned channel mixing matrix.

20
Q

Q: How do you initialize a basic MultiheadAttention module in PyTorch?

A

attention = nn.MultiheadAttention(
embed_dim=512, # Input dimension
num_heads=8, # Number of attention heads
batch_first=True # Batch dimension first
)

21
Q

Q: How do you perform a forward pass with MultiheadAttention for self-attention?

A

x shape: (batch_size, seq_length, embed_dim)

output, attention_weights = attention(
query=x,
key=x,
value=x
)

22
Q

Q: How do you implement a complete Transformer layer with MultiheadAttention?

A

class TransformerLayer(nn.Module):
def __init__(self, embed_dim, num_heads):
super().__init__()
self.attention = nn.MultiheadAttention(
embed_dim=embed_dim,
num_heads=num_heads,
batch_first=True
)
self.norm1 = nn.LayerNorm(embed_dim)
self.norm2 = nn.LayerNorm(embed_dim)
self.ffn = nn.Sequential(
nn.Linear(embed_dim, 4 * embed_dim),
nn.ReLU(),
nn.Linear(4 * embed_dim, embed_dim)
)

23
Q

Q: How do you implement the forward pass of a Transformer layer with residual connections?

A

def forward(self, x):
# Self attention with residual
attn_output, _ = self.attention(x, x, x)
x = self.norm1(x + attn_output)

# FFN with residual
ffn_output = self.ffn(x)
x = self.norm2(x + ffn_output)
return x