torch_modules Flashcards
What is the input shape of Conv1D?
- Input shape: (batch_size, channels, sequence_length)
what is output shape of conv 1d with specifc strid and kernel size
- Formula for output length: ⌊(sequence_length - kernel_size + 2 * padding) / stride⌋ + 1
give an exmaple of conv1d
# Example input: batch_size=32, channels=3, sequence_length=100 x = torch.randn(32, 3, 100) conv1d = nn.Conv1d(in_channels=3, out_channels=16, kernel_size=3, stride=1, padding=1) output = conv1d(x) # Shape: (32, 16, 100)
How does Conv1D transform dimensions?
- Input: (batch_size, in_channels, sequence_length)
- Output: (batch_size, out_channels, new_sequence_length)
- Example: If input is (32, 3, 100) with out_channels=16:
- Output becomes (32, 16, 100) with padding=1
- Each output channel represents different learned features
What is the input shape and purpose of Conv2D?
- Input shape: (batch_size, channels, height, width)
- Height_out = ⌊(height - kernel_size + 2 * padding) / stride⌋ + 1
- Width_out = ⌊(width - kernel_size + 2 * padding) / stride⌋ + 1
How does Conv2D transform dimensions?
- Input: (batch_size, in_channels, height, width)
- Output: (batch_size, out_channels, new_height, new_width)
- Example: If input is (32, 3, 224, 224) with out_channels=64:
- Output becomes (32, 64, 224, 224) with padding=1
- Each output channel is a different learned feature map
What is the purpose and operation of AvgPool1D?
- Purpose: Reduces sequence length by averaging values in sliding windows
- Operation: Takes the mean of each kernel-sized window
- No learnable parameters
- Reduces dimensionality while maintaining important features
- Output length = ⌊(sequence_length - kernel_size) / stride⌋ + 1
How does AvgPool1D transform dimensions?
- Input: (batch_size, channels, sequence_length)
- Output: (batch_size, channels, new_sequence_length)
- Example: With kernel_size=2, stride=2
- Input (32, 16, 100) becomes (32, 16, 50)
- Sequence length is halved while channels remain unchanged
What is the purpose of BatchNormalization?
- Purpose: Normalizes layer outputs to have zero mean and unit variance
- Helps with training stability and faster convergence
- Different versions for different dimensional data:
- BatchNorm1d: (batch_size, features)
- BatchNorm2d: (batch_size, channels, height, width)
- BatchNorm3d: (batch_size, channels, depth, height, width)
How does BatchNorm work during training vs. inference?
- Training:
- Calculates mean and variance from current batch
- Updates running statistics for inference
- Applies learned scale (gamma) and shift (beta)
- Inference:
- Uses stored running statistics instead of batch statistics
- Still applies learned scale and shift
- Maintains input shape in all cases
What is the purpose and operation of Dropout?
- Purpose: Prevents overfitting by randomly deactivating neurons
- Training:
- Randomly sets values to zero with probability p
- Scales remaining values by 1/(1-p)
- Inference:
- No random dropping
- No scaling needed
- Maintains input shape
How to use Dropout effectively?
- Common dropout rates: 0.1 to 0.5
- Higher rates = stronger regularization
- Place after activation functions
- Don’t use just before output layer
- Different types available:
- Dropout: Standard random dropping
- Dropout2d: Drops entire channels (good for CNNs)
- Dropout3d: Drops entire 3D features
What are the dimension transformations for each module?
Module | Input Shape | Output Shape | Parameters
For module in conv1d conv2d avgpool1d
Dropuut
Module | Input Shape | Output Shape | Parameters
——-|————-|————–|————
Conv1D | (B, C, L) | (B, C_out, L_new) | in_channels, out_channels, kernel_size, stride, padding
Conv2D | (B, C, H, W) | (B, C_out, H_new, W_new) | in_channels, out_channels, kernel_size, stride, padding
AvgPool1D | (B, C, L) | (B, C, L_new) | kernel_size, stride
BatchNorm | (B, *) | Same as input | num_features
Dropout | Any | Same as input | p (dropout probability)
What are common mistakes to watch for with these modules?
- Conv1D/AvgPool1D:
- Forgetting to permute channels for time series data
- Not accounting for padding in output size calculations
- BatchNorm:
- Using too small batch sizes (affects statistics)
- Forgetting to set training mode correctly
- Dropout:
- Not turning off during inference
- Using too high dropout rate
- Placing in wrong locations in network
- General:
- Not checking output shapes
- Incorrect dimension ordering
- Not handling edge cases in sequence lengths
What’s the key difference between depthwise and pointwise convolutions in terms of channel interaction?
Depthwise processes each channel independently while pointwise (1x1 conv) mixes information across all channels.
For input shape [m,128,3000], what affects the number of parameters in depthwise convolution?
Only kernel_size × in_channels determines parameters (e.g., 3 × 128 = 384 parameters for kernel size 3).
How does pointwise convolution (kernel_size=1) process the temporal dimension of input?
Each position in time dimension is processed independently, applying learned channel mixing weights at each timestep.
How do you implement a basic depthwise convolution layer in PyTorch for 1D input with 128 channels?
nn.Conv1d(in_channels=128, in_channels=128, kernel_size=3, groups=128, padding='same')
How do you process pointwise conv? Explain parameters for and input and output
nn.Conv1d(in_channels=128, out_channels=256, kernel_size=1)
takes input [batch,128,time] and outputs [batch,256,time], where kernel_size=1 ensures each timestep is processed independently through a learned channel mixing matrix.
Q: How do you initialize a basic MultiheadAttention module in PyTorch?
attention = nn.MultiheadAttention(
embed_dim=512, # Input dimension
num_heads=8, # Number of attention heads
batch_first=True # Batch dimension first
)
Q: How do you perform a forward pass with MultiheadAttention for self-attention?
x shape: (batch_size, seq_length, embed_dim)
output, attention_weights = attention(
query=x,
key=x,
value=x
)
Q: How do you implement a complete Transformer layer with MultiheadAttention?
class TransformerLayer(nn.Module):
def __init__(self, embed_dim, num_heads):
super().__init__()
self.attention = nn.MultiheadAttention(
embed_dim=embed_dim,
num_heads=num_heads,
batch_first=True
)
self.norm1 = nn.LayerNorm(embed_dim)
self.norm2 = nn.LayerNorm(embed_dim)
self.ffn = nn.Sequential(
nn.Linear(embed_dim, 4 * embed_dim),
nn.ReLU(),
nn.Linear(4 * embed_dim, embed_dim)
)
Q: How do you implement the forward pass of a Transformer layer with residual connections?
def forward(self, x):
# Self attention with residual
attn_output, _ = self.attention(x, x, x)
x = self.norm1(x + attn_output)
# FFN with residual ffn_output = self.ffn(x) x = self.norm2(x + ffn_output) return x