torch_modules Flashcards

Question 1

Q

What is the input shape of Conv1D?

Answer

A

Input shape: (batch_size, channels, sequence_length)

Question 2

Q

what is output shape of conv 1d with specifc strid and kernel size

Answer

A

Formula for output length: ⌊(sequence_length - kernel_size + 2 * padding) / stride⌋ + 1

Question 3

Q

give an exmaple of conv1d

Answer

A

# Example input: batch_size=32, channels=3, sequence_length=100
x = torch.randn(32, 3, 100)
conv1d = nn.Conv1d(in_channels=3, out_channels=16, kernel_size=3, stride=1, padding=1)
output = conv1d(x)  # Shape: (32, 16, 100)

Question 4

Q

How does Conv1D transform dimensions?

Answer

A

Input: (batch_size, in_channels, sequence_length)
Output: (batch_size, out_channels, new_sequence_length)
Example: If input is (32, 3, 100) with out_channels=16:
- Output becomes (32, 16, 100) with padding=1
- Each output channel represents different learned features

Question 5

Q

What is the input shape and purpose of Conv2D?

Answer

A

Input shape: (batch_size, channels, height, width)
- Height_out = ⌊(height - kernel_size + 2 * padding) / stride⌋ + 1
- Width_out = ⌊(width - kernel_size + 2 * padding) / stride⌋ + 1

Question 6

Q

How does Conv2D transform dimensions?

Answer

A

Input: (batch_size, in_channels, height, width)
Output: (batch_size, out_channels, new_height, new_width)
Example: If input is (32, 3, 224, 224) with out_channels=64:
- Output becomes (32, 64, 224, 224) with padding=1
- Each output channel is a different learned feature map

Question 7

Q

What is the purpose and operation of AvgPool1D?

Answer

A

Purpose: Reduces sequence length by averaging values in sliding windows
Operation: Takes the mean of each kernel-sized window
No learnable parameters
Reduces dimensionality while maintaining important features
Output length = ⌊(sequence_length - kernel_size) / stride⌋ + 1

Question 8

Q

How does AvgPool1D transform dimensions?

Answer

A

Input: (batch_size, channels, sequence_length)
Output: (batch_size, channels, new_sequence_length)
Example: With kernel_size=2, stride=2
- Input (32, 16, 100) becomes (32, 16, 50)
- Sequence length is halved while channels remain unchanged

Question 9

Q

What is the purpose of BatchNormalization?

Answer

A

Purpose: Normalizes layer outputs to have zero mean and unit variance
Helps with training stability and faster convergence
Different versions for different dimensional data:
- BatchNorm1d: (batch_size, features)
- BatchNorm2d: (batch_size, channels, height, width)
- BatchNorm3d: (batch_size, channels, depth, height, width)

Question 10

Q

How does BatchNorm work during training vs. inference?

Answer

A

Training:
- Calculates mean and variance from current batch
- Updates running statistics for inference
- Applies learned scale (gamma) and shift (beta)
Inference:
- Uses stored running statistics instead of batch statistics
- Still applies learned scale and shift
Maintains input shape in all cases

Question 11

Q

What is the purpose and operation of Dropout?

Answer

A

Purpose: Prevents overfitting by randomly deactivating neurons
Training:
- Randomly sets values to zero with probability p
- Scales remaining values by 1/(1-p)
Inference:
- No random dropping
- No scaling needed
Maintains input shape

Question 12

Q

How to use Dropout effectively?

Answer

A

Common dropout rates: 0.1 to 0.5
Higher rates = stronger regularization
Place after activation functions
Don’t use just before output layer
Different types available:
- Dropout: Standard random dropping
- Dropout2d: Drops entire channels (good for CNNs)
- Dropout3d: Drops entire 3D features

Question 13

Q

What are the dimension transformations for each module?

Module | Input Shape | Output Shape | Parameters
For module in conv1d conv2d avgpool1d
Dropuut

Answer

A

Conv2D | (B, C, H, W) | (B, C_out, H_new, W_new) | in_channels, out_channels, kernel_size, stride, padding

Dropout | Any | Same as input | p (dropout probability)

Question 14

Q

What are common mistakes to watch for with these modules?

Answer

A

Conv1D/AvgPool1D:
- Forgetting to permute channels for time series data
- Not accounting for padding in output size calculations
BatchNorm:
- Using too small batch sizes (affects statistics)
- Forgetting to set training mode correctly
Dropout:
- Not turning off during inference
- Using too high dropout rate
- Placing in wrong locations in network
General:
- Not checking output shapes
- Incorrect dimension ordering
- Not handling edge cases in sequence lengths

Question 15

Q

What’s the key difference between depthwise and pointwise convolutions in terms of channel interaction?

Answer

A

Depthwise processes each channel independently while pointwise (1x1 conv) mixes information across all channels.

Question 16

Q

For input shape [m,128,3000], what affects the number of parameters in depthwise convolution?

Answer

Study These Flashcards

A

Only kernel_size × in_channels determines parameters (e.g., 3 × 128 = 384 parameters for kernel size 3).

Question 17

Q

How does pointwise convolution (kernel_size=1) process the temporal dimension of input?

Answer

Study These Flashcards

A

Each position in time dimension is processed independently, applying learned channel mixing weights at each timestep.

Question 18

Q

How do you implement a basic depthwise convolution layer in PyTorch for 1D input with 128 channels?

Answer

Study These Flashcards

A

nn.Conv1d(in_channels=128, in_channels=128, kernel_size=3, groups=128, padding='same')

Question 19

Q

How do you process pointwise conv? Explain parameters for and input and output

Answer

Study These Flashcards

A

nn.Conv1d(in_channels=128, out_channels=256, kernel_size=1)
takes input [batch,128,time] and outputs [batch,256,time], where kernel_size=1 ensures each timestep is processed independently through a learned channel mixing matrix.

Question 20

Q

Q: How do you initialize a basic MultiheadAttention module in PyTorch?

Answer

Study These Flashcards

A

attention = nn.MultiheadAttention(
embed_dim=512, # Input dimension
num_heads=8, # Number of attention heads
batch_first=True # Batch dimension first
)

Question 21

Q

Q: How do you perform a forward pass with MultiheadAttention for self-attention?

Answer

Study These Flashcards

A

x shape: (batch_size, seq_length, embed_dim)

output, attention_weights = attention(
query=x,
key=x,
value=x
)

Question 22

Q

Q: How do you implement a complete Transformer layer with MultiheadAttention?

Answer

Study These Flashcards

A

class TransformerLayer(nn.Module):
def __init__(self, embed_dim, num_heads):
super().__init__()
self.attention = nn.MultiheadAttention(
embed_dim=embed_dim,
num_heads=num_heads,
batch_first=True
)
self.norm1 = nn.LayerNorm(embed_dim)
self.norm2 = nn.LayerNorm(embed_dim)
self.ffn = nn.Sequential(
nn.Linear(embed_dim, 4 * embed_dim),
nn.ReLU(),
nn.Linear(4 * embed_dim, embed_dim)
)

Question 23

Q

Q: How do you implement the forward pass of a Transformer layer with residual connections?

Answer

Study These Flashcards

A

def forward(self, x):
# Self attention with residual
attn_output, _ = self.attention(x, x, x)
x = self.norm1(x + attn_output)

# FFN with residual
ffn_output = self.ffn(x)
x = self.norm2(x + ffn_output)
return x

torch_modules Flashcards

(23 cards)