March 2025 Flashcards by Derek Johnson

Main benefit of using logarithms (mathematically speaking)

Multiplication, which can create really really large and small numbers can be replaced by addition.
log(a*b) = log(a) + log (b)

How well did you know this?

Not at all

Perfectly

Why use the negative log function as a loss function

Because the negative log gives us very high loss values when the prediction (accuracy)is close to zero.

How well did you know this?

Not at all

Perfectly

Cross Entropy Loss

SoftMax followed by the negative log likelihood loss.

How well did you know this?

Not at all

Perfectly

Cross Entropy (etymological description)

Cross Entropy is the comparison of two different probability distributions. In classification, this is usually a comparison between the known probability distribution (the labels), and the current predicted probability. You can see how that would yield itself to a loss function.

How well did you know this?

Not at all

Perfectly

Consequence of using a learning rate that is too low?

It will take too many epochs to converge. Too many epochs means overfit (the model will memorize the dataset)

How well did you know this?

Not at all

Perfectly

What is the most common last activation layer in a Classification CNN?

SoftMax

How well did you know this?

Not at all

Perfectly

What is the most common last activation layer in a Binary Classification CNN?

Sigmoid

How well did you know this?

Not at all

Perfectly

What is the concept called that is used to protect the general image classification tasks of a pre-trained model?

Freezing pre-trained layers.

How well did you know this?

Not at all

Perfectly

Does a learning rate have to be a single number?

No, they can differ by layer. And because layers are a gradient of abstraction it is often inappropriate for a single, one-size-fits-all learning rate.
Discriminative Learning Rates

How well did you know this?

Not at all

Perfectly

What should you focus on when training to know if your model is overconfident or beginning to overfit? And what should you NOT focus on?

You should focus on your metrics. You should not focus on the loss.
The loss function is just something you give to the model that it can differentiate and therefore perform SGD.

How well did you know this?

Not at all

Perfectly

Downsides of deeper architectures?

More prone to overfit (due to more parameters with which to overfit)
Out of memory errors driving smaller batch sizes
Much longer training times

How well did you know this?

Not at all

Perfectly

One way of speeding up the training of deep networks?

Mixed-Precision Training
Using half-precision floating point fp16 where possible during training
CUDA has a mode where you can enable this.

How well did you know this?

Not at all

Perfectly

/sys

It is a virtual file system for modern Linux distributions to store and allows modification of the devices connected to the system.

How well did you know this?

Not at all

Perfectly

F.relu

ReLU Rectified Linear Unit (replace every negative number with a zero)

How well did you know this?

Not at all

Perfectly

activation function

a nonlinear layer

How well did you know this?

Not at all

Perfectly

Precision

How many positive indications were actually positive.
Precision = TP / (TP +FP)
TP = True Positives
FP = False Positives
Precision is about CUTTING down on false positives.

Recall

How many of the actual positive instances were correctly identified.
Recall = TP / (TP + FN)
TP = True Positives
FN = False Negatives
Recall all the known positives at the risk of more false positives

Harmonic Mean & reason it is used in F1 score

N / (Sum of 1/xi)
lower values have a stronger influence
(It’s a “bad apples” algo)

F1 score

0 to 1
Harmonic mean of Precision and Recall
F1 = 2 (Precision * Recall) / (Precision + Recall)

Way of digging into what your classification model got wrong

Confusion Matrix

PyTorch method that changes the shape of a tensor without changing its contents.

view(-1,2828)
-1 is a special parameter to view that means “make this axis as big as necessary to fit all the data
Here we are multiplying 2828 *the lengths of the two image dimensions to get the new length of the new (smushed) dimension.
Taking an image and vectorizing it.

Define a PyTorch Dataset

a collection that contains tuples of independent and dependent variables.
independent = inputs, dependent = targets

What is the deal with “L” from FastAI?

Meaning of L

L is a specialized list-like container provided by fastcore ( dependency of fastai).

It extends Python lists with additional functionality such as element-wise operations, filtering, and mapping.

What is the significance of a method in PyTorch that end in an underscore?

This method will modify its contents in place. (Like a mutator from OO world).

Why was there only one bias value in the MNIST example I did?

Because there is only one output of the model (essentially is it a 3?) For a classifier there will be N biases, where N is the number of classes.

Universal Approximation Theorem

You can model any "wiggly" function if you use enough piecewise lines

What is special about an activation layer, and what role does it play

special: An activation layer is non-linear role: A stack of linear layers can be simplified down to a single linear layer. By sticking non-linear layers in-between each linear layer, it will allow each layer to "do it's own thing". This is a major tenet in Neural Networks. Without Nonlinearity, a NN cannot learn complex patterns.

What does view do in PyTorch?

Changes the shape of a tensor without changing its contents

What is a Python Partial?

When you have a function, you can create a partial of that function where you can "pre-set" some arguments. The new version of the function (almost like an alias) will have a new signature of arguments, without the ones that you "pre-set"

Mapping between loss function and type of NN

nn.CrossEntropyLoss for single-label classification nn.BCEWithLogitsLoss for multi-label classification nn.MSELoss for regression

Describe Image Regression (in the context of ML/AI)

from images to continuous numerical values example: points in an image, or points & depth, etc.

When is normalization most important?

When Transfer training. When using a pre-trained model, your "new" data needs to match the old data (statistically speaking). This is why when models are published, they are also published with their statistics. So that anyone using them can match them.

What is Test Time Augmenation?

The practice of using augmentation during testing (or inference or validation). Augmentation is usually used during preprocessing of data. That is why this is called "Test Time" Augmentation.

What is Mixup?

A data augmentation technique that blends two inputs and their targets. This is introduced as a new data point.

What is Label Smoothing?

A data augmentation technique to remove the "absolute" nature of target vectors. Instead of using a one-hot target like [0,1,0,0], instead we use [0.01, 0.97, 0.01, 0.01].

Brainscape's Knowledge GenomeTM

March 2025 Flashcards

Brainscape's Knowledge Genome^TM