Quiz #3 Flashcards

Question

What are some of the key ideas used in the Inception architecure?

Answer 1

1. Repeated blocks 2. Multiscale features (i.e. concatenating convolutional features created using different kernel sizes and using the concatenated stack as the final output map).

Answer 2

The use of multiscale features means that if each layer uses N multiscale convolutional features, we'll have to perform N number of convolutions, instead of just a single one as in a normal architecture.

Answer 3

1. Help prevent issues with vanishing gradients 2. Allow information from a layer to propagate to any future layer (forwards or backwards!) They are useful because they improve gradient flow.

Answer 4

It is the idea that even if your NN can theoretically perfectly model the world, there's no guarantee that your optimization algorithm can find an optimal set of weights that will achieve that level of performance.

Answer 5

1. Optimization error 2. Estimation error 3. Modeling error

Answer 6

It is the idea that even if we find a set of weights that works well on the training set, there isn't a guarantee that it will generalize to the test data. This could be because of overfitting, learning features that are good for the training set but don't generalize to the test set, etc.

Answer 7

It is the idea that their may be a disconnect between how the world actually works (reality) versus what the model is actually capable of representing. This could be because of insufficient capacity of the model, or using a model that isn't suited to the task (for example trying to simple multi-class logistic regression for semantic segmentation - there's no set of weights that could reasonable manage that complexity for such a simple model)

Answer 8

False. When fine tuning all the parameters are updated by training the pre-trained model on our smaller, domain specific dataset.

Answer 9

True (this is often done when there isn't enough data to train from scratch).

Answer 10

1. If the source dataset you train on is very different from the target dataset 2. If you have enough data for the target domain (if so, then probably the only benefit of using transfer learning will be faster convergence)

Answer 11

1. Weights 2. Activations (output maps) 3 Gradients 4. Robustness to perturbation

Answer 12

True. PCA and t-SNE (most common) are frequently used to do this.

Answer 13

The idea behind a saliency map is that we can backprop through a network all the way back to the image (or any arbitrary point in the computation graph) and look at the sensitivity of the loss to individual pixel changes. Large sensitivity implies important pixels.

Answer 14

Because the softmax layers can also improve the loss by "pushing down" the scores of the non-predicted classes to try to improve separability.

Answer 15

Many areas of an input image might actually DECREASE the feature activations. This can make trying to visualize gradients difficult. Guided backprop zeros out the negative gradients so that we only see the POSITIVE contributions to the activation.

Answer 16

It can be used to aid interpretability. Specifically, it can visually show us a great deal about what examples (not in the training set) are able to activate the network. We can do this by performing gradient ascent instead of descent.

Answer 17

It represents feature correlations, and can be used when performing style transfer to represent a content/texture.

Answer 18

28 x 28 x 2 ``` Output Size: [(H - K_h + 2*P) / S] + 1 H: Input height K_h: Kernel height P: Padding S: Stride ``` [(32 - 5 + 2*0) / 1] + 1 = 28 Since we have square kernel and two feature maps (F=2) final output is 28 x 28 x 2

Answer 19

F*(K_h * K_w * C + 1) = 2 * (5*5*3 + 1) = 152 F: Number of feature maps K_h: Kernel height K_w: Kernel width C: Number of input channels

Answer 20

F*(K_h * K_w * C + 1) = 1 * (1*1*16 + 1) = 17 F: Number of feature maps K_h: Kernel height K_w: Kernel width C: Number of input channels

Answer 21

F*(K_h * K_w * C + 1) = 100 * (5*5*3 + 1) = 7600 F: Number of feature maps K_h: Kernel height K_w: Kernel width C: Number of input channels

Answer 22

Answer: 3 Use the formula [(H - K_h + 2*P) / S] + 1 and solve for P: P = (S(X_h - 1) - H + K_h) / 2 P = (1*(63 - 1) - 63 + 7) / 2 = 3 ``` X_h: Output height H: Input height K_h: Kernel height P: Padding S: Stride ```

Answer 23

[(H - K_h) / 2] + 1 = [(32 - 2) / 2] + 1 = 16 x 16 x 16

Answer 24

False. Convolution has the property of 'Equivariance'. A translation of the feature results in the output being shifted by the same amount.

Answer 25

1. Invariance (features with small transformations/deformations should still activate the output) 2. Equivariance (no matter where the feature occurs in the image, the feature map will be activated, with the output values moving by the same translation)

Answer 26

1. Classification: Class distribution per image 2. Semantic Segmentation: Class distribution per pixel 3. Object Detection: List of bounding boxes with class distribution per box 4. Instance Segmentation: Class distribution per pixel with unique ID

Answer 27

H x W x number of classes

Answer 28

False. FC layers are simply vectors, so they don't explicitly retain spatial information.

Answer 29

Transpose convolution, which is essentially just the inverse of the usual convolution operation.

Answer 30

We simply cache the location of the max elements in the encoder, then on the decoder side, we set the same location to the max value, and zero the rest of the elements in that patch. It effectively upsamples the input.

Answer 31

False. They should be summed, not multiplied.

Answer 32

This occurs when the decoder side of the network is an exact 1-to-1 inverse of the encoder network. So if the encoder network was [conv2d, maxpool, conv2d, maxpool], the symmetric decoder would be [deconv2d, maxunpool, deconv2d, maxunpool], and each of the conv/deconv/pool/unpool layers would have the exact same kernel sizes, strides, etc.

Answer 33

False. There are no learnable parameters. This is actually one of the challenges of unpooling layers: we're not actually learning how to upsample, we're simply just using the indices of the max in the encoder stage.

Answer 34

False. This is where we use Transposed Convolution (a.k.a fractionally strided convolution).

Answer 35

It's used to create a learnable kernel for upsampling an image (useful for encoder/decoder style networks). It works by taking each input pixel, multiplying it by a learnable kernel, and "stamping" it on the output.

Answer 36

We can take the corresponding encoder kernel and rotate it 180 degrees. (This is less common to do than the learnable kernel in practice, but could be useful in situations where we want to reduce the number of parameters are network has to learn).

Answer 37

False. We would simply have to chop off the FC layers on the pre-trained network, then add in the corresponding symmetric decoder network.

Answer 38

One per pixel. This is because for segmentation we get a probability distribution over all the classes for each pixel, so we simply use Cross Entropy Loss on a per pixel basis and calculate gradients w.r.t the loss for each pixel.

Answer 39

U-Net. It works by using a symmetric encoder/decoder network, but it instead of just decoding from the bottleneck that sits in between the encoder/decoder sides of the network, the corresponding conv/pooling at each "scale" of the network are decoded.

Answer 40

False. This is one big advantage of conv layers over FC layers. This is able to be done because we're striding the kernel over the entire input, so it doesn't matter what the input size is.

Answer 41

Encoder/decoder style networks.

Answer 42

Given an image, we want to output a list of bounding boxes with a probability distribution over classes per box.

Answer 43

1. Variable number of bounding boxes possible | 2. Need to determine candidate regions (position and scale) first

Answer 44

The bounding box, which can be defined as the location of the upper-left hand corner at (x, y), and the width W and height H of the box.

Answer 45

* One head that predicts distribution over class labels (classification problem) * One head that predicts the location of the bounding box for each image region (regression problem)

Answer 46

False. This is one advantage of deep learning. Both heads can SHARE features, and then be jointly optimized (by summing the gradients)

Answer 47

True. Redundant boxes are then combined using Non-Maximal Suppression (NMS).

Answer 48

It's faster than many other alternatives because it looks for objects at a SINGLE SCALE.

Answer 49

Mean Average Precision (MAP) | See 10:00 mark in Module 2 Lesson 9: Single Stage Object Detection for calculation details

Answer 50

Instead of making dense predictions, we decompose the problem into two steps: 1. Find regions of interest with object-like things 2. Classify those regions (and refine bounding boxes)

Answer 51

The main idea is to also use the NN to generate the region proposals. It outputs an objectness score, and then the Top K regions are selected for classification.

Answer 52

1. Unsupervised learning (inefficient) 2. Map each ROI in image to corresponding feature maps (this way we don't repeat conv for overlapping regions). Challenge with this is we end up with variable input sizes into FC layers (can be remedied using pooling to convert to fixed size) 3. Use the network to predict an region proposals with an "objectness score", and take the Top K regions.

Answer 53

To REUSE computation by finding regions in FEATURE MAPS. One of the main challenges with this though is that you end up with variable input sizes to the FC layers. This can be remedied by using pooling to convert to a fixed grid size.

Answer 54

1. Fairness 2. Accountability 3. Transparency 4. Ethics 5. Safety/Security

Answer 55

For many reasons, but one of the most salient being that technology can be used to exert power and control over other people.

Answer 56

"Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes." In other words, if you incentivize optimization, you'll get overfitting.

Answer 57

1. Use of benchmarks incentivizes the creation of algorithms that are well suited to those problems but might not generalize well outside that narrow domain (overfitting) 2. Incompatible (i.e. can't fulfill one measure while also fulfilling another) and incommensurable (not even able to compare) fairness measures.

Answer 58

A classifier is well-calibrated if the probability of the observations with a given probability score of having a label is equal to the proportion of observations having that label. Example: If a binary classifier gives a score of 0.8 to 100 observations, then 80 of them should be in the positive class.

Answer 59

The scores for subgroups of interest are calibrated (or at least, equally mis-calibrated). This can be shown on a Reliability Plot (aka a Calibration Plot)

Answer 60

True. Logistic regression tends to be well calibrated. DL models (e.g. ResNet tend to be overconfident.

Answer 61

It's a way of fitting overconfident models to be more well-calibrated. It's done by taking a validation dataset and learning a transform that would make it well-calibrated.

Answer 62

False. It's very important to use a DIFFERENT validation dataset, although in practice it can be the same validation dataset that was used for other applications (e.g. during training).

Answer 63

Two, 'a' and 'b'. During the Platt Process, these parameters are learned such that the output of our model through a sigmoid function parameterized by 'a' and 'b' matches a well-calibrated output (as determined by, for example, our reliability diagram). q_hat = sigmoid(a*z_i + b)

Answer 64

The Temperature 'T': q_hat = max_k[softmax(z_i/T)]

Answer 65

Platt scaling relies on segmenting people/things into groups. Ideally, we would like to treat people as individuals. Practically speaking, we have to consider groups to make the problem tractable, but how we select those groups (and what characteristics we will use to define them) will have inherent tradeoffs.

Answer 66

Positive Predictive Value (PPV): TP / (TP + FP)

Answer 67

True. The PPV, FPR, FNR, etc. are all inter-related quantities based on the classification matrix. This is what gives rise to the "Fairness Impossibility Theorem(s)".

Answer 68

The impossibility theorem exists for ANY THREE (or more) measures of model performance derived (non degenerately) from the confusion matrix. In a system of equations with three or more equations, the prevalence 'p' is determined uniquely. If groups have different prevalences, these quantities CANNOT be equal.

Answer 69

1. There are different ethical perspectives. | 2. Process matters: do the moral math.

Answer 70

1. Recognize an Ethical Issue 2. Get the facts 3. Evaluate options following different approaches/moral frameworks/value systems 4. Make a decision and test it 5. Act and reflect on the outcome

Answer 71

It is the CROSS-CORRELATION between the UPSTREAM GRADIENT and the INPUT (until k1 x k2 output) See 6:40/15:10 mark in Lesson 6.

Answer 72

It is the CONVOLUTION between the UPSTREAM GRADIENT and the KERNEL (this can be implemented by simply flipping the kernel 180 degrees and performing cross-correlation)

Quiz #3 Flashcards

(102 cards)