Quiz #3 Flashcards

1
Q

What three partial derivatives must we calculate for backpropagation in a convolutional layer?

A
  1. dL/dh_in = dL/dHout * dHout/dHin (i.e. the partial derivative of the loss w.r.t. to the input from the previous layer. This is what gets passed back to the previous layer.)
  2. dL/dk = dL/dHout * dHout/dK (i.e. the partial derivative of the loss w.r.t. the kernel values)
  3. dL/dh_out (i.e. the partial derivative of the loss w.r.t. the output from the current layer. Remember that this is given because it is the “upstream gradient”)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

When calculating dL/dK, a kernel pixel does not affect all the values in the output? (True/False)

A

False, it does impact all the values of the output map. This is because we stride the kernel across the image, and we share weights in output maps.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

In a convolutional layer, when calculating the partial derivative of the loss w.r.t. the output of the layer (dL/dHout), we must incorporate ALL the upstream gradients and apply the chain rule over all the output pixels? (True/False)

A

True. This is because a single kernel pixel impacts the entire output since the kernel is strided across the image and weights are shared.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

If a node in a computation graph impacts multiple values in the output, what operation must be applied in the backward pass to ensure that information from each of those individual connections is incorporated in the backprop update?

A

We SUM the gradients from each of the upstream connections.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

If we take the partial derivative of the output pixel located at (r, c) w.r.t the kernel pixel located at (a’, b’), what expression represents the value of dY(r,c)/dK(a’,b’) if a’=b’=0?

A

dY(r,c)/dK(a’,b’) = x(r + a’, c + b’), so if a=b=0 then the derivative for this location is simply x(r, c)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

When calculating the partial derivatives for backpropagation in a convolutional layer, it is unnecessary to calculate the partial derivative of the loss L with respect to the input x (i.e. dL/dx) because that derivative does not impact the kernel weight value updates? (True/False).

A

False. While it’s true that dL/dx isn’t needed for updating the kernel values, this derivative is important because it is the gradient that gets passed back to the previous layer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What gradient needs to be calculated in order to pass back to the previous layer?

A

dL/dx, i.e. the partial derivative of the loss w.r.t the input of the current layer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

For input pixel x(r’, c’), what impact does this pixel have on the output when calculating the gradient dL/dx?

A

It impacts the neighborhood around it (where part of the kernel touches it).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

When calculating the loss w.r.t. the input x (dL/dx), each pixel in the output is impacted by the input pixel? (True/False)

A

False. Since we’re striding the kernel across the input x, only the region where the kernel touches that input pixel are impacted. In those regions, we have to sum the gradients, hence the impact of all the pixels in this neighboring region.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

When calculating the gradient for a max pooling layer, every input pixel into the max pool layer impacts the gradient? (True/False)

A

False. The entire point of the max pooling operation is to perform dimensionality reduction by zeroing out all non-max pixels within the kernel region. Since only one pixel in the region will have a non-zero value, the gradients with respect to any other pixel in the region are also zero.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

A single pixel deep in a multi-layered CNN is only sensitive to the receptive field from the n -1 layer? (True/False)

A

False. A single pixel in the deeper layers is impacted a larger receptive field from the previous layer, which in turn is influenced by a larger receptive field from the previous layer, and so on. This is what gives CNN their representational power.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What was the first major 21st century CNN architecture and when was it introduced?

A

AlexNet in 2012

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

We tend to use fewer convolutional kernels (i.e. feature maps) as we go deeper into the network? (True/False)

A

False, generally speaking.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What was the first modern CNN architecture to use ReLU instead of sigmoid or tanh?

A

AlexNet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What activation function is used in AlexNet?

A

ReLU (it was the first to do this)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the 5 key aspects of the AlexNet architecture (per the lectures)?

A
  1. ReLU instead of sigmoid or tanh
  2. Specialized normalization layers
  3. PCA-based data augmentation
  4. Dropout
  5. Ensembling (7 models were trained together)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

As we go deeper into a CNN, the receptive field increases?

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What layers uses the most memory and why?

A

Convolutional Layers. Because we have to store the activations we obtained from the forward pass because the gradient calculation requires them for the backward pass. Since the output from the forward pass is so large (we’re striding across the entire image, remember) this leads to a large memory footprint.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Convolutional layers tend to have more parameters than FC layers? (True/False)

A

False. Convolutional layers have a higher memory footprint, but FC layers have many more parameters since every weight is connected.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What layers tend to have the most parameters and why?

A

Fully connected layers. This is because every weight is (as implied by its name) connected.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

For a fully connected layer with 12 input neurons, 10 output neurons and 3 channels, how many parameters are there (excluding bias terms)?

A

12103 = 360

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are the two key aspects of the VGG architecture?

A
  1. Repeated application of blocks:
    • 3x3 conv (stride=1, padding=1)
    • 2x2 max pool (stride=2)
  2. Very large number of parameters (mostly from big FC layers)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are some of the main architectural differences between VGG and Alexnet?

A
  1. Alexnet used a large stride, but this loses information. VGG uses a much smaller stride (1 for conv layers, 2 for maxpool) to preserve information.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Roughly how many trainable parameters are required for VGG architectures versus Alexnet?

A

Hundreds of millions for VGG compared to 60-70M for AlexNet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What are some of the key ideas used in the Inception architecure?

A
  1. Repeated blocks
  2. Multiscale features (i.e. concatenating convolutional features created using different kernel sizes and using the concatenated stack as the final output map).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is one of the downsides of the Inception architecture?

A

The use of multiscale features means that if each layer uses N multiscale convolutional features, we’ll have to perform N number of convolutions, instead of just a single one as in a normal architecture.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is the key idea of Residual Blocks?

A
  1. Help prevent issues with vanishing gradients
  2. Allow information from a layer to propagate to any future layer (forwards or backwards!)

They are useful because they improve gradient flow.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is optimization error?

A

It is the idea that even if your NN can theoretically perfectly model the world, there’s no guarantee that your optimization algorithm can find an optimal set of weights that will achieve that level of performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What are the three types of error that pose a challenge to generalization?

A
  1. Optimization error
  2. Estimation error
  3. Modeling error
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is estimation error?

A

It is the idea that even if we find a set of weights that works well on the training set, there isn’t a guarantee that it will generalize to the test data. This could be because of overfitting, learning features that are good for the training set but don’t generalize to the test set, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is modeling error?

A

It is the idea that their may be a disconnect between how the world actually works (reality) versus what the model is actually capable of representing. This could be because of insufficient capacity of the model, or using a model that isn’t suited to the task (for example trying to simple multi-class logistic regression for semantic segmentation - there’s no set of weights that could reasonable manage that complexity for such a simple model)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

In the context of transfer learning, when performing fine tuning we only update the parameters in the last layer? (True/False)

A

False. When fine tuning all the parameters are updated by training the pre-trained model on our smaller, domain specific dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

In the context of transfer learning, when freezing the feature layer only the weights in that final layer are updated during training? (True/False)

A

True (this is often done when there isn’t enough data to train from scratch).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What are two reasons you might want to reconsider using transfer learning for some specific problem?

A
  1. If the source dataset you train on is very different from the target dataset
  2. If you have enough data for the target domain (if so, then probably the only benefit of using transfer learning will be faster convergence)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What are four visualization methods we can use to try to understand what a trained NN has learned?

A
  1. Weights
  2. Activations (output maps)
    3 Gradients
  3. Robustness to perturbation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Using dimensionality reduction, we can plot the activations of any layer (conv, linear, etc.) in 2D to try to understand the output space visually? (True/False)

A

True. PCA and t-SNE (most common) are frequently used to do this.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is a Saliency Map and what is it useful for?

A

The idea behind a saliency map is that we can backprop through a network all the way back to the image (or any arbitrary point in the computation graph) and look at the sensitivity of the loss to individual pixel changes. Large sensitivity implies important pixels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

When visualizing gradients of loss w.r.t. an input image, why do we use the gradient of the classifier scores BEFORE the softmax layer?

A

Because the softmax layers can also improve the loss by “pushing down” the scores of the non-predicted classes to try to improve separability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What is guided backprop used for?

A

Many areas of an input image might actually DECREASE the feature activations. This can make trying to visualize gradients difficult. Guided backprop zeros out the negative gradients so that we only see the POSITIVE contributions to the activation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Why is optimizing the input image to GENERATE examples to increase class scores or activations useful, and how do we do this in practice?

A

It can be used to aid interpretability. Specifically, it can visually show us a great deal about what examples (not in the training set) are able to activate the network. We can do this by performing gradient ascent instead of descent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What is a “Gram Matrix”?

A

It represents feature correlations, and can be used when performing style transfer to represent a content/texture.

42
Q

You have an input volume of 32×32×3. What are the dimensions of the resulting volume after convolving a 5×5 kernel with zero padding, stride of 1, and 2 filters?

A

28 x 28 x 2

Output Size: [(H - K_h + 2*P) / S] + 1
H: Input height
K_h: Kernel height
P: Padding
S: Stride

[(32 - 5 + 2*0) / 1] + 1 = 28

Since we have square kernel and two feature maps (F=2) final output is 28 x 28 x 2

43
Q

You have an input volume of 32×32×3 and convolve it with a 5×5 kernel with zero padding, stride of 1, and 2 filters. How many total weights and biases will be in this layer?

A

F(K_h * K_w * C + 1) = 2 * (55*3 + 1) = 152

F: Number of feature maps
K_h: Kernel height
K_w: Kernel width
C: Number of input channels

44
Q

Suppose you have an input volume of dimension 64x64x16. How many parameters would a single 1x1 convolutional filter have, including the bias?

A

F(K_h * K_w * C + 1) = 1 * (11*16 + 1) = 17

F: Number of feature maps
K_h: Kernel height
K_w: Kernel width
C: Number of input channels

45
Q

Suppose your input is a 300 by 300 color (RGB) image, and you use a convolutional layer with 100 filters that are each 5x5. How many parameters does this layer have including the bias parameters?

A

F(K_h * K_w * C + 1) = 100 * (55*3 + 1) = 7600

F: Number of feature maps
K_h: Kernel height
K_w: Kernel width
C: Number of input channels

46
Q

You have an input volume that is 63x63x16 and convolve it with 32 filters that are each 7x7, and stride of 1. You want to use a same convolution. What is the padding?

A

Answer: 3

Use the formula [(H - K_h + 2*P) / S] + 1 and solve for P:
P = (S(X_h - 1) - H + K_h) / 2

P = (1*(63 - 1) - 63 + 7) / 2 = 3

X_h: Output height
H: Input height
K_h: Kernel height
P: Padding
S: Stride
47
Q

What is the resulting volume of padding a 15x15x8 input volume using pad=2?

A

19x19x8

48
Q

What is the output volume of a 32x32x16 input data after applying max pooling with a filter of size 2 and stride = 2? (Assume no padding)

A

[(H - K_h) / 2] + 1 = [(32 - 2) / 2] + 1 = 16 x 16 x 16

49
Q

If a feature (such as a bird’s beak) were translated a little bit, the location of the output values from convolutional layer would remain unchanged? (True/False)

A

False. Convolution has the property of ‘Equivariance’. A translation of the feature results in the output being shifted by the same amount.

50
Q

What are two important properties of convolution?

A
  1. Invariance (features with small transformations/deformations should still activate the output)
  2. Equivariance (no matter where the feature occurs in the image, the feature map will be activated, with the output values moving by the same translation)
51
Q

Explain the four different types of computer vision tasks?

A
  1. Classification: Class distribution per image
  2. Semantic Segmentation: Class distribution per pixel
  3. Object Detection: List of bounding boxes with class distribution per box
  4. Instance Segmentation: Class distribution per pixel with unique ID
52
Q

What is the output shape of a network used for semantic segmentation?

A

H x W x number of classes

53
Q

Fully connected layers are good for retaining spatial information? (True/False)

A

False. FC layers are simply vectors, so they don’t explicitly retain spatial information.

54
Q

In an encoder/decoder type network, how do we perform the de-convolution required in the decoder side of the network?

A

Transpose convolution, which is essentially just the inverse of the usual convolution operation.

55
Q

How does a max unpooling layer work in an encoder/decoder type network?

A

We simply cache the location of the max elements in the encoder, then on the decoder side, we set the same location to the max value, and zero the rest of the elements in that patch. It effectively upsamples the input.

56
Q

In an encoder/decoder type network, in a max unpooling layer, contributions from multiple windows must be multiplied together? (True/False)

A

False. They should be summed, not multiplied.

57
Q

When is an encoder/decoder type network considered to be symmetric?

A

This occurs when the decoder side of the network is an exact 1-to-1 inverse of the encoder network. So if the encoder network was [conv2d, maxpool, conv2d, maxpool], the symmetric decoder would be [deconv2d, maxunpool, deconv2d, maxunpool], and each of the conv/deconv/pool/unpool layers would have the exact same kernel sizes, strides, etc.

58
Q

There are two learnable parameters in a max unpooling layer? (True/False)

A

False. There are no learnable parameters. This is actually one of the challenges of unpooling layers: we’re not actually learning how to upsample, we’re simply just using the indices of the max in the encoder stage.

59
Q

It is not possible to learn a kernel to upsample an image with a convolution type layer? (True/False)

A

False. This is where we use Transposed Convolution (a.k.a fractionally strided convolution).

60
Q

What is Transposed Convolution used for, and how do we perform it in practice?

A

It’s used to create a learnable kernel for upsampling an image (useful for encoder/decoder style networks). It works by taking each input pixel, multiplying it by a learnable kernel, and “stamping” it on the output.

61
Q

When performing Transpose Convolution, contributions from multiple windows are summed together? (True/False)

A

True.

62
Q

In an encoder/decoder style network, if we don’t want to use learnable deconvolutional kernels in the decoder side of the network, what variant on this approach can we take to accommodate that desire?

A

We can take the corresponding encoder kernel and rotate it 180 degrees. (This is less common to do than the learnable kernel in practice, but could be useful in situations where we want to reduce the number of parameters are network has to learn).

63
Q

It is not possible to use a pre-trained backbone network trained on a basic classification task (i.e. transfer learning) for use in a semantic segmentation task? (True/False)

A

False. We would simply have to chop off the FC layers on the pre-trained network, then add in the corresponding symmetric decoder network.

64
Q

If we use a pre-trained network for an instance segmentation task, how many losses will there be in the network when training on our segmentation targets?

A

One per pixel. This is because for segmentation we get a probability distribution over all the classes for each pixel, so we simply use Cross Entropy Loss on a per pixel basis and calculate gradients w.r.t the loss for each pixel.

65
Q

What type of network architecture has been quite successful for image segmentation tasks, and how does it work?

A

U-Net. It works by using a symmetric encoder/decoder network, but it instead of just decoding from the bottleneck that sits in between the encoder/decoder sides of the network, the corresponding conv/pooling at each “scale” of the network are decoded.

66
Q

Convolutional layers cannot accept arbitrary input sizes?

A

False. This is one big advantage of conv layers over FC layers. This is able to be done because we’re striding the kernel over the entire input, so it doesn’t matter what the input size is.

67
Q

What type of network is popular for image-to-image type tasks?

A

Encoder/decoder style networks.

68
Q

What task is being performed in object detection?

A

Given an image, we want to output a list of bounding boxes with a probability distribution over classes per box.

69
Q

What two challenges arise in object detection tasks?

A
  1. Variable number of bounding boxes possible

2. Need to determine candidate regions (position and scale) first

70
Q

What four numbers are important in object detection networks?

A

The bounding box, which can be defined as the location of the upper-left hand corner at (x, y), and the width W and height H of the box.

71
Q

Describe the multi-headed architecture used in an object detection network?

A
  • One head that predicts distribution over class labels (classification problem)
  • One head that predicts the location of the bounding box for each image region (regression problem)
72
Q

The two heads used in an object detection network use SEPARATE features? (True/False)

A

False. This is one advantage of deep learning. Both heads can SHARE features, and then be jointly optimized (by summing the gradients)

73
Q

For object detection networks, we feed in multiple gridded images (with different grid sizes for each image) to facilitate detection of objects at multiple scales? (True/False)

A

True. Redundant boxes are then combined using Non-Maximal Suppression (NMS).

74
Q

What is one advantage of the YOLO architecture for object detection?

A

It’s faster than many other alternatives because it looks for objects at a SINGLE SCALE.

75
Q

What is one type of metric that can be used for evaluating object detection tasks?

A

Mean Average Precision (MAP)

See 10:00 mark in Module 2 Lesson 9: Single Stage Object Detection for calculation details

76
Q

How does a two-stage object detection network work?

A

Instead of making dense predictions, we decompose the problem into two steps:

  1. Find regions of interest with object-like things
  2. Classify those regions (and refine bounding boxes)
77
Q

What is the key idea used in the Faster R-CNN algorithm?

A

The main idea is to also use the NN to generate the region proposals. It outputs an objectness score, and then the Top K regions are selected for classification.

78
Q

One challenge with the Faster R-CNN algorithm is that some parts (gradient w.r.t. bounding box coordinates) are not differentiable? (True/False)

A

True.

79
Q

In two-stage object detection, what are some ways we can generate region of interest (ROI) proposals?

A
  1. Unsupervised learning (inefficient)
  2. Map each ROI in image to corresponding feature maps (this way we don’t repeat conv for overlapping regions). Challenge with this is we end up with variable input sizes into FC layers (can be remedied using pooling to convert to fixed size)
  3. Use the network to predict an region proposals with an “objectness score”, and take the Top K regions.
80
Q

What is the key idea used in the Fast R-CNN algorithm?

A

To REUSE computation by finding regions in FEATURE MAPS. One of the main challenges with this though is that you end up with variable input sizes to the FC layers. This can be remedied by using pooling to convert to a fixed grid size.

81
Q

At both steps in two-stage object detection networks, the bounding boxes are refined through regression? (True/False)

A

True.

82
Q

It is not possible to use a pre-trained backbone network trained on a basic classification task (i.e. transfer learning) for use in an object detection task? (True/False)

A

False.

83
Q

What are some of the main concepts to consider in the context of Machine Learning Bias and Fairness.

A
  1. Fairness
  2. Accountability
  3. Transparency
  4. Ethics
  5. Safety/Security
84
Q

Consideration of the ethics of AI is separable from the pure science/engineering side of the field? (True/False)

A

False.

85
Q

Why can technology be considered political?

A

For many reasons, but one of the most salient being that technology can be used to exert power and control over other people.

86
Q

What is Goodhart’s Law?

A

“Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.”

In other words, if you incentivize optimization, you’ll get overfitting.

87
Q

What are two ways that Goodhart’s Law is applicable to ML in the context of fairness?

A
  1. Use of benchmarks incentivizes the creation of algorithms that are well suited to those problems but might not generalize well outside that narrow domain (overfitting)
  2. Incompatible (i.e. can’t fulfill one measure while also fulfilling another) and incommensurable (not even able to compare) fairness measures.
88
Q

What is the definition of a WELL-CALIBRATED classifier?

A

A classifier is well-calibrated if the probability of the observations with a given probability score of having a label is equal to the proportion of observations having that label.

Example: If a binary classifier gives a score of 0.8 to 100 observations, then 80 of them should be in the positive class.

89
Q

How is Group Calibration performed?

A

The scores for subgroups of interest are calibrated (or at least, equally mis-calibrated). This can be shown on a Reliability Plot (aka a Calibration Plot)

90
Q

Some models tend (empirically speaking) to have well-calibrated predictions, while other models tend to be overconfident? (True/False)

A

True. Logistic regression tends to be well calibrated. DL models (e.g. ResNet tend to be overconfident.

91
Q

What is Platt (Temperature) Scaling?

A

It’s a way of fitting overconfident models to be more well-calibrated. It’s done by taking a validation dataset and learning a transform that would make it well-calibrated.

92
Q

When performing Platt Scaling to improve model calibration, we use the same dataset as was used for original training? (True/False)

A

False. It’s very important to use a DIFFERENT validation dataset, although in practice it can be the same validation dataset that was used for other applications (e.g. during training).

93
Q

For a binary classifier, how many parameters must be learned for Platt Scaling?

A

Two, ‘a’ and ‘b’. During the Platt Process, these parameters are learned such that the output of our model through a sigmoid function parameterized by ‘a’ and ‘b’ matches a well-calibrated output (as determined by, for example, our reliability diagram).

q_hat = sigmoid(a*z_i + b)

94
Q

What parameter must be learned for multi-class Platt Scaling?

A

The Temperature ‘T’: q_hat = max_k[softmax(z_i/T)]

95
Q

What are the philosophical problems that arise from fairness techniques like Platt Scaling?

A

Platt scaling relies on segmenting people/things into groups. Ideally, we would like to treat people as individuals. Practically speaking, we have to consider groups to make the problem tractable, but how we select those groups (and what characteristics we will use to define them) will have inherent tradeoffs.

96
Q

What measure is used for “test fairness” for a binary classifier?

A

Positive Predictive Value (PPV): TP / (TP + FP)

97
Q

If we seek to equalize the FPR and FNR between two groups with DIFFERENT prevalences, the positive predictive value (PPV) will always be different? (True/False)

A

True. The PPV, FPR, FNR, etc. are all inter-related quantities based on the classification matrix. This is what gives rise to the “Fairness Impossibility Theorem(s)”.

98
Q

When does an instance of the “Fairness Impossibility Theorem” exist?

A

The impossibility theorem exists for ANY THREE (or more) measures of model performance derived (non degenerately) from the confusion matrix.

In a system of equations with three or more equations, the prevalence ‘p’ is determined uniquely. If groups have different prevalences, these quantities CANNOT be equal.

99
Q

What are the two central insights that are incorporated into The Markkula Center Framework for Ethical Decision-Making?

A
  1. There are different ethical perspectives.

2. Process matters: do the moral math.

100
Q

What five steps comprise The Markkula Center Framework for Ethical Decision-Making?

A
  1. Recognize an Ethical Issue
  2. Get the facts
  3. Evaluate options following different approaches/moral frameworks/value systems
  4. Make a decision and test it
  5. Act and reflect on the outcome
101
Q

What are operations are performed to calculate the gradient of the output of a layer calculated w.r.t. to the kernel values in a convolutional layer (dY/dK)?

A

It is the CROSS-CORRELATION between the UPSTREAM GRADIENT and the INPUT (until k1 x k2 output)

See 6:40/15:10 mark in Lesson 6.

102
Q

What are operations are performed to calculate the gradient of the loss w.r.t. to the input X in a convoluational layer (dL/dX)?

A

It is the CONVOLUTION between the UPSTREAM GRADIENT and the KERNEL (this can be implemented by simply flipping the kernel 180 degrees and performing cross-correlation)