Quiz #2 Flashcards
What steps and primary questions comprise the data wrangling process?
- What is the population of interest?
- 1 What sample S are we evaluating?
- Is sample S representative of the population?
- How do we cross-validate to evalute our model? How do we ovoid overfitting and data mining?
- What prediction task (classification vs. regression) do we care about? What is the meaningful evaluation criteria?
- How do we create a reproducible pipeline?
What are some examples of the definition of a population (in data terms)?
All users on Facebook.
All US users on Facebook.
All US users on Facebook in the last month
All the watermelons in the back of the truck.
All the watermelons greater than 5lbs in the back of the truck.
Etc…
How do we obtain data from a population?
Sampling
What are two simple probability-based methods for sampling?
- Random Sampling
2. Stratified Random Sampling
What is simple random sampling of a population?
Every observation from the population has the same chance of being sampled
What is stratified random sampling of a population?
Population is partitioned into groups and then a simple random sampling approach is applied within each group.
Example: In the watermelons in the back of the truck example, we could partition into 3 groups: (1) less than 5lbs, (2) greater than 5lbs but less than 10lbs, and (3) greater than 10lbs. We could then randomly sample within each group. This is stratified random sampling.
What are some best practices for data wrangling?
- Clearly define your population and sample
- Understand the representativeness of your sample
- Cross-validation can go wrong in many ways; understand the relevant problem and prediction task that will be done in practice
- Know the prediction task of interest (regression vs. classification)
- Incorporate model checks and evaluate multiple predictive performance metrics?
What is Cross Validation (CV)?
A method for estimating prediction error.
Grid search is always better than random search when trying to optimize hyperparameters? (True/False)
False. One 2012 paper by Bergstra and Bengio found that random search is often just as good, if not better, than grid search.
What are two methods for handling class imbalances?
- Sampling based, e.g. SMOTE (Synthetic Minority Oversampling, etc.)
- Cost-based, e.g. Focal Loss for object detection
What is one type of plot we can use to gauge the confidence a model has in its prediction?
Calibration plot
It isn’t necessary to include a datasheet when creating a new dataset? (True/False)
False. It can be very helpful to future researchers (including yourself!) to understand how the dataset was constructed.
What things should be included in a datasheet for a dataset?
- Motivation (why is the dataset needed?)
- Composition
- Collection process
- Recommended uses
…etc.
What are the three steps in the Data Cleaning process for ML?
- Clean
- Transform
- Preprocess
What are three mechanisms that can cause missing data?
- Missing completely at random.
- Missing at random: likelihood of any observation to be missing depends on OBSERVED data features (ex: men are less likely to fill out surveys about depression)
- Missing not at random: likelihood of any observation to be missing depends on UNOBSERVED outcome (ex: a person might be less likely to complete a survey if they are depressed)
What are some ways we can fix missing data?
- Remove (easy, but wasteful)
2. Imputation (mean/median, using learned model to predict, etc.)
What are some examples of the data transformation step in the data cleaning process?
- Converting categorical to index (ordinal numbering, one-hot encoding, etc)
- Bag-of-words
- TF-IDF
- Embeddings
…etc
What are some examples of the data preprocessing step in the data cleaning process?
Zero-center data, normalization, etc
What are three important components of fairness in ML?
- Anti-classification: verifying that protected attributes like race, gender, etc (and their proxies!)
- Classification Parity: common measures of predictive performances are equal across groups defined by protected attributes.
- Calibration: conditional on risk estimates, outcomes are independent of protected
What is an example of how a proxy to a protected attribute might result in an unfair ML model?
One example might be using features like zip code in areas with high racial segregation. If the model learns that zip code is an important discriminatory feature, there’s a good chance that it has learned a subtle proxy for racial discrimination.
Layers in a NN must always be fully connected? (True/False)
False. Other connectivity structures are possible, and in many cases (like images) desirable.
Why does it make sense to consider small patches of inputs when building a NN for image data? What are these small patches called?
They are called receptive fields, modeled after similar structure in the human visual cortex. They make sense to use because while structure exists in image data, it’s often localized, such as edges and lines, and collections of those lines and edges forming higher level motifs.
Why does using linear layers not make sense for some applications?
Consider the case of image data. If we connect each pixel to every weight in a hidden linear layer, there could be hundreds of millions of parameters to learn for just one layer. Furthermore, patterns in images tend to be SPATIALLY LOCAL. A pixel in the upper right corner in all likelihood will have very little to do with a pixel in the lower left.
As the number of parameters to learn in a model increase, more data is needed to ensure a robust model that generalizes to new data? (True/False)
True.
If we have a receptive field that DOES NOT share weights and is 3x3 pixels connected to 5 output nodes, how many parameters will there be to learn?
((K1 x K2) + 1) * N –> ((33) + 1)5 = 50
For image data, it is necessary to learn location specific features? (True/False)
False. There’s no reason to assume that a pattern in an image that occurs in the center might not also be repeated or at some other time appear in some other arbitrary location.
What are shared weights in a CNN, and why do we use them?
Output nodes in different location sharing weights across the input space. For example W11 in the leftmost node would be the same as W11 in the rightmost node. We use shared weights so that we can learn spatial features that are invariant to simple affine transformations, e.g. translation, etc.
If we have a receptive field that DOES share weights and is 3x3 pixels connected to 5 output nodes, how many parameters will there be to learn? (assume that this calculation is only considering a single feature extractor)
(K1k2) + 1 = 33 + 1 = 10
In a CNN, weights are shared across the outputs node weights as well as the different feature extractors? (True/False)
False. Weights are shared for the SAME feature extractor across the spatial input, but they are NOT shared between DIFFERENT feature extractors, i.e. each feature extractor has its own independent set of weights.
If we have a receptive field that DOES share weights and is 3x3 pixels connected to 5 output nodes and there are 4 individual features we want to learn, how many parameters will there be to learn? (assume that this calculation is only considering a single feature extractor)
(K1K2 + 1) * M, where M is the number of features we want to learn –> (33 + 1)*4 = 40
It is extremely important to remember to flip the kernel when implementing cross-correlation for a “convolutional” layer in a NN? (True/False)
False. Mathematically it’s useful to flip the kernel to make the math work out more elegantly, since we’re actually learning the kernel values in a CNN, the weights will be initialized randomly making the flipping operation superfluous is practice.
How do we implement the convolutional operation in a neural network in practice?
Simply take the dot product of the input with the kernel (i.e. element-wise multiply and sum).
If we implement the forward pass in a convolutional layer as cross-correlation, what operation will the backpropagation be?
Convolution. (this is the duality principle that arises in the forward/backward pass of a convolutional layer)
Convolution is a complex non-linear operation? (True/False)
False. It is a simple linear operation.
If we implement the forward pass in a convolutional layer as convolution, what operation will the backpropagation be?
Cross-Correlation. (this is the duality principle that arises in the forward/backward pass of a convolutional layer)
What is the “valid” form of convolution and what is the output size?
It only applies the kernel when it is fully within the image. The output size is: (H - K1 + 1) x (W - K2 + 1). The output dimensions will be SMALLER than the input.
What is the “padded” form of convolution and what is the output size?
This form of convolution can be used to force the input size to be the same as the output by adding padding to the input image (Zeros, mirrored, etc). In general the output is size is: (H - K1 + P1 + 1) x (W - K2 + P2 + 1)
If we apply valid convolutional on a 5x5 input image using a 3x3 kernel, what will the output size be?
(H - K1 + 1) x (W - K2 + 1) = (5 - 3 + 1) x (5 - 3 + 1) = 3x3
If we apply padded convolution on a 5x5 image with a 3x3 kernel and a padding size of 1, what will the output size be?
(H - K1 + P1 + 1) x (W - K2 + P2 + 1) = (5 - 3 + 1 + 1) x (5 - 3 + 1 + 1) = 4x4
Using a stride greater than one is a good way of performing dimensionality reduction on the input?
Typically false. Using a stride greater than one results in losing information because we’re skipping pixels.
For a multi-channel input, what is the shape of the kernel we use?
C x K1 x K2, where C is the dimensionality of the input. For example, an RGB image would be 3 x K1 x K2.
Why do we (in general) not end up simply learning the exact same feature maps when using multiple kernels per layer?
Because we initialize the weights randomly, so as gradient descent is applied the weights in each map will tend to converge to different values as a result of their different starting state. However, it is still possible learn redundant feature maps, so random initialization is more of a heuristic than a guarantee.
If we apply convolutional using M number of kernels to learn different features, what will be the number of channels in the output?
Since we concatenate the feature maps in the output, we’ll also have M output channels.
For an RGB (i.e. 3 channel) input image with N=4 filters and kernel size K1=K2=3, how many parameters will have to be learned for that layer? How many channels would there be in the output feature maps?
N * (K1 * K2 * 3 + 1) = 4 * (3 * 3 * 3 + 1) = 112. The number of channels in the output is simply equal to the number of filters, so 4.
What are pooling layers used for?
Dimensionality reduction.
Describe how a max pooling layer works and why it is useful?
It is useful for dimensionality reduction of the input. It is performed by striding a window across the image, but instead of applying convolution, the max-operation is applied to every window. This gives us a scalar output from a matrix input. For example, a 3x3 window would have 9 elements, but by applying the max operation, this becomes a single value.
How many learned parameters are required to learn in a max pooling layer?
None. The max argument takes no arguments other than the input, so nothing needs to be learned.
The only operation that can be performed in a pooling layer is taking the maximum? (True/False)
False. Any differentiable function can be used (e.g. average, etc.). In practice though, it’s uncommon to use something other than max pooling.
Why is the combination of a convolutional layer with a pooling layer particularly powerful?
This combination allows learned features to exhibit some degree of INVARIANCE to simple affine transformations like translation. If the translation of some feature in the image is within the bounds of the pooling layer, it should still be recognized by the feature map.
If a feature (such as a bird’s beak) were translated a little bit, the location of the output values from convolutional layer would remain unchanged? (True/False)
False. Convolution has the property of ‘Equivariance’. A translation of the feature results in the output being shifted by the same amount.
What are two important properties of convolution?
- Invariance (features with small transformations/deformations should still activate the output)
- Equivariance (no matter where the feature occurs in the image, the feature map will be activated, with the output values moving by the same translation)
What are four important design decision that must be made when developing a DNN?
- Architecture
- Data considerations
- Training and Optimization
- Machine Learning considerations
What are some important architectural considerations we should make when designing a DNN?
- What modules (layers) should we use?
- How should they be connected together?
- Can we use domain knowledge to add architectural biases?
A FC neural net can accept inputs with dynamic shapes? (True/False)
False. This is one of the downsides of a FC NN. Since every input is connected to all the weights, it can can only accept rigid inputs shapes.
What are some important optimization considerations we should make when designing a DNN?
- What optimizer should we use? Different optimizers make different weight updates depending on the gradients.
- How should we initialize the weights? If we initialize far away from the minima, can our optimizer actually get us there?
- What regularizers should we use? DNN often have more parameters than data. Regularization is often a must to avoid overfitting.
- What loss function is appropriate? Many different options available.