Questions Flashcards
Which of the following statements is/are true about word embeddings?
A) Pre-learned embeddings exist.
B) Embeddings can potentially capture additional information compared to a one-hot encoded representation.
C) Embeddings are useful for large dictionary/vocabulary sizes.
D) A word vector of an embedding has the same size as the word vector of a one-hot encoded representation, given a fixed sized dictionary/vocabulary.
A, B, C
k-means …
A) … does not need a fixed number of cluster centers as input.
B) … needs a fixed number of cluster centers as input.
C) … is a dimensionality reduction method.
D) … is a clustering method.
B, D
Given an input of size 15x15 and a kernel size of 5x5 with a stride of 1, what is the output size after the convolution operation?
A) 14x14
B) 11x11
C) 10x10
D) 13x13
B
Which of the following statements is/are true about an 8-bit grayscale image?
A) Can be converted into an RGB image without additional information.
B) It has only a single channel (brightness).
C) Every pixel is represented by 8 channels.
D) The channel information size is 8 bits, which means that 8 values can be stored.
B
Which of the following statements is/are true about the term ‘hyperparameters’?
A) There are models without any hyperparameters.
B) Hyperparameters are user-specifiable settings that control the model complexity or the training.
C) Hyperparameters can strongly influence the final model performance.
D) Hyperparameters are those model parameters that are adjusted during training.
B, C
Which of the following statements is/are true about loss functions?
A) Loss functions are used to obtain the final model prediction.
B) The output of loss functions is in the range [0, 1].
C) Loss functions can have an impact on the training process.
D) Loss functions are used to measure the difference between a model prediction and the true target.
C, D
Which of the following is/are useful loss functions for regression problems?
A) Cross entropy
B) Softmax
C) Sigmoid
D) Mean-squared error
D
Standard gradient descent performs an update step based on some step size/learning rate η. Which of the following statements is/are true?
A) If η is negative, we would go into the opposite direction (gradient ascent).
B) If η is too small, the update progress can be very slow.
C) If η is too large, the algorithm might not properly converge to some minimum.
D) If η is 0, no update is performed at all.
A, B, C, D
Which of the following is/are typically used activation functions?
A) Cross entropy
B) Sigmoid
C) Tanh
D) ReLU
B, C, D
Logistic regression …
A) … has an output in the range [0, 1].
B) … is a regression model.
C) … is never a good model choice.
D) … is a classification model.
A, D
Which of the following statements is/are true about pretrained models?
A) Using pretrained models might improve the prediction performance.
B) Pretrained models can be directly used for every task without having to adjust their architecture.
C) Using pretrained models always improves the prediction performance.
D) Pretrained models might be biased.
A, D
Which aspects have to be taken into consideration when dealing with high-dimensional input data?
A) Often difficult to visualize.
B) More features take up more space in memory.
C) Dimensionality reduction techniques might be useful.
D) More features might lead to longer model training times.
A, B, C, D
Consider the following vocabulary in the fixed order: cat dog wolf cow. Which of the following one-hot-encodings is the correct one for the word ‘wolf’?
A) (1, 1, 0, 1)
B) (3)
C) (1, 2, 3, 4)
D) (0, 0, 1, 0)
D
Assume you have the following input text that you want to encode with one-hot-encoding: ‘a cat and a dog and a wolf’. What is the dictionary/vocabulary size?
A) 6
B) 7
C) 5
D) 8
C
The bias-variance tradeoff …
A) … is about finding the best ratio of training set size vs. test set size.
B) … is about finding the most underfitting and most overfitting model.
C) … is about finding the best loss functions.
D) … is about finding a compromise between model underfitting and overfitting.
D
Which of the following statements is/are true about convolutional neural networks (CNNs)?
A) Because of 2D input data, CNNs cannot be trained using gradient descent.
B) CNNs are the same as fully-connected neural networks, just for 2D data.
C) Weight sharing is an essential part in CNNs.
D) CNNs take advantage of the ‘local structure’ in image data (neighboring pixels are often highly correlated).
C, D
In the forward pass of a neural network, the input vector is …
A) … passed through an element-wise non-linearity, added to bias weights and multiplied by a weight matrix.
B) … added to bias weights, multiplied by a weight matrix and passed through an element-wise non-linearity.
C) … multiplied by a weight matrix, added to bias weights and passed through an element-wise non-linearity.
D) … passed through an element-wise non-linearity, multiplied by a weight matrix and added to bias weights.
C
Which of the following statements is/are true about the logistic function (sigmoid)?
A) It is a common loss function.
B) It is used in logistic regression.
C) It introduces non-linearity.
D) It is used in linear regression.
B, C
Assume a multi-class classification problem with four classes (1, 2, 3, 4). Further assume that you have a model with a softmax function at the end which produced (0.3, 0.32, 0.35, 0.03). Which class should be chosen as the final classification prediction?
A) Class 4
B) Class 3
C) Class 2
D) Class 1
B
Which of the following statements is/are true about the softmax function?
A) The sum of all outputs equals 1.
B) It is suitable for multi-class classification problems.
C) It is a generalization of the sigmoid function.
D) The output is always 1 for the predicted class and 0 for all others.
A, B, C
Which of the following statements is/are true about padding in convolutional neural networks?
A) Padding is optional.
B) Padding can only be applied to the original input data, i.e., before the first network layer.
C) Padding can be used to keep the input size and output size the same.
D) Padding of size n is the same as using a kernel that is smaller by n compared to a bigger kernel.
A, C
In a fully-connected neural network …
A) … activation functions should be used in between layers to avoid that multiple linear transformations collapse into a single one.
B) … all inputs are connected to all nodes of the following layer.
C) … the output layer is used for the final model prediction.
D) … each hidden layer can have arbitrarily many nodes.
A, B, C, D
Batch normalization …
A) … is not applicable in convolutional neural networks.
B) … is only used in the last network layer.
C) … is performed for each mini-batch of training samples.
D) … is performed once for the dataset before training the network.
C
Which problems might arise when data augmentation is not done carefully?
A) The input data might no longer correlate with/represent the original target values.
B) The model performance might be worse than without augmentation.
C) There are no problems, data augmentation is always safe.
D) The target values might change too much.
A, B, D
What is meant by the term ‘underfitting’?
A) A model fits the training data (too) well but not the test data.
B) A model neither fits the training nor the test data well.
C) A model with too few hyperparameters was selected.
D) A model fits the training and the test data (too) well.
B
Which techniques can be used to potentially improve a neural network model in terms of prediction performance?
A) Loss function schedules
B) Deep networks
C) Hyperparameter augmentation
D) Batch normalization
B, D
Which of the following statements is/are true regarding the receptive field in convolutional neural networks?
A) The receptive field always remains constant throughout the depth of the network.
B) The receptive field is the (part of the) input that is connected to a node/neuron.
C) The receptive field is closely related to the terms ‘kernel’ or ‘filter’.
D) The receptive field is often bigger than the original input size.
B, C
Assume you have grayscale images with width=20 and height=20. What is the dimensionality when you want to train a model with such input data?
A) 400
B) 1200
C) 20
D) 40
A
Which of the following statements is/are true about empirical risk minimization (ERM)?
A) ERM is typically performed on a dedicated test set.
B) ERM is a method of hyperparameter optimization.
C) ERM is typically performed on a dedicated training set.
D) ERM is a method of estimating the generalization error/risk.
C
Considering labeled tabular data, assume you have a feature vector x and a target y for each table entry. Which of the following statements is/are true?
A) y can be numerical.
B) The x of one table entry might be identical to another x table entry.
C) y can be a class label.
D) x and y together form a sample.
A, B, C, D
t-distributed stochastic neighbor embedding (t-SNE) …
A) … is a dimensionality reduction method.
B) … is a data augmentation method.
C) … enables visualization of high-dimensional data.
D) … is a clustering method.
A, C
Which of the following statements is/are true about the result of loss functions (the ‘loss’)?
A) Typically, the higher the loss, the better the prediction.
B) When comparing the loss of two different loss functions, one should choose the function that yielded the lower loss.
C) Different loss functions might have different loss value ranges.
D) Typically, the lower the loss, the better the prediction.
B, C, D