Previous Questions Flashcards

1
Q

What is a pooling layer? What does it do?

A
  • After Convolutional layer usually: downsampling operation that reduces the dimensions of input feature map
  • Use
    o Extracts important features
    o Reduced computational complexity
    o Reduces overfitting
    o Neighbor pixel strongly correlated, makes sense to combine them
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Would be good practice after every pooling layer multiply features by 2

A
  • Images are losing their spatial resolution when going through a regular CNN which can be an issue for semantic segmentation/ object detection. Solution: recover spatial information that was lost in earlier pooling layers by upsample the output image by a factor of 2.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

For Neural Network against Overfitting Ridge regularization or Dropout? Or early stopping?

A

stopping?
* Answer: Dropout
* Explanation:
o CNN: weights are organized in filter kernels that slide across the input image to extract feature -> filters shared across different spatial locations of the input.
o Ridge regularization: adding a penalty term to each individual weight in the network.
o Ridge regularization: disrupt the sharing of weights and the spatial relationships captured by the filters.
o Adding penalty to each weight independently: potentially altering the balance and importance of the shared weights; regularization penalty may affect the weights differently at different spatial locations -> undermine the shared knowledge encoded in the weights; Encourages weights to be small and sparse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why dropout is preferred over early stopping in a Neural Network?

A

o “Neurons trained with dropout cannot co-adapt with their neighboring neurons; they have to be as useful as possible on their own. They also cannot rely excessively on just a few input neurons; they must pay attention to each of their input neurons. They end up being less sensitive to slight changes in the inputs. In the end, you get a more robust network that generalizes better.”
o a unique neural network is generated at each training step. Since each neuron can be either present or absent, there are a total of 2N possible networks (where N is the total number of droppable neurons). This is such a huge number that it is virtually impossible for the same neural network to be sampled twice. Once you have run 10,000 training steps, you have essentially trained 10,000 different neural networks (each with just one training instance). These neural networks are obviously not independent because they share many of their weights, but they are nevertheless all different. The resulting neural network can be seen as an averaging ensemble of all these smaller neural networks.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

how would you handle underfitting using dropout? = what should you do if you have dropout and you are underfitting

A
  • decrease dropout rate
  • Related Question:
    o If model underfits: Decrease or increase Dropout? : decrease
  • “If you observe that the model is overfitting, you can increase the dropout rate. Conversely, you should try decreasing the dropout rate if the model underfits the training set. It can also help to increase the dropout rate for large layers, and reduce it for small ones. Moreover, many state-of-the-art architectures only use dropout after the last hidden layer, so you may want to try this if full dropout is too strong. Dropout does tend to significantly slow down convergence, but it usually results in a much better model when tuned properly. So, it is generally well worth the extra time and effort.”
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why use both early stopping & dropout rate? Is good to use both at same time?

A
  • GPT: can be effective because they address different aspects of overfitting
  • Early stopping helps control the capacity of the model by stopping training before it starts overfitting
  • dropout introduces randomness during training, preventing the network from relying too heavily on any specific set of features or neurons.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What would I choose for image dimension reduction PCA and SVD (really deep question about the underlying reasons)

A
  • SVD for images, PCA for texts classification -> rule of thumb PCA numeric data and SVD image data
  • SVD taking a matrix directly as input (computationally less expensive), not requiring a calculation of the Covariance Matrix as for PCA
  • PCA focus: linear relationships between variables, which are possible to derive for images but a bit counterintuitive and more logical for numerical inputs with actual variables and observations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

PCA – how are the PC’s being calculated. How does it work?

A
  • Calculate covariance matrix
  • Calculate & oder Eigenvalues & -vectors of 2
    o Eigenvalues lambda = amount of variance in each PC
    o Eigenvectors = direction of axis where most variance is
  • Choose number of dimensions
    o Caclulate proportion of variance explained: eigenvalue / sum of all eigenvalues
    o Decide threshold
  • Create feature vector (matrix which as choosen eigenvectors as columns) & rotate dataset (SVD)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How to find optimal K in KMeans

A
  • Answer: Silhouette Score
  • Explanation
    o Silhoutte Score measures compactness & seperation of clusters
    o Calculation: Average of all silhouette scores of all data points
     Silhoutte score for one: (b-a)/max(a,b)
  • a: intra-cluster distance = mean distance to other instances in same cluster
  • b: inter-cluster distance = mean distance to instances in nearest-cluster
    o Range: -1 to 1 (1 = better defined cluster)
    o Try different k values & Choose the one that maximizes score
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why is RNN better than ARIMA for time series data?

A
  • RNNs are specifically designed to handle sequential data
  • Good for capturing long-term dependencies and patterns across time steps (LSTM)
  • ARIMA requires preprocessing to remove the trend and seasonality and you need to find the right parameters for the components (AR, I, MA). In contrast, RNNs can automatically learn relevant features and representations directly from the data.
  • Advantage of ARIMA: interpretability  insights into the underlying process generating the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is TLU?

A
  • Answer: Threshold Logic Unit = type of neuron or unit used in NN -> computes weighted sum of input values & compared to predefined threshold
    o sum > thresholt -> ouptut 1
    o sum < threshold -> output 0
  • Background: especially for binary classification
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

AdaBoost: Adaptive Boosting

A

o focuses on difficult-to-classify samples by assigning higher weights to them during training
o Goal: By iteratively adjusting the weights and training weak learners -> improve the overall model performance and handle complex classification tasks
o Able to handle imbalanced data
o Steps
 Initlaize weights (to all training samples same)
 Train week lerner e.g. decision tree
 Compute error: sum of weights for misclassified samples
 Updata weights
 Repeat until desired performance achieved or fiexed iterations
 Aggregate predictionns: combine predictions of all weak learners by assigning weight to theire predictions based on performance during training
 Final prediction: the weighted vote or average of week learners prediction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Can you do multiple class classification with a SVM

A
  • Answer: generally only for binary but can use iterative / several classifiers
  • Background
    o One vs All Approach: train multiple binary SVM classifiers, where each classifier is trained to distinguish one class from the rest. For a problem with N classes, N binary SVM classifiers are trained, and each classifier is trained to classify instances belonging to one class as positive and instances belonging to the other N-1 classes as negative. During prediction, all classifiers are applied to the test instance, and the class associated with the classifier that produces the highest confidence or decision score is assigned to the test instance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what are the autoencoders (how they are working and what types there are)

A
  • Unsupervised generative models: They aim to learn efficient representations of the input data by encoding it into a lower-dimensional latent space and then decoding it back to the original input format.
  • capable of randomly generating new data that looks very similar to training data.
  • consists of two parts: Encoding (converts inputs into a latent representation) and Decoding (converts internal representation to outputs)
  • Can be used for dimensionality reduction, feature extraction and visualizations
  • Types: Convolutional Autoencoder (for images), Recurrent Autoencoder (for sequences), Denoising Autoencoder (learn useful features by adding noise to the inputs)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why we used Colab and not UCloud?

A
  • Use of GPUs in Google Colab (no GPUs available in UCloud)
  • Sometimes UCloud had performance issues, processing was very slow sometimes
  • Ucloud had access issues when many people are using it (the session does not start for a few hours)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Cross-Validation: Effects & How to pick number of folds

A
  • More reliable evaluation of a model’s performance compared to a single train-test split.
  • Allows you to get an idea how precise the estimate is (standard deviation) by looking at the performance in each fold.
  • How to decide on the number of folds:
    o Computational cost of training the model several times is high with a large number of folds.
    o A large number of folds is only possible if you have enough data as the validation set would get very small  both sets should contain sufficient variation such that the underlining distribution is represented
  • We used 3 folds (cv=3) to reduce the CPU time needed (we could have used 5)