Deep Learning Knowledge Flashcards

Cnns lstms embedding transformers

1
Q

Cnn and is most important parts

A

also known asshift invariantorspace invariant artificial neural networks(SIANN), based on the shared-weight architecture of the convolution kernels that shift over input features and provide translationequivariantresponses

he “full connectivity” of these networks make them prone tooverfittingdata. Typical ways of regularization, or preventing overfitting, include: penalizing parameters during training (such as weight decay) or trimming connectivity (skipped connections, dropout, etc.) CNNs take a different approach towards regularization: they take advantage of the hierarchical pattern in data and assemble patterns of increasing complexity using smaller and simpler patterns embossed in their filters.

Each convolutional neuron processes data only for itsreceptive field. Althoughfully connected feedforward neural networkscan be used to learn features and classify data, this architecture is generally impractical for larger inputs such as high resolution images. It would require a very high number of neurons, even in a shallow architecture, due to the large input size of images, where each pixel is a relevant input feature. For instance, a fully connected layer for a (small) image of size 100 x 100 has 10,000 weights foreachneuron in the second layer. Instead, convolution reduces the number of free parameters, allowing the network to be deeper.[15]For example, regardless of image size, using a 5 x 5 tiling region, each with the same shared weights, requires only 25 learnable parameters. Using regularized weights over fewer parameters avoids the vanishing gradients and exploding gradients problems seen duringbackpropagationin traditional neural networks.[16][17]Furthermore, convolutional neural networks are ideal for data with a grid-like topology (such as images) as spatial relations between separate features are taken into account during convolution and/or pooling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

CNN distinguished features

A

Distinguishing features

3D volumes of neurons

Exploits local spatial relation

Shared weights: In CNNs, each filter is replicated across the entire visual field. These replicated units share the same parameterization (weight vector and bias) and form a feature map. This means that all the neurons in a given convolutional layer respond to the same feature within their specific response field. Replicating units in this way allows for the resulting activation map to beequivariantunder shifts of the locations of input features in the visual field, i.e. they grant translationalequivariance- given that the layer has a stride of one

Pooling: In a CNN’s pooling layers, feature maps are divided into rectangular sub-regions, and the features in each rectangle are independently down-sampled to a single value, commonly by taking their average or maximum value. In addition to reducing the sizes of feature maps, the pooling operation grants a degree of localtranslational invarianceto the features contained therein, allowing the CNN to be more robust to variations in their positions.

Convolutional layer
The layer’s parameters consist of a set of learnablefilters(orkernels), which have a small receptive field, but extend through the full depth of the input volume. During the forward pass, each filter isconvolvedacross the width and height of the input volume, computing thedot productbetween the filter entries and the input, producing a 2-dimensionalactivation mapof that filter. As a result, the network learns filters that activate when it detects some specific type offeatureat some spatial position in the input.[6
Stacking the activation maps for all filters along the depth dimension forms the full output volume of the convolution layer. Every entry in the output volume can thus also be interpreted as an output of a neuron that looks at a small region in the input and shares parameters with neurons in the same activation map.

Sometimes, the parameter sharing assumption may not make sense. This is especially the case when the input images to a CNN have some specific centered structure; for which we expect completely different features to be learned on different spatial locations. One practical example is when the inputs are faces that have been centered in the image: we might expect different eye-specific or hair-specific features to be learned in different parts of the image. In that case it is common to relax the parameter sharing scheme, and instead simply call the layer a “locally connected layer”.

Pooling layer
Intuitively, the exact location of a feature is less important than its rough location relative to other features. This is the idea behind the use of pooling in convolutional neural networks. The pooling layer serves to progressively reduce the spatial size of the representation, to reduce the number of parameters,memory footprintand amount of computation in the network, and hence to also controloverfitting

Translation equivariance
CNNs are invariant to shifts of the input. However, convolution or pooling layers within a CNN that do not have a stride greater than one areequivariant, as opposed toinvariant, to translations of the input.[61]Layers with a stride greater than one ignores theNyquist-Shannon sampling theorem, and leads toaliasingof the input signal, which breaks the equivariance (also referred to as covariance) property.[61]Furthermore, if a CNN makes use of fully connected layers,translation equivariancedoes not implytranslation invariance, as the fully connected layers are not invariant to shifts of the input.[73][4]One solution for complete translation invariance is avoiding any down-sampling throughout the network and applying global average pooling at the last layer.

Dropout
It reduces overfitting in fully connected layers. The technique seems to reduce node interactions, leading them to learn more robust features[clarification needed]that better generalize to new data.
major drawback to Dropout is that it does not have the same benefits for convolutional layers, where the neurons are not fully connected. stochastic pooling,[80]the conventionaldeterministicpooling operations are replaced with a stochastic procedure, where the activation within each pooling region is picked randomly according to amultinomial distribution, given by the activities within the pooling region.

Can also deform input images to make model more robust

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

L1 L2 and ElasticNet regularization practices

A

L2 regularization is the most common form of regularization. It can be implemented by penalizing the squared magnitude of all parameters directly in the objective. The L2 regularization has the intuitive interpretation of heavily penalizing peaky weight vectors and preferring diffuse weight vectors. Due to multiplicative interactions between weights and inputs this has the useful property of encouraging the network to use all of its inputs a little rather than some of its inputs a lot.
L1 regularization is also common. It makes the weight vectors sparse during optimization. In other words, neurons with L1 regularization end up using only a sparse subset of their most important inputs and become nearly invariant to the noisy inputs. L1 with L2 regularization can be combined; this is calledElastic net regularization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Transfer learning

A

Convolutional neural networks usually require a large amount of training data in order to avoidoverfitting. A common technique is to train the network on a larger data set from a related domain. Once the network parameters have converged an additional training step is performed using the in-domain data to fine-tune the network weights, this is known astransfer learning. Furthermore, this technique allows convolutional network architectures to successfully be applied to problems with tiny training sets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Multicollinearity

A

Can be detected with variation inflation factor
R2, chi squared test possibly pairplots if it’s numeric

Variance of the parameters will be large. Parameter estimates are not reliable anymore because you cannot hold all predictors constant to estimate the parameter of one predictor anymore. In tree based models, feature importance are affected. Rf will dilute the importance, xgboost will. Randomly pick one of them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Deal with missing data

A
Interpolation 
Drop samples 
Iterative imputer
MiCE 
Nearest neighbours imputation 
Denoising autoencoder in high dimensions
Fancy imputation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Metric for classification and regression

A
F1 score, ROC metrics, AUC
Log loss, categorical cross entropy, brier score...LL--> if you wanna take class probabilities into account in classification 

MSE, MAP, RMSE, MAE, R2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Class imbalance

A
Random oversample undersample, SMOTE
passing class weights
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Categorical feature encoding

A
One hot --> multicollinearity problem 
Label encode
Entity embedding 
Catboost encoder 
Mean encoding other target encoding
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Feature selection

A

Hi

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

PU learning

A

Lol

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Cluster mixed data

A

Kmodes
rock
K prototype

Convert to numerical then apply traditional methods

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Dimension reduction

A

PCA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Biclustering

A

Hi

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Lstm and rnn

A

Ho

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Target leakage

A

Hi

17
Q

Loss curves

A

Test train Val acc or loss curves

18
Q

Preprocessing steps

A

Listed in order

19
Q

Causal learners

A

T learners

S learners

20
Q

MERF and Linear Fixed Effects

A

Stats

21
Q

Biases and paradoxes

A

Simpsons

Selection bias

22
Q

Confidence level

A

Ci interval

23
Q

Prediction interval

A

Hu

24
Q

Query builder

A

Faiss
Scann
Fuzzysearch
Levenschtein distance

25
Q

Changeling algorithms

A

Ruptures

Facebook prophet

26
Q

Kolomogorov Smirnov Test

A

To see goodness of fit between the distribution of one random variable against another distribution
H0: Both distributions are identical
Ha: two sided, less, greater

In the 2 sample test, tests whether the two independent samples are drawn from the same continuous distribution.

27
Q

Embedding methods

A

Hi