Deep Learning Knowledge Flashcards by Srishti Sehgal

Cnn and is most important parts

also known asshift invariantorspace invariant artificial neural networks(SIANN), based on the shared-weight architecture of the convolution kernels that shift over input features and provide translationequivariantresponses

he “full connectivity” of these networks make them prone tooverfittingdata. Typical ways of regularization, or preventing overfitting, include: penalizing parameters during training (such as weight decay) or trimming connectivity (skipped connections, dropout, etc.) CNNs take a different approach towards regularization: they take advantage of the hierarchical pattern in data and assemble patterns of increasing complexity using smaller and simpler patterns embossed in their filters.

Each convolutional neuron processes data only for itsreceptive field. Althoughfully connected feedforward neural networkscan be used to learn features and classify data, this architecture is generally impractical for larger inputs such as high resolution images. It would require a very high number of neurons, even in a shallow architecture, due to the large input size of images, where each pixel is a relevant input feature. For instance, a fully connected layer for a (small) image of size 100 x 100 has 10,000 weights foreachneuron in the second layer. Instead, convolution reduces the number of free parameters, allowing the network to be deeper.[15]For example, regardless of image size, using a 5 x 5 tiling region, each with the same shared weights, requires only 25 learnable parameters. Using regularized weights over fewer parameters avoids the vanishing gradients and exploding gradients problems seen duringbackpropagationin traditional neural networks.[16][17]Furthermore, convolutional neural networks are ideal for data with a grid-like topology (such as images) as spatial relations between separate features are taken into account during convolution and/or pooling

How well did you know this?

Not at all

Perfectly

CNN distinguished features

Distinguishing features

3D volumes of neurons

Exploits local spatial relation

Shared weights: In CNNs, each filter is replicated across the entire visual field. These replicated units share the same parameterization (weight vector and bias) and form a feature map. This means that all the neurons in a given convolutional layer respond to the same feature within their specific response field. Replicating units in this way allows for the resulting activation map to beequivariantunder shifts of the locations of input features in the visual field, i.e. they grant translationalequivariance- given that the layer has a stride of one

Pooling: In a CNN’s pooling layers, feature maps are divided into rectangular sub-regions, and the features in each rectangle are independently down-sampled to a single value, commonly by taking their average or maximum value. In addition to reducing the sizes of feature maps, the pooling operation grants a degree of localtranslational invarianceto the features contained therein, allowing the CNN to be more robust to variations in their positions.

Convolutional layer
The layer’s parameters consist of a set of learnablefilters(orkernels), which have a small receptive field, but extend through the full depth of the input volume. During the forward pass, each filter isconvolvedacross the width and height of the input volume, computing thedot productbetween the filter entries and the input, producing a 2-dimensionalactivation mapof that filter. As a result, the network learns filters that activate when it detects some specific type offeatureat some spatial position in the input.[6
Stacking the activation maps for all filters along the depth dimension forms the full output volume of the convolution layer. Every entry in the output volume can thus also be interpreted as an output of a neuron that looks at a small region in the input and shares parameters with neurons in the same activation map.

Sometimes, the parameter sharing assumption may not make sense. This is especially the case when the input images to a CNN have some specific centered structure; for which we expect completely different features to be learned on different spatial locations. One practical example is when the inputs are faces that have been centered in the image: we might expect different eye-specific or hair-specific features to be learned in different parts of the image. In that case it is common to relax the parameter sharing scheme, and instead simply call the layer a “locally connected layer”.

Pooling layer
Intuitively, the exact location of a feature is less important than its rough location relative to other features. This is the idea behind the use of pooling in convolutional neural networks. The pooling layer serves to progressively reduce the spatial size of the representation, to reduce the number of parameters,memory footprintand amount of computation in the network, and hence to also controloverfitting

Translation equivariance
CNNs are invariant to shifts of the input. However, convolution or pooling layers within a CNN that do not have a stride greater than one areequivariant, as opposed toinvariant, to translations of the input.[61]Layers with a stride greater than one ignores theNyquist-Shannon sampling theorem, and leads toaliasingof the input signal, which breaks the equivariance (also referred to as covariance) property.[61]Furthermore, if a CNN makes use of fully connected layers,translation equivariancedoes not implytranslation invariance, as the fully connected layers are not invariant to shifts of the input.[73][4]One solution for complete translation invariance is avoiding any down-sampling throughout the network and applying global average pooling at the last layer.

Dropout
It reduces overfitting in fully connected layers. The technique seems to reduce node interactions, leading them to learn more robust features[clarification needed]that better generalize to new data.
major drawback to Dropout is that it does not have the same benefits for convolutional layers, where the neurons are not fully connected. stochastic pooling,[80]the conventionaldeterministicpooling operations are replaced with a stochastic procedure, where the activation within each pooling region is picked randomly according to amultinomial distribution, given by the activities within the pooling region.

Can also deform input images to make model more robust

How well did you know this?

Not at all

Perfectly

L1 L2 and ElasticNet regularization practices

L2 regularization is the most common form of regularization. It can be implemented by penalizing the squared magnitude of all parameters directly in the objective. The L2 regularization has the intuitive interpretation of heavily penalizing peaky weight vectors and preferring diffuse weight vectors. Due to multiplicative interactions between weights and inputs this has the useful property of encouraging the network to use all of its inputs a little rather than some of its inputs a lot.
L1 regularization is also common. It makes the weight vectors sparse during optimization. In other words, neurons with L1 regularization end up using only a sparse subset of their most important inputs and become nearly invariant to the noisy inputs. L1 with L2 regularization can be combined; this is calledElastic net regularization.

How well did you know this?

Not at all

Perfectly

Transfer learning

Convolutional neural networks usually require a large amount of training data in order to avoidoverfitting. A common technique is to train the network on a larger data set from a related domain. Once the network parameters have converged an additional training step is performed using the in-domain data to fine-tune the network weights, this is known astransfer learning. Furthermore, this technique allows convolutional network architectures to successfully be applied to problems with tiny training sets

How well did you know this?

Not at all

Perfectly

Multicollinearity

Can be detected with variation inflation factor
R2, chi squared test possibly pairplots if it’s numeric

Variance of the parameters will be large. Parameter estimates are not reliable anymore because you cannot hold all predictors constant to estimate the parameter of one predictor anymore. In tree based models, feature importance are affected. Rf will dilute the importance, xgboost will. Randomly pick one of them.

How well did you know this?

Not at all

Perfectly

Deal with missing data

Interpolation 
Drop samples 
Iterative imputer
MiCE 
Nearest neighbours imputation 
Denoising autoencoder in high dimensions
Fancy imputation

How well did you know this?

Not at all

Perfectly

Metric for classification and regression

F1 score, ROC metrics, AUC
Log loss, categorical cross entropy, brier score...LL--> if you wanna take class probabilities into account in classification

MSE, MAP, RMSE, MAE, R2

How well did you know this?

Not at all

Perfectly

Class imbalance

Random oversample undersample, SMOTE
passing class weights

How well did you know this?

Not at all

Perfectly

Categorical feature encoding

One hot --> multicollinearity problem 
Label encode
Entity embedding 
Catboost encoder 
Mean encoding other target encoding

How well did you know this?

Not at all

Perfectly

Feature selection

How well did you know this?

Not at all

Perfectly

PU learning

Lol

How well did you know this?

Not at all

Perfectly

Cluster mixed data

Kmodes
rock
K prototype

Convert to numerical then apply traditional methods

How well did you know this?

Not at all

Perfectly

Dimension reduction

PCA

How well did you know this?

Not at all

Perfectly

Biclustering

How well did you know this?

Not at all

Perfectly

Lstm and rnn

How well did you know this?

Not at all

Perfectly

Target leakage

Study These Flashcards

Loss curves

Study These Flashcards

Test train Val acc or loss curves

Preprocessing steps

Study These Flashcards

Listed in order

Causal learners

Study These Flashcards

T learners

S learners

MERF and Linear Fixed Effects

Study These Flashcards

Stats

Biases and paradoxes

Study These Flashcards

Simpsons

Selection bias

Confidence level

Study These Flashcards

Ci interval

Prediction interval

Study These Flashcards

Query builder

Study These Flashcards

Faiss
Scann
Fuzzysearch
Levenschtein distance

Changeling algorithms

Ruptures | Facebook prophet

Kolomogorov Smirnov Test

To see goodness of fit between the distribution of one random variable against another distribution H0: Both distributions are identical Ha: two sided, less, greater In the 2 sample test, tests whether the two independent samples are drawn from the same continuous distribution.

Embedding methods

Deep Learning Knowledge Flashcards

Cnns lstms embedding transformers (27 cards)