Deep Learning Knowledge Flashcards
Cnns lstms embedding transformers
Cnn and is most important parts
also known asshift invariantorspace invariant artificial neural networks(SIANN), based on the shared-weight architecture of the convolution kernels that shift over input features and provide translationequivariantresponses
he “full connectivity” of these networks make them prone tooverfittingdata. Typical ways of regularization, or preventing overfitting, include: penalizing parameters during training (such as weight decay) or trimming connectivity (skipped connections, dropout, etc.) CNNs take a different approach towards regularization: they take advantage of the hierarchical pattern in data and assemble patterns of increasing complexity using smaller and simpler patterns embossed in their filters.
Each convolutional neuron processes data only for itsreceptive field. Althoughfully connected feedforward neural networkscan be used to learn features and classify data, this architecture is generally impractical for larger inputs such as high resolution images. It would require a very high number of neurons, even in a shallow architecture, due to the large input size of images, where each pixel is a relevant input feature. For instance, a fully connected layer for a (small) image of size 100 x 100 has 10,000 weights foreachneuron in the second layer. Instead, convolution reduces the number of free parameters, allowing the network to be deeper.[15]For example, regardless of image size, using a 5 x 5 tiling region, each with the same shared weights, requires only 25 learnable parameters. Using regularized weights over fewer parameters avoids the vanishing gradients and exploding gradients problems seen duringbackpropagationin traditional neural networks.[16][17]Furthermore, convolutional neural networks are ideal for data with a grid-like topology (such as images) as spatial relations between separate features are taken into account during convolution and/or pooling
CNN distinguished features
Distinguishing features
3D volumes of neurons
Exploits local spatial relation
Shared weights: In CNNs, each filter is replicated across the entire visual field. These replicated units share the same parameterization (weight vector and bias) and form a feature map. This means that all the neurons in a given convolutional layer respond to the same feature within their specific response field. Replicating units in this way allows for the resulting activation map to beequivariantunder shifts of the locations of input features in the visual field, i.e. they grant translationalequivariance- given that the layer has a stride of one
Pooling: In a CNN’s pooling layers, feature maps are divided into rectangular sub-regions, and the features in each rectangle are independently down-sampled to a single value, commonly by taking their average or maximum value. In addition to reducing the sizes of feature maps, the pooling operation grants a degree of localtranslational invarianceto the features contained therein, allowing the CNN to be more robust to variations in their positions.
Convolutional layer
The layer’s parameters consist of a set of learnablefilters(orkernels), which have a small receptive field, but extend through the full depth of the input volume. During the forward pass, each filter isconvolvedacross the width and height of the input volume, computing thedot productbetween the filter entries and the input, producing a 2-dimensionalactivation mapof that filter. As a result, the network learns filters that activate when it detects some specific type offeatureat some spatial position in the input.[6
Stacking the activation maps for all filters along the depth dimension forms the full output volume of the convolution layer. Every entry in the output volume can thus also be interpreted as an output of a neuron that looks at a small region in the input and shares parameters with neurons in the same activation map.
Sometimes, the parameter sharing assumption may not make sense. This is especially the case when the input images to a CNN have some specific centered structure; for which we expect completely different features to be learned on different spatial locations. One practical example is when the inputs are faces that have been centered in the image: we might expect different eye-specific or hair-specific features to be learned in different parts of the image. In that case it is common to relax the parameter sharing scheme, and instead simply call the layer a “locally connected layer”.
Pooling layer
Intuitively, the exact location of a feature is less important than its rough location relative to other features. This is the idea behind the use of pooling in convolutional neural networks. The pooling layer serves to progressively reduce the spatial size of the representation, to reduce the number of parameters,memory footprintand amount of computation in the network, and hence to also controloverfitting
Translation equivariance
CNNs are invariant to shifts of the input. However, convolution or pooling layers within a CNN that do not have a stride greater than one areequivariant, as opposed toinvariant, to translations of the input.[61]Layers with a stride greater than one ignores theNyquist-Shannon sampling theorem, and leads toaliasingof the input signal, which breaks the equivariance (also referred to as covariance) property.[61]Furthermore, if a CNN makes use of fully connected layers,translation equivariancedoes not implytranslation invariance, as the fully connected layers are not invariant to shifts of the input.[73][4]One solution for complete translation invariance is avoiding any down-sampling throughout the network and applying global average pooling at the last layer.
Dropout
It reduces overfitting in fully connected layers. The technique seems to reduce node interactions, leading them to learn more robust features[clarification needed]that better generalize to new data.
major drawback to Dropout is that it does not have the same benefits for convolutional layers, where the neurons are not fully connected. stochastic pooling,[80]the conventionaldeterministicpooling operations are replaced with a stochastic procedure, where the activation within each pooling region is picked randomly according to amultinomial distribution, given by the activities within the pooling region.
Can also deform input images to make model more robust
L1 L2 and ElasticNet regularization practices
L2 regularization is the most common form of regularization. It can be implemented by penalizing the squared magnitude of all parameters directly in the objective. The L2 regularization has the intuitive interpretation of heavily penalizing peaky weight vectors and preferring diffuse weight vectors. Due to multiplicative interactions between weights and inputs this has the useful property of encouraging the network to use all of its inputs a little rather than some of its inputs a lot.
L1 regularization is also common. It makes the weight vectors sparse during optimization. In other words, neurons with L1 regularization end up using only a sparse subset of their most important inputs and become nearly invariant to the noisy inputs. L1 with L2 regularization can be combined; this is calledElastic net regularization.
Transfer learning
Convolutional neural networks usually require a large amount of training data in order to avoidoverfitting. A common technique is to train the network on a larger data set from a related domain. Once the network parameters have converged an additional training step is performed using the in-domain data to fine-tune the network weights, this is known astransfer learning. Furthermore, this technique allows convolutional network architectures to successfully be applied to problems with tiny training sets
Multicollinearity
Can be detected with variation inflation factor
R2, chi squared test possibly pairplots if it’s numeric
Variance of the parameters will be large. Parameter estimates are not reliable anymore because you cannot hold all predictors constant to estimate the parameter of one predictor anymore. In tree based models, feature importance are affected. Rf will dilute the importance, xgboost will. Randomly pick one of them.
Deal with missing data
Interpolation Drop samples Iterative imputer MiCE Nearest neighbours imputation Denoising autoencoder in high dimensions Fancy imputation
Metric for classification and regression
F1 score, ROC metrics, AUC Log loss, categorical cross entropy, brier score...LL--> if you wanna take class probabilities into account in classification
MSE, MAP, RMSE, MAE, R2
Class imbalance
Random oversample undersample, SMOTE passing class weights
Categorical feature encoding
One hot --> multicollinearity problem Label encode Entity embedding Catboost encoder Mean encoding other target encoding
Feature selection
Hi
PU learning
Lol
Cluster mixed data
Kmodes
rock
K prototype
Convert to numerical then apply traditional methods
Dimension reduction
PCA
Biclustering
Hi
Lstm and rnn
Ho