All Flashcards
What type of data is there?
Unstructured data, such as images, videos, audio and text, are often known categorized as qualitative data. It cannot be processed or analyzed using conventional data tools and methods.
Structured data, e.g. tensors or tables, storing n-dimensional data,
Semistructured data, such as JSON files.
Metadata, data that describes other data. Such as variable names.
What is data augmentation?
A technique to generate more samples. E.g. flipping, rescaling, rotation, thermofilter.
What is data reduction?
remove noisy data to improve the model, better accuracy. Reduces complexity of the model and trains faster.
Feature selection, remove unimportant features.
What is an autoencoder?
Purpose:
Autoencoders are primarily used for unsupervised learning tasks, such as data compression, dimensionality reduction, anomaly detection, and feature learning.
The main goal of an autoencoder is to learn a compact, efficient representation (encoding) of input data and then reconstruct the input data from this encoding as accurately as possible.
Use Cases:
Dimensionality reduction (similar to PCA, but nonlinear).
Data denoising (denoising autoencoders).
Anomaly detection (reconstruction error highlights anomalies).
Generative modeling (e.g., speech to text).
What is an transformer?
Purpose:
Transformers are designed for handling sequential data and are particularly powerful for tasks involving natural language processing (NLP), such as machine translation, text summarization, and language modeling.
They use self-attention mechanisms to model relationships between all elements in a sequence simultaneously, rather than relying on sequential processing like RNNs.
Use Cases:
Language translation (e.g., Google Translate).
Text generation (e.g., GPT models).
What are the core things in a transformer?
- Encoder-decoder stack
- Self-attention mechanism
- Positional Encoding
- Feed-forward network
What are the criteria (needed for working model) to model sequences?
- Handle variable length sequences.
- Track long term dependencies.
- Maintain information about order.
- Share parameters across the sequence.
Needed to generate new possible outcomes.
What are benefits of base models in computer vision?
Benefits of base models in computer vision:
Improves overall accuracy of the model. This is because a base model is trained on thousands or millions of images even. This means that the feature extraction part of the CNN model is well trained. All you would have to do after is retrain the model on the new subset of images that you want and finetune the last layer (or perhaps last 2). Because if you do not have that much data, and would only train on the subset you have, there is a big chance that the model would be overfitted to your data. With these comes some obvious benefits. Reduced training time (takes alot of time to train a new model), works well with fewer samples, you can add classes to an already existing classifier, less power consumption.
Difference between Discriminative and generative models?
Generative: Given the distribution, capture the correlations.
Discriminative: Learn differences, ignore correlations (divide data with line/plane)
What are GRUs and LSTMs?
Both of these algorithms are sort of RNNS but with the exception that they keep more of the context. They are good at forecasting timeseries or sequence modeling.
Compare LSTM and Transformers.
Lstm process input sequentially, carrying forward hidden states (longterm / shortterm memory) to capture information from previous steps. They have a recurrent structure in which the output is fed back into the algorithm. It also has special gates to mitigate the problem of vanishing/exploding gradient.
Transformers has parallel processing rather than sequential. The self attention mechanism allow year input (eg word) to attent to other words in the sequence, regardless of the position in the sequence. The self attention mechanism itself is what makes a transformers ability to capture long term relationships. Each token in a sequence calculates the attention with every other token, allowing the model to weighing its importance. Since transformers process inputs in parallel, positional encoding is added to the input embeddings to give the model a sense of position within the sequence.
Transformers are often better because they are: More effective at handle long range dependencies, more efficient with parallelism, rich representation thanks to attention.
Describe vanishing and exploding gradient.
Vanishing gradient is when the gradient of the loss function is small relative to the model parameters. It causes the weights to update very little at the time, resulting in slow learning. Because of the functionality of backpropagation, with the chain rule of derivatives, this update of weights become smaller and smaller as we get towards the earlier layers. Usually becomes a problem when the algorithm utilizes activation functions like sigmoid or tanh. Since they squish the number between 0-1, it leads to a gradient being exponentially smaller. A solution could be to use relu instead.
Exploding gradient is instead when the gradient of the loss function is large relative to the parameters, resulting in instability in the learning process. Poor initialization of weights can lead to large values being passed through the layers. If they gradients are large, the product can quickly escalate when they are propagated backwards. Solutions could be better initializations of the weights and gradient clipping.
A solution for both could be batchnormalization. It normalizes the output from a previous activation layer by subtracting the batch mean and dividing by the batch standard deviation. After the normalization, the data is scaled and shifted using learnable parameters. By Normalizing the input to each layer, it helps to maintain a stable gradient flow, especially in deep networks. Mostly a problem for RNNs
DL vs ML?
Traditional ML involves feature engineering. Good extraction of relevant features etc, the better the algorithm is. Deep learning on the other hand can have hundreds of successive layers of representation. Thus, deep learning completely automates feature engineering
Denoising autoencoder vs a normal one?
A Denoising Autoencoder (DAE) is a type of autoencoder designed specifically to handle noisy data by learning to reconstruct a clean output from a corrupted input. It extends the basic concept of a regular autoencoder by introducing noise to the input data during the training process and requiring the model to learn how to remove or “denoise” this noise.
Normal autoencoder:
Tries to, as accurately and efficiently as possible, to learn a representation (encoding) of the input data. This can include a translator. Not translating word for word but to accurately capture the meaning (representation) of the sentence and decode it, in this example that means the language translated to.
Denoising autoencoder:
Denoising autoencoder learns robust representation of the original, clean data from a corrupted version of the input.This makes the model more resilient to noise and can help the model generalize. It learns to remove the noise. In comparison to the normal autoencoder, which is more affected by noise in the data.
What is the bottleneck layer of an autoencoder?
The bottleneck layer forces the model to compress the input to a lower dimensional representation. This compression of the data reduces the amount of information the network can use to reconstruct the input. Thus, compelling the model to learn the most important and informative features. It basically acts as a feature extractor. It helps the model to not only remember the input as it is, but encourages it to generalize by focusing on patterns and features.