2023. selective state space model architecture

Machine Learning Flashcards by Sophie Vincoff

Mamba

selective state space model architecture

How well did you know this?

Not at all

Perfectly

State-space models

Models that describe the relationship between some hidden (unknown) variables and their observed measurements. They help us analyze time series problems that involve dynamical systems, and are widely used in statistics, econometrics, engineering, computer science, and finance.

Ex: GPS measures time-of-arrival of signals from satellites and uses it to infer two hidden variables: position and velocity

How well did you know this?

Not at all

Perfectly

State equation and measurement equation

Two important equations in state-space models. State equation describes the development of the hidden variable over time. Measurement equation describes the relationship between the measurement (observed variable) and the state (hidden variable). The variables in these equations can be scalars or vectors.

How well did you know this?

Not at all

Perfectly

Kalman gain

in state-space modeling, Kalman gain is the weight given to the measurements vs. the model’s current-state estimate. It can be “tuned” to optimize performance.

How well did you know this?

Not at all

Perfectly

Kalman Filter

The Kalman Filter is an algorithm that merges noisy measurements with a predictive model to estimate the state of a system over time. It involves two primary steps: prediction, using the state transition matrix (F) and process noise covariance (Q) to forecast the next state and its uncertainty (P); and update, where the prediction is refined using new measurements and their uncertainty (R), adjusted by the Kalman Gain (K). This process iteratively refines the state estimate (x) and its uncertainty (P), making it essential for real-time estimation in systems with uncertainty, like navigation and tracking.

How well did you know this?

Not at all

Perfectly

RNN (Recurrent Neural Network) 4 principle structures

one-to-one; one-to-many; many-to-one; many-to-many (non-matching or matching)

How well did you know this?

Not at all

Perfectly

Geometric Deep Learning

Umbrella term for approaches considering a broad class of LM problems from the perspectives of symmetry and invariance. It provides a common blueprint allowing to derive from first principle neural network architectures as diverse as CNNs, GNNs, and Transformers. Physical measurements can have low-dimensional geometries (e.g. grids in images, sequences in time-series, position and momentum in molecules) and associated symmetries (e.g. translation, rotation)

How well did you know this?

Not at all

Perfectly

Representation learning/feature learning

How well did you know this?

Not at all

Perfectly

SE(3)

Special Euclidean Group in 3 dimensions; the group of all possible simultaneous rotations and translations for a vector. SE(3) is often used as a mathematical framework to model the complex spatial arrangements of proteins. SE(3) invariance (where features remain unchanged under transformations) and equivariance (where features change in a predictable way under transformations) are important features.

How well did you know this?

Not at all

Perfectly

Invariance and Equivariance

Invariance: I want the output to stay constant no matter how I transform the input

Equivariance: I want the output to undergo exactly the same transformation as applied to the input

How well did you know this?

Not at all

Perfectly

Zero-shot learning

Zero-shot learning involves training a model in such a way that it can perform tasks or make predictions on data it has never seen during training. It learns abstract representations that can generalize to new, unseen tasks. This is typically achieved by training the model on a diverse set of tasks or data and using techniques that encourage the learning of generalizable features.

Image classification example: in traditional machine learning, a model is trained to classify images of animals it has seen during training like cats, dogs, and birds. It can accurately identify these animals in the test set, but fails to recognize an animal like a zebra that wasn’t in the training set. In zero-shot design, the model could classify even animals not seen in training like a zebra. This is possible because it learns higher-level features (e.g. stripes, four legs) that can generalize beyond the training set.

Translation example: in traditional ML, you may train a model for English-to-French and French-to-English translations. This model cannot to English-German translation. A zero-shot model could do cross-lingual translation. If trained on English-French and English-German, it may be able to translate between French and German by understanding the abstract linguistic concepts.

How well did you know this?

Not at all

Perfectly

Autoencoders (AEs)

A type of feedforward neural network designed to reconstruct the input data through an encoder-decoder mechanism, with a bottleneck layer in between that captures the essential features of the data

How well did you know this?

Not at all

Perfectly

Denoising Autoencoders (DAEs)

Autoencoders that introduce noise to the input data during the training process (before it’s passed through the encoder). The decoder attempts to reconstruct the original data from the noisy representation, and minimize the reconstruction error between the corrupted input and the output.

The primary objective is to learn robust data representation by forcing the network to reconstruct the original, clean data from noisy versions of it. This process encourages the auto encoder to capture meaningful and salient features while filtering out the noise, resulting in a more generalizable and informative representation.

How well did you know this?

Not at all

Perfectly

Variational Autoencoder (VAE)

VAE is an auto encoder whose encoding distribution is regularized during training to ensure that its latent space has good properties, allowing us to generate some new data. The term “variational” comes from the close relation between the regularization and the variational inference method in statistics.

How well did you know this?

Not at all

Perfectly

Dimensionality reduction

The process of reducing the number of features that describe some data. Can be done by:
- selection (only some existing features are conserved)
- extraction (a reduced number of new features are created based on old features)

Useful for data visualization, data storage, heavy computation…

How well did you know this?

Not at all

Perfectly

“Lossy” compression

Study These Flashcards

Part of the information that is compressed by the encoder from the initial space into the encoded or latent space is LOST, and this information cannot be recovered when decoding.

The manifold hypothesis

Study These Flashcards

The hypothesis that natural data forms lower-dimensional manifolds in its embedding space

Manifold learning

Study These Flashcards

An approach to non-linear dimensionality reduction. Algorithms for this task are based on the idea that the dimensionality of many datasets is only artificially high. Manifold learning aims to capture non-linear structure in data, which is often missed by PCA, Independent Component Analysis, Linear Discriminant Analysis, and others.

Manifold learning methods include: Isomap, Locally Linear Embedding (LLE), Modified LLE (MLLE), Hessian Eigenmmapping (HLLE), Spectral Embedding, Local Tangent Space Alignment, Multi-Dimensional Scaling (MDS), t-distributed Stochastic Neighbor Embedding (t-SNE)

Autoencoders vs. Diffusers

Study These Flashcards

Similarities:
- Learning paradigm: given some data as input, reproduce it as output, and learn the data manifold in the process
- Architectures: both use bottleneck layers
- Denoising AEs corrupt the input and learn to denoise it, like diffusion models

Differences:
- Diffusion conditions on the timestep (t) as input. This allows a single diffusion model - and a single set of parameters - to handle different noise levels. As a result, a single diffusion model can generate (blurry) images from noise at high t and then sharpen them at lower t.

Gaussian DSM (Denoising Score Matching)

Study These Flashcards

used in standard diffusion models

Score Matching

Study These Flashcards

An approach to unsupervised learning that focuses on density estimation. It’s a nonparametric method for estimating the underlying probability density function of the data using a score function, which directly estimates the gradient of the log-likelihood.

The score function measures the divergence between the data distribution and the model distribution. By minimizing this divergence, Score Matching effectively learns the optimal model parameters to approximate the true data distribution. It does not require assumptions about the data distribution, making it powerful and flexible.

Dilated convolutional neural network

Study These Flashcards

Feed-forward deep neural network that aggregates long-range dependencies in sequences over an exponentially large receptive field

Logit

Study These Flashcards

unnormalized log probability

Orthogonal Finetuning (OFT)

Study These Flashcards

Neurons in the same layer are transformed by the same orthogonal matrix such that pairwise angles between neurons are provably preserved throughout the fine-tuning process

Model finetuning

The model architecture remains unchanged and a subset of model parameters get fine-tuned to improve performance on a specific task

Adapter tuning

Additional trainable parameters are added to the original model and these are trained to improve model performance on a specific task

Prompt tuning

Additional trainable prefix tokens are attached to the input and these are trained to improve model performance on a specific task

Block-diagonal matrix

A matrix where A*A-transpose = I. There are blocks rather than single elements along the diagonal. - Application: Orthogonal Finetuning uses these to reduce the number of parameters, but there is a price: this matrix has a fixed sparsity pattern, which may introduce inductive biases.

Sparse matrix

A matrix comprised mostly of 0 values

Dense matrix

A matrix comprised mostly of nonzero values

Machine Learning Flashcards

to cover ML models and vocabulary I need to know (31 cards)