Deep Learning: Introduction and Mathematical Foundations Flashcards
Conceptual/Analogous/Reference flashcards for Deep Learning
What is the key idea of Machine Learning?
Introduction
“Machines that think”. It refers to the idea of creating machines that can perform tasks requiring intelligence, like learning, reasoning, or problem-solving.
Why are abstract and formal tasks easy for machines but hard for humans?
Introduction
Abstract tasks, like playing chess, involve well-defined rules and patterns that machines can follow easily, whereas humans struggle with the computational aspects.
What is machine learning?
Introduction
Machine learning is the acquisition of knowledge from raw data by identifying patterns and learning from them.
Why is data representation important in machine learning?
Introduction
Data is formatted into representations and features to make pattern recognition easier, though identifying the most beneficial features can be challenging.
Introduction
What is representation learning?
Introduction
Representation learning uses machine learning to map raw data into meaningful representations, separating factors of variation. An example is an autoencoder (Encoder + Decoder).
Introduction
What is the role of an encoder and decoder in an autoencoder?
Introduction
The encoder converts input data into a representation, and the decoder reconstructs the input from that representation.
Introduction
What is deep learning, and how does it differ from other types of machine learning?
Introduction
Deep learning is a subfield of machine learning that extracts abstract features through multiple layers, including visible (simple) and hidden (abstract) layers.
Introduction
What historical concepts influenced deep learning architectures?
Introduction
Concepts like cybernetics, connectionism, and artificial neural networks were inspired by neuroscience.
Introduction
What does “model depth” mean in deep learning?
Introduction
Model depth is the number of sequential instructions or computations required to evaluate an output.
Introduction
Why can’t linear models learn the XOR function?
Introduction
Linear models cannot capture the non-linearity required to separate XOR inputs, as they rely on straight-line decision boundaries.
Introduction
How has the understanding of biological neurons influenced modern neural networks?
Introduction
Modern neural networks adopt the idea that increased interconnections between neurons lead to more intelligent systems, though they diverge from actual biological neurons.
Introduction
What is the modern neuron architecture used in neural networks?
Introduction
The Rectified Linear Unit (ReLU) is the standard modern neuron architecture for neural networks.
Introduction
What is the significance of deep belief networks?
Introduction
The Deep Belief Network (DBN), introduced in 2006 by Geoffrey Hinton, revolutionized deep learning by using greedy layer-wise pretraining with Restricted Boltzmann Machines (RBMs) to address vanishing gradients, enable unsupervised feature learning, and make deep networks trainable, sparking the modern era of deep learning.
Introduction
How did big data impact machine learning?
Introduction
Big data provided vast amounts of information, making it easier for machine learning models to learn and improve performance.
Introduction
How are model size and performance related to computer capabilities?
Introduction
Model size and performance are directly proportional to the computational power and performance of modern computers.
Introduction
What is a logical inference machine?
Introduction
A system that reasons automatically about formal statements using logical inference rules.
Introduction
What is a knowledge base in AI?
Introduction
A database of formally defined facts and rules used for reasoning in AI systems.
Introduction
What is logistic regression?
Introduction
A simple machine learning algorithm used for binary classification problems.
Introduction
What is Naive Bayes?
Introduction
A probabilistic classifier based on Bayes’ theorem, assuming independence between features.
Introduction
What is a multilayer perceptron (MLP)?
Introduction
A feedforward neural network consisting of multiple layers of neurons with activation functions, enabling hierarchical learning.
Introduction
How is model depth measured in neural networks?
Introduction
By counting the number of sequential computations or layers in the network.
Introduction
What is the McCulloch-Pitts neuron?
Introduction
A simple mathematical model of a biological neuron used in early neural network research.
Introduction
What is ADALINE?
Adaptive Linear Neuron, an early machine learning model using linear activation and adaptive weights.
What is stochastic gradient descent (SGD)?
An optimization algorithm that updates model parameters using the gradient of a randomly selected subset of data points.
What is the Neocognitron?
A hierarchical, convolutional neural network model inspired by the human visual system.
What is parallel distributed processing (connectionism)?
A perspective in AI emphasizing the distributed representation of knowledge across interconnected units.
What are distributed representations in deep learning?
Representations where information is encoded across multiple neurons rather than a single unit.
What is an LSTM model?
Long Short-Term Memory model, a type of recurrent neural network designed to capture long-term dependencies in sequential data.
What is reinforcement learning?
A machine learning paradigm where an agent learns to maximize rewards through interactions with its environment.
What are the different components of a tensor space?
A scalar is a singular value, representing a single numerical quantity. A vector is a one-dimensional collection of values, typically representing magnitude and direction in space. A matrix is both a one-dimensional collection of vectors and a two-dimensional collection of values, organized in rows and columns. A tensor is an
𝑛-dimensional collection of values, generalizing scalars, vectors, and matrices to higher dimensions.
What are some important matrix properties? (Part 1)
Transpose: Flips rows and columns.
Inverse: A matrix 𝐴 such that 𝐴𝐴−1 =𝐼 where 𝐼 is the identity matrix.
Square Matrix: Equal number of rows and columns.
Singular: A matrix without an inverse (determinant = 0).
Diagonal Matrix: Non-diagonal elements are zero.
What are some important matrix properties? (Part 2)
Symmetric: 𝐴=𝐴𝑇
Orthogonal: 𝐴𝐴𝑇=𝐼
Orthonormal: Vectors in 𝐴 are unit length and orthogonal
Trace: Sum of diagonal elements
Unit Vector: A vector of magnitude 1
What are positive and negative definite matrices ?
A matrix 𝐴 is positive definite if:
1. All its eigenvalues are strictly positive (𝜆𝑖0)
2. For any non-zero vector 𝑥, the quadratic form 𝑥𝑇𝐴𝑥=0
A matrix 𝐴 is negative definite for the vice versa of the above.
Positive definite matrices often appear in optimization, where they ensure that a function has a unique minimum.
Negative definite matrices indicate that a quadratic function is concave and has a unique maximum.
What are positive and negative semi-definite matrices ?
A matrix 𝐴 is positive semi-definite if:
1. All its eigenvalues are non-negative (𝜆𝑖≥0).
2. For any non-zero vector 𝑥, the quadratic form 𝑥𝑇𝐴𝑥≥0
A matrix 𝐴 is negative semi-definite for the vice versa of the above.
Positive semi-definite matrices represent systems where the quadratic form cannot be negative but might equal zero for some 𝑥
Negative definite matrices indicate that a quadratic function is concave and has a unique maximum.
How does a linear combination operator relate to solving systems of equations?
A linear combination operator spans all values obtainable through linear combinations of vectors. It helps determine whether a system has infinite or no solutions. z = αx + (1 - α)y for a 2 unknown system.
How is a linear system represented mathematically?
A linear system can be written as 𝐴𝑛×𝑚𝑥=𝑏, where 𝑏 ∈ 𝑅𝑚, 𝑛≥𝑚, and solutions are linearly independent.
What is a norm in linear algebra?
A norm measures the “size” or “span” of a vector in space. Common norms include:
1. L2 (Euclidean Norm): ∥𝑥∥2= √ (∑xi2
2. L1 (Manhattan Norm): ∥𝑥∥2= ∑xi
3. Linf (Max Norm): ∥𝑥∥inf= max(xi)
4. Frobenius Norm: The Euclidean Norm for a matrix
What conditions must a norm function satisfy?
A norm function must:
1. Satisfy 𝑓(0)=0,
2. Obey the triangle inequality, and
3. Be homogeneous under scaling.
Why are orthogonal matrices important?
Orthogonal matrices are computationally efficient because their determinant is easy to compute, and their inverse is simply their transpose.
What are eigenpairs, and how are they represented?
Eigenpairs consist of:
1. Eigenvalue (𝜆): Indicates the stretching or scaling factor.
2. Eigenvector (𝑣): Specifies the direction.
The relationship is expressed as 𝐴𝑣=𝜆𝑣.
What does matrix diagonalizability mean?
A matrix 𝐴 is diagonalizable if it can be written as 𝐴=𝑉diag(𝜆)𝑉−1, where 𝑉 is the eigenvector matrix and 𝜆 are eigenvalues. This representation helps assess system stability.
What is Singular Value Decomposition (SVD)?
U: Left singular vectors,
Σ: Diagonal matrix of singular values,
𝑉𝑇: Right singular vectors.
It generalizes matrix inversion and applies even when eigenvalue decomposition fails.
What is the Moore-Penrose Pseudoinverse?
The Moore-Penrose Pseudoinverse is used for matrices that are:
1. Non-square,
2. Singular, or
3. Represent systems with no or infinite solutions.
It provides a generalized inverse to solve 𝐴𝑥=𝑏.
What is PCA, and why is it important?
Principal Component Analysis (PCA) is a dimensionality reduction technique that:
1. Extracts features by maximizing variance,
2. Reduces data dimensions while preserving variance, and
3. Simplifies data visualization and computation.
How does PCA work?
PCA involves:
1. Preparing the data,
2. Computing the covariance matrix,
3. Finding eigenpairs (eigenvalues and eigenvectors),
4. Sorting and selecting principal components, and
5. Transforming the data into the new basis.
What are the three main sources of uncertainty in probability?
Inherent Stochasticity – Events have randomness built into them.
Incomplete Observability – We lack knowledge of all variables in the system.
Incomplete Modeling – Our models either lack information or deliberately discard it.
What is the “degree of belief” in probability?
The degree of belief is a fuzzy logic truth value between 0 (completely uncertain) and 1 (completely certain), representing confidence in an event occurring.
How does frequentist probability differ from Bayesian probability?
Frequentist Probability focuses on event occurrence rates over many trials.
Bayesian Probability quantifies certainty in an event occurring, incorporating prior knowledge.
Deep Learning typically uses the former while the use of the latter is growing
What are random variables, and how are they classified?
Discrete Random Variables take finite or countable values.
Continuous Random Variables take values in an uncountable real range.
What is the Probability Mass Function (PMF), and what properties does it satisfy?
The PMF gives the probability distribution of discrete random variables.
It must satisfy:
1. 0≤𝑃(𝑋=𝑥)≤1for all 𝑥
2. ∑𝑥𝑃(𝑋=𝑥)=1(it normalizes to 1).
It can also describe joint probability distributions across multiple variables.
What is a uniform distribution?
A distribution where all events have equal probability.
What is the Probability Density Function (PDF), and how does it differ from the PMF?
The PDF describes the probability distribution of continuous random variables.
Properties:
1. It integrates to 1 over the domain: ∫
−∞∞𝑝(𝑥)𝑑𝑥=1
2. It satisfies 𝑝(𝑥)≥0 for all 𝑥.
Unlike PMFs, PDF values are not probabilities themselves but density values.
What is marginal probability, and how is it computed?
Marginal probability is the probability distribution of a subset of variables when the joint probability over all variables is known. It is computed by summing (for discrete variables) or integrating (for continuous variables) over the unwanted variables.
For discrete variables (summing over unwanted variable
𝑦): 𝑃(𝑥)=∑𝑦𝑃(𝑥,𝑦)
For continuous variables (integrating over unwanted variable 𝑦): 𝑃(𝑥)=∫𝑃(𝑥,𝑦)𝑑𝑦
This process effectively eliminates the variables that are not of interest.
What is expectation?
The expected value (average) of a function over a probability distribution.
1.Discrete case: 𝐸[𝑓(𝑋)]=∑𝑥𝑓(𝑥)𝑃(𝑋=)
2. Continuous case: 𝐸[𝑓(𝑋)]=∫𝑓(𝑥)𝑝(𝑥)𝑑𝑥
Expectation is linear and homogeneous: E[aX+b]=aE[X]+b
What is conditional probability?
The probability of an event given another event has occurre:
𝑃(𝐴∣𝐵)=𝑃(𝐴∩𝐵)/𝑃(𝐵), where 𝑃(𝐵)≠0, since conditioning on an impossible event is undefined.
What is variance, and how is it computed?
Variance measures the spread of a random variable:
Var(𝑋)=𝐸[(𝑋−𝐸[𝑋]2)]
A low variance means values cluster around
E[X], while high variance means values are widely spread.
How is standard deviation related to variance?
The standard deviation is the square root of variance:
𝜎=√Var(𝑋)
What is covariance?
A measure of how two variables are linearly related:
Cov(𝑋,𝑌)=𝐸[(𝑋−𝐸[𝑋])(𝑌−𝐸[𝑌])]
High covariance means variables vary together.
The diagonal of the covariance matrix is a vector of variance values.
What is correlation, and how is it different from covariance?
Correlation normalizes covariance:
𝜌(𝑋,𝑌)=Cov(𝑋,𝑌)/𝜎𝑋𝜎𝑌
It is bounded between -1 and 1, making it easier to compare relationships across different scales.
What is a Bernoulli distribution?
A binary probability distribution:
𝑃(𝑋=𝑥)=𝜙𝑥(1−𝜙)(1−𝑥)
where 𝑥∈{0,1}.
Expectation: 𝐸[𝑋]=𝜙
Variance: Var(X)=ϕ(1−ϕ).
What is the precision of a distribution?
Precision is the inverse of variance:
𝜆=1/𝜎2
It simplifies calculations in Gaussian distributions.
```
~~~
What is a Standard Normal Distribution?
A Normal Distribution with Expectation: 𝐸[𝑋]=0 and Standard deviation: 𝜎=1
What is the Central Limit Theorem?
The sum of many independent random variables is approximately normally distributed, regardless of their original distribution.
What is an indicator function?
A Boolean function representing whether a condition is satisfied:
1{condition}. It takes values: 1 if the condition is true 0 otherwise.
```
~~~
What is a Dirac distribution?
A distribution concentrated at a single point:
p(x)=δ(x−μ), where δ is the Dirac delta function.
What is the Dirac Delta Function?
A generalized function:
δ(x)=0 everywhere except at x=0.
∫ −∞∞ δ(x)dx=1.
It is used in empirical distributions.
What is a Mixture Distribution?
A distribution formed from multiple component distributions, often using latent variables (unobservable random variables). It allows for more flexible modeling of complex data.
Mathematically, a mixture distribution is expressed as:
P(x)= ∑k=1 K πk P(x∣z=k), where:
K is the number of component distributions.
𝜋𝑘is the mixing coefficient, representing the probability of selecting the k-th component.
𝑃(𝑥∣𝑧=) is the component probability distribution given that the latent variable z is in state k.
The mixing coefficients satisfy :
∑k=1 K πk =1
What is the Gaussian Mixture Model (GMM)?
A mixture model using Gaussians, acting as a universal density approximator. It estimates prior and posterior probabilities.
What is the logistic sigmoid function?
A function that transforms values into probabilities:
σ(x)=1/(1+e−x)
It is used to generate 𝜙 values for Bernoulli Distributions
It Saturates at extreme values. It stays flat/stable for extremely small changes
The logit function is commonly used and is its inverse
What is the Softplus function?
A smooth approximation of max(0, x):
Softplus(x)=log(1+ex)
It is used to generate 𝜎 values for Gaussian Distributions
What are independent variables in probability?
Two random variables 𝑋 and 𝑌 are independent if knowing one does not change the probability distribution of the other. This is mathematically defined as:
P(X,Y)=P(X)P(Y)
which means their joint probability equals the product of their individual probabilities.
What are dependent variables in probability?
Two variables are dependent if knowledge of one changes the probability of the other, meaning:
P(X,Y)=P(X=x)P(Y=y|X=x)
In this case, their probabilities influence each other.
What is conditional independence, and how is it defined mathematically?
Two events A and B are conditionally independent given a third event C if knowing C makes A and B independent.
Mathematically, this is written as:
P(A,B∣C)=P(A∣C)P(B∣C)
This means that once we account for C, the occurrence of A does not affect the probability of B.
What is an isotropic Gaussian distribution?
An isotropic Gaussian distribution is a multivariate normal distribution where the covariance matrix is a scaled identity matrix, meaning the variables have equal variance and are uncorrelated.
Mathematically, it is given by:
𝑃(𝑥)=1/(2𝜋𝜎2)𝑛/2exp(-∥𝑥−𝜇∥2/2𝜎2)
where:
μ is the mean vector.
σ 2I is the covariance matrix (where is the identity matrix).
The distribution is “isotropic” because it has the same variance in all directions.
What is an empirical distribution
The empirical distribution represents observed data points and provides an estimate of the true underlying distribution. It is defined as:
𝑃𝑛(𝑥)=1/𝑛∑i=1 n=1/𝑛∑𝛿(𝑥−𝑥𝑖)
where:
𝑥𝑖 are observed data points.
𝛿(𝑥−𝑥𝑖) is the Dirac delta function, which ensures that each observation contributes a probability mass of 1/𝑛
The empirical distribution converges to the true distribution as the number of observations increases.
What is Bayes’ Rule, and how is it applied?
Bayes’ Rule is a fundamental theorem in probability that describes how to update beliefs in light of new evidence. It is given by:
𝑃(𝐴∣𝐵)=𝑃(𝐵∣𝐴)𝑃(𝐴)𝑃(𝐵)
where:
𝑃(𝐴∣𝐵) is the posterior probability (updated belief after observing B).
𝑃(𝐵∣𝐴) is the likelihood (probability of observing B given A).
𝑃(𝐴) is the prior probability (initial belief about A).
𝑃(𝐵) is the evidence (normalization factor).