Deep Learning: Introduction and Mathematical Foundations Flashcards

Question

What is the Neocognitron?

Answer 1

A hierarchical, convolutional neural network model inspired by the human visual system.

Answer 2

A perspective in AI emphasizing the distributed representation of knowledge across interconnected units.

Answer 3

Representations where information is encoded across multiple neurons rather than a single unit.

Answer 4

Long Short-Term Memory model, a type of recurrent neural network designed to capture long-term dependencies in sequential data.

Answer 5

A machine learning paradigm where an agent learns to maximize rewards through interactions with its environment.

Answer 6

A scalar is a singular value, representing a single numerical quantity. A vector is a one-dimensional collection of values, typically representing magnitude and direction in space. A matrix is both a one-dimensional collection of vectors and a two-dimensional collection of values, organized in rows and columns. A tensor is an 𝑛-dimensional collection of values, generalizing scalars, vectors, and matrices to higher dimensions.

Answer 7

Transpose: Flips rows and columns. Inverse: A matrix 𝐴 such that 𝐴𝐴⁻¹ =𝐼 where 𝐼 is the identity matrix. Square Matrix: Equal number of rows and columns. Singular: A matrix without an inverse (determinant = 0). Diagonal Matrix: Non-diagonal elements are zero.

Answer 8

Symmetric: 𝐴=𝐴^𝑇 Orthogonal: 𝐴𝐴^𝑇=𝐼 Orthonormal: Vectors in 𝐴 are unit length and orthogonal Trace: Sum of diagonal elements Unit Vector: A vector of magnitude 1

Answer 9

A matrix 𝐴 is positive definite if: 1. All its eigenvalues are strictly positive (𝜆_𝑖0) 2. For any non-zero vector 𝑥, the quadratic form 𝑥^𝑇𝐴𝑥=0 A matrix 𝐴 is negative definite for the vice versa of the above. Positive definite matrices often appear in optimization, where they ensure that a function has a unique minimum. Negative definite matrices indicate that a quadratic function is concave and has a unique maximum.

Answer 10

A matrix 𝐴 is positive semi-definite if: 1. All its eigenvalues are non-negative (𝜆_𝑖≥0). 2. For any non-zero vector 𝑥, the quadratic form 𝑥^𝑇𝐴𝑥≥0 A matrix 𝐴 is negative semi-definite for the vice versa of the above. Positive semi-definite matrices represent systems where the quadratic form cannot be negative but might equal zero for some 𝑥 Negative definite matrices indicate that a quadratic function is concave and has a unique maximum.

Answer 11

A linear combination operator spans all values obtainable through linear combinations of vectors. It helps determine whether a system has infinite or no solutions. z = αx + (1 - α)y for a 2 unknown system.

Answer 12

A linear system can be written as 𝐴_𝑛×𝑚𝑥=𝑏, where 𝑏 ∈ 𝑅^𝑚, 𝑛≥𝑚, and solutions are linearly independent.

Answer 13

A norm measures the "size" or "span" of a vector in space. Common norms include: 1. L₂ (Euclidean Norm): ∥𝑥∥₂= √ (∑x_i² 2. L₁ (Manhattan Norm): ∥𝑥∥₂= ∑x_i^{3. L_inf (Max Norm): ∥𝑥∥_inf= max(x_i^{)
4. Frobenius Norm: The Euclidean Norm for a matrix}}

Answer 14

A norm function must: 1. Satisfy 𝑓(0)=0, 2. Obey the triangle inequality, and 3. Be homogeneous under scaling.

Answer 15

Orthogonal matrices are computationally efficient because their determinant is easy to compute, and their inverse is simply their transpose.

Answer 16

Eigenpairs consist of: 1. Eigenvalue (𝜆): Indicates the stretching or scaling factor. 2. Eigenvector (𝑣): Specifies the direction. The relationship is expressed as 𝐴𝑣=𝜆𝑣.

Answer 17

A matrix 𝐴 is diagonalizable if it can be written as 𝐴=𝑉diag(𝜆)𝑉⁻¹, where 𝑉 is the eigenvector matrix and 𝜆 are eigenvalues. This representation helps assess system stability.

Answer 18

U: Left singular vectors, Σ: Diagonal matrix of singular values, 𝑉^𝑇: Right singular vectors. It generalizes matrix inversion and applies even when eigenvalue decomposition fails.

Answer 19

The Moore-Penrose Pseudoinverse is used for matrices that are: 1. Non-square, 2. Singular, or 3. Represent systems with no or infinite solutions. It provides a generalized inverse to solve 𝐴𝑥=𝑏.

Answer 20

Principal Component Analysis (PCA) is a dimensionality reduction technique that: 1. Extracts features by maximizing variance, 2. Reduces data dimensions while preserving variance, and 3. Simplifies data visualization and computation.

Answer 21

PCA involves: 1. Preparing the data, 2. Computing the covariance matrix, 3. Finding eigenpairs (eigenvalues and eigenvectors), 4. Sorting and selecting principal components, and 5. Transforming the data into the new basis.

Answer 22

Inherent Stochasticity – Events have randomness built into them. Incomplete Observability – We lack knowledge of all variables in the system. Incomplete Modeling – Our models either lack information or deliberately discard it.

Answer 23

The degree of belief is a fuzzy logic truth value between 0 (completely uncertain) and 1 (completely certain), representing confidence in an event occurring.

Answer 24

Frequentist Probability focuses on event occurrence rates over many trials. Bayesian Probability quantifies certainty in an event occurring, incorporating prior knowledge. Deep Learning typically uses the former while the use of the latter is growing

Answer 25

Discrete Random Variables take finite or countable values. Continuous Random Variables take values in an uncountable real range.

Answer 26

The PMF gives the probability distribution of discrete random variables. It must satisfy: 1. 0≤𝑃(𝑋=𝑥)≤1for all 𝑥 2. ∑_𝑥𝑃(𝑋=𝑥)=1(it normalizes to 1). It can also describe joint probability distributions across multiple variables.

Answer 27

A distribution where all events have equal probability.

Answer 28

The PDF describes the probability distribution of continuous random variables. Properties: 1. It integrates to 1 over the domain: ∫ _−∞^∞𝑝(𝑥)𝑑𝑥=1 2. It satisfies 𝑝(𝑥)≥0 for all 𝑥. Unlike PMFs, PDF values are not probabilities themselves but density values.

Answer 29

Marginal probability is the probability distribution of a subset of variables when the joint probability over all variables is known. It is computed by summing (for discrete variables) or integrating (for continuous variables) over the unwanted variables. For discrete variables (summing over unwanted variable 𝑦): 𝑃(𝑥)=∑_𝑦𝑃(𝑥,𝑦) For continuous variables (integrating over unwanted variable 𝑦): 𝑃(𝑥)=∫𝑃(𝑥,𝑦)𝑑𝑦 This process effectively eliminates the variables that are not of interest.

Answer 30

The expected value (average) of a function over a probability distribution. 1.Discrete case: 𝐸[𝑓(𝑋)]=∑_𝑥𝑓(𝑥)𝑃(𝑋=) 2. Continuous case: 𝐸[𝑓(𝑋)]=∫𝑓(𝑥)𝑝(𝑥)𝑑𝑥 Expectation is linear and homogeneous: E[aX+b]=aE[X]+b

Answer 31

The probability of an event given another event has occurre: 𝑃(𝐴∣𝐵)=𝑃(𝐴∩𝐵)/𝑃(𝐵), where 𝑃(𝐵)≠0, since conditioning on an impossible event is undefined.

Answer 32

Variance measures the spread of a random variable: Var(𝑋)=𝐸[(𝑋−𝐸[𝑋]²)] A low variance means values cluster around E[X], while high variance means values are widely spread.

Answer 33

The standard deviation is the square root of variance: 𝜎=√Var(𝑋)

Answer 34

A measure of how two variables are linearly related: Cov(𝑋,𝑌)=𝐸[(𝑋−𝐸[𝑋])(𝑌−𝐸[𝑌])] High covariance means variables vary together. The diagonal of the covariance matrix is a vector of variance values.

Answer 35

Correlation normalizes covariance: 𝜌(𝑋,𝑌)=Cov(𝑋,𝑌)/𝜎_𝑋𝜎_𝑌 It is bounded between -1 and 1, making it easier to compare relationships across different scales.

Answer 36

A binary probability distribution: 𝑃(𝑋=𝑥)=𝜙^𝑥(1−𝜙)^(1−𝑥) where 𝑥∈{0,1}. Expectation: 𝐸[𝑋]=𝜙 Variance: Var(X)=ϕ(1−ϕ).

Answer 37

Precision is the inverse of variance: 𝜆=1/𝜎² It simplifies calculations in Gaussian distributions.

Answer 38

A Normal Distribution with Expectation: 𝐸[𝑋]=0 and Standard deviation: 𝜎=1

Answer 39

The sum of many independent random variables is approximately normally distributed, regardless of their original distribution.

Answer 40

A Boolean function representing whether a condition is satisfied: 1_{condition}. It takes values: 1 if the condition is true 0 otherwise.

Answer 41

A distribution concentrated at a single point: p(x)=δ(x−μ), where δ is the Dirac delta function.

Answer 42

A generalized function: δ(x)=0 everywhere except at x=0. ∫ _−∞^∞ δ(x)dx=1. It is used in empirical distributions.

Answer 43

A distribution formed from multiple component distributions, often using latent variables (unobservable random variables). It allows for more flexible modeling of complex data. Mathematically, a mixture distribution is expressed as: P(x)= ∑_k=1 ^K π_k P(x∣z=k), where: K is the number of component distributions. 𝜋_𝑘is the mixing coefficient, representing the probability of selecting the k-th component. 𝑃(𝑥∣𝑧=) is the component probability distribution given that the latent variable z is in state k. The mixing coefficients satisfy : ∑_k=1 ^K π_k =1

Answer 44

A mixture model using Gaussians, acting as a universal density approximator. It estimates prior and posterior probabilities.

Answer 45

A function that transforms values into probabilities: σ(x)=1/(1+e^−x) It is used to generate 𝜙 values for Bernoulli Distributions It Saturates at extreme values. It stays flat/stable for extremely small changes The logit function is commonly used and is its inverse

Answer 46

A smooth approximation of max(0, x): Softplus(x)=log(1+e^x) It is used to generate 𝜎 values for Gaussian Distributions

Answer 47

Two random variables 𝑋 and 𝑌 are independent if knowing one does not change the probability distribution of the other. This is mathematically defined as: P(X,Y)=P(X)P(Y) which means their joint probability equals the product of their individual probabilities.

Answer 48

Two variables are dependent if knowledge of one changes the probability of the other, meaning: P(X,Y)=P(X=x)P(Y=y|X=x) In this case, their probabilities influence each other.

Answer 49

Two events A and B are conditionally independent given a third event C if knowing C makes A and B independent. Mathematically, this is written as: P(A,B∣C)=P(A∣C)P(B∣C) This means that once we account for C, the occurrence of A does not affect the probability of B.

Answer 50

An isotropic Gaussian distribution is a multivariate normal distribution where the covariance matrix is a scaled identity matrix, meaning the variables have equal variance and are uncorrelated. Mathematically, it is given by: 𝑃(𝑥)=1/(2𝜋𝜎²)^𝑛/2exp(-∥𝑥−𝜇∥²/2𝜎²) where: μ is the mean vector. σ ²I is the covariance matrix (where is the identity matrix). The distribution is "isotropic" because it has the same variance in all directions.

Answer 51

The empirical distribution represents observed data points and provides an estimate of the true underlying distribution. It is defined as: 𝑃_𝑛(𝑥)=1/𝑛∑_i=1 ⁿ=1/𝑛∑𝛿(𝑥−𝑥_𝑖) where: 𝑥_𝑖 are observed data points. 𝛿(𝑥−𝑥_𝑖) is the Dirac delta function, which ensures that each observation contributes a probability mass of 1/𝑛 The empirical distribution converges to the true distribution as the number of observations increases.

Answer 52

Bayes' Rule is a fundamental theorem in probability that describes how to update beliefs in light of new evidence. It is given by: 𝑃(𝐴∣𝐵)=𝑃(𝐵∣𝐴)𝑃(𝐴)𝑃(𝐵) where: 𝑃(𝐴∣𝐵) is the posterior probability (updated belief after observing B). 𝑃(𝐵∣𝐴) is the likelihood (probability of observing B given A). 𝑃(𝐴) is the prior probability (initial belief about A). 𝑃(𝐵) is the evidence (normalization factor).

Deep Learning: Introduction and Mathematical Foundations Flashcards

Conceptual/Analogous/Reference flashcards for Deep Learning (76 cards)