Model Identification and Data Analysis Flashcards

Question

[Model Selection] What are the two options to perform regularization?

Answer 1

1. Constrained optimization problem with a budget C: min_w Ein(w), subject to: ||w||₂² <= C 2. Unconstrained Optimization problem with regularization penaly λ: min_w Eaug(w), wher: Eaug(w) = Ein(w) + λ/N ||w||₂²

Answer 2

w_hat_reg=(X^T+ λI)^-1 X^TY

Answer 3

Try to model directly Eout. Without changing the cost to minimize, but the way we employ the data. Essentially dividing the data-set into a Training Set and a Validation Set.

Answer 4

1. Divide Data Set into a Training set (size N-k) and a Validation set (size k). 2. Learn model using the Training set: Ein(w) = 1/(N-k) sum_i=1^N-k(yi - wTxi)^2 => This produces a w_hat 3. Validate model using the Validation set and w-hat produced by the training: Ein(w) = 1/(k) sum_i=N-k+1^N(yi - w-hat Txi)^2 => This can be considered the Eout of the model

Answer 5

A Small K produces: Large N-k, which means a better w_hat. But it produces a bad estimation of Eout( w_hat ) A large K: Good estimate of Eout( w_hat ), but a small N-k, which means a bad w_hat.

Answer 6

Eaug(w) = Ein(w) + λ ||w||₂² 1st we Train the model using the training set and a fixed λ to obtain w-hat. 2nd using the validation set we choose the optimal λ-hat. 3rd we use the Test set to estimate Eout, using the w-hat and λ-hat.

Answer 7

Using Gradient Descent. w(t+1)=w(t) - n ∇Ein(w(t))

Answer 8

Because it is necessary to compute the gradient, and sign(wtx) cannot be differentiated.

Answer 9

It is a computational model inspired by the structure of the human brain. Where each neuron receives inputs, processes them, and sends an output to other neurons in the next layer.

Answer 10

Also called Cybenko's Theorem, states that a neural network with at least one hidden layer containing sufficient number of neurons, can approximate any continuous function on a closed and bounded interval, given appropriate weights and biases.

Answer 11

While Parametric Modeling assumes h(x) is defined by some parameters w, Nonparametric modeling does not assume any structure for h(x). It is to find classification rules based on other criteria, doable only if we only wish to know y-hat= h(x), and don't care about the knowledge of h.

Answer 12

The objective of K-NN is classifying a new point using the K number of Nearest Neighbours to it. The new point will be assigned the class of the majority of points in the set of Nearest Neighbours.

Answer 13

y-hat = sign (SUM_i=1^k y(x-barⁱ) where y-hat is the sign of a new point. k is the number of nearest neighbours (usually given) y(x-barⁱ) is the sign of the nearest k points.

Answer 14

1. Euclidean Distance: d(x, y) =||x-y|| 2. Weighted Euclidean Distance: d(x,y) = sqrt((x-y)^T Q (x-y)) 3. Cosine Similarity: d(x,y) = cos(x,y) = (x . y)/(||x|| . ||y||)

Answer 15

K = floor (sqrt(N))

Answer 16

1. Miclassification Error (Ein or Eout) 2. Confusion Matrix C = [a b; c d] where a: # of true positives, b=# False positives, c= false negatives, d= # True Negatives. 3. Derived form 2: Positive Predicted Value (Precision): PPV = a/(a + b) True Positive Rate (Recall): TPR = a / (a + c) F1- Score: F1 = (2 PPV TPR) / (PPV + TPR)

Answer 17

The goal is to find a partition of the space so as to divide the N points into K sets with corresponding Centroids. So basically group the points into K groups, with each group being represented by a single point.

Answer 18

The centroid of a group is representative of the data in the group if every point in the group is close to the centroid.

Answer 19

Given a K, we use Lloyd's Algorithm: 1.- Initialize the centroids u1,u2,...uk. 2. Construct each set Sj to be the set of all points closest to the centroid uj. (For each point find which centroid is closest) 3. Update the centroids uj so as to be the real centers of each set: uj = 1/Nj SUM_{i in Sj} xⁱ 4. Repeat 2 and 3 until Ein stops decreasing.

Answer 20

1. Take 1 point at random in the dataset and establish the first centroid there. u₁ 2. Make the second centroid u₂ the furthest point in the dataset from u₁. 3. Do something similar to step 2 for the rest of the centroids.

Answer 21

Graphing the Ein to the K, we can get a "Knee" type behaviour of the graph. The selection of K should be close to the "kneecap"

Answer 22

Hierarchical Clustering. We use a dendrogram to visualize the Error after changing the number of clusters starting from N clusters (N being the number of points) and ending with a single cluster, and recording the error between merged clusters after each iteration. Normally there exists a step at which a large jump in distance can be seen from one level to the next, which indicates natural clusters have been found, and we can stop the algorithm there.

Answer 23

1. Input Centering 2. Input Normalization 3. Input Whitening 4. Data Cleaning 5. Dimensionality Reduction

Answer 24

The Goal is to remove Bias in X by translating the origin of the dataset to be exactly in the midle of the datapoints.

Answer 25

1st.- find the Input Bias: x^bar = 1/N SUM _i=1^N xⁱ 2nd.- Calculate the new points: z⁽ⁱ⁾=x⁽ⁱ⁾ - x^(bar)

Answer 26

To scale the Input X (having centered X beforehand). Basically compressing the dataset.

Answer 27

1st we obtain the Variance along each dimension: Z_j^2 = 1/N SUM_i=1^N x_j⁽ⁱ⁾^2 2nd: we update the values of the datapoints: z⁽ⁱ⁾ = X_j⁽ⁱ⁾ / Z_j⁽ⁱ⁾

Answer 28

The goal is to Decorrelate Input Samples. So remove any potential shape the data might have that correlates one sample to the next.

Answer 29

1st Calculate Empirical covariance Matrix: A = 1/N SUM x⁽ⁱ⁾x^{(i) T} = 1/n X^TX 2nx update data: z⁽ⁱ⁾ = A^(-1/2) X⁽ⁱ⁾

Answer 30

The goal is to remove outliers in the original data.

Answer 31

Symmetry and Intensity (Physical Insight) Data-Driven Tools

Answer 32

PCA-Principal Component Analysis 1. Input a Data Matrix X and the wished dimensions K. 2. Compute the Singular Value Decomposition (SVD) of matrix X: [U, T, V] = svd(X) 3. Let V_k be the first k columns of V 4. PCA feature Matrix: Z = X . Vk

Answer 33

The Goal of PCA is to extract the input features in a lower dimension space, but keeping the important information. (It's Unsupervised). In a 2D case, this means to represent all the datapoints (x,y) in a single dimension x.

Answer 34

An Infinite Sequence of Random variable, all defined in the same probabilistic space. Where its function not only depends on x_t, but also on past values of x and y.

Answer 35

Different values of the y function at a single time (different experiments that get different values while using the same initial parameters).

Answer 36

Also PDF, It represents the probability of the value of y. Normally a gaussian distribution, where the value of y is probably at the mean, but not necessarily.

Answer 37

It's a simplified representation of y(x,t). Where the function is described using only its mean and covariance functions.

Answer 38

m(t) = _E__s[x(s,t)] = Integral evaluated in all the probabilistic space of { x(s,t) pdf (s) ds}

Answer 39

gamma(t1,t2) = _E__s{[x(s,t1) - m(t1)][x(s,t2) - m(t2))} if t1 = t2, we get the variance of x in t1. if t1 != t2 Gamma tells us how much x(t1) and x(t2) are linearly correlated.

Answer 40

Also SSP. A S.P. is satationary if: 1. mean is constant over all time 2. gamma(t1,t2) depends only on Tao = t2-t1, and therefore the covariance can be defined with a single variable Tao gamma(t1,t2) = gamma(Tao). The correlation between different t's is always be the same, and so the variance is always the same for any t.

Answer 41

The covariance given by: Gamma(Tao) = _E_{(x(s,t) - m)(x(s, t-Tao) -m)} 1. Non-Negativity: Gamma(0) = _E_{(x(s,t) - m)^2} >= 0 2. Variance prevalance: |Gamma(Tao)| <= Gamma(0) for all Tao 3. Symmetry: Gamma(Tao) = Gamma(-Tao) for all time

Answer 42

e(s,t) ~ White Noise WN(miu, lambda^2) if: 1. Mean = miu: _E_[e(s,t)] = miu 2. Variance = lambda^2: _E_[(e(s,t) - miu)^2] = lambda^2 3. Covariance is = 0 for any Tao except 0: Gamma (Tao) =_E_[(e(s,t) - miu)(e(s,t-Tao)-miu)] = 0

Answer 43

It represents the range of frequencies that make up a signal, showing how much of each frequecy is present. (Amplitud - Phase spectrums). Fourier Series and Transform are used to find these in deterministic settings.

Answer 44

By using the Spectral Density of the process. Also called Spectron. Defined as the fourier transform of the covariance function: GAMMA(w) = SUM_-∞^+∞ Gamma (Tao) e^(-jwTao)

Answer 45

Lambda^2, because gamma only has value Lambda^2 at 0, everywhere else is 0, threfore: GAMMA(w)=SUM_-∞^+∞ Gamma (Tao) e^(-jwTao) = Lambda^2 e^0 = Lambda^2

Answer 46

1. Realness: The Imaginary part is 0 for any real w 2. Non-Negativity: GAMMA(w) >= 0 for any real w 3. Symmetry: GAMMA(w) = GAMMA(-w) for any real w. 4. Periodicity (of period 2pi): GAMMA(w)=GAMMA(w+2pik), for any real w and k.

Answer 47

gamma(Tao) = 1/(2pi) integral_-pi^pi GAMMA(w) e^(jwTao) dw For any Tao=0 (Variance) the value for gamma(Tao) is the area under GAMMA's curve / 2pi

Answer 48

1. mean and covariance function 2. mean dn spectral density function.

Answer 49

m-hat = 1/N SUM_t=1^N x(s-bar , t)

Answer 50

Given N Datapoints with zero mean: Gamma-hat (Tao)=1/(1-|Tao|) SUM _t=1^N-|Tao| x(s-bar , t)x(s-bar, t+ |Tao|)

Answer 51

Given N Datapoints with zero mean: GAMMA-hat_N (w)=SUM _-(N-1)^Tao^N-1 gamma(Tao) e^(-jwTao)

Answer 52

Using FFT (Fast Fourier Transform): GAMMA-hat-prime_N(w)=1/N |SUM _t=1^N x(t)e^(-jwTao)|^2

Answer 53

By using Averaging: GAMMA-hat-star_N (w)=1/3 SUM _i=1³ GAMMA-hat_N/3⁽ⁱ⁾ (w)

Answer 54

1.- Divide x(s,t) into x_SSP(s,t) + v(t) 2. Estimate the nonstationary component from the data v-hat(t) 3. Remove the deterministic part: x-hat_SSP= x - v-hat 4. work with x-hat_SSP. Getting m-hat, gamma-hat(Tao) 5. work with x: m-hat_x(t) = m-hat + v-hat (t)

Answer 55

Given N Data-points: 1. x(t) = x_SSP(t) + linear trend v(t) 2. v(t) = k t + q, where q includes any possible non-zero mean of the process. 3. Choose v-hat(t) by using OLS: [k-hat q-hat]^T = [SUM(t^2) SUM(t); SUM(t) N]^-1 [SUM(t x(t)) ; SUM(x(t))]

Answer 56

Given N Data-points: 1. x(t) = x_SSP(t) + seasonal trend v(t) 2. v(t) = v(t + kT) where T is the period of seasonality and is known. 3. To choose v-hat(t): v-hat(t) = 1/M SUM_h=0^M-1 x(t + hT) where: M is the number of periods, t = 1, 2 . .. ....T. and M.T <= N