Model Identification and Data Analysis Flashcards

1
Q

[Linear Classification] What’s a Perceptron.

A

A perceptron is the numerical solution for a linear classification problem. Where the Perceptron tells to which group a certain point should belong to. It is given by the expression:
h(x,w) = sign(w^T . X)
h(x,w) = sign ( weights_transposed . x_variables)

In the case of a 2D case where the dots are given by x1 and x2, the perceptron would be something like:
y_hat = sign (w0 + w1.x1 + w2.x2)

Where w0 corresponds to the Bias.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

[Linear Classification] Given a graph with the separating line, how do you get the expression for the perceptron?

A

Considering a 2D case, the perceptron would have a form:
h(x1, x2) = 1, if w0 + w1x1 + w2x2>0
h(x1, x2) = -1, Otherwise.

To choose values for w0, w1 and w2, we substitute points where the line intersects the axes:
w0 + w1 (x1) = 0
w0 + w2 (x2) = 0
we get an expression similar to:
w0 = w1 = -2w2.
then consider the sign of the solution for any point outside of the line and develop a possible solution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

[Linear Classification] What’s the Perceptron Learning Algorithm (PLA) Update Rule?

A

The PLA Update Rule is given by:
w(t+1) = w(t) + y(t) x(t)
where:
w(t) is the vector containing the current weights of the perceptron (w0, w1, … wn).
x(t) is the vector containing the coordinates of the missclassified point (x0, x1, … xn).
y(t) is the real class of the point.
And w(t+1) is the updated Weight Vector.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

[Linear Classification] How does one use the PLA?

A

First you evaluate a point with the current perceptron, using the point’s coordinates. If the Point is correctly classified, the PLA doesn’t change W.
If the Point is missclassified we use the Update Rule, and get a new set of W’s.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

[Linear Classification] Why might Generalization be an issue with the PLA?

A

One doesn’t really know how good the algorithm will classify new points. The Generalization Theorem doen’t really help with identifying just how general the algorithm is.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

[Linear Regression] What happens to the training and testing errors when the training set is increased? Why?

A

Training Error increases, because as more examples have to be fitted, it becomes hard to get close to the different points.
Testing Error decreases, because there is more information, and therefore we can develop a better model.
More training examples leads to better Generalization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

[Linear Regression] How is the Ordinary Least Squares formula derived?

A

1.- Define Problem: find a y-hat = Xw equation that minimizes the error between the predicted and actual values of y.
2.- Squared Loss (Cost Function):
L(w) = SUMi=1 : n (y-i - x-iT w)^2
or
L(w) = (y - Xw)T(y-Xw)
3.- Expand expression to be:
L(w) = yTy - 2wTXTy + wTXTXw
4.- Minimize expression with respect to w:
dL(w)/dw = -2XTy + 2XTXw = 0
5.- Solve for w:
XTXw = XTy
what = (XTX)-1XTy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

[Linear Regression] What’s the Ordinary Least Squares (OSL) Formula

A

what = (XTX)-1XTy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

[Linear Regression] When using OLS, what are the conditions necessary for what to be a minimum point?

A

1.- Gradient of Ein (what) = 0
2.- Hessian of Ein is positive definite (the derivative of the OSL formula with respect to w must be postive)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

[Linear Regression] How would you describe the Generalization of the OSL algorithm?

A

Bias-Variance Trade-off: OSL tends to have low variance, reducing risk of overfitting. However for complex data, linear regression may not generalize well unseen data.
OSL generally generalizes better with more data.
Mean Square Error can be used on the validation data to provide a measure of generalization error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

[Logistic Classification] Why shouldn’t the Gradient Descent Algorithm Step size be too big or too small?

A

If the step size is too small it may take too long to train the model. If it is too big we may risk going “up the valley”, and maybe even never getting to a useful model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

[Logistic Classification] What is the purpose of using the Gradient Descent Technique?

A

The goal is to find the minimum of a convex loss function. We want to get to a point where the error is the minimum, therefore we need to roll-down the valley and find the minimum.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

[Logistic Classification] What is the output of a logistic regression model?

A

The output can be used as a probability, as it is a number between 0 and 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the relationship between a Logistic Regression and Neural Networks?

A

Both are widely used in binary classification.

Logistic regression can be seen as a simplified neural network consisting of a single layer.

Neural Netwroks incorporate multiple layers with non-linear activation functions, allowing these to learn complex representations of the inputs.

They both can use Gradient descent to minimize loss function, but neural networks need more resources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

[Logistic Classification] What is the formula for the output of the Logistic Regression?

A

h(s) = e^s / (1+e^s)
equivalent to:
h(s) = 1/(1+e^-s)

Where s is the linear combination of the input features. Usually given by s = wTx

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

[Logistic Classification] What’s the classification boundary of a Logistic Classifier?

A

Where the predicted probability switches from below 0.5 to above 0.5. This is given at the point where s=0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

[Gradient Descent] How do we know when to stop the Gradient Descent algorithm?

A

We can choose one of the following options:
1. Set a threshold for ||∇Ein(w(t))||
2. Set uper bound on number of iterations
3. Set Threshold on Ein
4. A combination of the above

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

[Logistic Optimization] What’s the formula for Gradient Descent?

A

w(t+1) = w(t) - n ∇[Ein(w(t))]

Where n is the step of each update,
∇Ein(w(t)) is the gradient of the Cost Funcion at the current conditions.
w(t) The current weights of the model
w(t+1) Updated Weights of the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

[Gradient Descent] Explain Stochastic Gradient Descent. How is it Different from Normal Gradient Descent?

A

w(t+1) = w(t) - n ∇Ein(w;xi,yi)
Similar to normal gradient descent, but in this case the algorithm selects specific random training samples to compute the gradient and not the whole dataset. In general, it converges much faster compared to normal GD, but the error function might not be as well minimized as in GD.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

[Bias-Variance] What is the Bias Error?

A

Ex [ (hbar(x)-f(x))^2]
Where:
hbar(x) is the average model with all possible datasets (NOT COMPUTABLE)
f(x) is the desired model.

Bias is constant with respect to the Dataset. It represents how much the model h within the selected Class H can get close to the target function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

[Bias-Variance] What is the Variance Error?

A

Ex{ED[ (hD(x)-hhat)^2]}
Constant with respect to f. It represents the deviation of the computed model hD(x) with respcet to the average one-

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

[Bias-Variance] What happens when variance = 0? why?

A

The Bias gets vey high.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

[Model Selection] What are the 2 different approaches to choose the complexity of the model?

A
  1. Regularization
  2. Cross-Validation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

[Model Selection] Explain the objective of Regularization.

A

Model the mismatch between Ein and Eout and define a more appropriate cost.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

[Model Selection] What are the two options to perform regularization?

A
  1. Constrained optimization problem with a budget C:
    minw Ein(w), subject to: ||w||22 <= C
  2. Unconstrained Optimization problem with regularization penaly λ:
    minw Eaug(w), wher:
    Eaug(w) = Ein(w) + λ/N ||w||22
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

[Model Selection] What is the formula for Regularized Least Squares (RLS)?

A

w_hatreg=(XT+ λI)-1 XTY

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

[Model Selection] What is the Objective of Cross-Validation?

A

Try to model directly Eout. Without changing the cost to minimize, but the way we employ the data.
Essentially dividing the data-set into a Training Set and a Validation Set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

[Model Selection] What are the steps in order to do Cross-Validation?

A
  1. Divide Data Set into a Training set (size N-k) and a Validation set (size k).
  2. Learn model using the Training set: Ein(w) = 1/(N-k) sumi=1N-k(yi - wTxi)^2 => This produces a w_hat
  3. Validate model using the Validation set and w-hat produced by the training:
    Ein(w) = 1/(k) sumi=N-k+1N(yi - w-hat Txi)^2 => This can be considered the Eout of the model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

[Model Selection] In Cross-Validation, what is the trade-off in the selection of k?

A

A Small K produces: Large N-k, which means a better w_hat. But it produces a bad estimation of Eout( w_hat )

A large K: Good estimate of Eout( w_hat ), but a small N-k, which means a bad w_hat.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

[Model Selection] In Cross-Validation, what is the rule of thumb to select k?

A

k = N/5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

[Model Selection] How do we combine Regularization and Cross-Validation to tune λ?

A

Eaug(w) = Ein(w) + λ ||w||22
1st we Train the model using the training set and a fixed λ to obtain w-hat.
2nd using the validation set we choose the optimal λ-hat.
3rd we use the Test set to estimate Eout, using the w-hat and λ-hat.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

[Neural Networks] How is a Neural Network of non-linear combinations optimized?

A

Using Gradient Descent.
w(t+1)=w(t) - n ∇Ein(w(t))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

[Neural Networks] Why can’t a linear classification Neural Network (sign(wTX)) be trained using Gradient Descent?

A

Because it is necessary to compute the gradient, and sign(wtx) cannot be differentiated.

34
Q

[Neural Networks] What is the definition of a Neural Network?

A

It is a computational model inspired by the structure of the human brain. Where each neuron receives inputs, processes them, and sends an output to other neurons in the next layer.

35
Q

[Neural Networks] What is the Universal Approximation Theorem?

A

Also called Cybenko’s Theorem, states that a neural network with at least one hidden layer containing sufficient number of neurons, can approximate any continuous function on a closed and bounded interval, given appropriate weights and biases.

36
Q

[Non-Parametric Modeling] What is Nonparametric Modeling?

A

While Parametric Modeling assumes h(x) is defined by some parameters w, Nonparametric modeling does not assume any structure for h(x).
It is to find classification rules based on other criteria, doable only if we only wish to know y-hat= h(x), and don’t care about the knowledge of h.

37
Q

[Nonparametric Modeling] What is K-Nearest Neighbours?

A

The objective of K-NN is classifying a new point using the K number of Nearest Neighbours to it. The new point will be assigned the class of the majority of points in the set of Nearest Neighbours.

38
Q

[Nonparametric Modeling] What is the formula to implement K- Nearest Neighbours?

A

y-hat = sign (SUMi=1k y(x-bari)
where y-hat is the sign of a new point.
k is the number of nearest neighbours (usually given)
y(x-bari) is the sign of the nearest k points.

39
Q

[Nonparametric Modeling] What are the options to calculate the distance between two points?

A
  1. Euclidean Distance:
    d(x, y) =||x-y||
  2. Weighted Euclidean Distance:
    d(x,y) = sqrt((x-y)T Q (x-y))
  3. Cosine Similarity:
    d(x,y) = cos(x,y) = (x . y)/(||x|| . ||y||)
40
Q

[Nonparametric Modeling] What is the rule of thumb to find K in K-NN?

A

K = floor (sqrt(N))

41
Q

[General Classification] How to assess the quality of a classifier?

A
  1. Miclassification Error (Ein or Eout)
  2. Confusion Matrix C = [a b; c d] where a: # of true positives, b=# False positives, c= false negatives, d= # True Negatives.
  3. Derived form 2:
    Positive Predicted Value (Precision):
    PPV = a/(a + b)
    True Positive Rate (Recall):
    TPR = a / (a + c)
    F1- Score:
    F1 = (2 PPV TPR) / (PPV + TPR)
42
Q

[Clustering] What is the goal of Clustering?

A

The goal is to find a partition of the space so as to divide the N points into K sets with corresponding Centroids. So basically group the points into K groups, with each group being represented by a single point.

43
Q

[Clustering] What is the goal of K-Means?

A

The centroid of a group is representative of the data in the group if every point in the group is close to the centroid.

44
Q

[Clustering] How does one apply K-Means?

A

Given a K, we use Lloyd’s Algorithm:
1.- Initialize the centroids u1,u2,…uk.
2. Construct each set Sj to be the set of all points closest to the centroid uj. (For each point find which centroid is closest)
3. Update the centroids uj so as to be the real centers of each set:
uj = 1/Nj SUMi in Sj xi
4. Repeat 2 and 3 until Ein stops decreasing.

45
Q

[Clustering] How does one initialize the centroids for K-Means?

A
  1. Take 1 point at random in the dataset and establish the first centroid there. u1
  2. Make the second centroid u2 the furthest point in the dataset from u1.
  3. Do something similar to step 2 for the rest of the centroids.
46
Q

[Clustering] How does one make sure the selection of K is the right one in K-Means?

A

Graphing the Ein to the K, we can get a “Knee” type behaviour of the graph. The selection of K should be close to the “kneecap”

47
Q

[Clustering] What’s the name of the method used to cluster without the need to select a K? How does it work?

A

Hierarchical Clustering. We use a dendrogram to visualize the Error after changing the number of clusters starting from N clusters (N being the number of points) and ending with a single cluster, and recording the error between merged clusters after each iteration. Normally there exists a step at which a large jump in distance can be seen from one level to the next, which indicates natural clusters have been found, and we can stop the algorithm there.

48
Q

[Data Preprocessing] What are the 5 main data pre-processing techniques?

A
  1. Input Centering
  2. Input Normalization
  3. Input Whitening
  4. Data Cleaning
  5. Dimensionality Reduction
49
Q

[Data Preprocessing] What’s the goal of Input Centering?

A

The Goal is to remove Bias in X by translating the origin of the dataset to be exactly in the midle of the datapoints.

50
Q

[Data Preprocessing] How is Input Centering Done?

A

1st.- find the Input Bias:
xbar = 1/N SUM i=1N xi
2nd.- Calculate the new points:
z(i)=x(i) - x(bar)

51
Q

[Data Preprocessing] What is the goal of Input Normalization?

A

To scale the Input X (having centered X beforehand). Basically compressing the dataset.

52
Q

[Data Preprocessing] How is Input Normalization done?

A

1st we obtain the Variance along each dimension:
Zj^2 = 1/N SUMi=1N xj(i)^2
2nd: we update the values of the datapoints:
z(i) = Xj(i) / Zj(i)

53
Q

[Data Preprocessing] What’s the goal of Input Whitening?

A

The goal is to Decorrelate Input Samples. So remove any potential shape the data might have that correlates one sample to the next.

54
Q

[Data Preprocessing] How is Input Whitening done?

A

1st Calculate Empirical covariance Matrix:
A = 1/N SUM x(i)x(i) T = 1/n XTX
2nx update data:
z(i) = A^(-1/2) X(i)

55
Q

[Data Preprocessing] What is the goal of Data Cleaning?

A

The goal is to remove outliers in the original data.

56
Q

[Data Preprocessing] What are the two different categories of Dimensionality Reduction methods?

A

Symmetry and Intensity (Physical Insight)

Data-Driven Tools

57
Q

[Data Preprocessing] What are the general steps to use the PCA algorithm? and what does PCA stand for?

A

PCA-Principal Component Analysis
1. Input a Data Matrix X and the wished dimensions K.
2. Compute the Singular Value Decomposition (SVD) of matrix X: [U, T, V] = svd(X)
3. Let V_k be the first k columns of V
4. PCA feature Matrix: Z = X . Vk

58
Q

[Data Preprocessing] What is the Goal of Principal Component Analysis?

A

The Goal of PCA is to extract the input features in a lower dimension space, but keeping the important information. (It’s Unsupervised). In a 2D case, this means to represent all the datapoints (x,y) in a single dimension x.

59
Q

[Stochastic Processes] What’s a Stochastic Process?

A

An Infinite Sequence of Random variable, all defined in the same probabilistic space. Where its function not only depends on xt, but also on past values of x and y.

60
Q

[Stochastic Processes] What’s a Realization of a stochastic process?

A

Different values of the y function at a single time (different experiments that get different values while using the same initial parameters).

61
Q

[Stochastic Processes] What’s the probabilistic distribution of a stochastic process at a fixed time?

A

Also PDF, It represents the probability of the value of y. Normally a gaussian distribution, where the value of y is probably at the mean, but not necessarily.

62
Q

[Stochastic Processes] What’s the Wide-Sense Characterization of a S.P.?

A

It’s a simplified representation of y(x,t). Where the function is described using only its mean and covariance functions.

63
Q

[Stochastic Processes] How does one get the Mean of a S.P.?

A

m(t) = Es[x(s,t)] = Integral evaluated in all the probabilistic space of { x(s,t) pdf (s) ds}

64
Q

[Stochastic Processes] How does one get the Covariance of a S.P.?

A

gamma(t1,t2) = Es{[x(s,t1) - m(t1)][x(s,t2) - m(t2))}
if t1 = t2, we get the variance of x in t1.
if t1 != t2 Gamma tells us how much x(t1) and x(t2) are linearly correlated.

65
Q

[Stochastic Processes] What’s a Stationary Stochastic Process?

A

Also SSP. A S.P. is satationary if:
1. mean is constant over all time
2. gamma(t1,t2) depends only on Tao = t2-t1, and therefore the covariance can be defined with a single variable Tao gamma(t1,t2) = gamma(Tao). The correlation between different t’s is always be the same, and so the variance is always the same for any t.

66
Q

[Stochastic Processes] What are the properties of the covariance for a SSP?

A

The covariance given by:
Gamma(Tao) = E{(x(s,t) - m)(x(s, t-Tao) -m)}
1. Non-Negativity: Gamma(0) = E{(x(s,t) - m)^2} >= 0
2. Variance prevalance: |Gamma(Tao)| <= Gamma(0) for all Tao
3. Symmetry: Gamma(Tao) = Gamma(-Tao) for all time

67
Q

[Stochastic Processes] When is SSP considered White Noise?

A

e(s,t) ~ White Noise WN(miu, lambda^2) if:
1. Mean = miu: E[e(s,t)] = miu
2. Variance = lambda^2: E[(e(s,t) - miu)^2] = lambda^2
3. Covariance is = 0 for any Tao except 0:
Gamma (Tao) =E[(e(s,t) - miu)(e(s,t-Tao)-miu)] = 0

68
Q

[Frequency-Domain Representations] What is the Spectrum of a signal?

A

It represents the range of frequencies that make up a signal, showing how much of each frequecy is present. (Amplitud - Phase spectrums). Fourier Series and Transform are used to find these in deterministic settings.

69
Q

[Frequency-Domain Representations] How does one find the spectrum of a stochastic process?

A

By using the Spectral Density of the process. Also called Spectron. Defined as the fourier transform of the covariance function:
GAMMA(w) = SUM-∞+∞ Gamma (Tao) e^(-jwTao)

70
Q

[Frequency-Domain Representations] What is the Spectron of a White Noise?

A

Lambda^2, because gamma only has value Lambda^2 at 0, everywhere else is 0, threfore:
GAMMA(w)=SUM-∞+∞ Gamma (Tao) e^(-jwTao) = Lambda^2 e^0 = Lambda^2

71
Q

[Frequency-Domain Representations] Describe some properties of the Spectron GAMMA(w)

A
  1. Realness: The Imaginary part is 0 for any real w
  2. Non-Negativity: GAMMA(w) >= 0 for any real w
  3. Symmetry: GAMMA(w) = GAMMA(-w) for any real w.
  4. Periodicity (of period 2pi): GAMMA(w)=GAMMA(w+2pik), for any real w and k.
72
Q

[Frequency-Domain Representations] What’s the formula for getting the covariance function from a Spectral density function?

A

gamma(Tao) = 1/(2pi) integral-pipi GAMMA(w) e^(jwTao) dw
For any Tao=0 (Variance) the value for gamma(Tao) is the area under GAMMA’s curve / 2pi

73
Q

[Frequency-Domain Representations] What are the two options to describe a SSP?

A
  1. mean and covariance function
  2. mean dn spectral density function.
74
Q

[Stochastic Processes] How is the Sample mean calculated for a Stationary Stochastic Process?

A

m-hat = 1/N SUMt=1N x(s-bar , t)

75
Q

[Stochastic Processes] How is the Sample Covariance Function calculated for a Stationary Stochastic Process?

A

Given N Datapoints with zero mean:
Gamma-hat (Tao)=1/(1-|Tao|) SUM t=1N-|Tao| x(s-bar , t)x(s-bar, t+ |Tao|)

76
Q

[Stochastic Processes] How is the Sample Spectral Density calculated for a Stationary Stochastic Process?

A

Given N Datapoints with zero mean:
GAMMA-hatN (w)=SUM -(N-1)^TaoN-1 gamma(Tao) e^(-jwTao)

77
Q

[Stochastic Processes] How is the Sample Spectral Density calculated for a Stationary Stochastic Process Directly?

A

Using FFT (Fast Fourier Transform):
GAMMA-hat-primeN(w)=1/N |SUM t=1N x(t)e^(-jwTao)|^2

78
Q

[Stochastic Processes] How is the Sample Spectral Density calculated to make it more consistent?

A

By using Averaging:
GAMMA-hat-starN (w)=1/3 SUM i=13 GAMMA-hatN/3(i) (w)

79
Q

[Stochastic Processes] What is the standard procedure to analyse a non-stationary SP x(s,t)?

A

1.- Divide x(s,t) into xSSP(s,t) + v(t)
2. Estimate the nonstationary component from the data v-hat(t)
3. Remove the deterministic part: x-hatSSP= x - v-hat
4. work with x-hatSSP. Getting m-hat, gamma-hat(Tao)
5. work with x: m-hatx(t) = m-hat + v-hat (t)

80
Q

[Stochastic Processes] How does one analyse a non-stationary linear trend?

A

Given N Data-points:
1. x(t) = xSSP(t) + linear trend v(t)
2. v(t) = k t + q, where q includes any possible non-zero mean of the process.
3. Choose v-hat(t) by using OLS:
[k-hat q-hat]T = [SUM(t^2) SUM(t); SUM(t) N]-1 [SUM(t x(t)) ; SUM(x(t))]

81
Q

[Stochastic Processes] How does one analyse a non-stationary seasonal trend?

A

Given N Data-points:
1. x(t) = xSSP(t) + seasonal trend v(t)
2. v(t) = v(t + kT) where T is the period of seasonality and is known.
3. To choose v-hat(t):
v-hat(t) = 1/M SUMh=0M-1 x(t + hT) where: M is the number of periods, t = 1, 2 . .. ….T. and M.T <= N