midterm Flashcards

1
Q

When would boosting overfit?

A

If the weak learners overfit! A bunch of things that overfit when combined can only overfit as well.

The other case is when there is pink noise (uniform noise)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How is SVM like KNN?

A

SVM like KNN except you done the work of figuring out which point matters, only local points maters, you don’t need all of the points.
KNN -> point that are closer together ought to have similar labels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a notion of similarity in SVM?

A

In the quadratic equation, there is a term: xT_i*x_j which is the dot product of the vectors. This tells you whether they are orthogonal (0), in the same direction (large +) or opposite directions (large -). They are a form of similarity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the kernel trick?

A

K(x_i, x_j), Your similarity metric. It’s where you inject domain knowledge. This is a dot product that projects your data into a higher dimensional space where it can (could?) be linearly separable.

You don’t actually have to change your data into that dimension, though, because of the kernel trick.

The kernel should really capture your domain knowledge.

The datapoints (x_i, x_j) can be continuous or discrete!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is an SVM kernel?

A

The kernel function is a function which can be represented the dot product of the mapping function (kernel method)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the Mercer Condition?

A

“It acts like a distance or a similarity” which means it’s a well behaved distance function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How is boosting related to SVM?

A

As you add more learners, your error doesn’t necessarily change but you increase your confidence. This is like the margin in SVM.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why do you want large margins in SVM?

A

Large margin mean the ability to generalize better an avoid overfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Version space in ML?

A

Set of all hypotheses that are consistent with the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are mistake bounds?

A

How many misclassifications can a learner make over an infinite run.

1) Assume possible each variable is positive AND negated (A && NOT A && B && NOT B && …)
2) Given input, compute output, you guess FALSE
3) first time you’re wrong, keep bit pattern in your positive/negated table.
4) every time you’re wrong thereafter, set all positive variables that were 0 to absent, negative variables that were 1 to absent, goto 2

The idea is that whenever you see a true response, any bit patterns that are flipped when compared to the previous true response must be absent because they don’t contribute to the output.

You will make k + 1 mistakes at most (k being number of hypotheses).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is computational complexity?

A

How many computational effort is needed for a learner to converge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why boosting hard to overfitting?

A

1) giving high weight to low confidence to be high confidence
2) maximize margin meaning no overfitting because it will be able to generalize well

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Haussler’s theorem?

A

a

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the training error vs True error of a hypothesis?

A

Training error -=> is the fraction of training examples misclassified by h,
True error -> is the fraction of examples that would be misclassified on sample drawn from D

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How do you find the VC dimension?

A

a

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do VC dimensions relate to PAC?

A

a

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the VC of finite H space?

A

a

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is Haussler’s theorem?

A

Is a away to Bounding the true error as function of training example.
m >= 1/epsilon * (ln |H| + ln 1/delta)
(delta is the failure probability)
(epsilon is the error parameter)
There is a concern as the number of sample depend on the size of the hypothesis (what happen if you have large hypothesis space)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are VC dimensions?

A

“What kind of data can this classifier classify?”. This fundamental question can be answered using a classifier’s VC dimension.
“the cardinality (size) of the largest set of points that the classification algorithm can shatter”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How do you find the VC dimension?

A

You put some number of points down and then no matter how you label those points, you must be able to separate them.

VC is the number of parameters!
For a d-dimensional hyperplane, VC = d + 1
1 dimension  has 1 VC    (theta)
interval has 2 VC  (a,b)
two dimension has 3 VC (W in R2, theta)
three dimension has 4 VC
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How do VC dimensions relate to PAC?

A

H PAC-learnable if and only if VC dimension is finite. You can’t PAC learn it if the VC dimension is infinite.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the VC of finite H space?

A
There is log relation between the size of finite hypotheses class and the VC dimensionality 
d = VC(H) <= log_2 |H|
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How does the VC dimension relate to sample complexity?

A

m >= 1/epslion(8.vc(H).log(13/epslion ) + 4 log 2/delta ) –> infinite case

Like the Haussler’s theorem, except instead of ln |H| you use VC(H)… and there are many other changes.

The point is you can calculate a minimum number of samples to be epsilon exhausted if VC(H) is finite.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What’s Bayes Rule?

A

Pr(h|D) = ( Pr(D|h).Pr(h) )/ pr(D)

Pr(D) -> Prior on the data
Pr(D|h) -> easy to think about
pr(h) -> prior on h (our domain knowledge)

Allows you to swap what’s on the sides of the bar’s.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Describe Bayesian learning

A
Learn the best (probable) hypothesis give data and some domain knowledge
For each h E H :
  calclulate Pr(h|D)
output:
 h = argmax pr(h|D)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What’s the MAP hypothesis?

A

Maximum a posteriori = argmax pr(h|D)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is the ML hypothesis?

A

h = argmax Pr(D|h) we don’t care of Pr(h) as all hypothesis equally uniform

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

How does Bayesian learning and information theory fit together?

A

Transform L_map = argmax_h P(D|h)*P(h) using lg:

argmin_h [-lg P(D|h) -lg P(h)]

where each of these represents a length a la information theory.

so the lg P(h) is like the length of your hypothesis (‘size’ of h) and the lg P(D|h) is a notion of misclassification or ‘error’.

eg it minimizes error using the simplest hypothesis. There is usually a tradeoff, though.

This is called minimum description length.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

How does Bayesian classification work?

A

Weighted voted h E H, pr(h|D)

finding the value is what we care about

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is conditional independence?

A

X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given the value of Z ; that is if

P(X=x | Y=y, Z=z) = p(X=x |Z=z)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is normal independence?

A

Pr(x,y) = pr(x).pr(y)

“The joint distribution between two variables is equal to the product of their marginals” - CL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What are belief networks?

A

Represent the conditional independence relationship between all the variable in the join distribution graphically. Node -> variables. Edges -> dependences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

How do you recover the joint distribution from a belief network?

A

You multiple the probabilities of the entire graph back together:
Pr(A,B,C,D,E) = pr(A). pr(B).pr(c|A,B).pr(D.B,c).Pr(E|C,D)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Why sample in Bayesian Learning?

A

Why have distribution

1) probability of value
2) generate value according this distribution

Why sample in Bayesian Learning?

1) Simulation of a complex process
2) approximate inference(Not exactly because it hard)
3) visualization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What is marginalization (stats)?

A

p(x) = sum (p(x,y))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What is Naïve Bayes?

A

Naïve Bayes classifiers is “probabilistic classifiers” based on applying Bayes’ theorem with strong independence assumptions between the features

P(V|a1,a2,….,n) = product p(a|V) . p(v)/z
e.g spam email classification

A special case of a belief network where you’ve got a root node (label) producing many different attributes values. Each attribute has no children and only has the root node as its single parent.

P(V | a1, a2, a3…) = prod_i(P(a_i | V) * P(V) / Z

MAP Class = argmax_V P(V) * prod_i(P(a_i | V)) - The most likely class given the data that we’ve seen.

Naive bayes can handle missing attributes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Why is Naïve Bayes powerful?

A

1) When you have naïve structure Inference is cheap
2) few parameters (Num. of parameters you need to know is linear)
3 Easy Estimate parameters with label data
4) connect inference and classification
5) empirically successful
6) Can do inference in both directions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Why is Naïve Bayes… naïve?

A

No free lunch.
The network not represent the world most of the time because the network says all the attributes are conditionally independent giving that you know the label.

Doesn’t model interrelationship between attributes.
It assumes all the attributes are conditionally independent of each other! (root note -> attributes). It doesn’t model interrelationships between attributes.

It’s OK because we don’t care about the absolute probabilities that come out of it, only that we get the correct label (so ordering of probabilities is what matters).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Describe random restart hill climbing

A
  • Guess some value, x
  • Check both sides of your current position
  • Move to the highest of your neighboring points
  • Repeat until both of your neighbors are lower than you
  • Once you’ve reached your local maximum, restart in some other random position and repeat
  • Do this some number of iterations and output the best result
  • In general this is an exploitation algorithm because it’s always going uphill. Maybe we could call the random restarts the exploration.

Is away of dealing with local optima.
Once local optimum reached try again starting from a randomly chosen x .
1) Not much more expensive.
2) multiple tries to find a good starting place.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

When might random hill climbing have trouble finding the solution?

A

If you’ve got a wide basin of attraction on a local optima (that is not the global) then the majority of the time you’re not reaching the global optima, especially if the global optima has a narrow basin of attraction. It may take a very large number of iterations to find it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Describe simulated annealing

A

Don’t always improve - sometimes you need to search.

A random search algorithm that trades off exploration and exploitation by starting off with more exploration and moving toward exploitation over time.

Algorithm:
• Sample a new point, x_t in neighborhood N(x)
• Compute probability of moving to that point as:
P(x, x_t, T):
• 1 if f(x_t) >= f(x) (neighbor is >= to you, make the move)
• e^([f(x_t) - f(x))]/T)
• Decrease Temperature, T

f(x_t) - f(x) will always be negative, so you’ve really got 1/e^(difference/T)

  • When T is large the exponent goes toward 0 and 1/1 = 1, so will move downhill.
  • As the temperature decreases, exponent becomes very small, which magnifies the exponent, making the probability very small.

So you start out with more exploration, accepting many downhill neighbors when T is large and accepting less downhill neighbors choosing mostly to move uphill as T gets smaller.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What are the properties of simulated annealing?

A

T ->0 : like hill climbing
T ->infinity : random walk
Decrease T slowly
BOLTZMANN distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Describe Genetic Algorithms

A

Population like parallel random restart.
Initial population of size k
repeat until converge:
* compute fitness of all x E p +
* select most fit individual
* pair up individual, replacing least fit individual via crossover/ mutation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

What is one point crossover (GA’s)?

A

1) locality of bit matters
2) it assumes that subpart of the space that can be independently optimize

A method of generating ‘children’ from ‘parents’ where the children have n bits from one parent and length-n bits of the other parent. Each child has the opposite set of bits from each parent.

This does imply some kind of bias about the locality of the bits - that some bit groups should stay together.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What is Uniform Crossover (GA’s)?

A

Randomize at each bit position

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Describe the MIMIC algorithm

A
  • Directly model distribution
  • Successively refine model
  • Generate samples from P(x) theta_t (starts out uniform)
  • set theta_t+1 to n_th percentile (fittest)
  • retain only those samples such that f(x) >= theta_t+1
  • Estimate the new distribution, P(x) theta_t+1
  • Repeat

You’re taking theta_min toward theta_max over time.

It’s like the water rising around a bunch of mountainous islands. The peaks remain visible and the water keeps rising until the peak(s) become single point(s) (if equal height peaks).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

How is MIMIC like genetic algorithms?

A
  • generating samples from P(x) theta_t : this is like your population
  • keeping the nth percentile of samples with f(x) >= theta_t: is like culling the population in GA
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What are dependency trees and how are they used in MIMIC?

A

A belief network where each node in the tree has exactly one parent (except for the root). Each variable then depends on exactly one other variable.

They are used used to estimate the distribution P(x) theta_t. By using dependency trees we are assuming each variable will have at most one dependent variable.

Typically we would have:
• P(x) = P(x1 | x2,… xn) * P(x2 | x3 … xn) … * P(xn)

With dependency trees, it becomes:
• P(x)_pi = prod(P(x_i | pi(x_i))

Where pi is the parent of x_i.

49
Q

Does MIMIC require the use of dependency trees?

A

No - using dependency trees is just the easiest thing to do while still allowing you to represent dependent relationships. You could use bayesian networks or unconditional probability distributions, etc.

50
Q

What is joint entropy?

A

a

51
Q

What is conditional entropy?

A

a

52
Q

What is the conditional and joint entropy if the variables are independent?

A

a

53
Q

What is entropy?

A

a

54
Q

Why boosting hard to overfitting?

A

1) giving high weight to low confidence to be high confidence
2) maximize margin meaning no overfitting because it will be able to generalize well

55
Q

What is joint entropy?

A

a

56
Q

What is conditional entropy?

A

the conditional entropy quantifies the amount of information needed to describe the outcome of a random variable given that the value of another random variable is known

57
Q

Which hypothesis space are infinte

A

1) Linear separators as there is a number of infinite
lines
2) ANN (infinite number of Weights )
3) DT (Continuous Inputs)

NO ( DT discrete)

58
Q

What is entropy?

A

Entropy is a measure of disorder or uncertainty.

59
Q

What is the VC dimension of convex polygons?(One edge counts as inside)

A

infinity because parameters is infinity

60
Q

What is the VC dimension of convex polygons?(One edge counts as inside)

A

infinity because parameters is infinity

61
Q

Give the VC dimension of these hypothesis spaces, briefly explaining your answers:
1. An origin-centered circle (2D)

A

Let us start by considering two points. Two points can be labelled in 4 possible ways. We can always find a circle that shatters these two points. So the VC dimension is at least 2

Let us now consider three points. In this case if the nearest and furthest point has the same label andthe remaining point has the opposite label, you can’t draw a circle to shatter this instance space.So the VC dimension is 2.

62
Q

Give the VC dimension of these hypothesis spaces, briefly explaining your answers:
1. An origin-centered sphere (3D)

A

We use the same argument as above to prove that the VC dimension is 2.
(Will our answers change if the circle and sphere was not origin centered? Hint: Yes.)

63
Q

Give the VC dimension of these hypothesis spaces, briefly explaining your answers:
1. An origin-centered sphere (3D)

A

We use the same argument as above to prove that the VC dimension is 2.
(Will our answers change if the circle and sphere was not origin centered? Hint: Yes.)

64
Q

Minimum Description Length

A

Tradeoff between size of h and error

65
Q

Optimization Approach

A

Generate & test -> small input space, complex function
Calculus -> function has derivative & solvable =0
Network’s method -> function has derivative, iteratively improve, if you have single optimum

Randomized optimization -> Big input space, complex function, no derivative (or hard to find) , many local optima

66
Q

How is boosting like other algorithms?

A

Like KNN because it took simple hypotheses and made a more complex one (local weighted regression in KNN).

Weighted combination of things is like NN’s.

67
Q

Why is Boosting non-linear?

A

Because you pass it through the sgn function at the end which is non-linear.

68
Q

Why does boosting tend to do well? What’s the intuition?

A

You’re using weak learners that must get at least 50% of the classifications correct. At each step you re-distribute such that you put more weight on the likelihood of drawing samples that the previous learner got incorrect. So the next weak learner is focusing more on the problems the previous learner got wrong, and they must get > 50% of them correct. This means you should always be improving.

69
Q

How do SVM’s work?

A

The goal is to maximize the margin between the data subject to being able to properly classify the data on either side of that decision boundary.

try to minimize 1/2 * || w ||^2 where w is a vector containing your hyperplane parameters.

Ultimately this turns into a quadratic programming problem, and once you solve it you can recover w as:

w = sum(alpha_i * y_i * x_i), but many alphas are 0 so most of your data doesn’t matter, only a few of the x_i’s do. The data that IS left are your support vectors.

70
Q

How is SVM like KNN?

A

Only the local points matter (near the decision boundary). Also like KNN except you’ve already done the work of figuring out which points matter and you can throw away the rest.

Like instance based learning except you’ve already put some energy into figuring out which points matter.

71
Q

What’s the big O notation of XOR vs OR?

A

XOR is exponential: O(2^n), while OR is linear.

72
Q

What is restriction bias?

A

When the set of possible hypotheses considered becomes restricted to a smaller subset.

73
Q

What is instance based learning?

A

Lazy learner - store all of the training data. Don’t fit or do anything like that. Computation is done during prediction instead of fitting.

74
Q

How does regression work for KNN?

A

a

75
Q

How do you get a ‘not’ behavior from a perceptron?

A

Set the weight to negative (or the threshold)!

76
Q

K-mean Algorithm

A

Randomly Initialize K cluster centroids
Repeat until convergence {
1) Cluster Assignment step -> Set c = argmin || x-mu ||
2) Move Centroid -> for J =1….k, mu = average
}

77
Q

Choosing the value of k in KNN

A
Elbow method  (x axis -> is no of cluser    y-> cost function )
In practice choose by hand based on business domain
78
Q

PCA AND Linear regression

A

PCA is not linear regression
LR -> Points and straight line
PCA -> Magnitude( shortest orthogonal distances)
PCA is trying to find low dimensional surface onto which to project the data as to minimize this squared projection error

79
Q

PCA Algorithm

A

After mean normalization and optionally feature scaling

1) Reduce data from n-dimensions to k-dimensions
2) compute “Covariance matrix”
sigma = 1/m Sum (X)(X)^T -> (1/m) * X^T * X
3) compute “eigenvectors” of matrix sigma
[U,S,V] = svd(sigma);
4) U reduce = U[:, 1:k]
5) Z = U reduce ^ T * X

80
Q

Choosing k (Number of principal components)

A

1) Average squared projection error: 1/m Sum ||X - X approx || ^ 2
2) Total Variation in the data: 1/m Sum || X|| ^ 2

Typically, choose k to be smallest value so that
Average squared projection error / Total Variation in the data <= 0.01 (1%)
99% of valiance is retained

81
Q

Choosing k Algorithm for PCA

A
Try PCA with K = 1
Compute U reduce, Z, X approx
check if 
check if Average squared projection error / Total Variation in the data   <= 0.01  (1%)  
if not Choose increate K

But it is not so efficient but fortunately when calling
[U, S, V] = svd(Sigma)

it gives us vector S
S= [S11 0]
[0 S22]

This S contains the values

82
Q

Can Reconstruction from compressed representation

A

Z = U reduce ^T * X

X approx = U reduce * Z

83
Q

Application of PCA

A

1) Compression by choose k which 99% of variance retain
* Reduce memory/disk needed to store data
* speed up learning algorithm
2) visualization (k=2 or k=3)

84
Q

Bad use of PCA (Not recommended)

A

To prevent overfitting
use z instead of x to reduce the number of features to
k

Reducing dimensionality does cause some information loss (just like compressing an image to JPEG can degrade its quality), so even though it will speed up training, it may make your system perform slightly worse. It also makes your pipelines a bit more complex and thus harder to maintain. So, if training is too slow, you should first try to train your system with the original data before considering using dimensionality reduction. In some cases, reducing the dimensionality of the training data may filter out some noise and unnecessary details and thus result in higher performance, but in general it won’t; it will just speed up training.

85
Q

PCA is sometime used where it shouldn’t be

A

Before starting I show ask how about doing the whole thing without using PCA.

86
Q

Anomaly detection

A

p(X) < epsilon -> flag anomaly

p(X) >= epsilon -> ok

87
Q

Anomaly detection example

A
1) Fraud detection
    X = feature of user i's activities 
    Model p(x) from data
    identify unusual users by checking which have p(x) < 
      epsilon
2) Manufacturing 
3) Monitoring computers in a data center
88
Q

Gaussian (Normal) distribution

A

Say x E R, if x is a distributed Gaussian with mean m and valiance sigma^2

89
Q

Parameter estimation problem

A

We have dataset, these example are coming from Gaussian distribution with parameter meu and sigma.
But I don’t know what these parameters are.
So I want to estimate these parameters.

Mu = 1/m Sum X (the average of my example)
Sigma ^ 2 = 1/m sum( X-Mu)^2

These are the Maximum likelihood estimation of the parameters Mu and Sigma

90
Q

Density estimation

A

** the aircraft engine come on assemble line example

can we model what he density of x p(x)

we assume that these data from Gaussian Distribution

P(x) = p(x1; Mu, simag^2)p(x2; Mu, simag^2)p(x3; Mu, simag^2) …….p(xn; Mu, simag^2)

It the same as independent assumption on the x1..xn

P(x) = Pi p(xj; Mu, simag^2)

(pi = Product)

91
Q

Anomaly detection Algorithm

A

1) Choose feature x that you think might be indicative of anomalous examples.
2) Fit parameters M1, …Mn, Simga1 ^2,…..Simga1^2 n

Mu = 1/m Sum X (the average of my example)
Sigma ^ 2 = 1/m sum( X-Mu)^2

3) Give new example, Compute p(x):
then if p(x)< epsilon

92
Q

Anomaly detection vs supervised learning

A

Anomaly detection -> very small number of positive examples and large negative
Anomaly detection -> Many different types of anomalies. Hard for algorithm to learn from positive example what the anomalies look like; future anomalies may look nothing like any of the anomalous examples we’ve seen so far.

Supervised learning -> Large number of positive and negative examples
Supervised learning -> Enough positive examples for algorithm to get a sense of what positive examples are like, future positive examples likely to be similar to ones in training set.

93
Q

Anomaly Detection | Choosing What Features To Use

A
IF not non-gaussian feature, 
try to use log(x)
try to use log(x+1)
try to use log(x+c)
try to use x^.5
try to use x ^.33
until it be because gaussian.
94
Q

Error Analysis for anomaly detection

A
Want p(x) large for normal examples. 
Want p(x) small for anomalous examples. 

Most common problem:
p(x) is comparable(say, both large) for normal and anomalous examples.

95
Q

Anomaly Detection | Multivariate Gaussian Distribution

A
X E R^N    , don't model p(x1), p(x2), ...., etc. Separately. 
Model p(x) all in one go. 
Parameter : M E R^2, Sigma E R^(N*N) 
(covariance matrix)
96
Q

Cost function for K-mean

A

J(C,Mu) = Sum || x-Mu ||^2

97
Q

Does K-mean stuck on local minima

A

Yes, but you could avoid that by run it 100 times for examples

98
Q

EM algorithm

A

1) Start with two randomly placed Gaussians (m1,sigma1), (m2, sigma 2)
2) for each point : p(b|X) = does it look like it came from b ?
3) Adjust (m1, sigma1) and (m2,simga2) to fit point assigned to them. (the same as k-mean but you use probability)
4) iterate until convergence

99
Q

Mixture of Gaussian model

A

What if we don’ know the source. Which point belong to which source.
Can Guess whether point is more likely to be a or b

100
Q

Expectation Maximization Chicken and egg prolem

A

need M, sigma to guess the source o points

need to know source to estimate m and sigma

101
Q

Properties of EM

A

1) Monotonically non-decreasing likelihood
2) does not converge (practically does)
3) Will not diverge(UP) (there is an infinite number of probability) different from K-mean
4) can get stuck, local optima (restart to fix)
5) work with any distribution (If Expectation ,maximization solvable)

102
Q

Clustering properties

A

1) Richness
For any assignment of objects to clusters, there is some distance matrix D such that PO returns that clustering

2) scale-invariance
scaling distance by a positive value does not change the clustering. (NASA Problem)

3) consistency
shrinking intra-cluster distance and expanding inter-cluster distance does not change the cluster

103
Q

Impossibility theorem

A

No Clustering schema can achieve all three of richness, scale invariance and consistency

104
Q

Does the K-Means algorithm does behave very well when the blobs have very different diameters?

A

No, because all it cares about when assigning an instance to a cluster is the distance to the centroid. (Hard Clustering)

it can be useful to give each instance a score per cluster, which is called soft clustering.

105
Q

The computational complexity of K-Mean

A

Is generally linear with regard to the number of instances m, the number of clusters k, and the number of dimensions n. However, this is only true when the data has a clustering structure. If it does not, then in the worst-case scenario the complexity can increase exponentially with the number of instances. In practice, this rarely happens, and K-Means is generally one of the fastest clustering algorithms.

106
Q

Does K-mean guaranteed to converge ?

A

Yes, but it may not converge to the right solution(i.e., it may converge to a local optimum )whether it does or not depends on the centroid initialization

107
Q

How to mitigate this risk of local optimum by improving the centroid initialization For K-mean

A

run the algorithm multiple times with different random initializations and keep the best solution

But how exactly does it know which solution is the best? It uses a performance metric! called the model’s inertia, which is the mean squared distance between each instance and its closest centroid.

108
Q

Does inertia a good performance metric when trying to choose k for K-mean?

A

No, because it keeps getting lower as we increase k. Indeed, the more clusters there are, the closer each instance will be to its closest centroid, and therefore the lower the inertia will be.

109
Q

What technique for choosing the best value for the number of cluster?

A

elbow but it is coarse. A more precise approach (but also more computationally expensive) is to use the silhouette score,

An even more informative visualization is obtained when you plot every instance’s silhouette coefficient, sorted by the cluster they are assigned to and by the value of the coefficient. This is called a silhouette diagram

110
Q

Limit of K-mean

A

1) it is necessary to run the algorithm several times to avoid suboptimal solutions,
2) you need to specify the number of clusters
3) K-Means does not behave very well when the clusters have varying sizes, different densities, or non spherical shapes
4) K-Means prefers clusters of similar sizes (the ladybug example until use 8 clsuter)

111
Q

What should you do before run k-mean?

A

It is important to scale the input features before you run K-Means, or the clusters may be very stretched and K-Means will perform poorly. Scaling the features does not guarantee that all the clusters will be nice and spherical, but it generally improves things.

112
Q

What is he two types of Feature Selection Algoirthm

A

Filtering and wrapping

113
Q

What is difference between Filtering and wrapping?

A

Filtering :
+ speed
- Isolated feature (Maybe if combined with another feature became so important)
- Ignore the Learner feed back

Wrapping:
- so slow but useful
+ Takes into account model (inductive) bias
+ take feed back from the learner

114
Q

Which supervised algorithm is doing filter as the filter selection algorithm?

A

Decision Tree. you run DT for the search algorithm, get the features. and pass it to whatever label.

115
Q

In feature selection, why if you use DT as a filter, you don’t use it as a learner?

A

May be the decision tree doesn’t do a good job on noisy data for example.

DT is very good in know why feature is important(Information Gain)

116
Q

What is the different between strongly relevant and weakly relevant?

A

1) Xi is strongly relevant if removing it degrades B.O.C
2) Xi is weakly relevant if
- Not strongly relevant
- subset of features s such that adding xi to S improves B.O.C
3) otherwise irrelevant

117
Q

Relevance vs. usefulness ?

A

1) Relevance Measure effect of B.O.C

2) Usefulness measure effect on Particular Learner

118
Q

B.O.C ?

A

Compute the best label, given all the probabilities that you could compute over the hypnosis space.