midterm Flashcards

Question

Describe Bayesian learning

Answer 1

``` Learn the best (probable) hypothesis give data and some domain knowledge For each h E H : calclulate Pr(h|D) output: h = argmax pr(h|D) ```

Answer 2

Maximum a posteriori = argmax pr(h|D)

Answer 3

h = argmax Pr(D|h) we don't care of Pr(h) as all hypothesis equally uniform

Answer 4

Transform L_map = argmax_h P(D|h)*P(h) using lg: argmin_h [-lg P(D|h) -lg P(h)] where each of these represents a length a la information theory. so the lg P(h) is like the length of your hypothesis ('size' of h) and the lg P(D|h) is a notion of misclassification or 'error'. eg it minimizes error using the simplest hypothesis. There is usually a tradeoff, though. This is called minimum description length.

Answer 5

Weighted voted h E H, pr(h|D) finding the value is what we care about

Answer 6

X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given the value of Z ; that is if P(X=x | Y=y, Z=z) = p(X=x |Z=z)

Answer 7

Pr(x,y) = pr(x).pr(y) | "The joint distribution between two variables is equal to the product of their marginals" - CL

Answer 8

Represent the conditional independence relationship between all the variable in the join distribution graphically. Node -> variables. Edges -> dependences

Answer 9

You multiple the probabilities of the entire graph back together: Pr(A,B,C,D,E) = pr(A). pr(B).pr(c|A,B).pr(D.B,c).Pr(E|C,D)

Answer 10

Why have distribution 1) probability of value 2) generate value according this distribution Why sample in Bayesian Learning? 1) Simulation of a complex process 2) approximate inference(Not exactly because it hard) 3) visualization

Answer 11

p(x) = sum (p(x,y))

Answer 12

Naïve Bayes classifiers is "probabilistic classifiers" based on applying Bayes' theorem with strong independence assumptions between the features P(V|a1,a2,....,n) = product p(a|V) . p(v)/z e.g spam email classification A special case of a belief network where you've got a root node (label) producing many different attributes values. Each attribute has no children and only has the root node as its single parent. P(V | a1, a2, a3...) = prod_i(P(a_i | V) * P(V) / Z MAP Class = argmax_V P(V) * prod_i(P(a_i | V)) - The most likely class given the data that we've seen. Naive bayes can handle missing attributes.

Answer 13

1) When you have naïve structure Inference is cheap 2) few parameters (Num. of parameters you need to know is linear) 3 Easy Estimate parameters with label data 4) connect inference and classification 5) empirically successful 6) Can do inference in both directions

Answer 14

No free lunch. The network not represent the world most of the time because the network says all the attributes are conditionally independent giving that you know the label. Doesn't model interrelationship between attributes. It assumes all the attributes are conditionally independent of each other! (root note -> attributes). It doesn't model interrelationships between attributes. It's OK because we don't care about the absolute probabilities that come out of it, only that we get the correct label (so ordering of probabilities is what matters).

Answer 15

* Guess some value, x * Check both sides of your current position * Move to the highest of your neighboring points * Repeat until both of your neighbors are lower than you * Once you've reached your local maximum, restart in some other random position and repeat * Do this some number of iterations and output the best result * In general this is an exploitation algorithm because it's always going uphill. Maybe we could call the random restarts the exploration. Is away of dealing with local optima. Once local optimum reached try again starting from a randomly chosen x . 1) Not much more expensive. 2) multiple tries to find a good starting place.

Answer 16

If you've got a wide basin of attraction on a local optima (that is not the global) then the majority of the time you're not reaching the global optima, especially if the global optima has a narrow basin of attraction. It may take a very large number of iterations to find it.

Answer 17

Don't always improve - sometimes you need to search. A random search algorithm that trades off exploration and exploitation by starting off with more exploration and moving toward exploitation over time. Algorithm: • Sample a new point, x_t in neighborhood N(x) • Compute probability of moving to that point as: P(x, x_t, T): • 1 if f(x_t) >= f(x) (neighbor is >= to you, make the move) • e^([f(x_t) - f(x))]/T) • Decrease Temperature, T f(x_t) - f(x) will always be negative, so you've really got 1/e^(difference/T) * When T is large the exponent goes toward 0 and 1/1 = 1, so will move downhill. * As the temperature decreases, exponent becomes very small, which magnifies the exponent, making the probability very small. So you start out with more exploration, accepting many downhill neighbors when T is large and accepting less downhill neighbors choosing mostly to move uphill as T gets smaller.

Answer 18

T ->0 : like hill climbing T ->infinity : random walk Decrease T slowly BOLTZMANN distribution

Answer 19

Population like parallel random restart. Initial population of size k repeat until converge: * compute fitness of all x E p + * select most fit individual * pair up individual, replacing least fit individual via crossover/ mutation

Answer 20

1) locality of bit matters 2) it assumes that subpart of the space that can be independently optimize A method of generating 'children' from 'parents' where the children have n bits from one parent and length-n bits of the other parent. Each child has the opposite set of bits from each parent. This does imply some kind of bias about the locality of the bits - that some bit groups should stay together.

Answer 21

Randomize at each bit position

Answer 22

* Directly model distribution * Successively refine model * Generate samples from P(x) theta_t (starts out uniform) * set theta_t+1 to n_th percentile (fittest) * retain only those samples such that f(x) >= theta_t+1 * Estimate the new distribution, P(x) theta_t+1 * Repeat You're taking theta_min toward theta_max over time. It's like the water rising around a bunch of mountainous islands. The peaks remain visible and the water keeps rising until the peak(s) become single point(s) (if equal height peaks).

Answer 23

* generating samples from P(x) theta_t : this is like your population * keeping the nth percentile of samples with f(x) >= theta_t: is like culling the population in GA

Answer 24

A belief network where each node in the tree has exactly one parent (except for the root). Each variable then depends on exactly one other variable. They are used used to estimate the distribution P(x) theta_t. By using dependency trees we are assuming each variable will have at most one dependent variable. Typically we would have: • P(x) = P(x1 | x2,... xn) * P(x2 | x3 ... xn) ... * P(xn) With dependency trees, it becomes: • P(x)_pi = prod(P(x_i | pi(x_i)) Where pi is the parent of x_i.

Answer 25

No - using dependency trees is just the easiest thing to do while still allowing you to represent dependent relationships. You could use bayesian networks or unconditional probability distributions, etc.

Answer 26

1) giving high weight to low confidence to be high confidence 2) maximize margin meaning no overfitting because it will be able to generalize well

Answer 27

the conditional entropy quantifies the amount of information needed to describe the outcome of a random variable given that the value of another random variable is known

Answer 28

1) Linear separators as there is a number of infinite lines 2) ANN (infinite number of Weights ) 3) DT (Continuous Inputs) NO ( DT discrete)

Answer 29

Entropy is a measure of disorder or uncertainty.

Answer 30

infinity because parameters is infinity

Answer 31

infinity because parameters is infinity

Answer 32

Let us start by considering two points. Two points can be labelled in 4 possible ways. We can always find a circle that shatters these two points. So the VC dimension is at least 2 Let us now consider three points. In this case if the nearest and furthest point has the same label andthe remaining point has the opposite label, you can’t draw a circle to shatter this instance space.So the VC dimension is 2.

Answer 33

We use the same argument as above to prove that the VC dimension is 2. (Will our answers change if the circle and sphere was not origin centered? Hint: Yes.)

Answer 34

We use the same argument as above to prove that the VC dimension is 2. (Will our answers change if the circle and sphere was not origin centered? Hint: Yes.)

Answer 35

Tradeoff between size of h and error

Answer 36

Generate & test -> small input space, complex function Calculus -> function has derivative & solvable =0 Network's method -> function has derivative, iteratively improve, if you have single optimum Randomized optimization -> Big input space, complex function, no derivative (or hard to find) , many local optima

Answer 37

Like KNN because it took simple hypotheses and made a more complex one (local weighted regression in KNN). Weighted combination of things is like NN's.

Answer 38

Because you pass it through the sgn function at the end which is non-linear.

Answer 39

You're using weak learners that must get at least 50% of the classifications correct. At each step you re-distribute such that you put more weight on the likelihood of drawing samples that the previous learner got incorrect. So the next weak learner is focusing more on the problems the previous learner got wrong, and they must get > 50% of them correct. This means you should always be improving.

Answer 40

The goal is to maximize the margin between the data subject to being able to properly classify the data on either side of that decision boundary. try to minimize 1/2 * || w ||^2 where w is a vector containing your hyperplane parameters. Ultimately this turns into a quadratic programming problem, and once you solve it you can recover w as: w = sum(alpha_i * y_i * x_i), but many alphas are 0 so most of your data doesn't matter, only a few of the x_i's do. The data that IS left are your support vectors.

Answer 41

Only the local points matter (near the decision boundary). Also like KNN except you've already done the work of figuring out which points matter and you can throw away the rest. Like instance based learning except you've already put some energy into figuring out which points matter.

Answer 42

XOR is exponential: O(2^n), while OR is linear.

Answer 43

When the set of possible hypotheses considered becomes restricted to a smaller subset.

Answer 44

Lazy learner - store all of the training data. Don't fit or do anything like that. Computation is done during prediction instead of fitting.

Answer 45

Set the weight to negative (or the threshold)!

Answer 46

Randomly Initialize K cluster centroids Repeat until convergence { 1) Cluster Assignment step -> Set c = argmin || x-mu || 2) Move Centroid -> for J =1....k, mu = average }

Answer 47

``` Elbow method (x axis -> is no of cluser y-> cost function ) In practice choose by hand based on business domain ```

Answer 48

PCA is not linear regression LR -> Points and straight line PCA -> Magnitude( shortest orthogonal distances) PCA is trying to find low dimensional surface onto which to project the data as to minimize this squared projection error

Answer 49

After mean normalization and optionally feature scaling 1) Reduce data from n-dimensions to k-dimensions 2) compute "Covariance matrix" sigma = 1/m Sum (X)(X)^T -> (1/m) * X^T * X 3) compute "eigenvectors" of matrix sigma [U,S,V] = svd(sigma); 4) U reduce = U[:, 1:k] 5) Z = U reduce ^ T * X

Answer 50

1) Average squared projection error: 1/m Sum ||X - X approx || ^ 2 2) Total Variation in the data: 1/m Sum || X|| ^ 2 Typically, choose k to be smallest value so that Average squared projection error / Total Variation in the data <= 0.01 (1%) 99% of valiance is retained

Answer 51

``` Try PCA with K = 1 Compute U reduce, Z, X approx check if check if Average squared projection error / Total Variation in the data <= 0.01 (1%) if not Choose increate K ``` But it is not so efficient but fortunately when calling [U, S, V] = svd(Sigma) it gives us vector S S= [S11 0] [0 S22] This S contains the values Refined Algorithm: -------------------------

Answer 52

Z = U reduce ^T * X | X approx = U reduce * Z

Answer 53

1) Compression by choose k which 99% of variance retain * Reduce memory/disk needed to store data * speed up learning algorithm 2) visualization (k=2 or k=3)

Answer 54

To prevent overfitting use z instead of x to reduce the number of features to k Reducing dimensionality does cause some information loss (just like compressing an image to JPEG can degrade its quality), so even though it will speed up training, it may make your system perform slightly worse. It also makes your pipelines a bit more complex and thus harder to maintain. So, if training is too slow, you should first try to train your system with the original data before considering using dimensionality reduction. In some cases, reducing the dimensionality of the training data may filter out some noise and unnecessary details and thus result in higher performance, but in general it won’t; it will just speed up training.

Answer 55

Before starting I show ask how about doing the whole thing without using PCA.

Answer 56

p(X) < epsilon -> flag anomaly | p(X) >= epsilon -> ok

Answer 57

``` 1) Fraud detection X = feature of user i's activities Model p(x) from data identify unusual users by checking which have p(x) < epsilon 2) Manufacturing 3) Monitoring computers in a data center ```

Answer 58

Say x E R, if x is a distributed Gaussian with mean m and valiance sigma^2

Answer 59

We have dataset, these example are coming from Gaussian distribution with parameter meu and sigma. But I don't know what these parameters are. So I want to estimate these parameters. Mu = 1/m Sum X (the average of my example) Sigma ^ 2 = 1/m sum( X-Mu)^2 These are the Maximum likelihood estimation of the parameters Mu and Sigma

Answer 60

** the aircraft engine come on assemble line example can we model what he density of x p(x) we assume that these data from Gaussian Distribution P(x) = p(x1; Mu, simag^2)p(x2; Mu, simag^2)p(x3; Mu, simag^2) .......p(xn; Mu, simag^2) It the same as independent assumption on the x1..xn P(x) = Pi p(xj; Mu, simag^2) (pi = Product)

Answer 61

1) Choose feature x that you think might be indicative of anomalous examples. 2) Fit parameters M1, ...Mn, Simga1 ^2,.....Simga1^2 n Mu = 1/m Sum X (the average of my example) Sigma ^ 2 = 1/m sum( X-Mu)^2 3) Give new example, Compute p(x): then if p(x)< epsilon

Answer 62

Anomaly detection -> very small number of positive examples and large negative Anomaly detection -> Many different types of anomalies. Hard for algorithm to learn from positive example what the anomalies look like; future anomalies may look nothing like any of the anomalous examples we've seen so far. Supervised learning -> Large number of positive and negative examples Supervised learning -> Enough positive examples for algorithm to get a sense of what positive examples are like, future positive examples likely to be similar to ones in training set.

Answer 63

``` IF not non-gaussian feature, try to use log(x) try to use log(x+1) try to use log(x+c) try to use x^.5 try to use x ^.33 until it be because gaussian. ```

Answer 64

``` Want p(x) large for normal examples. Want p(x) small for anomalous examples. ``` Most common problem: p(x) is comparable(say, both large) for normal and anomalous examples.

Answer 65

``` X E R^N , don't model p(x1), p(x2), ...., etc. Separately. Model p(x) all in one go. Parameter : M E R^2, Sigma E R^(N*N) (covariance matrix) ```

Answer 66

J(C,Mu) = Sum || x-Mu ||^2

Answer 67

Yes, but you could avoid that by run it 100 times for examples

Answer 68

1) Start with two randomly placed Gaussians (m1,sigma1), (m2, sigma 2) 2) for each point : p(b|X) = does it look like it came from b ? 3) Adjust (m1, sigma1) and (m2,simga2) to fit point assigned to them. (the same as k-mean but you use probability) 4) iterate until convergence

Answer 69

What if we don' know the source. Which point belong to which source. Can Guess whether point is more likely to be a or b

Answer 70

need M, sigma to guess the source o points | need to know source to estimate m and sigma

Answer 71

1) Monotonically non-decreasing likelihood 2) does not converge (practically does) 3) Will not diverge(UP) (there is an infinite number of probability) different from K-mean 4) can get stuck, local optima (restart to fix) 5) work with any distribution (If Expectation ,maximization solvable)

Answer 72

1) Richness For any assignment of objects to clusters, there is some distance matrix D such that PO returns that clustering 2) scale-invariance scaling distance by a positive value does not change the clustering. (NASA Problem) 3) consistency shrinking intra-cluster distance and expanding inter-cluster distance does not change the cluster

Answer 73

No Clustering schema can achieve all three of richness, scale invariance and consistency

Answer 74

No, because all it cares about when assigning an instance to a cluster is the distance to the centroid. (Hard Clustering) it can be useful to give each instance a score per cluster, which is called soft clustering.

Answer 75

Is generally linear with regard to the number of instances m, the number of clusters k, and the number of dimensions n. However, this is only true when the data has a clustering structure. If it does not, then in the worst-case scenario the complexity can increase exponentially with the number of instances. In practice, this rarely happens, and K-Means is generally one of the fastest clustering algorithms.

Answer 76

Yes, but it may not converge to the right solution(i.e., it may converge to a local optimum )whether it does or not depends on the centroid initialization

Answer 77

run the algorithm multiple times with different random initializations and keep the best solution But how exactly does it know which solution is the best? It uses a performance metric! called the model’s inertia, which is the mean squared distance between each instance and its closest centroid.

Answer 78

No, because it keeps getting lower as we increase k. Indeed, the more clusters there are, the closer each instance will be to its closest centroid, and therefore the lower the inertia will be.

Answer 79

elbow but it is coarse. A more precise approach (but also more computationally expensive) is to use the silhouette score, An even more informative visualization is obtained when you plot every instance’s silhouette coefficient, sorted by the cluster they are assigned to and by the value of the coefficient. This is called a silhouette diagram

Answer 80

1) it is necessary to run the algorithm several times to avoid suboptimal solutions, 2) you need to specify the number of clusters 3) K-Means does not behave very well when the clusters have varying sizes, different densities, or non spherical shapes 4) K-Means prefers clusters of similar sizes (the ladybug example until use 8 clsuter)

Answer 81

It is important to scale the input features before you run K-Means, or the clusters may be very stretched and K-Means will perform poorly. Scaling the features does not guarantee that all the clusters will be nice and spherical, but it generally improves things.

Answer 82

Filtering and wrapping

Answer 83

Filtering : + speed - Isolated feature (Maybe if combined with another feature became so important) - Ignore the Learner feed back Wrapping: - so slow but useful + Takes into account model (inductive) bias + take feed back from the learner

Answer 84

Decision Tree. you run DT for the search algorithm, get the features. and pass it to whatever label.

Answer 85

May be the decision tree doesn't do a good job on noisy data for example. DT is very good in know why feature is important(Information Gain)

Answer 86

1) Xi is strongly relevant if removing it degrades B.O.C 2) Xi is weakly relevant if - Not strongly relevant - subset of features s such that adding xi to S improves B.O.C 3) otherwise irrelevant

Answer 87

1) Relevance Measure effect of B.O.C | 2) Usefulness measure effect on Particular Learner

Answer 88

Compute the best label, given all the probabilities that you could compute over the hypnosis space.

midterm Flashcards

(118 cards)