Test 1 W1-4 Flashcards

Question

How to solve overfitting? How to pick k?

Answer 1

We can use validation or cross validation to select model

Answer 2

Adv: Learning is cheap (just need to remember all data points) Dis: - Prediction expensive (need to retrieve k nearest neighbors from a large set of N points, for each prediction made) - in high dimensions, points are far away from each other (poor performance)

Answer 3

aka instance-based/ case-based/ memory-based method. Number of model parameters grows with number of training cases/data points (example is KNN)

Answer 4

aka model-based methods | Number of parameters is fixed

Answer 5

1. Linear classifiers | 2. Decision Trees

Answer 6

To find the line (or hyperplane) which can "best" (under some criterion/ objective) separate two classes

Answer 7

minimize an error function known as perceptron criterion (which associates a zero error with any data point correctly classified) (Seeks a weight vector such that a pattern X in class C1 will have wTφ(xn) > 0 and in C2 will have wTφ(xn) < 0. When a data point is misclassified it's feature vector is added to the current weight vector giving a new decision boundary. and the true class label 't' takes the value +1 for C1 and -1 for C2)

Answer 8

stochastic gradient descent (SGD) (aka online algorithm)

Answer 9

if there exists an exact solution (if the training data is linearly separable) then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps. If the data is not linearly seperable, perceptron will not converge

Answer 10

``` Parametric Model - a classification tree, where (1) each internal node represents a "test" on variable/ feature; (2) each branch represents the outcome of the test, and (3) each leaf node has a class label ```

Answer 11

(Greedily) search features, choose one that splits data in order to reduce impurity most, which is measure with: - GINI - Entropy - Misclassification Errors

Answer 12

``` Generative: Use Bayes' theorem to find the posterior class probability p(Ck|x). (by sampling from the joint distribution possible to generate synthetic/ unseen data points in the input space) ``` Discriminative: Learn p(Ck|x) directly by just using a discriminant function to map each input x directly onto a class label, where probabilities play no role

Answer 13

Used for both Generative/ Discriminative models to choose a class label. Tells us how to make optimal decisions given the appropriate probabilities (minimize the chance of assigning x to the wrong class)

Answer 14

When data is scarce, consider making the number of folds to be the number of data points

Answer 15

Prediction is expensive. You need to retrieve k nearest neighbours from a large set of N points.

Answer 16

Minimize expected sum of impurity at leaves.

Answer 17

A generative model (with continuous input) Closed-form solution in which you obtain parameters using maximal likelihood estimation. Need to manipulate mean vectors and co-variance matrixes when extending from univariate to multivariate gaussians

Answer 18

A generative model (with discrete input)

Answer 19

conditioned on class, features/ variables are independent

Answer 20

Gaussian Bayes Classifier | Has a separate, diagonal covariance matrix for each class

Answer 21

Discriminative | - have the same number of adjustable parameters as dimensions of the features space

Answer 22

Logistic regression (over Gaussian classifiers)

Answer 23

use the "kernel trick"

Answer 24

reduce an algorithm to one which depends only on dot products between data vectors. Then replace the dot product with a kernel function k(x, z)

Answer 25

contains (two modules): the algorithm and the kernel function take a standard algorithm and massage it so that all references to the original data vectors x appear only in dot products x(^T)z

Answer 26

The "Gram Matrix" | can build and throw away original data if "kernelized" an algorithm successfully

Answer 27

kernelized maximum-margin hyperplane classifier

Answer 28

maximum margin principle

Answer 29

same coverence matrix (Gaussian classifier)

Answer 30

regression

Answer 31

Depends on your error function - For squarred error e, the mean is best - For abstract error, we get the median (absolute error)

Answer 32

(regression) says y = t, independent of x -> biggest problem for regression

Answer 33

Logistic regression, Perceptron

Answer 34

a. A validation set can be used to prevent overfitting | b. In some situation you should use validation but not cross validation e.g., when training takes too long (month)

Answer 35

d. Naive Bayes can work on continuous features directly

Answer 36

d. p(c1 | x) < 0.5, so predict x to be c2

Answer 37

c. During test phase, SVM needs to compute the dot product between a test data point and all training data points

Test 1 W1-4 Flashcards

Tuesday, October 2 (62 cards)