Supervised Classifiers Flashcards

1
Q

What is knowledge representation?

A

Viewing what the machine algorithm has learned. It may be in the form of a set of rules; probability distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Steps in developing a machine learning application

A

1) Collect Data
2) Prepare the input data
3) Analyze the input data
4) Train the algorithm
5) Test the algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are k-nearest neighbor pros and cons?

A

Pros: High accuracy, insensitive to outliners, no assumptions about data
Cons: Computationally expensive, requires lots of memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Describe k-nearest-neighbors.

A

We have an existing set of example data, our training set. We have labels for all of this data – we know what class each piece of the data should fall into. When we’re given a new piece of data without a label, we compare that new piece of data to the existing data, every piece of existing data. We then take the most similar pieces of data (the nearest neighbors) and look at their labels. We look at the top k most similar pieces of data from our known dataset. Majority vote wins.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What’s the formula to normalize and scale everything to the range 0 to 1

A

newValue = (oldValue - min)/(max-min)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the Euclidian distance between two vectors?

A

sqrt( (Xa0 - Xb0)^2 + (Xa1 - Xb1) ^2 + … )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the pros and cons of decision trees?

A

Pros: Computationally cheap to use, easy for humans to understand learned results, missing values OK, can deal with irrelevant features.
Cons: Prone to overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe decision tree creation.

A

1) Make a decision on which feature to best split the data first.
2) Split the dataset into subsets. The subsets will then traverse down the branches of the first decision node.
a) If the data on the branches is the same class, then you’ve properly classified it and don’t need to continue splitting it.
b) If not the same, then you need to repeat the splitting process on this subset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Shannon entropy or just entropy for short?

A

The expected value of the information.
Information is defined as follows. If you’re classifying something that can take on multiple values, the information for xi is defined as L(Xi) = log2(p(Xi)), where p(Xi) is the probability of choosing this class.
To calculate entropy, you need the expected value of all the information of all possible values of our class. This gives us:
H = - SUM(i = 1 to n) ( p(Xi)*log2p(Xi) ) where n is the number of classes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Give an example of shannon entropy.

A

A coin toss. 2 shannons of entropy: Information entropy is the log-base-2 of the number of possible outcomes; with two coins there are four outcomes, and the entropy is two bits.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Naive Bayes Pros and Cons

A

Pros: Works with a small amount of data, handles multiple classes.
Cons: Sensitive to how input data is prepared.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is underflow?

A

Doing too many multiplications of small numbers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do we get around underflow?

A

To take the natural logarithm of the products. If you recall from algebra, ln(a*b) = ln(a) + ln(b)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What’s the difference between a set-of-words model vs bag-of-words model?

A

Set-of-words is when you treat the presence or absence of a word as a feature. A bag of words can have multiple occurrences of each word.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is hold-out cross validation?

A

When you randomly select a portion of our data for the training set and a portion for the test set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are two assumptions that make naive bayes naive?

A

1) We assume independence. That one feature or word is just as likely by itself as it is next to other words.
2) That every feature is equally important.

17
Q

Pros and Cons of Logistic Regression?

A

Pros: Computationally inexpensive, easy to implement, knowledge representation easy to interpret.
Cons: Prone to underfitting, may have low accuracy.

18
Q

What is the sigmoid function?

A

The sigmoid function is given by the following equation: sigma(z) = 1 / (1 + e^-z)

19
Q

What is Gradient Descent?

A

To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point. The gradient operator will always point in the direction of the greatest increase so we take the negative. The step size is given by parameter alpha. We repeat the step until we reach a stopping condition: either a specified number of steps or the algorithm is within a certain tolerance margin.

20
Q

How do you classify an instance with logistic regression?

A

You calculate the sigmoid of the vector under test multiplied by the weights optimized earlier with gradient descent. If the sigmoid gives you a value greater than .5, the class is 1, and it’s 0 otherwise.

21
Q

Describe SVM

A

A support vector machine constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space, which can be used for classification, regression, or other tasks. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the nearest training-data point of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier.

22
Q

What is generalization error?

A

The generalization error of a machine learning model is a function that measures how well a learning machine generalizes to unseen data. It is measured as the distance between the error on the training set and the test set and is averaged over the entire set of possible training data that can be generated after each iteration of the learning process.

23
Q

What is the use of a Kernel?

A

Mapping from one feature space to another. Think of the kernel as a wrapper or interface for the data to translate it from a difficult formatting to an easier formatting.

24
Q

What combines multiple classifiers?

A

Ensemble methods or meta-algorithms

25
Q

Describe bootstrap aggregating or bagging.

A

It is a technique where the data is taken from the original dataset S times to make S new datasets. The datasets are the same size as the original. Each dataset is built by randomly selecting an example from the original with replacement.

26
Q

What is a random forest?

A

Random forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random forests correct for decision trees’ habit of overfitting to their training set.In particular, trees that are grown very deep tend to learn highly irregular patterns: they overfit their training sets, because they have low bias, but very high variance. Random forests are a way of averaging multiple deep decision trees, trained on different parts of the same training set, with the goal of reducing the variance.[1]:587–588 This comes at the expense of a small increase in the bias and some loss of interpretability, but generally greatly boosts the performance of the final model.

27
Q

What is tree bagging?

A

Using bagging with decision trees.This means that while the predictions of a single tree are highly sensitive to noise in its training set, the average of many trees is not, as long as the trees are not correlated. Simply training many trees on a single training set would give strongly correlated trees (or even the same tree many times, if the training algorithm is deterministic); bootstrap sampling is a way of de-correlating the trees by showing them different training sets.

28
Q

Describe the difference between tree bagging and random forests.

A

The above procedure describes the original bagging algorithm for trees. Random forests differ in only one way from this general scheme: they use a modified tree learning algorithm that selects, at each candidate split in the learning process, a random subset of the features. This process is sometimes called “feature bagging”. The reason for doing this is the correlation of the trees in an ordinary bootstrap sample: if one or a few features are very strong predictors for the response variable (target output), these features will be selected in many of the B trees, causing them to become correlated.
Typically, for a classification problem with p features, √p features are used in each split

29
Q

What is boosting?

A

Boosting is a technique similar to bagging. In boosting and bagging, you always use the same type of classifier. But in boosting, the different classifiers are trained sequentially. Each new classifier is trained based on the performance of those already trained. Boosting makes new classifiers focus on data that was previously misclassified by previous classifiers. Boosting is different from bagging because the output is calculated from a weighted sum of all classifiers. The weights aren’t equal as in bagging but are based on how successful the classifier was in the previous iteration.

30
Q

What is AdaBoost?

A

Short for adaptive boosting. It works by combining decision stumps with boosting.

31
Q

What is precision?

A

TP / (TP + FP)

Tells us the fraction of records that were positive from the group that the classifier predicted to be positive.

32
Q

What is recall?

A

TP / (TP + FN)
Measures the fraction of positive examples the classifier got right. Classifiers with large recall don’t have many positive examples classified incorrectly.

33
Q

What is the ROC curve?

A

Receiver operative characteristic.
ROC curve, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) or recall against the false positive rate (FPR) at various threshold settings.

34
Q

What is the false positive rate or fallout?

A

FP / (FP + TN)