Supervised Classifiers Flashcards
What is knowledge representation?
Viewing what the machine algorithm has learned. It may be in the form of a set of rules; probability distribution
Steps in developing a machine learning application
1) Collect Data
2) Prepare the input data
3) Analyze the input data
4) Train the algorithm
5) Test the algorithm
What are k-nearest neighbor pros and cons?
Pros: High accuracy, insensitive to outliners, no assumptions about data
Cons: Computationally expensive, requires lots of memory
Describe k-nearest-neighbors.
We have an existing set of example data, our training set. We have labels for all of this data – we know what class each piece of the data should fall into. When we’re given a new piece of data without a label, we compare that new piece of data to the existing data, every piece of existing data. We then take the most similar pieces of data (the nearest neighbors) and look at their labels. We look at the top k most similar pieces of data from our known dataset. Majority vote wins.
What’s the formula to normalize and scale everything to the range 0 to 1
newValue = (oldValue - min)/(max-min)
What is the Euclidian distance between two vectors?
sqrt( (Xa0 - Xb0)^2 + (Xa1 - Xb1) ^2 + … )
What are the pros and cons of decision trees?
Pros: Computationally cheap to use, easy for humans to understand learned results, missing values OK, can deal with irrelevant features.
Cons: Prone to overfitting.
Describe decision tree creation.
1) Make a decision on which feature to best split the data first.
2) Split the dataset into subsets. The subsets will then traverse down the branches of the first decision node.
a) If the data on the branches is the same class, then you’ve properly classified it and don’t need to continue splitting it.
b) If not the same, then you need to repeat the splitting process on this subset.
What is Shannon entropy or just entropy for short?
The expected value of the information.
Information is defined as follows. If you’re classifying something that can take on multiple values, the information for xi is defined as L(Xi) = log2(p(Xi)), where p(Xi) is the probability of choosing this class.
To calculate entropy, you need the expected value of all the information of all possible values of our class. This gives us:
H = - SUM(i = 1 to n) ( p(Xi)*log2p(Xi) ) where n is the number of classes.
Give an example of shannon entropy.
A coin toss. 2 shannons of entropy: Information entropy is the log-base-2 of the number of possible outcomes; with two coins there are four outcomes, and the entropy is two bits.
Naive Bayes Pros and Cons
Pros: Works with a small amount of data, handles multiple classes.
Cons: Sensitive to how input data is prepared.
What is underflow?
Doing too many multiplications of small numbers.
How do we get around underflow?
To take the natural logarithm of the products. If you recall from algebra, ln(a*b) = ln(a) + ln(b)
What’s the difference between a set-of-words model vs bag-of-words model?
Set-of-words is when you treat the presence or absence of a word as a feature. A bag of words can have multiple occurrences of each word.
What is hold-out cross validation?
When you randomly select a portion of our data for the training set and a portion for the test set.
What are two assumptions that make naive bayes naive?
1) We assume independence. That one feature or word is just as likely by itself as it is next to other words.
2) That every feature is equally important.
Pros and Cons of Logistic Regression?
Pros: Computationally inexpensive, easy to implement, knowledge representation easy to interpret.
Cons: Prone to underfitting, may have low accuracy.
What is the sigmoid function?
The sigmoid function is given by the following equation: sigma(z) = 1 / (1 + e^-z)
What is Gradient Descent?
To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point. The gradient operator will always point in the direction of the greatest increase so we take the negative. The step size is given by parameter alpha. We repeat the step until we reach a stopping condition: either a specified number of steps or the algorithm is within a certain tolerance margin.
How do you classify an instance with logistic regression?
You calculate the sigmoid of the vector under test multiplied by the weights optimized earlier with gradient descent. If the sigmoid gives you a value greater than .5, the class is 1, and it’s 0 otherwise.
Describe SVM
A support vector machine constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space, which can be used for classification, regression, or other tasks. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the nearest training-data point of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier.
What is generalization error?
The generalization error of a machine learning model is a function that measures how well a learning machine generalizes to unseen data. It is measured as the distance between the error on the training set and the test set and is averaged over the entire set of possible training data that can be generated after each iteration of the learning process.
What is the use of a Kernel?
Mapping from one feature space to another. Think of the kernel as a wrapper or interface for the data to translate it from a difficult formatting to an easier formatting.
What combines multiple classifiers?
Ensemble methods or meta-algorithms