classification Flashcards
what is classification
Classification is to determine P(H|X), (i.e., posteriori probability): the probability that the
determine the different subsets of probabilities
● P(H) (prior probability): the initial probability
○ E.g., X will buy a computer, regardless of age, income
● P(X): the probability that sample data is observed
● P(X|H) (likelihood): the probability of observing the sample X, given that the hypothesis
holds
○ E.g., Given that X
define briefly the naive bayes
it assumes dependence between attributes which ease up the calculation,
if the attributes are continuous then a gaussian distribution is used
if the attributes are categorical then they calculated used defined probability
how to deal with 0 values in naive bayes
laplacian correction, adding 1 to each case
how to measure accuracy
We can rely on
precision exactness – what % of tuples that the classifier labelled as positive are
actually positive completeness – what % of positive tuples did the classifier label as positive
recall harmonic mean of precision and recall,
what is ensemble method
An ensemble for classification is a composite model, made up of a combination of classifiers. ● The individual classifiers vote, and a class label prediction is returned by the ensemble based on the collection of votes
what is the difference between bagging and boosting
bagging is an ensemble method that assigns equal weight predictions,
while boosting for each training tuple weight are assigned, when classifier Mi is learned, weights are updated to allow other classifiers to learn from the tuple that were misclassified by the Mi
usually is more accurate but leads to over fitting
why using random forests
bagging does not work very well on decision trees as the trees that are generated are pretty correlated, the idea is we choose L out of D attributes
when L is too small and we perform a linear combination of
what is the disadvantage of naive classifier
iclass conditional independence, therefore loss of accuracy
what is the holdout method
the idea consists of splitting the data into 2 parts training set and testing the issue here is there would be an imbalance which reduce the accuracy
what is cross validation
when we can not split our data into 2 parts, we only t split the dataset into K chunks, take one chunk use it for testing and the other for training keep iterating till finishing with the all the chunks, this include leave on out, k-fold when the unbalcan happen we may want to stratify where every fold has the same probabilities
what is the idea of bootstrapping
perform n sampling with replacement, in small dataset use bootstrapped data for training and the original one for testing