classification Flashcards

Question 1

Q

what is classification

Answer

A

Classification is to determine P(H|X), (i.e., posteriori probability): the probability that the

Question 2

Q

determine the different subsets of probabilities

Answer

A

● P(H) (prior probability): the initial probability
○ E.g., X will buy a computer, regardless of age, income
● P(X): the probability that sample data is observed
● P(X|H) (likelihood): the probability of observing the sample X, given that the hypothesis
holds
○ E.g., Given that X

Question 3

Q

define briefly the naive bayes

Answer

A

it assumes dependence between attributes which ease up the calculation,
if the attributes are continuous then a gaussian distribution is used
if the attributes are categorical then they calculated used defined probability

Question 4

Q

how to deal with 0 values in naive bayes

Answer

A

laplacian correction, adding 1 to each case

Question 5

Q

how to measure accuracy

Answer

A

We can rely on
precision exactness – what % of tuples that the classifier labelled as positive are
actually positive completeness – what % of positive tuples did the classifier label as positive
recall harmonic mean of precision and recall,

Question 6

Q

what is ensemble method

Answer

A

An ensemble for classification is a composite model, made up of a combination of
classifiers.
● The individual classifiers vote, and a class label prediction is returned by the
ensemble based on the collection of votes

Question 7

Q

what is the difference between bagging and boosting

Answer

A

bagging is an ensemble method that assigns equal weight predictions,
while boosting for each training tuple weight are assigned, when classifier Mi is learned, weights are updated to allow other classifiers to learn from the tuple that were misclassified by the Mi
usually is more accurate but leads to over fitting

Question 8

Q

why using random forests

Answer

A

bagging does not work very well on decision trees as the trees that are generated are pretty correlated, the idea is we choose L out of D attributes
when L is too small and we perform a linear combination of

Question 9

Q

what is the disadvantage of naive classifier

Answer

A

iclass conditional independence, therefore loss of accuracy

Question 10

Q

what is the holdout method

Answer

A

the idea consists of splitting the data into 2 parts training set and testing the issue here is there would be an imbalance which reduce the accuracy

Question 11

Q

what is cross validation

Answer

A

when we can not split our data into 2 parts, we only t split the dataset into K chunks, take one chunk use it for testing and the other for training keep iterating till finishing with the all the chunks, this include leave on out, k-fold when the unbalcan happen we may want to stratify where every fold has the same probabilities

Question 12

Q

what is the idea of bootstrapping

Answer

A

perform n sampling with replacement, in small dataset use bootstrapped data for training and the original one for testing