Week 9 DSE Flashcards
What does likelihood mean?
Prob of observing data x if we think that y equals to some number
What can bayes rule be used for ?
binary, discrete vairable
What does fk (X) mean?
conditional probability mass function of X (if X discrete) for an observation from class k.
What is 𝜋𝑘 ?
prior probability that random observation comes from class k
easily estimated under random sampling: just a fraction of the training observations that belong to class k.
What is the assumption under naive bayes classifier?
Assumes that within class k, all p predictors are independent.
ELIMINATES THE NEED FOR JOINT DISTRIBUTION
fk(x)= fk1(x1) x fk2(x2) x fk3(x3)
What to do when joint distribution is multivariate normal?
Estimate covariance matrix (tough in high dimension)
What is another option for quantitative/continuus X besides drawn from normal distribution?
Use non parametric density estimation
make a histogram for observation of jth predictor with each class
or use kernel density estimator
How to estimate a qualitative X?
What odes it mean to be qualitative?
qualitative means x discrete
count the proportion of training observations for the j-th predictor
What is it called when pi 1 and pi2 both equal 0.5, IF THERE ARE 2 CLASSES
called flat prior
What assumption does the naive bayes function in r rely on ?
assumption of Normal distribution for quantitative predictors.
What do we ask ourselves when applying statistical methods?
are assumptions valid?
What is the purpose of assuming predictors independent conditional on class?
mainly ensures quantitative tractability
What is a potential source of high variance? How does naive bayes solve it?
Estimating joint distribution of predictors when P gets larger becomes a very hard problem requiring a lot of data
simplifies this problem, introducing some bias, but reducing variance
Naive bayes is a relatively simple method with little tendency to overfit.
When does naive bayes reduce variance drastically?
when N is not large relative to P
When does naive bayes work best?(slide 29)
Works best when data has relatively many predictors so that reliable estimation of joint densities of predictors for each class is hard to achieve.