KNN and Bayes Classifier Flashcards by Rasmus Johansson

What is the main difference between regression and classification in supervised learning, as introduced in this lecture?

In regression, the target (response) is a continuous value (e.g., winning time in seconds). In classification, the target (label) is categorical (e.g., class A vs. class B). The lecture focuses on classification methods that assign discrete labels.

How well did you know this?

Not at all

Perfectly

Suppose you have a binary classification problem (labels 0 or 1) with attribute vectors xₙ. What does a probabilistic classifier output, and how does that differ from a non-probabilistic classifier?

A probabilistic classifier outputs P(tₙ=1|xₙ) (and hence P(tₙ=0|xₙ) = 1 - P(tₙ=1|xₙ)), giving a probability of belonging to each class. A non-probabilistic classifier directly outputs a hard class label (like 0 or 1) without providing the probability of that choice.

How well did you know this?

Not at all

Perfectly

Describe the basic steps of the K-Nearest Neighbors (KNN) classification algorithm and why it is considered ‘non-probabilistic.’

(1) Choose a value of K. (2) For a new test point x_new, find the K closest training points to x_new. (3) Assign the majority class among those neighbors to x_new. KNN is non-probabilistic because it simply votes on a class rather than producing a probability of class membership.

How well did you know this?

Not at all

Perfectly

Give an example problem: Suppose we have two classes (red, blue) and a new point x_new. Illustrate how you would classify x_new using 3-NN (K=3).

First, find the 3 training points nearest to x_new in terms of distance (e.g., Euclidean). Next, see how many of those 3 neighbors are labeled red vs. blue. If 2 or more are red, classify x_new as red; otherwise, classify x_new as blue.

How well did you know this?

Not at all

Perfectly

What is the effect of choosing K=1 versus a large K in KNN, and what common pitfall can result from too large a K for imbalanced classes?

K=1 can lead to overfitting (excessively complex decision boundaries), while a large K can oversmooth the boundary and cause minority classes to be dominated by majority classes (making the minority class practically disappear if K > size of minority class).

How well did you know this?

Not at all

Perfectly

How would you use cross-validation to pick the best K for KNN?

(1) Choose a set of candidate K values. (2) For each K, run cross-validation – in each fold, train KNN using the other folds, then measure misclassifications on the validation fold. (3) Average the error rates across folds for each K. (4) Choose the K with the lowest average error.

How well did you know this?

Not at all

Perfectly

Name two typical problems that arise in KNN when dealing with real-world datasets, and propose short suggestions to mitigate them.

(1) Imbalanced classes – if a class is underrepresented, large K can bury it. One fix is weighting neighbors by distance or adjusting class priors. (2) High dimensionality – distances become less meaningful. One fix is dimensionality reduction or feature selection before applying KNN.

How well did you know this?

Not at all

Perfectly

Summarize the essence of KNN’s ‘decision boundary’ behavior using a 2D example.

In 2D, training points of each class partition the space. As K increases, the boundary becomes smoother. With K=1, the boundary is highly jagged, allowing tiny ‘islands’ of a class around isolated points. With larger K, local anomalies are smoothed out, but risk overshadowing smaller classes.

How well did you know this?

Not at all

Perfectly

Why might one want a probabilistic classifier instead of KNN’s ‘hard’ label assignment, especially in situations like medical diagnosis?

Probabilistic classifiers provide confidence estimates (e.g., P(disease=1|x)). This is vital in high-stakes decisions, such as medical diagnoses, where knowing the probability informs how to handle false positives vs. false negatives. A single label alone can’t convey uncertainty.

How well did you know this?

Not at all

Perfectly

State Bayes’ rule for classification and explain the meaning of each term.

P(t_new=k|x_new,X,t) = p(x_new|t_new=k,X,t) P(t_new=k) / Σₘ p(x_new|t_new=m,X,t) P(t_new=m).

Bayes’ rule says the posterior probability of the new label k, given x_new and training data (X,t), is proportional to the likelihood (how likely x_new is if it belongs to class k) times the prior probability of class k, all normalized by the sum over all possible classes.

How well did you know this?

Not at all

Perfectly

In the Bayes classifier, how do we model the likelihood term p(x_new|t_new=k,X,t) in practice?

We pick a parametric form for each class (e.g., Gaussian, Binomial, etc.), estimate the parameters for that class using training points belonging to class k, then evaluate p(x_new|class k) under that fitted distribution. That’s our class-conditional likelihood.

How well did you know this?

Not at all

Perfectly

Give an example scenario: You have two coins (coin A, coin B), each tossed 20 times. The data are how many heads were observed. How would you do a Bayes classification for a new observation x_new heads in 20 tosses?

(1) Fit the coin A distribution by r_A = (Total heads for coin A) / (20*#A-observations). Similarly for coin B. (2) Each class conditional is Binomial(20, r_A) or Binomial(20, r_B). (3) Evaluate P(x_new|A) and P(x_new|B), then multiply each by the prior P(A) and P(B). Finally, use Bayes’ rule to get the posterior P(A|x_new) and P(B|x_new).

How well did you know this?

Not at all

Perfectly

What role do priors play in the Bayes classifier, and why might someone choose P(t_new=k) ≠ 1/K?

Priors encode how likely each class is before seeing x_new. If classes are not equally common (e.g., a rare disease), one might set smaller classes with appropriately small priors. This influences the posterior to reflect real-world prevalence or domain knowledge.

How well did you know this?

Not at all

Perfectly

Explain the Naive-Bayes assumption and why it is used for high-dimensional features x.

Naive-Bayes assumes that within a class, the features x₁, x₂, …, x_D are conditionally independent. This simplifies the joint likelihood p(x|class) to a product of marginal distributions, drastically reducing complexity when D is large, at the cost of ignoring potential feature correlations.

How well did you know this?

Not at all

Perfectly

Show a small example problem using Naive-Bayes with a 2D Gaussian, each dimension assumed independent. How do we fit its parameters for class k?

We treat each dimension d as N(µₖd,σₖd²). For class k:\n(1) µₖd = (1/Nₖ) Σ(xₙd) over points where tₙ=k.\n(2) σₖd² = (1/Nₖ) Σ(xₙd - µₖd)².\nThe joint for class k is p(x|k) = Π_d N(x_d|µₖd,σₖd²).

How well did you know this?

Not at all

Perfectly

Compare the final prediction formulas between a standard (fully correlated) Gaussian Bayes classifier and a Naive-Bayes Gaussian classifier in 2D.

Study These Flashcards

A fully correlated Gaussian would use p(x|k)= (1/2π|Σₖ|^(1/2))exp(-½(x-µₖ)ᵀΣₖ⁻¹(x-µₖ)), while the Naive-Bayes version factorizes into p(x|k)=Π_d N(x_d|µₖd,σₖd²). In 2D, the former has a covariance matrix coupling x₁, x₂; the latter effectively uses a diagonal covariance.

In the 2D example with three classes, how do we combine the likelihood and prior to form the posterior probability for each class at a new point x_new?

Study These Flashcards

We use P(t_new=k|x_new) = [p(x_new|k)P(k)] / [Σ_j p(x_new|j)P(j)]. Here, p(x_new|k) is derived from the class-conditional distribution (like the product of 2 Gaussians under Naive-Bayes), and P(k) is the prior. We then normalize over all classes j.

Could we interpret KNN in a Bayesian sense? Why or why not?

Study These Flashcards

While there are probabilistic extensions of KNN (e.g., weighting votes by distance and deriving approximate posteriors), standard KNN doesn’t explicitly define a parametric likelihood or produce continuous posterior probabilities. It’s typically seen as a direct, non-parametric, vote-based method.

Show a short challenge question: Suppose your dataset has 5 classes, each dimension is assumed independent within a class, and you have 1000-dimensional data. Why might a fully Gaussian Bayes classifier be more challenging here than a Naive-Bayes approach?

Study These Flashcards

A fully Gaussian approach needs to estimate a 1000×1000 covariance matrix per class, requiring enormous data. Naive-Bayes drastically reduces parameters by assuming diagonal covariance (or separated 1D distributions per dimension), making parameter estimation far more tractable with limited data.

Summarize how a ‘Bayes classifier’ makes its decisions, then contrast it briefly with how a ‘KNN classifier’ makes decisions.

Study These Flashcards

A Bayes classifier estimates class-conditional likelihoods and uses Bayes’ rule to compute posterior probabilities for each class, picking the one with the highest posterior. A KNN classifier finds the nearest K training points and uses a majority vote among them, without explicitly modeling probabilities.

Propose an example of combining the Bayes classifier with cross-validation. Why might you do that, and what parameter might you tune?

Study These Flashcards

We might do cross-validation to choose the best type of likelihood model (e.g., univariate Gaussian vs. multi-variate Gaussian vs. naive factorization) or to select hyperparameters like covariance regularization. By testing different model assumptions via CV, we see which yields the lowest misclassification on validation folds.

Why does Naive-Bayes often work well in high-dimensional text classification problems (e.g., spam vs. not-spam)?

Study These Flashcards

In text classification, features (the presence or frequency of particular words) can be assumed roughly independent given the class. This assumption is not fully correct, but it greatly simplifies the computations and often yields surprisingly good performance when data is abundant.

What are the main pros and cons of the Bayes classifier approach vs. the KNN approach?

Study These Flashcards

(1) Bayes Classifier Pros: Gives posterior probabilities, allows domain knowledge through priors, can handle small data well if the model is correct. Cons: Must assume a distribution (it can be wrong), more complex to implement for certain data.\n(2) KNN Pros: Very simple, no explicit training. Cons: No probability outputs, must store all data, sensitive to distance metrics and large dimensionality.

Offer a short example problem: You have two classes and a prior favoring class 1 with P(t=1)=0.8, P(t=0)=0.2. After fitting class-conditionals, for x_new you find p(x_new|1)=0.05, p(x_new|0)=0.07. Which class is chosen by Bayes?

Study These Flashcards

Compute numerator(1)=0.05×0.8=0.04 and numerator(0)=0.07×0.2=0.014. Posterior ratio: 0.04 vs. 0.014. Since 0.04 > 0.014, the classifier picks class 1. This demonstrates how a larger prior can outweigh even a smaller likelihood.

25. Conclude: What is Naive-Bayes in a nutshell, and why is it still popular despite its simplicity?

Naive-Bayes is a Bayesian classifier that assumes feature independence within each class, drastically simplifying likelihood estimation to the product of univariate distributions. It’s popular because it’s easy to implement, fast, and often performs surprisingly well in practice, especially for high-dimensional classification tasks like text.

KNN and Bayes Classifier Flashcards

(25 cards)