Discrete&Continuous Data Flashcards

Question

K-Means Clustering

Answer 1

(1) Select k points at random (or otherwise) to act as seed clusters (2) Assign each instance to the cluster with the nearest centroid (3) Compute seed points as the centroids of the clusters of the current partition (the centroid is the centre, i.e., mean point, of the cluster) (4) Go back to 2, stop when no reassignments (converge) It may or may not converge at the end but fast converge is fairly typical One typical improvement runs k-means multiple times (with random seeds), looking for a common clustering and simply ignore runs which don’t converge within τ iterations

Answer 2

Strengths: • relatively efficient: O(tkn), where n is # instances, k is # clusters, and t is # iterations; normally k,t <= n Weaknesses: • tends to converge to local minimum; sensitive to seed instances • need to specify k in advance • not able to handle non-convex clusters • “mean” ill-defined for nominal attributes

Answer 3

(1) Naive (2) v1 improvement (3) v2 improvement

Answer 4

“Group” values into class-contiguous intervals Steps: 1. Sort the values, and identify breakpoints in class membership 2. Reposition any breakpoints where there is no change in numeric value 3. Set the breakpoints midway between the neighbouring values

Answer 5

Advantages: • simple to implement Disadvantages: • no sense of ordering • usually creates too many categories (overfitting)

Answer 6

v1: delay inserting a breakpoint until each “cluster” contains at least n instances of the majority class v2: merge neighbouring clusters until they reach a certain size/at least n instances of the majority class

Answer 7

For a discrete random variable X that takes on a finite or countably infinite number of possible values, we determined P(X = x) for all of the possible values of X, and called it the probability mass function.

Answer 8

For continuous random variables, the probability that X takes on any particular value x is 0. That is, finding P(X = x) for a continuous random variable X is not going to work. Instead, we'll need to find the probability that X falls in some interval (a, b), that is, we'll need to find P(a < X < b). We'll do that using a probability density function

Answer 9

Gaussian/normal distribution

Answer 10

* symmetric about the mean * area under the curve = 1 * to estimate the probability, we need mean µ and standard deviation σ of a distribution X

Answer 11

• In practice, a normal distribution is a reasonable approximation for many events * This is a consequence of the Central Limit Theorem * More careful analysis shows that the mean is almost always normally distributed, but outliers can wreak havoc on our probability estimates

Discrete&Continuous Data Flashcards

(35 cards)