Midsemester Exam Flashcards

Question

Why is k-nearest neighbour considered lazy?

Answer 1

It doesn't build a model. It remembers all the training data and has to compute distance to neighbours every time it is run.

Answer 2

If k is too small it is sensitive to noise. If k is too large it may pick up data from other class labels.

Answer 3

Normalize the data so one attribute doesn't dramatically affect the k-means plots.

Answer 4

Because it doesn't build a model it has to compute the distance every time it classifies a new record. This is a relatively expensive operation.

Answer 5

A Naive Bayes Classifier uses Bayes theorem to compute the probability of a class label.

Answer 6

It's easy to implement and generally gets good results

Answer 7

Assumes attributes are independent. This is not likely in the real world for some cases. Usually there are dependencies in data.

Answer 8

Make a tree. Split the tree based on class attributes. The internal node is a class attribute. Make decisions at this point. The leaf node is the class label.

Answer 9

How to split the data for decision-making. How to optimize the decision-making part.

Answer 10

Binary split is like a yes/no split. Two possible outcomes. Multiway split is where there is more than 2 possible outcomes.

Answer 11

Start with a low gini index.

Answer 12

The gini index is a measure of impurity of data used in determinings splits of data in a decision tree.

Answer 13

Overfitting in the context of a decision tree is where your decision tree matches the data too precisely. The decision tree may be capturing noise.

Answer 14

Underfitting is when a model is too simple, both training and test errors are large

Answer 15

Stop the algorithm before it makes a fully grown tree. 1. Stop if all instances belong to the same class. 2. Stop if all the attribute values are the same. Other notes Stop if it doesn't improve the Gini index.

Answer 16

1. It's inexpensive to construct. 2. Extremely fast at classifying unknown records 3. Easy to interpret for small-sized trees 4. Accuracy is comparable to other classification techniques for many simple data sets

Answer 17

Cluster analysis is finding objects that belong to a particular. You want the distance between objects in a group to be small. You want the distance between groups to be large. In other words, you want groups of data to look distanct in the groups they belong to.

Answer 18

Grouping of stocks, websites, genes and proteins. When we want to summarize large datasets.

Answer 19

Clustering is often used as a pre-processing tool before performing a classification or recommendation.

Answer 20

High similarity within clusters. You want it to be cohesive within a cluster. Low similarity with other clusters. You want it to be distinct between other clusters.

Answer 21

Clustering is ambiguous because determining a cluster can be subjective. Where to divide? How many clusters?

Answer 22

Hierarchical clustering is overlapping that's organized as a tree. Partition clustering is organized as separate groups with no overlaps.

Answer 23

Each cluster has a centroid. It's a center point which you measure data points to. The number of clusters is determined at the beginning.

Answer 24

1. Set a centroid point, 2. Assign data points to the nearest centroid. 3. Recompute the centroid as an average of all the points in the cluster. AKA The centroid moves. 4. Continue 2-3 until the centroid doesn't move. AKA the "right" clustering has happened. Note: sometimes we change step 4 until they only move very little.

Answer 25

Use the Sum of Squared Error formula The errror is the distance from a point to the nearest cluster centroid.

Answer 26

Depending on the data spread, the initial centroids may converge and not actually represent the data clusters.

Answer 27

Do multiple runs and see if there's a difference.

Answer 28

Eliminate ‘small’ clusters that may represent outliers Split ‘loose’ clusters (clusters with relatively high SSE) Merge ‘close’ clusters (clusters with relatively low SSE)

Answer 29

- Difference sizes - - Different densities - Different shapes (non-circular)

Answer 30

Use Minkwoski distance, distance = Sum of (datapoint Pattribute - datapoint Qattribute)^rto the power of (1/r) r is 1 = manhattan distance r is 2 = euclidean distance r is infinity is Max

Answer 31

L₁(1, norm) is manhattan distance. L₂(2, norm) is euclidean distance. L_infinity(X,Y) is the max of these.

Answer 32

Jaccard distance is 1 minus the Jaccard Coefficient

Answer 33

It's between 0 and 1. A = {1, 2, 3} B = {4, 5, 6} ||A|| = 1\*1 + 2\*2 +3\*3 ||B||= 4\*4 + 5\*5 + 6\*6 |A . B| = 1\*4 + 2\*5 +3\*6

Answer 34

The number of changes that need to be made to transform one string to another. "Ben" -\> "Jen" = 1 edit distance "Good" -\> "Evil" = 4 edit distance. 1/ (1 + edit distance)

Answer 35

Hierarchical clustering is when cluster partitions belong to parent clusters.

Answer 36

No assumption about the number of clusters. Can be useful for taxonomy. Monkeys -\> primates

Answer 37

Agglomerative, you start with each point as a single cluster and start to merge them together. Divisive, you start with one cluster and divide as you go.

Answer 38

1. Min (single linkage) 2. Max (complete linkage) 3. Group Average 4. Centroid distance

Answer 39

Measure from two closest points in each cluster.

Answer 40

The measurement from the furthest point in each cluster.

Answer 41

It is sensitive noise

Answer 42

It can break large clusters

Answer 43

The average distance between each point and all the other points in the other cluster. Summed up. Divided by the size of each cluster multiplied together

Answer 44

Calculate the similarity between clusters using the centroids of each cluster

Answer 45

F = 2 \* (Precision x Recall) \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ Precision + Recall

Answer 46

Use the F-measure F-Measure = 2 \* Precision x Recall \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ Precision + Recall The closer to 1, the better.

Midsemester Exam Flashcards

To revise content for the midsemester exam (77 cards)