Midsemester Exam Flashcards

To revise content for the midsemester exam

1
Q

What’s the difference between Active and Passive learning?

A

Passive learning is where an expert tells you all the features that help you classify the data and you memorise them. Active learning is where an expert classifies a dataset for you and you discover the features for yourself.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the process of Classification?

A
  1. Get a training set of data
  2. Apply a learning algorithm on the training set to create a model.
  3. Apply the model to more of the training set and classify it
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is one way you can classify data?

A

Using a decision tree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why don’t you explicitly tell the model the characteristics of data?

A

This can be difficult to do. For example, how do you determine that an email is spam? All the features of a spam email? Much easier to just tell the model which ones are spam and which are not.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the difference between Classification and Clustering?

A

Classification is supervised learning. You are the expert who supervises the model. Clustering is unsupervised learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the two steps in the classification process?

A
  1. Model construction. 2. Model Usuage
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Explain the model construction step in the classification process.

A

To construct a model we need to

  • get a training set
  • Determine class labels for the training set. E.g Salmon, not salmon
  • Determine how to represent the model: E.g using Decision trees, math formulas or rules.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What do we do in the model usage step?

A

Check the accuracy of the model. How? By checking that the test sample’s class labels are correctly classified.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What’s the difference between training data and testing data?

A

Training data and testing data are both partitions. The testing data class label is not known by the model. The mode’s job is accurately predict it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is one way of determining the accuracy of a model?

A

Use a confusion matrix to determine the accuracy of the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do you use a confusion matrix to determine the accuracy of a model?

A

General idea: The number of correctly classified tuples divided by all the tuples.

T-Positive + T-Negative ____________________________________

T-Positive + F-Positive + T-Negative + F-Negative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a disadvantage of the Confusion Matrix method of testing model accuracy?

A

In an example of 10,000 fish that are to be classified. If there are only 10 Salmon that are classified. The accuracy calculation will be very high 99.9. But given the few number of Salmon in the dataset, this could be because the model isn’t detecting things correctly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is precision and recall?

A

Precision: exactness – what % of tuples that the classifier labeled as positive are actually positive?

Recall: completeness – what % of positive tuples did the classifier correctly label as positive?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do we calculate precision?

A

Precision =

True Positive

________________________________

True Positive + False Positive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How do we calculate recall?

A

Recall = True Positive / True Positive + False Negative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the holdout method of estimation?

A

Take 70% as the training data and 30% as the testing data. Do this a couple of times.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the cross validation method of estimation?

A

Divide the dataset into subsets. Check each subset as a test set and use the other subsets as training set. Do this for all subsets Example: Subset 1 - > test set Subset 2 - 10 -> training set Next step: Subset 2 -> test set Subset 1, Subset 3-10 -> training set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the three main algorithms for classifying data?

A

Nearest Neighbour, Bayes, and Decision Tree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What’s an instance-based classifier?

A

Rote learner model of classification. It has to match exactly for the classification to work.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are the three things you need to get K-neartest neighbour to work?

A

You need three things. 1. The stored records 2. Compute the distance to neighbours. 3. Choose how many neighbours to compare to.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are some things to consider for classification algorithms?

A

Accuracy - Does it correctly classify class labels Speed - time to build the model, time to use the model Robustness - handling signal and noise Scalability - handling large amounts of data Interpretability - understand the insight from the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are similarity/dissimilarity in the context of classification algorithms?

A

Similarity is how alike two data objects are. 1 means the same.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What measure do we use to compute similarity between data objects?

A

Use Euclidean distance. Distance equals the square root of the sum of the squared distance. Example: John(45yo , $10 000), Kelly(34yo, $15 000) Distance = sqrt[(45-34)^2 + (10 000 - 15 000)^2]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What are the steps of the K-nearest algorithm/classification?

A
  1. Compute the euclidean distance from the data point to other data points
  2. Get the class label from the neighest neighbour. How? Majority vote based off the weight factor
  3. What’s the weight factor? Weight = 1/distance^2
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Why is k-nearest neighbour considered lazy?

A

It doesn’t build a model. It remembers all the training data and has to compute distance to neighbours every time it is run.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What are the issues with choosing a k-value for nearest neighbour?

A

If k is too small it is sensitive to noise. If k is too large it may pick up data from other class labels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

How do we stop one attribute from dominating the k-means algorithm?

A

Normalize the data so one attribute doesn’t dramatically affect the k-means plots.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What’s an obvious disadvantage of k-means algorithm?

A

Because it doesn’t build a model it has to compute the distance every time it classifies a new record. This is a relatively expensive operation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What’s a Naive Bayes Classifier?

A

A Naive Bayes Classifier uses Bayes theorem to compute the probability of a class label.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is an advantage of using a Naive Bayes Classifier?

A

It’s easy to implement and generally gets good results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What’s a disadvantage of using a Naive Bayes Classifier?

A

Assumes attributes are independent. This is not likely in the real world for some cases. Usually there are dependencies in data.

32
Q

What is general idea for a decision tree classification?

A

Make a tree. Split the tree based on class attributes. The internal node is a class attribute. Make decisions at this point. The leaf node is the class label.

33
Q

What are some issues that you need to consider when developing a decision tree?

A

How to split the data for decision-making. How to optimize the decision-making part.

34
Q

What’s the difference between a binary split and a multi-way split?

A

Binary split is like a yes/no split. Two possible outcomes. Multiway split is where there is more than 2 possible outcomes.

35
Q

How do you determine the best split when making a decision tree?

A

Start with a low gini index.

36
Q

What is the Gini index?

A

The gini index is a measure of impurity of data used in determinings splits of data in a decision tree.

37
Q

What is the issue of overfitting?

A

Overfitting in the context of a decision tree is where your decision tree matches the data too precisely. The decision tree may be capturing noise.

38
Q

What’s splitting based on classification error?

A

TODO

39
Q

What is underfitting?

A

Underfitting is when a model is too simple, both training and test errors are large

40
Q

How do we fix the issue of overfitting?

A

Stop the algorithm before it makes a fully grown tree. 1. Stop if all instances belong to the same class. 2. Stop if all the attribute values are the same. Other notes Stop if it doesn’t improve the Gini index.

41
Q

What is an advantage of decision tree based classification?

A
  1. It’s inexpensive to construct. 2. Extremely fast at classifying unknown records 3. Easy to interpret for small-sized trees 4. Accuracy is comparable to other classification techniques for many simple data sets
42
Q

What is cluster analysis?

A

Cluster analysis is finding objects that belong to a particular. You want the distance between objects in a group to be small. You want the distance between groups to be large. In other words, you want groups of data to look distanct in the groups they belong to.

43
Q

What are some real-world examples of when we use cluster analysis?

A

Grouping of stocks, websites, genes and proteins. When we want to summarize large datasets.

44
Q

Clustering is used as a ______________ tool.

A

Clustering is often used as a pre-processing tool before performing a classification or recommendation.

45
Q

How do you define good clustering?

A

High similarity within clusters. You want it to be cohesive within a cluster. Low similarity with other clusters. You want it to be distinct between other clusters.

46
Q

Why is clustering ambiguous?

A

Clustering is ambiguous because determining a cluster can be subjective. Where to divide? How many clusters?

47
Q

What’s the difference between hierarchical and partition clustering?

A

Hierarchical clustering is overlapping that’s organized as a tree. Partition clustering is organized as separate groups with no overlaps.

48
Q

What’s the general idea for K-means clustering?

A

Each cluster has a centroid. It’s a center point which you measure data points to. The number of clusters is determined at the beginning.

49
Q

What is the process of K-means clustering algorithm?

A
  1. Set a centroid point,
  2. Assign data points to the nearest centroid.
  3. Recompute the centroid as an average of all the points in the cluster. AKA The centroid moves.
  4. Continue 2-3 until the centroid doesn’t move. AKA the “right” clustering has happened.

Note: sometimes we change step 4 until they only move very little.

50
Q

How do you validate the K-means cluster?

A

Use the Sum of Squared Error formula The errror is the distance from a point to the nearest cluster centroid.

51
Q

Why is picking initial centroids important?

A

Depending on the data spread, the initial centroids may converge and not actually represent the data clusters.

52
Q

What is the solution to the initial centroid problem?

A

Do multiple runs and see if there’s a difference.

53
Q

What is post-processing for k-means clustering?

A

Eliminate ‘small’ clusters that may represent outliers Split ‘loose’ clusters (clusters with relatively high SSE) Merge ‘close’ clusters (clusters with relatively low SSE)

54
Q

What are some limitations of K-means clustering?

A
  • Difference sizes -
  • Different densities
  • Different shapes (non-circular)
55
Q

How do we determine do k-means algorithm for complex data types?

A

Use Minkwoski distance,

distance = Sum of (datapoint Pattribute - datapoint Qattribute)r to the power of (1/r)

r is 1 = manhattan distance

r is 2 = euclidean distance

r is infinity is Max

56
Q

What’s the difference betwen L1(1, norm), L2(2, norm) and Linfinity(X,Y) ?

A

L1(1, norm) is manhattan distance.

L2(2, norm) is euclidean distance.

Linfinity(X,Y) is the max of these.

57
Q

What’s the formula for L1 (1-norm)?

A
58
Q

What’s the formula for L2(2-norm)?

A
59
Q

What’s the formula for Linfinity ?

L∞ (X,Y) ?

A
60
Q

Can you calculate the L1, L2 and Linfinity?

A
61
Q

What’s Jaccard Distance?

A

Jaccard distance is 1 minus the Jaccard Coefficient

62
Q

What’s the Jaccard Coefficient?

A
63
Q

What’s cosine distance/similarity?

A

It’s between 0 and 1.

A = {1, 2, 3}

B = {4, 5, 6}

||A|| = 1*1 + 2*2 +3*3

||B||= 4*4 + 5*5 + 6*6

|A . B| = 1*4 + 2*5 +3*6

64
Q

What is Edit distance? How do we measure similarity?

A

The number of changes that need to be made to transform one string to another.

“Ben” -> “Jen” = 1 edit distance

“Good” -> “Evil” = 4 edit distance.

1/ (1 + edit distance)

65
Q

What is hierarchical clustering?

A

Hierarchical clustering is when cluster partitions belong to parent clusters.

66
Q

What are the strengths of hierarchical clustering?

A

No assumption about the number of clusters.

Can be useful for taxonomy. Monkeys -> primates

67
Q

What is agglomerative and divisive hierarchical clustering?

A

Agglomerative, you start with each point as a single cluster and start to merge them together.

Divisive, you start with one cluster and divide as you go.

68
Q

What are the four ways we can cluster data in agglomerative clustering?

A
  1. Min (single linkage)
  2. Max (complete linkage)
  3. Group Average
  4. Centroid distance
69
Q

What is min/single linkage?

A

Measure from two closest points in each cluster.

70
Q

What is max/complete linkage?

A

The measurement from the furthest point in each cluster.

71
Q

What is a limitation of min/single linkage?

A

It is sensitive noise

72
Q

What is a limitation of max/complete linkage?

A

It can break large clusters

73
Q

What is group average cluster similarity?

A

The average distance between each point and all the other points in the other cluster. Summed up. Divided by the size of each cluster multiplied together

74
Q

What is inter-cluster analysis centroid ammogerative clustering?

A

Calculate the similarity between clusters using the centroids of each cluster

75
Q

How do you calculate F-Measure?

A

F = 2 * (Precision x Recall)

___________________________

Precision + Recall

76
Q

If you have the precision and recall of two different models. How do you determine which one is better?

A

Use the F-measure

F-Measure = 2 * Precision x Recall

__________________________________

Precision + Recall

The closer to 1, the better.

77
Q
A