Pre-exam Flashcards

1
Q

What is model underfitting?

A

When the model has yet to learn the true structure of the data. Performs poorly on both training and test sets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is model overfitting?

A

The model is too large and fits the data so well that it is no longer able to generalize

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

When a model is overfit, what happens to the test and training error rate?

A

Test error rate increases Training error rate continues to decrease

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are some reasons overfitting can occur?

A

Fitting noise points Not enough representative data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is MDL?

A

The minimum description length (MDL) principle is a formalization of Occam’s razor in which the best hypothesis for a given set of data is the one that leads to the best compression of the data. - wikipedia

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a validation set used for when building decision trees?

A

The data set is divided into two smaller subsets: 1 for training (2/3) and the other for testing the generalization error (1/2).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is prepruning?

A

A method used on decision trees to stop growing the tree before it is fully grown.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is post pruning?

A

Pruning a decision tree after it has been constructed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are 3 methods used for evaluating the performance of a classifier?

A
  1. Holdout (part data training, part data testing) 2. Random subsampling (repeating holdout several times) 3. Cross validation (each record is used the same # of times for training and 1 time for testing) 4. Bootstrap - sample w/ replacement
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a rule antecedent?

A

The condition for a rule.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a rule consequent?

A

The end result for a rule (a class).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

In a rule set, what is coverage?

A

The fraction of records in a data set that trigger the rule (satisfy the antecedent).

e.g. rule is triggered 5 times out of a data set of size 10 would yield a coverage of 0.5.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

In a rule set, what is accuracy?

A

The fraction of records triggered by the rule whose class labels are equal to y.

The fraction of records that satisfy both the antecedent and consequent out of the number of time the rule satisfies the antecedant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a mutually exclusive rule?

A

The rules are independent of each other.

Every record is covered by no more than 1 rule (a recorord is not covered by 2 or more rules).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are exhaustive rules?

A

Accounts for every possible combination of attribute values Each record is covered by at least one rule

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a default rule?

A

It has an empty antecedent and a default class. It is triggered when all other rules have failed (if used).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are ordered rules?

A

Rules that are ordered in decreasing priority (what ever is defined such as accuracy, coverage, etc…). Also known as a decision list.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the two methods for extracting classification rules?

A
  1. Direct methods which extract rules directly from the data 2. Indirect methods which extract rules from other classification models such as decision trees. `
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Explain the sequential covering algorithm.

A
  1. Start from an empty rule 2. Grow a rule using the Learn-One-Rule function 3. Remove the training records covered by the rule 4. Repeat step 2 and 3 until a stopping criterion is met.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are the two rule growing strategies?

A

General to specific Specific to general

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are some advantages of rule based classifiers?

A

As highly expressive as decision trees Easy to interpret Can classify new instances rapidly Performance is comparable to decision trees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is a rote-learner?

A

Memorizes the entire training data and performs classification only if attributes of a record match one of the training examples exactly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is an instance-based classifier?

A

Stores the training records Uses the training records to predict the class label of unseen cases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is a Voronoi diagram?

A

It’s basically a model for KNN. It splits the solution space.. on a line = equal distance to parents on an intersectoin = equal distance to 3 parents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is the algorithm for k-nearest neighbour?

A

Compute the distance between two points (Euclidean distance) Determine the class from nearest neighbour list - take the majority vote of class labels among the k-nearest neighbours - weight the vote according to distance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What 3 things does KNN classifiers require?

A

-The set of stored records -Distance metric to compute the distance between records -The value of k, the number of nearest neighbours to retrieve.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What can be the issue with choosing the value of K for KNN?

A
  • If k is too small, the model is sensitive to noise points - If k is too large, the neighbourhood may include points from other classes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Explain scaling issues with distance based classifiers.

A

Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes. e.g. height may var from 1.5m to 1.8m weight may vary from 90lb to 300lb

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What are some of the characteristics of a KNN classifier?

A
  • It’s a lazy learner - It does not build a model explicitly - Classifying unknown records can be relatively expensive
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is a Bayes classifier?

A

A probabilistic framework for solving classification problems. It uses conditional probability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is Bayes theorem?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is a Naive Bayes Classifier?

A

It assumes independence among attributes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Why is a large data set required for Bayes classifiers?

A

Smaller data sets means rows have more influrence

Need repeated passes over the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

How is the probability for continuous attributes used in Bayes theorm?

A

Discretize the range of values into bins

  • Can use two way split (A < v) or (A > v)
  • Or Probnability density estimation

assume the values follow a normal distribution

Use data to estimate parameters of distribution

Can use it to estimate conditional probability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What are some of the characteristics of Naive Bayes classifiers?

A

Robust to noise points

Can handle missing values by ignoring the instance during probability estimate calculations

Robust to irrelevant attributes

Independance assumption may not hold for some attributes (but can use other techniques such as bayesian belief networks instead)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What is a Bayesian Belief Network?

A

A classifier similar to Naive Bayes, but it assumes that attributes are dependeant on one another.

This approach is more sophisticated, but more realistic.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

How is a Bayesian Belief Network represented?

A
  1. Create a directed acylic graph (dag) encoding the dependence relationships among a set of variables
  2. Create a probabilit table associating each node to its immediate parent nodes.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What is the issue with an ANN that has many nodes?

What is the solution?

A

The more nodes you have, the more likely you are going to get stuck in a local minimum.

Solutions:

  1. Reduce the number of nodes
  2. Or, repeat the model building process seveal times
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What are the properties of a feedforward ANN?

A

Each layer only connects to the next layer.

There are no connections between layers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What is an Acyclic Network (ANN)?

A

The connections between nodes in the same layer are cut out.

A given layer can still be connected to many other layers (unlike feed forward).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What is the typical setup for an ANN with respet to classes and attributes?

A

The number of input nodes is equal to the # of attributes.

The number of output nodes is equal to the # of classes (with the exception of 2 classes – just uses 1 output node).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Why is it important to remove redundant or irrelevant attributes in an ANN?

A

The # of attributes determines the # of input nodes, so its important to remove redundant or irrelevant attributes to keep down the # of connections and avoid a local minimum.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

How is the output of an ANN interpreted?

A

A high value at an output node suggests a high probability of the input belonging to the class associated with the output node.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Describe the back propogation algorithm used for ANNs.

A
  1. Present sample to input nodes.
  2. Propogate data through layers
  3. Calculate results at output nodes
  4. Determine error at output nodes
  5. Propogate error backwards to adjust the weights

Repeat until stopping criterion is satisified

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What are the 3 types of ANN learning?

A

Correlation

Competative learning

Feedback-based

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Describe the Correlation learning method for ANNs.

A

Correlation (similar things should be more similar and less similar things should be made less similar)

  • When different nerurons fire at the same time, the weight representing the connection between them should be increased (associative learning).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

Describe the competitive learning, learning method for ANNs.

A

Competative learning

  • For each sample (set of input value, one node wil lbe the best match. The weights between this winner and the input nodes should be increased.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Describe the Feedback-based learning method for ANNs.

A

Corret behavior should be rewarded by increasing weights

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

What are some applications of neural networks?

A

pattern association

character recognition

image compression

classification

forecasting

optimization

etc…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

What are some characteristics of ANNs?

A

Slow

Poor interpreability of results

Able to approximate any target function

Can ignore irrelevant or redundant attributes

Easy to parallelize

May converge to a local minimum (can be mitigated with similated annealing)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

What is an epoch with respect to ANNs?

A

A full application of a complete data set to a neural network.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

What is similated annealing?

A

A momentum term to avoid falling in a local minimum.

A noise term used over multiple iterations. At first the term is very noisy and gets less noisy as tree becomes more accurate. Tries to prevent a local minimum.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

How many training sets are required in an ANN?

A

The number of training sets has to be larger than the # of weights divided by (1 - accuracy).

Training sets = weights / (1 - accuracy)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

How is an ANN pruned?

A

Train a neural network.

While the performance doesn’t degrade by more than a threshold value {

  • delete a node or connection and retrain the network

}

Connections with small weights (irrelevant or redundant attributes) are pruned.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

What preprocessing techniques are required for ANNs?

A
  • remove redundant or irrelevant features
  • transform to numerical values
  • normalize data to the range of 0 to 1 or -1 to 1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

How do support vector machines represent the decision boundary?

A

Using a subset of the training examples, known as support vectors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

Describe Maximum Margin Hyperplanes for SVMs.

A

There are infinitely many hyperplanes (ways to split classes in a model).

The SVM must choose one of the hyperplanes to represent its decision boundary, based on how well they are expecte4d to perform on test examples.

Trying to maximize the width of the “road” / margin.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

How is the decision boundary selected in SVMs?

A

The goal is to maximize the width of the margin “road”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

Why is it important for an SVM to maximize the margin of their decision boundary?

A

To ensure that their worst-case generalization errors are minimized.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

What are support vectors?

A

Points in a SVM that lie next to the maximized margins. SVMs use a subset of training examples called support vectors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

What approach can be used for SVMs when decision boundaries aren’t as clear?

A

Allow for some contamination by using slack variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

What approach can be used if a support vector machine is not linear?

A

Math can be used to map the data to a new space and the new data can be used to classify.

In this case, the data needs to be numerical.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

Why are support vector machines a popular choice of classifer?

A

They don’t get stuck in a local minimum like ANNs and KNNs. Especially when the data is transformed to a new space.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

What is required for data to be transformed to a new space when using SVMs?

A

The data must be linearly seperable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

What is the difference between a global model and a local model?

A

A global model is 1 model (SVMs)

A local model combines smaller models (KNN and ANNs)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

Does SVM try to find a global minimum or emply a greedy based strategy to search the hypothesis space?

A

SVMs try to find the global minimum unlike neural networks, which employ a greedy based strategy to search the hypothsis space.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

What is an ensemble method for classification?

A

Improves classification accuracy by aggregating the predictions of multiple classifiers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

Describe a high-level overview of how ensemble methods work.

A

An esemble method constructs a set of base classifiers from training data and performs classification by taking a vote on the predictions made by each base classifier.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

Using ensemble methods, what can be done to a classifier if one classifer performs better than others.

A

If one classifier is better than others, you can weight the vote of the classifiers and make the better performing classifier have a higher weight.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

What is bagging with respect to data mining and ensemble methods?

A

Bagging (bootstrap aggregating) is basically out of bag estimation.

Sample repreatedly with replacement from the original data to create new training sets.

Reduces variance and helps to avoid overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

What is boosting with respect to data mining and ensemble methods?

A

Give more emphasis on specific examples that are difficult to classify. Assign a higher weight, greater probability of being selected to them.

Records that are wrongly classified will have their weights increased.

Records that are classified correctly will have their weights decreased.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

Describe the AdaBoost method used in ensembles.

A

AdaBoost creates many classifiers / models and repreatedly draws from samples.

Samples that are easy to classifiy get a lower weight, and ones that are harder to classify get a higher weight.

If any intermediate rounds produce an error rate higher than 50%, the weights are reverted back and the resampling procedure is repreated.

The classifier also gets a weight.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
73
Q

What conditions must be satisified for ensemble methods to improve the overall classification accuracy?

A

1) the different classifiers make different mistakes in the data
2) the different classifiers perform better than random guessing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
74
Q

Exam question: think about how to work with noisy data and what this means with respect to bias and variance.

A

A model with high variance and low bias tends to generalize new test instances well, but is susceptible to overfitting noisy data.

If data is noisy, perhaps it’s better to have high bias, and lower variance.

Choise of classifier is important.

Bagging and Boosting can help.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
75
Q

What is bias?

A

Bias is a systematic shift in ground truth

The stronger the assumptinos made by a classifier about the nature of its decision boundary, the larger the classifier’s bias will be.

Design choices such as choice of algorithm can introduce bias too.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
76
Q

What is variance?

A

Variance is a measure of spread of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
77
Q

Describe a model with high bias, but low variance.

A

A model constructed with high bias and low variance tends to underfit training data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
78
Q

Describe a model with high variance and low bias.

A

A model with high variance and low bias tends to generalize new test instances well, but is susceptible to overfitting noisy data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
79
Q

Describe a model with high bias and low variance.

A

A model constructed with low variance and high bias tends to underfit training data. Both underfitting and overfitting can lead to a model that performs poorly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
80
Q

What happens to the bias and variance as the complexity of a model grows (for example, the number of nodes in a decision tree).

A

If you increase the # of nodes (the complexity of the tree) the variance will increase and the bias will decrease.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
81
Q

Describe overfitting with respect to the bias-variance decomposition.

A

Overfitting is modelling a random noise component in the data (model is too complex).

Increasing the complexity of the model means you have to estimate more parameters and their is a greater probability for error.

Simpler models tend to have low variance and potentially higher bias.

Vizualization can help you to pick a good model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
82
Q

What is the class imbalance problem?

A

Data sets with imbalanced class distirubtions.

E.g. Credit card fraud. It’s very rare that fraud exists, but when it does, it’s important and should be given a higher weight.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
83
Q

What is accuracy measure?

A

It’s a metric used to compare the performance of classifiers.

Accuracy = TP + TN / TP + TN + FP + FN

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
84
Q

What is a confusion matrix?

A

It tells you how accurate your model is by showing you the TP FN FP and TN instances for a given classifier in a matrix format.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
85
Q

What is a cost matrix?

A

A cost matrix assigns a cost to the TP FN FP TN instances. It’s good for imbalanced classes. You can assign a high cost to instances that are classified incorrectly.

The goal is to have high accuracy and low cost.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
86
Q

What is percision?

A

Determines the fraction of reecords that actually turns out to be positive in the group the classifier has declared as a positive class.

TP / ( TP + FP )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
87
Q

What is recall?

A

Recall measures the fraction of positive examples correctly predicted by the classifier.

r = TP / ( TP + FN )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
88
Q

What is true negative rate (specificity)?

A

The fraction of negative examples correctly predicted by the model.

TNR = TN / (TN + FP)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
89
Q

What is true positive rate?

A

The fraction of positive examples predicted correctly by the model.

TPR = TP / ( TP + FN )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
90
Q

What is F measure?

A

A measure that tries to maximize both precision and recall.

F1 = 2 x TP / ( 2 x TP + FP + FN )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
91
Q

What is a learning curve?

A

It shows how accuracy changes with varying sample size.

Requires a sampling schedule.

In the graph their is a horizontal bar near the top which is an upper limit on accuracy. You can never surpass this bar due to noise, etc… in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
92
Q

* on exam

What is a ROC curve?

A

Reciever Operating Characteristic (ROC)

It can be used to visualize the TP (x) vs FP rate (y) of a given model.

It also allows for relative comparrison across different models (each model is represented by a curve).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
93
Q

Using a ROC curve, how is one model compared to another?

A

The area under the curve tells you how the model performed. The ideal area is one. You want the curve to be as close to the upper left corner as possible (high TP rate and low FP rate).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
94
Q

What would the model of a classifier that makes random guesses look like on a ROC curve?

A

A line across the main diagnonal connecting 0,0 and 1,1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
95
Q

What is association rule mining?

A

Given a set of transactions, find rules that will predict the occurance of an item based on the occurences of other items in the transaction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
96
Q

Define itemset with respect to association rule mining.

A

A collection of one or more items.

E.g: (Milk, Bread, Diaper)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
97
Q

Define support count with respect to association rule mining.

A

The frequency of occurances of an itemset.

Denoted by σ

e.g. σ ( { milk, bread, diaper } ) = 2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
98
Q

Define support with respect to association rule mining.

A

The fraction of transactions that contain an itemset.

s ( { milk, bread, diaper } ) = 2/5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
99
Q

Define frequent itemset with respect to association rule mining.

A

An itemset whose support is greater than or equal to a minsup threshold.

e.g. at least 10 occurances together in the database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
100
Q

What is an association rule with respect to association rule mining?

A

An implication expression of the form X → Y where X and Y are itemsets.

e.g. { milk, diaper } → { beer }

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
101
Q

What are two rule evaluation metrics with respect to assocation rule analysis?

A

Support ( s )

Confidence ( c )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
102
Q

What is support count ( s ) with respect to assocation rule analysis?

A

The fraction of transactions that contain both X and Y.

Support determines how often a rule is applicable to a given data set.

{ X } → { Y }

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
103
Q

What is confidence with respect to assocation rule analysis?

A

Measures how often items in Y appear in transactions that contain X.

{ X } → { Y }

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
104
Q

Why is support count important?

A

Rules that have low support may only occur by chance.

A low support rule is likely to be uninteresting from a business perspective.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
105
Q

Why is confidence important with respect to assocation rule analysis?

A

Condience measures the reliability of the inference made by a rule.

The higher the confidence, the more likely it is for Y to be present in transactions that contain X.

Confidence also provides an estimate of the conditional probability of Y given X.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
106
Q

What is the 2 step approach to assocation rule mining?

A
  1. Generate all itemsets whose support >= minsup
  2. Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
107
Q

What is the goal of assocation rule mining?

A

Given a set of transactions T, find rules having:

support >= minsup threshold

confidence >= minconf threshold

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
108
Q

Is a brute-force approach feasible with respect to association rule mining?

A

It’s computationally prohibative.

It would involve:

list all possible assocation rules

computing the suppor tand confidence for each rule

prune rules that fail the minsup and minconf thresholds

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
109
Q

What is an itemset lattice?

A

A series of connected nodes that show all possible combinations of an itemset, and sometimes a subset of combinations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
110
Q

What is the objective of frequent itemset generation with respect to association rule analysis.

A

Find all the itemsets that satisfy the minsup threshold. These itemsets are called frequent itemsets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
111
Q

What is the objective of rule generaiton with respect to association rule analsysis?

A

Extract all the high-confidence rules from the frequent itemsets found in the previous step (frequent itemset generation). These rules are called strong rules.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
112
Q

What are the 3 strategies for Frequent Itemset Generation?

A
  1. Reduce the number of candidates (M) using pruning
  2. Reduce the number of transactions by reducing the size of N as the size of the itemset increases
  3. Reduce the number of comparisions by using efficient data structures to store candidates or transactions. No need to match every candidate against every transaction.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
113
Q

What is the Apriori Principle with respect to association rule analysis?

A

If an itemset is frequent, then all of its subsets must also be frequent.

{A B C} = frequent then {A B }, {A C}, {B C} = frequent

Conversely, if an itemset is infrequent, then all of its supersets must be infrequent too.

{A B }

infrequent itemsets can be pruned

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
114
Q

What is support based pruning with respect to association rule analysis?

A

trimming the exponential search space based on the support measure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
115
Q

What is the anti-monotone property with respect to association rule analysis and support-based pruning?

A

The support for an itemset never exceeds the support for its subsets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
116
Q

Describe the Apriori algorithm used for assocation rule mining.

A

Let K=1

Generate frequent itemsets of length 1

Repeat until no new frequent itemsets are identified

Generate length (k+1) candidate itemsets from length k frequent itemsets

Prune candidate itemsets containing subsets of length k that are infrequent

Count the support for each candidate by scanning the DB

Eliminate cadidates that are ifnrequent, leaving only those that are frequent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
117
Q

Describe how the Apriori algoirthm works.

A

Every item is considered as a candidate 1-itemset {cola}. Discard itemsets that don’t meet support threshold.

Do the same with 2-itemsets using only the frequent 1-itemsets (because the Apriori principal holds).

Repeat for the maximum number of items in an itemset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
118
Q

Describe the apriori pruning algorithm for association rule analysis

A

Initally make a pass over data to determine the support of teach item and determine 1-itemsets.

Iteratively generate new candidate k-itemsets using the k-1 itemsets found in the previous iteration.

Make an additional pass over the data set to count the support of canddiates and eliminate candidates whose support is less than minsup.

Terminate when there are no new frequent itemsets generated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
119
Q

In the Apriori frequent itemset generation algorithm, what is a hash tree used for?

A

Determining if each enuerated k-itemset corresponds to an existing candidate itemset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
120
Q

What factors affect the complexity of the Apriori algorithm?

A

Support threshold (lower the threshold results in more itemsets being declared as frequent)

Number of items (dimensionality - more space will be required to store the support counts of items).

The number of transactions (Apriori makes repeated passes over data set - increases with larger # of transactions)

Average transaction width - dense data sets, the average transaction width can be large.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
121
Q

What is a closed itemset with respect to association analsysis?

A

An itemset is closed if none of its immediate supersets has the same support as the itemset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
122
Q

What are two ways of representing frequent itemsets in a compact way?

A
  1. A binary table
  2. A lattice with a boundary where we found frequent items (all subsets will be frequent).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
123
Q

What is a maximimal frequent itemset?

A

a frequent itemset for which none of its immediate supersets are frequent.

* When a border is drawn in a lattice to distinguish between frequent and non frequent, the items residing near the border (on the frequent side) are maximal frequent itemsets. Their immediate supersets are infrequent. This is the same for non-maximal itemsets (on the other side of the border).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
124
Q

What are maximal frequent itemsets useful for?

A

Providing a compact representation of frequent itemsets. They form the smallest set of itemsets from which all frequent itemsets can be derived.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
125
Q

What are closed frequent itemsets?

A

Provide a mininimal representation of itemsets without losing their support information.

An itemset is a closed frequent itemset if it is closed and its support is greater than or equal to minsup.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
126
Q

What is the relationship between closed frequent itemsets, maximal frequent itemsets and frequent itemsets?

A

Maximal freuqnet itemsets are subset of closed frequent itemsets. Closed frequent itemsets are a subset of frequent itemsets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
127
Q

What are some of the disadvantages ofthe Apriori Algorithm for generating frequent itemsets?

A

Still incurs considerable I/O overhead since it requires making several passes over the transaction data set.

May degrade significantly for dense data sets because of the increasing width of transactions.

128
Q

What are alternative methods to the Apriori algorithms for generating frequient itemsets?

A
  1. Traversal of itemset lattice
  2. FP-Growth Algorithm
129
Q

Describe the FP Growth Algorithm

A

An alternative method for discovering frequent itemsets

It encodes the data using a compact structure called an FP-tree

It extracts frequent itemsets directly from the structure.

130
Q

Describe an FP tree used by the FP growth algoriothm for frequent itemset generation

A

Reads data set one transaction at a time

Maps each transation onto a path in the FP-tree

Paths may overlap as transactions are similar

The more paths overlap, the more compression

Sometimes makes tree small enough to fit into main memory

131
Q

How does the FP growth algorithm mine frequent itemsets?

A

It uses a recursive divided-and-conquer approach

It uses pointrs to assist frequent itemset generation.

It requires preprocessing.

132
Q

How can the Apriori and FP-growth algorithm be compared to heapsort and merge sort?

A

Work before vs work after

Lazy learner vs eager learner

133
Q

How are rules generated in association rule mining?

A

Find all non-empty subsets in a given frequent itemset such that such that the subset satisfies the minimum confidence requirement

134
Q

What is an anti-monotone property with respect to association rule mining?

A

The support for an itemset never exceeds the support for its subsets.

135
Q

Do confidence of rules have an anti-monotone property?

A

No, but confidence of rules generated from the same itemset do.

136
Q

What happens when items from the left side of the rule are moved to the right side of the rule?

A

Confidence can only decrease.

If a parent doesn’t meet the confidence threshold, the children whon’t meet the threshold either.

137
Q

What happens as the number of items increases in association rule analysis?

A

The number of items decreases.

It can be difficult to find an appripriate support threshold because most items in a store will have a low support count.

138
Q

What are the conseuqnces if the minsup threshold is too high?

A

itemsets involving interesting rare items (e.g. expensive products) could be missed.

139
Q

What is the consieuqnce if the minsup threshold is too low?

A

It can become computationally expensive to process and the number of itmesets will be very large.

140
Q

What is multiple minimum support?

A

You can use different support counts for different support items.

141
Q

What are interestingness measures?

A

used to prune/rank the derived patterns because association rules tend to produce too many rules and many of them are uninteresting or redundant.

142
Q

How can interestingness measure be computed?

A

Interstingness can be computied using a contingency table.

143
Q

What is a contingency table for computing interestingness in assocation rule discovery?

A

A matrix of items that were bought and items that were not bought.

144
Q

What are some measures used to describe confidence level?

A

There are over 20…

Gini index

Jaccard

Cosine

Intersect

Laplace…

145
Q

What is the drawback of using confidence level for association rule analysis?

A

The rules can be misleading even though confidence may be high

eg.

Confidence(Coffee|Tea) = 0.75

but P(Coffee) = 0.9

Although confidence is high,

the rule is misleading: P(Coffee| !Tea) = 0.9375

146
Q

What is lift?

A

A measure used to evaluate interestingness in association rule analysis that takes prior probabilities into account.

It’s much more reliable than other techniques such confidence level.

If the lift value is around 1, we can assume that the rule is statistically independent.

147
Q

What is the challenge of applying assocation analysis to non-asymmetric binary variables?

A

Can cause computation issues.

Need to bin and use discretization.

Bin continuous and map categorical.

148
Q

What is equal width binning?

A

divides the range into N intervals of equal size

149
Q

What is equal depth binning?

A

It divides the range into N intervals, each containing approximately same number of samples

150
Q

What sort of rules can be created with continuous attributes?

A

Age(21,35) ^ Salary(70k,120k) - > buy

151
Q

What are 2 problems with discretization of continuous attributes in assocation rule mining?

A

Execution time (possible ranges of values)

Too many rules

{refund = No, Income {cheat = no}

{refund = No, 90 k Income {cheat = no}

{refund = No, 50K < Income < 52K } -> {cheat = no}

152
Q

What are multi-level association rules?

A

Items are grouped into categories which cuts down on the # of items. There are then hierarchical levels.

For example, Electronics as the root, then computers… etc..

153
Q

Why should a concept hierarchy be used in multi-level association rules?

A

Cuts down on the number of rules by grouping.

154
Q

What is sequence data with respect to association rule analysis?

A

Examines events that occur over time over customers.

Events can be grouped together.

Becomes computationally expensive.

155
Q

What can be done to reduce the amount of sequence data with respect to association rule mining?

A

preprocessing

156
Q

What is sequential pattern mining?

A

Given a database of sequences and a user-specified minimum support threshold, minsup, find all subsequences with support >= minsup

157
Q

What is the commonality between subgraph mining, sequential pattern mining, multi-level assocation rules, etc…

A

They all come down to preprocessing the data so that the APRIORI algorithm can be applied to them.

158
Q

What is frequent subgraph mining?

A

Extends assocation rule mining to find frequent subgraphs.

Graphs that extend to subraphs

Looking for common subgraphs

159
Q

What is frequent subgraph mining useful for?

A

computational chemistry, bioinformatics, spatial data sets,etc..

160
Q

How are graphs represented as transactions?

A

Transform to matrix using adjaceny lists

You can add edge weights but this blowes up the matrix even more.

161
Q

Compare APRIORI and FP Growth

A

Apriori pruning strategy: If an itemset is infrequent, then all of its supersets must also be inrequent. All of these infrequent itemsets can be pruned. Multiple scans of DB.

FP-Growth: encodes the data set using a compact data structure called an FP-tree and extracts frequent itemsets directly from this structure. Divide and conquer approach. Outperforms the standard Apriori algorithm by several orders of magnitude. Requires less memory. Scan DB only twice.

162
Q

How are non asymmetric binary attributes handled in association rule analysis?

A

Binning and descretization.

163
Q

What are the advantages of grouping categories of items together into concept hiearchies for assocational analysis?

A

The items at the lower level of the heirarchy may not have neough support to appear in the frequent itemset (this is good because rules at the bottom tend to be overly specific and may not be as interesting).

164
Q

What are some of the disadvantages of using a concept hierarchy for association rule analysis?

A

Items residing at the higher levels tend to have higher support counts than those resideing at the lower levels. Only the patterns residing in the higher levels are likely to have patterns extracted.

Conversely, if the threshold is set too low, then you get too many rules.

Increases computation time.

May lead to redundant rules.

165
Q

How can patters be extracted from sequence data?

A

The Apriori principal holds for sequential data becase any data sequence contains a particular k-sequence must also contain all of its (k-1) subsequences. An Apriori-like algorithm can be used to extraxt sequential patters from a sequence data set.

166
Q

How can frequent subgraphs be found from graphs in association analsysis?

A

Transform each graph into a transaction-like format so that existing algorithms such as Apriori can be applied.

167
Q

How are cadidate k-subgraphs pruned after they have been generated in assocation analsysis?

A

repreatedly remove an edge from the candidate k-subgraph and checking whether the correspnding (k-1) subgraph is cnnneted and frequent.

168
Q

What are 2 main challenges for mining infrequent patters in asswocation rule analysis?

A
  1. how to identify interesting infrequent patters
  2. how to efficently discover them in large data sets
169
Q

What is cluster analysis?

A

Finding groups of objects such that the objects in a group are similar (or related) to one another and different from (or unrelated to) objects in other groups.

170
Q

What is one of the most common clustering algorithms?

A

K-means

171
Q

Is clustering supervised or unsupervised?

A

Clustering can be considered a form of unsupervised classification.

172
Q

What are two applications of cluster analysis?

A

Understanding: group related items that have similarities

Summarization: reduce the size of large data sets

173
Q

What is a clustering?

A

A set of clusters

174
Q

What are the two main types of clustering?

A

Partiitional

Hierarchical

175
Q

What is partitonal clustering?

A

A division of data objects into non-oberlapping subsets (clusters) such that each data object is exactly one subset.

Simply put, a division of the data into a group or subset.

176
Q

What is hierarchical clustering?

A

A set of nested clusters organized as a heirarchical tree

Simply put, each cluster can have a sub-cluster

177
Q

What is the difference between hierarchical and partitional clustering?

A

The most common distinction is whether the set of clusters are nested (hierarchical) or unested.

178
Q

What two factors introduce bias into cluster analysis?

A

The distance measure

The use of clustering algorithm

179
Q

What is the difference between exclusive and non-exclusive clusterings?

A

In non-exclusive clusterings, points may belong to multiple clusters.

180
Q

What is the difference between fuzzy and non fuzzy clusterings?

A

In fuzzy clustering, a point belongs to every cluster with some weight bewteen 0 and 1.

The weights must sum to 1.

181
Q

What is the difference of partial vs complete clustering?

A

In some cases, we only want to cluster some of the data.

182
Q

What is the differnces between hetrogenous vs homogeneous clustering?

A

Clustering of same type or different types.

For example, clusters can vary widley in size, shape, and densities.

183
Q

How are hierarchical clusters represented?

A

Using a denogram. The horizontal heights of the linkages represent the order in which clusters are formed.

184
Q

What is a well-seperated cluster?

A

Clusters where points are closer to points in the same cluster than to points in every other cluster.

185
Q

What are center-based clusters?

A

An object in a cluster is closer (more similar) to the center of a cluster, than to the center of any other cluster.

These types of clusters thend to be globular.

186
Q

What is a contiguity-based cluster?

A

two objects are connected only if they are within a specified distances of each other.

187
Q

What is a density based cluster?

A

A cluster where a dense region of objects is surrounded by a region of low density.

188
Q

When are density-based clusters used?

A

when the clusters are irregular or itnertwined and when noise and outliers are present

189
Q

What statistical measure is often used for clustering and why?

A

The median is often more appropriate because outliers affect the mean greatly.

190
Q

What are conceptual clusters?

A

Clusters that share some common property or rpresent a particular concept.

(for example, two overlapping rings).

191
Q

What are clusters defined by an objective function?

A

goal of findingclusters that minimize or maximize an objective function.

Enumerate all possible ways of dividing the points into clusters and evaluate the ‘goodness’ of each potential set of clusters by using the given objective function.

192
Q

What variables will influence how good a clustering is?

A

Type of proximimity or density measure

Sparseness

Attribute type

Dimensionality

Noise & Outliers

Type of distribution

193
Q

What are the 3 main clustering algorithms?

A

K-means (and it’s variants)

Hierarchical clustering

Density-based clustering

194
Q

Is the k-means clustering approach paritional or hierarchical?

A

partitional

195
Q

Recite the algorithm for k-means

A

select k points as the initial centroids (usually random points)

repeat:

form k clusters by assigning all points to the closet centroid

recompute the centroid of each cluster

until: the centroids don’t change or stopping critereon is met

196
Q

How are the number of clusters chosen for k-means?

A

user specified number of clusters

197
Q

How is the closeness of k-means clusters evaluated?

A

Using a distance measure such as Euclidean distance, cosine similarity, correlation, etc…

198
Q

What is one of the issues with the k-means clustering algorithm?

A

The cluster centres are usually picked randomly (but you can pick them to speed things up).

This increases the likeliness of the model getting stuck in a local minimum.

199
Q

Why is the significance of picking the initial centroids in k-means clustering?

A

One of the disadvantages is that, what should be one cluster, is often split into two.

200
Q

Why should a data miner be skeptical of an algorithm that just runs repeatedly over and over (for example k-means that can fall into a local minimum)

A

This introduces bias – a data miner could simply pick the iteration that produced the best result if they wanted.

It’s better to do some form of smart sampling or preprocessing instead.

201
Q

What can produce considerably different results in the k-means algorithm?

A

the choice of inital centroids

202
Q

What is the issue with using Euclidean distance as a distance measure in algorithms such as k-means clustering?

A

K-means tends to find round / globular clusters because of the use of Euclidean distance as a distance measure.

203
Q

What is the most common evaluation measure for k-means clustering?

A

The sum of Squared Error (SSE)

204
Q

Describe how the sum of squared error SSE as an evaluation measure for k-means clustering is evaluated

A

For each point, the error is the distance to the nearest cluster.

To get the SSE, we square the errors and then sum them.

205
Q

What are some solutions to the k-means initial centroid problem?

A
  1. multiple runs (not the best approach)
  2. sample and use hierarchical clustering to determine intial centroids
  3. select more than k intiial centroids and then select among these initial centroids
  4. postprocessing
  5. Bisecting k-means (split clusers into multiple clusters)
206
Q

K-means can yield empty clusters. How are they handled?

A

Just get rid of them.

207
Q

Describe the approach for updating centers incrementally for k-means clustering.

A

In the basic k-means algorithm, centroids are updated after all points are assigned to acentroid.

An lternative is to update the centroids after each assignment (incremental approach).

208
Q

What are the disadvantages and advantages of updating centers incrementally for the k-means clustering algorithm?

A

More expensive

Intorudces an order dependency

Never get an empty cluster

(this approach is not used often)

209
Q

Describe 2 preprocessing steps commonly used for k-means clustering

A

Normalize the data (it’s distance based)

Eliminate outliers

210
Q

Describe 3 post processing steps commonly used for k-means

A

Eliminate small clusters that may represent outliers

Split ‘loose’ clusters with relatively high SSE

Merge clusters that are ‘close’ and have relatively low SSE

211
Q

What is the basic idea of the bisecting k-means algoirthm?

A

to obtain k clusters, split the set of all points into two clusters, select one of these clusters to split and so on, until k clusters have been produced.

212
Q

Why is post processing important in k-means clustering algorithms and clustering algorithms in general?

A

Unsupervised approaches don’t give us feed back about how good our model is. Thsi is why post-processing is important.

213
Q

What are 2 problems of the k-means clustering algorithm?

A

K means has problems when clusters differ in size, densities, and are non-spherical (globular) shapes.

K-means has problems when the data contains outliers.

214
Q

What is hierarchical clustering?

A

produces a set of nested clusters organized as a hierarchical tree

these clusters can be vizualized as a dendogram (a tree like structre that records the sequences of merges and splits)

215
Q

What are the two strengths of hierarchical clustering?

A

They do not have to assume any particular number of clusters

They correspond to meaningful taxonomies (eg, biological sciences – animal kingdon)

216
Q

What are the two main types of hierarchical clustering?

A
  1. Agglomerative
  2. Divisive
217
Q

What is agglomerative hierarchical clustering?

A

start with the points as indiivudual clusters

at each step, merge the closest pair of clusters until only one cluster (or k clusters) are left

218
Q

What is dvisive hierarchical clustering?

A

Start with one, all inclusive cluster

At each step, split the cluster until each cluster contains a point (or there are k clusters)

219
Q

What is the most popular hieararchical clustering technique?

A

Agglomerative clustering algorithm

220
Q

Explain the agglomerative clustering algorithm

A
  1. Compute the proximity matrix
  2. Let each data point be a cluster
  3. repeat
  4. merge the two closest clusters

update the proximity matrix

  1. until only a single cluster remains
221
Q

What was can the similarity between clusters be defined?

A
  1. Min
  2. Max
  3. Group average
  4. Distance between centroids
  5. Other methods with a defined method
222
Q

What is the MIN method of defining inter cluster similarity?

A

link the next closest point from another cluster

223
Q

What is the MAX method of defining similarity between clusters (inter-cluster similarity)?

A

Minimize the maximum distance from clusters (tends to produce “round” things).

224
Q

What are the strengths of hierarchical clustering using MIN for inter-cluster similarity?

A

Can handle non-elliptical shapes.

225
Q

What are the limitations of MIN for hierarchical clustering?

A

sensitive to noise and outliers

226
Q

What are the strengths of MAX for hiearchical clusters?

A

it’s less succesible to noise and outliers

227
Q

What are the limitations of MAX for hieararchical clustering?

A

tends to break large clusters

biased toward globular clusters

228
Q

What is the group average method used for cluster similarity?

A

proximity of two clusters is the average pairwise proximity between points in the two clusters

229
Q

What is Ward’s method for clusters similarity?

A

The similarity of two clusters is based on the increase in squarred error when two clusters are merged

230
Q

What are some of the characteristics of Ward’s methods for hiearrchical clustering?

A

Less sucestible to nosie and outliers

biased towards globular clusters

can be used to initialize k-means

231
Q

WHat are some of the problems and limitations of hieararchical clustering?

A

cannot undo a decision once two clusters have been combined

no objective function is directly minimized

have problems with one or more of the following:

sensitivity to noise

difficulty handling different sized clusters and convex shapes

breaking large clusters

232
Q

Explain the minimum spanning tree (MST) method for divisive hierarchical clustering.

A
  1. Start with a tree that consists of any point
  2. In successive steps, look for the closet pair of points such that one point is in the current tree and the other is not.
  3. Add the point that is not to the tree into the tree and put an edge between these two points

(this is a top down approach; it is uncommon and rarely used)

233
Q

What is DBSCAN for clustering?

A

DBscan is a density based algorithm used to find clusters

234
Q

What are the 3 componetnts in the DBSCAN algoirthm?

A
  1. core point
  2. border point
  3. noise point
235
Q

What is a core point in DBSCAN?

A

A point is a core point if it has more than a specified number of points (MinPts) within Eps (a specified radius)

236
Q

What is a border point in DBSCAN?

A

A border point has fewer than MinPts (user specified threshold) within Eps (a user specified radius) but within the neighbourhood of a core point

237
Q

What is a noise point in DBSCAN?

A

A noise point is any point that is not a core point or a border point

238
Q

Describe the algorithm for DBSCAN

A
  1. Label all points as core, border, or noise points
  2. Eliminate noise points
  3. Put an edge between all core points that are within the user specified radis (Eps) of each other
  4. Make each group of connected core points into a speerate cluster
  5. Assign each border point to one of the clusters of its associated core points
239
Q

When does the DBSCAN clustering algorithm run into issues?

A

DBSCAN doese not work well for clusters that have varying densities

240
Q

What are the 2 parameters that need to be set for DBSCAN?

A
  1. Threshold (density) MinPts
  2. Radius (how far out) Eps
241
Q

How can the best threshold for Eps (radius) and MinPts (density) in DBSCAN be found?

A

Plot the distance of every point to its kth nearest neighbour, then find the biggest change or “knee” in the curve.

242
Q

What are the characteristics of the K-means clustering algorithm?

A

Uses a user specified number of clusters

Paritioned-based

Complete clustering

Non-deterministic (can exhibit different behaviors on different runs)

noise stays in space and contributes to SSE

243
Q

What are the characteristics of the hierarchical clustering algorithm?

A

works if the # of clusters are unknown (a horizontal cut in the dendogram specifies the # of clusters)

can set a max distance for threshold

complete clustering

noise depends on linkage criteria

deterministic

244
Q

What are characteristics of the DBSCAN clustering algorithm?

A

can handle unknown # of clusters

requires to user-defined parameters to be specified

paritinoed based (fixed # of clusters)

partial clustering (doesn’t get rid of all points)

deals best with noise because it gets rid of it

worst of 3 approaches for clustering data with varying densities

deterministic

245
Q

What is cluster evaluation

A

proposes ways of evaluating unsupervised classifiers (clustering)

evaluating unsupervised classifiers is much more difficult (you can’t use measures such as accuracy and class labels, etc…)

246
Q

What is the issue with evaluating clusters?

A

Clusters are in the eye of the beholder (they are subjective)

247
Q

What is the purpose of evaulating clusters if they are subjective?

A

To avoid finding patterns in noise

To compare clustering algorithms

To compare two sets of clusters

To compare two clusters

248
Q

In clustering, what is the technique of distinguishing whether non-random structure actually exists in the data

A

Clustering tendency

249
Q

What are the different aspects of cluster validation?

A

Determining whether non-random structures exist in the data

Comparing results to externally known results (eg. class label)

Determining which clustering technique is better

Determining the “correct” number of clusters

250
Q

What are 3 measures of cluster validatiy?

A

External index

Internal index

Relative index

251
Q

Explain the purpose of external index measure of cluster validity

A

used to measure the extend to which cluster labels match externally supplied class labels

252
Q

Explain the purpose of internal index measure of cluster validity

A

Used to measure the goodness of a clustering structure without respect to external information

253
Q

Explain the purpose of relative index measure of cluster validity

A

used to compare two different clusterings or clusters

254
Q

Explain how to measure cluster valididity via correlectation

A

Generate two matricies (proximity matrix and incidence matrix)

1 row and 1 colum for each data point. An entry is 1 if the associated pair of points belng to the same cluster and 0 if they belong to different clusters.

Compute the correlation between the two matricies.

High correlation incidates that points that belong to the same cluster are close together.

255
Q

What is the issue of measuring cluster validity via correlation?

A

it’s not a good measure for some density or contiguity (connected) based clusters

256
Q

What does a high correlation and a low correlation mean with respect to cluster evaluation?

A

A high correlation means a good clustering, a low correlation means a poor clustering.

257
Q

Can the sum of squared error be used for cluster evaluation?

A

SSE can be used to cmpare two clusterings or two clusters (average SSE).

It can also be used to estimate the # of clusters.

258
Q

How can the SSE be used to determine the # of clusters?

A

Plot the SSE vs k (the # of points?) and look for the knee in the curve.

* The knee method is good for spehrical clusters because it may not be as pronouced otherwise.

259
Q

Explain the statistic freamwork for SSE when evaulating clusters

A

Compare the SSE to random data

Get SSE values

repreat many times

look at the correlation

260
Q

What are 2 internal measures for cluster evaluation?

A

cluster cohesion

cluster seperation

261
Q

what is the cluster cohesion internal measure for cluster evaluation?

A

measures how closely related objects in a cluster are

measured by the wihtin cluster sum of squares (SSE)

262
Q

define the cluster seperation measure for cluster evaluation

A

measures how distinct or well-seperated a cluster is from other clusters

measured by the between cluster sum of squares

263
Q

Explain how a proximity graph bbased approach can be used for cohesion and seperation cluster evaluation

A

cluster cohesion is the sum of the weights of all the links within the cluster

cluster seperation is the sum of the weights between nodes in the cluster and nodes outside the cluster

264
Q

What is the silhouette coefficient for internal measures of cluster evaulation?

A

combines ideas of both choesion and speeration, but for individual points, as well as clusters and clusterings

looks as closeness within a cluster and closeness between clusters

265
Q

What is the great challenge with cluster valididty?

A

The validation of clustering is tricky.

The result can look good, but this may not always be the case.

266
Q

Define similarity

A

a numerical measure of how alike two data objects are

it is higher when objects are more alike

often falls in the range of 0 and 1

267
Q

define dissimilarity

A

a numerical measure of how different two data objects are

lower when objects are more alike

minimm dissimilarity is often 0

upper limit varies

268
Q

define proximity

A

refers to a dimilarity or dissimilarity

269
Q

are similarity and dissimilarity opposites?

A

no. one is not the opposite of the other, unless the range is from 0 t o 1.

270
Q

How is similarity and dissimilarity defined for nominal attributes if p and q are the attribute values for two data objects?

A

similar if p = q and dissimilar if p <> q

271
Q

How is similarity and dissimilarity defined for ordinal attributes if p and q are the attribute values for two data objects?

A

values are mapped to integers 0 to n-1 where n is the # of values

dissimilarity = p - q / # of values (n)

similarity = 1 - p-q / # of values (n)

272
Q

How is similarity and dissimilarity defined for interval or ration attributes if p and q are the attribute values for two data objects?

A

similarity = | p - q |

disimilarity is equal to the negative value of similarity

273
Q

What is the formula for Euclidean distance?

A
274
Q

Is standardization necessary for Euclidean distnace?

A

If the scales differ, then yes standardization is necessary.

275
Q

What is the minkowski distance?

A

a generalization of the euclidean distance

it’s the same as the euclidean distance, but with a parameter r instead of 2

This r parameter gives more flexability. n is the # of parameters.

n is the # of dimensions (attributes)

276
Q

In the Minikowski Distance, what is the formula equal to with values of 1 to 3 for r?

A

r= 1 is the hamming distance

r = 2 is the euclidean distance

r > 3 ( ∞ ) is the supremum distance

277
Q

What is the mahalnanobis distance?

A

takes the average values of pairs of attributes and subtracts them from the mean of values

gets the co-variance to scanle the distance measures

an alternative to normalizzation (scaling is built in)

  • take into account the dspread of the data in a direction
278
Q

What is an outlier / anomaly?

A

A point, patter, or set of patterns which do not conform to what we define as normal within the data

279
Q

What are some applications of anomaly detection?

A

intrusion detection fraud, medical, image processing

280
Q

what are 3 things that can cause outliers?

A

measurement error

collection error

natural variation

281
Q

What are the 3 types of outliers?

A
  1. point
  2. contextual
  3. collective
282
Q

what is a point outlier?

A

a point within a graph

283
Q

what is a contexual outlier?

A

the anomaly completely depends on the context that you are looking at.

284
Q

what is a collective outlier?

A

the outlier exists as a seuqnce

AAABBBAAABBBABABABAAABBBAAABBB

285
Q

What are 2 example applications of spatial anomaly?

A

Brain imaging

Weather

286
Q

What is an example of time series anomaly application?

A

ECG patterns

287
Q

How can outliers be visualized in 1 or 2 dimensions?

A

Box plots, histograms, and scatter plots

288
Q

How can outlier be detected in 3 dimensions?

A

3 dimensional scatter plots

289
Q

Can outliers be detected in 3 or more dimensions?

A

Can’t directly visualize – very difficult

290
Q

Distinguish between outliers created by noise an anomalies

A

noise is often created by measurement error, extreme values

noise is typically uninteresting and generally meaningless

Anomalies are data points created by different mechanisms - usually very interesting

291
Q

What are 2 approaches to outler detection?

A

Approach depends on the format of the data

  1. generalized approach: build a model of the normal data
  2. graphical: visualize the data using various means - typically subjective
292
Q

What are the 3 common responses to outliers?

A
  1. removal
  2. accomodation - keep but use accomodation methods while processing
  3. explanation - why does it exist?
293
Q

What are the 2 approaches to evaulating an outlier?

A
  1. outlier label: a point or group of popints are labeled as an anomaly or normal
  2. outlier score: assign an outlier score to a data point or group of data points

represents degree of outlierness

can create a ranked list or use a threshold

294
Q

What are the 3 common approaches to outliers detection?

A
  1. supervised

class labels

  1. unsupervised

no class labels

  1. semisupervised

some class labels

295
Q

What is a statistical approach for outlier detection?

A
  1. use a parameter based model which describes the distribution of the data (eg gaussian distribution etc…)
  2. apply statistical tests which depend on the:

distriubtion of the data

population parametes or sample statistics

number of expected outliers

296
Q

What is Grubb’s test?

A

A statistical test used to detect outliers in univariate data assumed to come from a normally distributed population.

297
Q

What are the steps in Grubbs test

A

apply the test

remove outliers

repeat

298
Q

Explain the linear regression approach for removing outliers

A
  1. perform a linear regression on the data set
  2. determine the value of the resideual for each point belonging to the data set
  3. use univeariate outlier detection on the residuals to detect if an outlier is present
  4. remove the outlier
  5. repeat until outliers are not present
299
Q

What is the assumption of the linear agression approach to outlier detection?

A

The error term is distrubted normally

300
Q

What is the nearst neighbour approach for outlier detection?

A

compute the distance between all pairs of points

outlier score can be defined as the distance from the kth nearest neighbour or the average distance from the k-nearest neighbours

301
Q

what is the advantage of using the nearest neigubhour approach for outlier detection?

A

highly granular (detailed)

302
Q

what are the disadvantages of using the k-nearest neighbour approach for outlier detection?

A

computationally expensive

sensitive to the chosen value of k

measningless in high dimensional space

303
Q

Explain the density based method for outlier dtection

A

assumption that outliers are points in regions of low density

304
Q

Explain cluster based methods for anomaly detection

A

points that are not strongly related can be considered as outliers

cluster, remove outlier, repeat

requires a stopping criterion

305
Q

Explain the rare class detection method for anomaly detection

A

consider the problem to be the same as an imbalanced class problem

two approaches:

cost sensitive

re-smampling

306
Q

Explain the SMOTE method for anomoaly detection

A

automatically determines the minority class

finds the k nearest neighbours for each minority class

randomly chooes from ke nearest neighbours depending on level of oversampling required

creates a syntehci point along the line sequents connnectiong that points to its k nearesa neiughbor

307
Q

What is spaital outlier detection?

A

looks at finding abbrupt changes in behavioural attributes which violate spatial autocorrection and hetreroscedaticity

308
Q

what is novelty detection?

A

related to anomaly detection but is a field of it’s own

the outlier hasn’t occured yet, and is only an outlier for a brief period of time and then it becomes a part of the normal model

309
Q

What are some challengs in outlier detection?

A

defining normal regions

evolving definition of normal

differing notion of anomalies

number of attributes used to define an anomaly

noise

310
Q

What is a linear regression?

A

a simple linear regression is basically an equation of a line

311
Q

What is a residual?

A

a residual is a vertical line sprouting off of the diagonal line of a linear regression

312
Q

What is a model?

A

a model is an abstract representation of a real system

313
Q

What is hetrososcedasticity?

A

variance that you would expect to see

Specifically, it refers to the distribution of numbers for one variable in relation to the distribution of numbers for another variable.

find points that would defy this for outlier detection

314
Q

When are statistical approaches to outlier detection effective?

A

When there is sufficient knowledge of the data and ype of test that should be applied.

Fewer options are available for multivariate data.

These statistical tests perform poorly on highly dimensional data.

315
Q

Define density with respect to density-based outlier detection.

A

The density around an object is equal to the # of objects taht are within a specified distance d of the object.

316
Q

What happes if the value of d is too large or small with respect to density-based outlier detection.

A

If d is too small, then many normal points may have low desnity and thus a high outlier score. If d is too large, then many outliers may have desnities (and outlier scores) that are similar to normal points.