Pre-exam Flashcards

Question

What is the algorithm for k-nearest neighbour?

Answer 1

Compute the distance between two points (Euclidean distance) Determine the class from nearest neighbour list - take the majority vote of class labels among the k-nearest neighbours - weight the vote according to distance

Answer 2

-The set of stored records -Distance metric to compute the distance between records -The value of k, the number of nearest neighbours to retrieve.

Answer 3

- If k is too small, the model is sensitive to noise points - If k is too large, the neighbourhood may include points from other classes

Answer 4

Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes. e.g. height may var from 1.5m to 1.8m weight may vary from 90lb to 300lb

Answer 5

- It's a lazy learner - It does not build a model explicitly - Classifying unknown records can be relatively expensive

Answer 6

A probabilistic framework for solving classification problems. It uses conditional probability.

Answer 7

It assumes independence among attributes.

Answer 8

Smaller data sets means rows have more influrence Need repeated passes over the data

Answer 9

Discretize the range of values into bins - Can use two way split (A \< v) or (A \> v) - Or Probnability density estimation assume the values follow a normal distribution Use data to estimate parameters of distribution Can use it to estimate conditional probability

Answer 10

Robust to noise points Can handle missing values by ignoring the instance during probability estimate calculations Robust to irrelevant attributes Independance assumption may not hold for some attributes (but can use other techniques such as bayesian belief networks instead)

Answer 11

A classifier similar to Naive Bayes, but it assumes that attributes are dependeant on one another. This approach is more sophisticated, but more realistic.

Answer 12

1. Create a directed acylic graph (dag) encoding the dependence relationships among a set of variables 2. Create a probabilit table associating each node to its immediate parent nodes.

Answer 13

The more nodes you have, the more likely you are going to get stuck in a local minimum. Solutions: 1. Reduce the number of nodes 2. Or, repeat the model building process seveal times

Answer 14

Each layer only connects to the next layer. There are no connections between layers.

Answer 15

The connections between nodes in the same layer are cut out. A given layer can still be connected to many other layers (unlike feed forward).

Answer 16

The number of input nodes is equal to the # of attributes. The number of output nodes is equal to the # of classes (with the exception of 2 classes -- just uses 1 output node).

Answer 17

The # of attributes determines the # of input nodes, so its important to remove redundant or irrelevant attributes to keep down the # of connections and avoid a local minimum.

Answer 18

A high value at an output node suggests a high probability of the input belonging to the class associated with the output node.

Answer 19

1. Present sample to input nodes. 2. Propogate data through layers 3. Calculate results at output nodes 4. Determine error at output nodes 5. Propogate error backwards to adjust the weights Repeat until stopping criterion is satisified

Answer 20

Correlation Competative learning Feedback-based

Answer 21

Correlation (similar things should be more similar and less similar things should be made less similar) - When different nerurons fire at the same time, the weight representing the connection between them should be increased (associative learning).

Answer 22

Competative learning - For each sample (set of input value, one node wil lbe the best match. The weights between this winner and the input nodes should be increased.

Answer 23

Corret behavior should be rewarded by increasing weights

Answer 24

pattern association character recognition image compression classification forecasting optimization etc...

Answer 25

Slow Poor interpreability of results Able to approximate any target function Can ignore irrelevant or redundant attributes Easy to parallelize May converge to a local minimum (can be mitigated with similated annealing)

Answer 26

A full application of a complete data set to a neural network.

Answer 27

A momentum term to avoid falling in a local minimum. A noise term used over multiple iterations. At first the term is very noisy and gets less noisy as tree becomes more accurate. Tries to prevent a local minimum.

Answer 28

The number of training sets has to be larger than the # of weights divided by (1 - accuracy). Training sets = weights / (1 - accuracy)

Answer 29

Train a neural network. While the performance doesn't degrade by more than a threshold value { - delete a node or connection and retrain the network } Connections with small weights (irrelevant or redundant attributes) are pruned.

Answer 30

- remove redundant or irrelevant features - transform to numerical values - normalize data to the range of 0 to 1 or -1 to 1

Answer 31

Using a subset of the training examples, known as support vectors.

Answer 32

There are infinitely many hyperplanes (ways to split classes in a model). The SVM must choose one of the hyperplanes to represent its decision boundary, based on how well they are expecte4d to perform on test examples. Trying to maximize the width of the "road" / margin.

Answer 33

The goal is to maximize the width of the margin "road".

Answer 34

To ensure that their worst-case generalization errors are minimized.

Answer 35

Points in a SVM that lie next to the maximized margins. SVMs use a subset of training examples called support vectors.

Answer 36

Allow for some contamination by using slack variables.

Answer 37

Math can be used to map the data to a new space and the new data can be used to classify. In this case, the data needs to be numerical.

Answer 38

They don't get stuck in a local minimum like ANNs and KNNs. Especially when the data is transformed to a new space.

Answer 39

The data must be linearly seperable.

Answer 40

A global model is 1 model (SVMs) A local model combines smaller models (KNN and ANNs)

Answer 41

SVMs try to find the global minimum unlike neural networks, which employ a greedy based strategy to search the hypothsis space.

Answer 42

Improves classification accuracy by aggregating the predictions of multiple classifiers.

Answer 43

An esemble method constructs a set of base classifiers from training data and performs classification by taking a vote on the predictions made by each base classifier.

Answer 44

If one classifier is better than others, you can weight the vote of the classifiers and make the better performing classifier have a higher weight.

Answer 45

Bagging (bootstrap aggregating) is basically out of bag estimation. Sample repreatedly with replacement from the original data to create new training sets. Reduces variance and helps to avoid overfitting.

Answer 46

Give more emphasis on specific examples that are difficult to classify. Assign a higher weight, greater probability of being selected to them. Records that are wrongly classified will have their weights increased. Records that are classified correctly will have their weights decreased.

Answer 47

AdaBoost creates many classifiers / models and repreatedly draws from samples. Samples that are easy to classifiy get a lower weight, and ones that are harder to classify get a higher weight. If any intermediate rounds produce an error rate higher than 50%, the weights are reverted back and the resampling procedure is repreated. The classifier also gets a weight.

Answer 48

1) the different classifiers make different mistakes in the data 2) the different classifiers perform better than random guessing

Answer 49

A model with high variance and low bias tends to generalize new test instances well, but is susceptible to overfitting noisy data. If data is noisy, perhaps it's better to have high bias, and lower variance. Choise of classifier is important. Bagging and Boosting can help.

Answer 50

Bias is a systematic shift in ground truth The stronger the assumptinos made by a classifier about the nature of its decision boundary, the larger the classifier's bias will be. Design choices such as choice of algorithm can introduce bias too.

Answer 51

Variance is a measure of spread of data

Answer 52

A model constructed with high bias and low variance tends to underfit training data.

Answer 53

A model with high variance and low bias tends to generalize new test instances well, but is susceptible to overfitting noisy data.

Answer 54

A model constructed with low variance and high bias tends to underfit training data. Both underfitting and overfitting can lead to a model that performs poorly.

Answer 55

If you increase the # of nodes (the complexity of the tree) the variance will increase and the bias will decrease.

Answer 56

Overfitting is modelling a random noise component in the data (model is too complex). Increasing the complexity of the model means you have to estimate more parameters and their is a greater probability for error. Simpler models tend to have low variance and potentially higher bias. Vizualization can help you to pick a good model.

Answer 57

Data sets with imbalanced class distirubtions. E.g. Credit card fraud. It's very rare that fraud exists, but when it does, it's important and should be given a higher weight.

Answer 58

It's a metric used to compare the performance of classifiers. Accuracy = TP + TN / TP + TN + FP + FN

Answer 59

It tells you how accurate your model is by showing you the TP FN FP and TN instances for a given classifier in a matrix format.

Answer 60

A cost matrix assigns a cost to the TP FN FP TN instances. It's good for imbalanced classes. You can assign a high cost to instances that are classified incorrectly. The goal is to have high accuracy and low cost.

Answer 61

Determines the fraction of reecords that actually turns out to be positive in the group the classifier has declared as a positive class. TP / ( TP + FP )

Answer 62

Recall measures the fraction of positive examples correctly predicted by the classifier. r = TP / ( TP + FN )

Answer 63

The fraction of negative examples correctly predicted by the model. TNR = TN / (TN + FP)

Answer 64

The fraction of positive examples predicted correctly by the model. TPR = TP / ( TP + FN )

Answer 65

A measure that tries to maximize both precision and recall. F1 = 2 x TP / ( 2 x TP + FP + FN )

Answer 66

It shows how accuracy changes with varying sample size. Requires a sampling schedule. In the graph their is a horizontal bar near the top which is an upper limit on accuracy. You can never surpass this bar due to noise, etc... in the data.

Answer 67

Reciever Operating Characteristic (ROC) It can be used to visualize the TP (x) vs FP rate (y) of a given model. It also allows for relative comparrison across different models (each model is represented by a curve).

Answer 68

The area under the curve tells you how the model performed. The ideal area is one. You want the curve to be as close to the upper left corner as possible (high TP rate and low FP rate).

Answer 69

A line across the main diagnonal connecting 0,0 and 1,1.

Answer 70

Given a set of transactions, find rules that will predict the occurance of an item based on the occurences of other items in the transaction.

Answer 71

A collection of one or more items. E.g: (Milk, Bread, Diaper)

Answer 72

The frequency of occurances of an itemset. Denoted by σ e.g. σ ( { milk, bread, diaper } ) = 2

Answer 73

The fraction of transactions that contain an itemset. s ( { milk, bread, diaper } ) = 2/5

Answer 74

An itemset whose support is greater than or equal to a minsup threshold. e.g. at least 10 occurances together in the database.

Answer 75

An implication expression of the form X → Y where X and Y are itemsets. e.g. { milk, diaper } → { beer }

Answer 76

Support ( s ) Confidence ( c )

Answer 77

The fraction of transactions that contain both X and Y. Support determines how often a rule is applicable to a given data set. { X } → { Y }

Answer 78

Measures how often items in Y appear in transactions that contain X. { X } → { Y }

Answer 79

Rules that have low support may only occur by chance. A low support rule is likely to be uninteresting from a business perspective.

Answer 80

Condience measures the reliability of the inference made by a rule. The higher the confidence, the more likely it is for Y to be present in transactions that contain X. Confidence also provides an estimate of the conditional probability of Y given X.

Answer 81

1. Generate all itemsets whose support \>= minsup 2. Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset.

Answer 82

Given a set of transactions T, find rules having: support \>= minsup threshold confidence \>= minconf threshold

Answer 83

It's computationally prohibative. It would involve: list all possible assocation rules computing the suppor tand confidence for each rule prune rules that fail the minsup and minconf thresholds

Answer 84

A series of connected nodes that show all possible combinations of an itemset, and sometimes a subset of combinations.

Answer 85

Find all the itemsets that satisfy the minsup threshold. These itemsets are called frequent itemsets.

Answer 86

Extract all the high-confidence rules from the frequent itemsets found in the previous step (frequent itemset generation). These rules are called strong rules.

Answer 87

1. Reduce the number of candidates (M) using pruning 2. Reduce the number of transactions by reducing the size of N as the size of the itemset increases 3. Reduce the number of comparisions by using efficient data structures to store candidates or transactions. No need to match every candidate against every transaction.

Answer 88

If an itemset is frequent, then all of its subsets must also be frequent. {A B C} = frequent then {A B }, {A C}, {B C} = frequent Conversely, if an itemset is infrequent, then all of its supersets must be infrequent too. {A B } infrequent itemsets can be pruned

Answer 89

trimming the exponential search space based on the support measure.

Answer 90

The support for an itemset never exceeds the support for its subsets.

Answer 91

Let K=1 Generate frequent itemsets of length 1 Repeat until no new frequent itemsets are identified Generate length (k+1) candidate itemsets from length k frequent itemsets Prune candidate itemsets containing subsets of length k that are infrequent Count the support for each candidate by scanning the DB Eliminate cadidates that are ifnrequent, leaving only those that are frequent

Answer 92

Every item is considered as a candidate 1-itemset {cola}. Discard itemsets that don't meet support threshold. Do the same with 2-itemsets using only the frequent 1-itemsets (because the Apriori principal holds). Repeat for the maximum number of items in an itemset.

Answer 93

Initally make a pass over data to determine the support of teach item and determine 1-itemsets. Iteratively generate new candidate k-itemsets using the k-1 itemsets found in the previous iteration. Make an additional pass over the data set to count the support of canddiates and eliminate candidates whose support is less than minsup. Terminate when there are no new frequent itemsets generated.

Answer 94

Determining if each enuerated k-itemset corresponds to an existing candidate itemset.

Answer 95

Support threshold (lower the threshold results in more itemsets being declared as frequent) Number of items (dimensionality - more space will be required to store the support counts of items). The number of transactions (Apriori makes repeated passes over data set - increases with larger # of transactions) Average transaction width - dense data sets, the average transaction width can be large.

Answer 96

An itemset is closed if none of its immediate supersets has the same support as the itemset.

Answer 97

1. A binary table 2. A lattice with a boundary where we found frequent items (all subsets will be frequent).

Answer 98

a frequent itemset for which none of its immediate supersets are frequent. \* When a border is drawn in a lattice to distinguish between frequent and non frequent, the items residing near the border (on the frequent side) are maximal frequent itemsets. Their immediate supersets are infrequent. This is the same for non-maximal itemsets (on the other side of the border).

Answer 99

Providing a compact representation of frequent itemsets. They form the smallest set of itemsets from which all frequent itemsets can be derived.

Answer 100

Provide a mininimal representation of itemsets without losing their support information. An itemset is a closed frequent itemset if it is closed and its support is greater than or equal to minsup.

Answer 101

Maximal freuqnet itemsets are subset of closed frequent itemsets. Closed frequent itemsets are a subset of frequent itemsets.

Answer 102

Still incurs considerable I/O overhead since it requires making several passes over the transaction data set. May degrade significantly for dense data sets because of the increasing width of transactions.

Answer 103

1. Traversal of itemset lattice 2. FP-Growth Algorithm

Answer 104

An alternative method for discovering frequent itemsets It encodes the data using a compact structure called an FP-tree It extracts frequent itemsets directly from the structure.

Answer 105

Reads data set one transaction at a time Maps each transation onto a path in the FP-tree Paths may overlap as transactions are similar The more paths overlap, the more compression Sometimes makes tree small enough to fit into main memory

Answer 106

It uses a recursive divided-and-conquer approach It uses pointrs to assist frequent itemset generation. It requires preprocessing.

Answer 107

Work before vs work after Lazy learner vs eager learner

Answer 108

Find all non-empty subsets in a given frequent itemset such that such that the subset satisfies the minimum confidence requirement

Answer 109

The support for an itemset never exceeds the support for its subsets.

Answer 110

No, but confidence of rules generated from the same itemset do.

Answer 111

Confidence can only decrease. If a parent doesn't meet the confidence threshold, the children whon't meet the threshold either.

Answer 112

The number of items decreases. It can be difficult to find an appripriate support threshold because most items in a store will have a low support count.

Answer 113

itemsets involving interesting rare items (e.g. expensive products) could be missed.

Answer 114

It can become computationally expensive to process and the number of itmesets will be very large.

Answer 115

You can use different support counts for different support items.

Answer 116

used to prune/rank the derived patterns because association rules tend to produce too many rules and many of them are uninteresting or redundant.

Answer 117

Interstingness can be computied using a contingency table.

Answer 118

A matrix of items that were bought and items that were not bought.

Answer 119

There are over 20... Gini index Jaccard Cosine Intersect Laplace...

Answer 120

The rules can be misleading even though confidence may be high eg. Confidence(Coffee|Tea) = 0.75 but P(Coffee) = 0.9 Although confidence is high, the rule is misleading: P(Coffee| !Tea) = 0.9375

Answer 121

A measure used to evaluate interestingness in association rule analysis that takes prior probabilities into account. It's much more reliable than other techniques such confidence level. If the lift value is around 1, we can assume that the rule is statistically independent.

Answer 122

Can cause computation issues. Need to bin and use discretization. Bin continuous and map categorical.

Answer 123

divides the range into N intervals of equal size

Answer 124

It divides the range into N intervals, each containing approximately same number of samples

Answer 125

Age(21,35) ^ Salary(70k,120k) - \> buy

Answer 126

Execution time (possible ranges of values) Too many rules {refund = No, Income {cheat = no} {refund = No, 90 k Income {cheat = no} {refund = No, 50K \< Income \< 52K } -\> {cheat = no}

Answer 127

Items are grouped into categories which cuts down on the # of items. There are then hierarchical levels. For example, Electronics as the root, then computers... etc..

Answer 128

Cuts down on the number of rules by grouping.

Answer 129

Examines events that occur over time over customers. Events can be grouped together. Becomes computationally expensive.

Answer 130

preprocessing

Answer 131

Given a database of sequences and a user-specified minimum support threshold, minsup, find all subsequences with support \>= minsup

Answer 132

They all come down to preprocessing the data so that the APRIORI algorithm can be applied to them.

Answer 133

Extends assocation rule mining to find frequent subgraphs. Graphs that extend to subraphs Looking for common subgraphs

Answer 134

computational chemistry, bioinformatics, spatial data sets,etc..

Answer 135

Transform to matrix using adjaceny lists You can add edge weights but this blowes up the matrix even more.

Answer 136

Apriori pruning strategy: If an itemset is infrequent, then all of its supersets must also be inrequent. All of these infrequent itemsets can be pruned. Multiple scans of DB. FP-Growth: encodes the data set using a compact data structure called an FP-tree and extracts frequent itemsets directly from this structure. Divide and conquer approach. Outperforms the standard Apriori algorithm by several orders of magnitude. Requires less memory. Scan DB only twice.

Answer 137

Binning and descretization.

Answer 138

The items at the lower level of the heirarchy may not have neough support to appear in the frequent itemset (this is good because rules at the bottom tend to be overly specific and may not be as interesting).

Answer 139

Items residing at the higher levels tend to have higher support counts than those resideing at the lower levels. Only the patterns residing in the higher levels are likely to have patterns extracted. Conversely, if the threshold is set too low, then you get too many rules. Increases computation time. May lead to redundant rules.

Answer 140

The Apriori principal holds for sequential data becase any data sequence contains a particular k-sequence must also contain all of its (k-1) subsequences. An Apriori-like algorithm can be used to extraxt sequential patters from a sequence data set.

Answer 141

Transform each graph into a transaction-like format so that existing algorithms such as Apriori can be applied.

Answer 142

repreatedly remove an edge from the candidate k-subgraph and checking whether the correspnding (k-1) subgraph is cnnneted and frequent.

Answer 143

1. how to identify interesting infrequent patters 2. how to efficently discover them in large data sets

Answer 144

Finding groups of objects such that the objects in a group are similar (or related) to one another and different from (or unrelated to) objects in other groups.

Answer 145

Clustering can be considered a form of unsupervised classification.

Answer 146

Understanding: group related items that have similarities Summarization: reduce the size of large data sets

Answer 147

A set of clusters

Answer 148

Partiitional Hierarchical

Answer 149

A division of data objects into non-oberlapping subsets (clusters) such that each data object is exactly one subset. Simply put, a division of the data into a group or subset.

Answer 150

A set of nested clusters organized as a heirarchical tree Simply put, each cluster can have a sub-cluster

Answer 151

The most common distinction is whether the set of clusters are nested (hierarchical) or unested.

Answer 152

The distance measure The use of clustering algorithm

Answer 153

In non-exclusive clusterings, points may belong to multiple clusters.

Answer 154

In fuzzy clustering, a point belongs to every cluster with some weight bewteen 0 and 1. The weights must sum to 1.

Answer 155

In some cases, we only want to cluster some of the data.

Answer 156

Clustering of same type or different types. For example, clusters can vary widley in size, shape, and densities.

Answer 157

Using a denogram. The horizontal heights of the linkages represent the order in which clusters are formed.

Answer 158

Clusters where points are closer to points in the same cluster than to points in every other cluster.

Answer 159

An object in a cluster is closer (more similar) to the center of a cluster, than to the center of any other cluster. These types of clusters thend to be globular.

Answer 160

two objects are connected only if they are within a specified distances of each other.

Answer 161

A cluster where a dense region of objects is surrounded by a region of low density.

Answer 162

when the clusters are irregular or itnertwined and when noise and outliers are present

Answer 163

The median is often more appropriate because outliers affect the mean greatly.

Answer 164

Clusters that share some common property or rpresent a particular concept. (for example, two overlapping rings).

Answer 165

goal of findingclusters that minimize or maximize an objective function. Enumerate all possible ways of dividing the points into clusters and evaluate the 'goodness' of each potential set of clusters by using the given objective function.

Answer 166

Type of proximimity or density measure Sparseness Attribute type Dimensionality Noise & Outliers Type of distribution

Answer 167

K-means (and it's variants) Hierarchical clustering Density-based clustering

Answer 168

partitional

Answer 169

select k points as the initial centroids (usually random points) repeat: form k clusters by assigning all points to the closet centroid recompute the centroid of each cluster until: the centroids don't change or stopping critereon is met

Answer 170

user specified number of clusters

Answer 171

Using a distance measure such as Euclidean distance, cosine similarity, correlation, etc...

Answer 172

The cluster centres are usually picked randomly (but you can pick them to speed things up). This increases the likeliness of the model getting stuck in a local minimum.

Answer 173

One of the disadvantages is that, what should be one cluster, is often split into two.

Answer 174

This introduces bias -- a data miner could simply pick the iteration that produced the best result if they wanted. It's better to do some form of smart sampling or preprocessing instead.

Answer 175

the choice of inital centroids

Answer 176

K-means tends to find round / globular clusters because of the use of Euclidean distance as a distance measure.

Answer 177

The sum of Squared Error (SSE)

Answer 178

For each point, the error is the distance to the nearest cluster. To get the SSE, we square the errors and then sum them.

Answer 179

1. multiple runs (not the best approach) 2. sample and use hierarchical clustering to determine intial centroids 3. select more than k intiial centroids and then select among these initial centroids 4. postprocessing 5. Bisecting k-means (split clusers into multiple clusters)

Answer 180

Just get rid of them.

Answer 181

In the basic k-means algorithm, centroids are updated after all points are assigned to acentroid. An lternative is to update the centroids after each assignment (incremental approach).

Answer 182

More expensive Intorudces an order dependency Never get an empty cluster (this approach is not used often)

Answer 183

Normalize the data (it's distance based) Eliminate outliers

Answer 184

Eliminate small clusters that may represent outliers Split 'loose' clusters with relatively high SSE Merge clusters that are 'close' and have relatively low SSE

Answer 185

to obtain k clusters, split the set of all points into two clusters, select one of these clusters to split and so on, until k clusters have been produced.

Answer 186

Unsupervised approaches don't give us feed back about how good our model is. Thsi is why post-processing is important.

Answer 187

K means has problems when clusters differ in size, densities, and are non-spherical (globular) shapes. K-means has problems when the data contains outliers.

Answer 188

produces a set of nested clusters organized as a hierarchical tree these clusters can be vizualized as a dendogram (a tree like structre that records the sequences of merges and splits)

Answer 189

They do not have to assume any particular number of clusters They correspond to meaningful taxonomies (eg, biological sciences -- animal kingdon)

Answer 190

1. Agglomerative 2. Divisive

Answer 191

start with the points as indiivudual clusters at each step, merge the closest pair of clusters until only one cluster (or k clusters) are left

Answer 192

Start with one, all inclusive cluster At each step, split the cluster until each cluster contains a point (or there are k clusters)

Answer 193

Agglomerative clustering algorithm

Answer 194

1. Compute the proximity matrix 2. Let each data point be a cluster 3. repeat 4. merge the two closest clusters update the proximity matrix 6. until only a single cluster remains

Answer 195

1. Min 2. Max 3. Group average 4. Distance between centroids 5. Other methods with a defined method

Answer 196

link the next closest point from another cluster

Answer 197

Minimize the maximum distance from clusters (tends to produce "round" things).

Answer 198

Can handle non-elliptical shapes.

Answer 199

sensitive to noise and outliers

Answer 200

it's less succesible to noise and outliers

Answer 201

tends to break large clusters biased toward globular clusters

Answer 202

proximity of two clusters is the average pairwise proximity between points in the two clusters

Answer 203

The similarity of two clusters is based on the increase in squarred error when two clusters are merged

Answer 204

Less sucestible to nosie and outliers biased towards globular clusters can be used to initialize k-means

Answer 205

cannot undo a decision once two clusters have been combined no objective function is directly minimized have problems with one or more of the following: sensitivity to noise difficulty handling different sized clusters and convex shapes breaking large clusters

Answer 206

1. Start with a tree that consists of any point 2. In successive steps, look for the closet pair of points such that one point is in the current tree and the other is not. 3. Add the point that is not to the tree into the tree and put an edge between these two points (this is a top down approach; it is uncommon and rarely used)

Answer 207

DBscan is a density based algorithm used to find clusters

Answer 208

1. core point 2. border point 3. noise point

Answer 209

A point is a core point if it has more than a specified number of points (MinPts) within Eps (a specified radius)

Answer 210

A border point has fewer than MinPts (user specified threshold) within Eps (a user specified radius) but within the neighbourhood of a core point

Answer 211

A noise point is any point that is not a core point or a border point

Answer 212

1. Label all points as core, border, or noise points 2. Eliminate noise points 3. Put an edge between all core points that are within the user specified radis (Eps) of each other 4. Make each group of connected core points into a speerate cluster 5. Assign each border point to one of the clusters of its associated core points

Answer 213

DBSCAN doese not work well for clusters that have varying densities

Answer 214

1. Threshold (density) MinPts 2. Radius (how far out) Eps

Answer 215

Plot the distance of every point to its kth nearest neighbour, then find the biggest change or "knee" in the curve.

Answer 216

Uses a user specified number of clusters Paritioned-based Complete clustering Non-deterministic (can exhibit different behaviors on different runs) noise stays in space and contributes to SSE

Answer 217

works if the # of clusters are unknown (a horizontal cut in the dendogram specifies the # of clusters) can set a max distance for threshold complete clustering noise depends on linkage criteria deterministic

Answer 218

can handle unknown # of clusters requires to user-defined parameters to be specified paritinoed based (fixed # of clusters) partial clustering (doesn't get rid of all points) deals best with noise because it gets rid of it worst of 3 approaches for clustering data with varying densities deterministic

Answer 219

proposes ways of evaluating unsupervised classifiers (clustering) evaluating unsupervised classifiers is much more difficult (you can't use measures such as accuracy and class labels, etc...)

Answer 220

Clusters are in the eye of the beholder (they are subjective)

Answer 221

To avoid finding patterns in noise To compare clustering algorithms To compare two sets of clusters To compare two clusters

Answer 222

Clustering tendency

Answer 223

Determining whether non-random structures exist in the data Comparing results to externally known results (eg. class label) Determining which clustering technique is better Determining the "correct" number of clusters

Answer 224

External index Internal index Relative index

Answer 225

used to measure the extend to which cluster labels match externally supplied class labels

Answer 226

Used to measure the goodness of a clustering structure without respect to external information

Answer 227

used to compare two different clusterings or clusters

Answer 228

Generate two matricies (proximity matrix and incidence matrix) 1 row and 1 colum for each data point. An entry is 1 if the associated pair of points belng to the same cluster and 0 if they belong to different clusters. Compute the correlation between the two matricies. High correlation incidates that points that belong to the same cluster are close together.

Answer 229

it's not a good measure for some density or contiguity (connected) based clusters

Answer 230

A high correlation means a good clustering, a low correlation means a poor clustering.

Answer 231

SSE can be used to cmpare two clusterings or two clusters (average SSE). It can also be used to estimate the # of clusters.

Answer 232

Plot the SSE vs k (the # of points?) and look for the knee in the curve. \* The knee method is good for spehrical clusters because it may not be as pronouced otherwise.

Answer 233

Compare the SSE to random data Get SSE values repreat many times look at the correlation

Answer 234

cluster cohesion cluster seperation

Answer 235

measures how closely related objects in a cluster are measured by the wihtin cluster sum of squares (SSE)

Answer 236

measures how distinct or well-seperated a cluster is from other clusters measured by the between cluster sum of squares

Answer 237

cluster cohesion is the sum of the weights of all the links within the cluster cluster seperation is the sum of the weights between nodes in the cluster and nodes outside the cluster

Answer 238

combines ideas of both choesion and speeration, but for individual points, as well as clusters and clusterings looks as closeness within a cluster and closeness between clusters

Answer 239

The validation of clustering is tricky. The result can look good, but this may not always be the case.

Answer 240

a numerical measure of how alike two data objects are it is higher when objects are more alike often falls in the range of 0 and 1

Answer 241

a numerical measure of how different two data objects are lower when objects are more alike minimm dissimilarity is often 0 upper limit varies

Answer 242

refers to a dimilarity or dissimilarity

Answer 243

no. one is not the opposite of the other, unless the range is from 0 t o 1.

Answer 244

similar if p = q and dissimilar if p \<\> q

Answer 245

values are mapped to integers 0 to n-1 where n is the # of values dissimilarity = p - q / # of values (n) similarity = 1 - p-q / # of values (n)

Answer 246

similarity = | p - q | disimilarity is equal to the negative value of similarity

Answer 247

If the scales differ, then yes standardization is necessary.

Answer 248

a generalization of the euclidean distance it's the same as the euclidean distance, but with a parameter r instead of ² This r parameter gives more flexability. n is the # of parameters. n is the # of dimensions (attributes)

Answer 249

r= 1 is the hamming distance r = 2 is the euclidean distance r \> 3 ( ∞ ) is the supremum distance

Answer 250

takes the average values of pairs of attributes and subtracts them from the mean of values gets the co-variance to scanle the distance measures an alternative to normalizzation (scaling is built in) - take into account the dspread of the data in a direction

Answer 251

A point, patter, or set of patterns which do not conform to what we define as normal within the data

Answer 252

intrusion detection fraud, medical, image processing

Answer 253

measurement error collection error natural variation

Answer 254

1. point 2. contextual 3. collective

Answer 255

a point within a graph

Answer 256

the anomaly completely depends on the context that you are looking at.

Answer 257

the outlier exists as a seuqnce AAABBBAAABBB**ABABAB**AAABBBAAABBB

Answer 258

Brain imaging Weather

Answer 259

ECG patterns

Answer 260

Box plots, histograms, and scatter plots

Answer 261

3 dimensional scatter plots

Answer 262

Can't directly visualize -- very difficult

Answer 263

noise is often created by measurement error, extreme values noise is typically uninteresting and generally meaningless Anomalies are data points created by different mechanisms - usually very interesting

Answer 264

Approach depends on the format of the data 1. generalized approach: build a model of the normal data 2. graphical: visualize the data using various means - typically subjective

Answer 265

1. removal 2. accomodation - keep but use accomodation methods while processing 3. explanation - why does it exist?

Answer 266

1. outlier label: a point or group of popints are labeled as an anomaly or normal 2. outlier score: assign an outlier score to a data point or group of data points represents degree of outlierness can create a ranked list or use a threshold

Answer 267

1. supervised class labels 2. unsupervised no class labels 3. semisupervised some class labels

Answer 268

1. use a parameter based model which describes the distribution of the data (eg gaussian distribution etc...) 2. apply statistical tests which depend on the: distriubtion of the data population parametes or sample statistics number of expected outliers

Answer 269

A statistical test used to detect outliers in univariate data assumed to come from a normally distributed population.

Answer 270

apply the test remove outliers repeat

Answer 271

1. perform a linear regression on the data set 2. determine the value of the resideual for each point belonging to the data set 3. use univeariate outlier detection on the residuals to detect if an outlier is present 4. remove the outlier 5. repeat until outliers are not present

Answer 272

The error term is distrubted normally

Answer 273

compute the distance between all pairs of points outlier score can be defined as the distance from the kth nearest neighbour or the average distance from the k-nearest neighbours

Answer 274

highly granular (detailed)

Answer 275

computationally expensive sensitive to the chosen value of k measningless in high dimensional space

Answer 276

assumption that outliers are points in regions of low density

Answer 277

points that are not strongly related can be considered as outliers cluster, remove outlier, repeat requires a stopping criterion

Answer 278

consider the problem to be the same as an imbalanced class problem two approaches: cost sensitive re-smampling

Answer 279

automatically determines the minority class finds the k nearest neighbours for each minority class randomly chooes from ke nearest neighbours depending on level of oversampling required creates a syntehci point along the line sequents connnectiong that points to its k nearesa neiughbor

Answer 280

looks at finding abbrupt changes in behavioural attributes which violate spatial autocorrection and hetreroscedaticity

Answer 281

related to anomaly detection but is a field of it's own the outlier hasn't occured yet, and is only an outlier for a brief period of time and then it becomes a part of the normal model

Answer 282

defining normal regions evolving definition of normal differing notion of anomalies number of attributes used to define an anomaly noise

Answer 283

a simple linear regression is basically an equation of a line

Answer 284

a residual is a vertical line sprouting off of the diagonal line of a linear regression

Answer 285

a model is an abstract representation of a real system

Answer 286

variance that you would expect to see Specifically, it refers to the distribution of numbers for one variable in relation to the distribution of numbers for another variable. find points that would defy this for outlier detection

Answer 287

When there is sufficient knowledge of the data and ype of test that should be applied. Fewer options are available for multivariate data. These statistical tests perform poorly on highly dimensional data.

Answer 288

The density around an object is equal to the # of objects taht are within a specified distance d of the object.

Answer 289

If d is too small, then many normal points may have low desnity and thus a high outlier score. If d is too large, then many outliers may have desnities (and outlier scores) that are similar to normal points.

Pre-exam Flashcards

(316 cards)