Final Exam - Theory Flashcards

Question

What methods do Text Mining use?

Answer 1

Information Retrieval. Pre-processing of text documnets

Answer 2

Text Classification, Text Clustering or Text Summarization

Answer 3

Traditional data mining is structured. Text often has no real structure.

Answer 4

A document is represented as a "bag" of words.

Answer 5

There are many words in the English language.

Answer 6

Removing the stop words ("A, the, this, that ...")Stemming (e.g combine the similar verbs (past/present tense)

Answer 7

Use TF-IDFWeight = TF \* IDFTF = Term Frequency (how many times)IDF = Inverse Document Frequency = log (total documents / document frequency)

Answer 8

1. Get the text2. Remove the stop words3. Convert all the words to lowercase (optional step)4. Stem the commonly associated word (interesting-\> interested)5. Count the term frequency6. Create an index file, which has all the terms and all their frequency. Sort it alphabetically. 7. Create Vector Space Model: For each occurence, put a 1 in its vector space, occurs 3, put 3).8. Compute the IDF. How many documents did this word appear in? / How many documents there are.9. Compute the weight (tf \* idf)10. Normalize to less than 1. For each term, the weight is divided by the square root of the sum of all the weights squared

Answer 9

Use cosine distance.

Answer 10

The set of data points that are very different than the remainder of the data

Answer 11

Find all the data points with anomaly scores greater than threshold that you have defined.

Answer 12

Fraud detection

Answer 13

Yes. Noise is random error.Noise should be removed before outlier detection.Outliers are interesting.

Answer 14

Novelty is eventually

Answer 15

Anomaly detection is unsupervised (like Clustering).

Answer 16

Build a profile of what is normal and then detect anything that is different

Answer 17

Global OutliersContextual OutliersCollective Outliers

Answer 18

A point that significantly deviates from the rest of the data set. Issue: You need a measurement of how you measure this

Answer 19

An outlier that deviates significantly based on selected contextE.g Is 40 degrees Celsius an outlier? In winter, yes. In summer, no.

Answer 20

Every object doesn't look like an outlier but when you bring many objects together, it starts to look like an outlier. Example: Sports/team: A good player Neymar is just like Messi or Ronaldo. But when you put them together with a good team they become an anomaly.

Answer 21

The objects are generated by a model.Identify objects in low probability regions of the model as outliers.Two types: Parametric/Non-parametric

Answer 22

A model that describes the distribution of the dataIf something in the model has low probability, then it is an outlier. Find the mean and the standard deviation. Check each the difference from the average. If it is greater than a threshold, then it is an anomaly.

Answer 23

Not always a normal distributionCan be problematic for high dimensional data

Answer 24

A histogram

Answer 25

The 'long tail' part of the histogram is considered the anomaly area of the model.

Answer 26

How to set the number of buckets (x-axis) to effectively capture the data

Answer 27

An anomaly that is in fact that an anomaly. (Our histogram is too detailed).

Answer 28

1. Distance-based2. Density-based

Answer 29

An object is considered a distance based outlier if it's neighbourhood doesn't have enough other points.

Answer 30

An object is considered a density-based outlier if its density is relatively much lower than it's neighbours

Answer 31

General idea: For each point, calculate the density of it's neighbourhood.Compute: Local Outlier Factor: it's the average of the ratio of density of the sample p and the density of it's nearest neighbourOutliers are the points with low LOF.

Answer 32

Density = k / distance to the k-nearest neighbours, or compare with the set of N - nearest neighbours

Answer 33

It doesn't belong to a cluster.There is a large distance between an object and it's cluster.It belongs to a very small or sparse cluster

Answer 34

Use k-means and build clusters, get an outlier (measure the distance to its closest centre. If it's distance is higher than average then it is likely an outlier

Answer 35

Assign a cluster-based local outlier factor.If p belongs to a large cluster: CBLOF = cluster size \* similarity between P and ClusterIf p belongs to a small cluster: CBLOF = cluster size \* similarity between p and the closest large clusterLOW CBLOF scores are suspected outliers`

Answer 36

High computational cost

Answer 37

To look for interesting relationships between objects in large datasets.

Answer 38

Find all rules that correlate the presence of one set of items with another set of items E.g., 80% of customers who buy {diapers} tend to buy {beer, milk}.

Answer 39

- An item: an item in a basket - An itemset is a set of items. n E.g., X = {milk, bread, cereal} is an itemset. - A k-itemset is an itemset with k items. - A transaction: items purchased in a basket n it may have TID (transaction ID) - A transactional dataset: A set of transactions

Answer 40

If they buy X, they will buy Y.

Answer 41

Support is a measure of how frequent an item appears in the set. E.g Half of the people at Woolworths have milk in their basket. The support is 0.5 or 50%. Confidence is a measure of how likely an item is bought if another item is also bought (X-\>Y). Of the people who buy milk, 80% of people buy bread as well. Confidence is 0.8

Answer 42

These are Strong Association Rules.

Answer 43

The minimum frequency we care about.If minimum support equals 3.Any item that occurs only 2 times is not important for our analysis.

Answer 44

Confidence (X -\> Y) = P(Y | X) = P(X U Y) / P(X)

Answer 45

The goal of association rule mining is to find all rules having1. support ≥ min\_sup threshold2. confidence ≥ min\_conf threshold

Answer 46

1. Apriori Algorithm2. Frequent Pattern (FP) Growth Algorithm

Answer 47

1. Frequent Itemset Generation– Get all itemsets whose support ≥ minsup 2. - Generate high confidence rules from each frequent itemset

Answer 48

If an itemset is frequent, then all of its subsets must also be frequent.

Answer 49

- The choice of minimum support threshold - Dimensionality (number of items) in the data set - Size of database - Average transaction width

Answer 50

Discover interesting relations between objects in large databases

Answer 51

The benefit of applying the Apriori Algorithm is you can eliminate patterns that do not meet the mininum support threshold and save yourself computation time as calculating the support for different patterns in a large dataset is costly..

Answer 52

It's hard to choose an appropriate bin size for histograms. Too small - you capture normal objects in an outlier bins. Too large - you capture outliers in some frequent bins

Final Exam - Theory Flashcards

(77 cards)