Final Exam - Theory Flashcards

1
Q

What is web mining?

A

Data mining on the web

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Is web mining the same as web searching?

A

No, web mining is about finding patterns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the various Web-Data Types?

A

Web PagesIntra-page structuresInter-page structuresUsage dataSupplemental data- Profiles- Registration information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What concepts and technologies for search engines?

A

Crawlers/IndexProfiles/Personalisation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What’s a web crawler?

A

A spider that traverses the hyperlinks and to build out a popularity of web pages.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the process of a web crawler?

A
  1. Start with a seed2. Send crawlers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is an issue with Web Crawling ?

A

Time wasted for waiting for responses to requests.To reduce inefficiency, web crawlers use threadsWeb crawlers use politeness policies to stop them from flooding sites with requests.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What’s freshness?

A

Web crawlers need to revisit a page in order to maintain the freshness of a document because web pages are always being added, deleted and modified.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is focused crawling?

A

Attempt to download only those pages that about a particular topic. Popular pages tend to have links to other pages on the same topic. Crawlers uses text classifiers to decide whether a page is on topic. Example: Google Scholar (search by citations), Google patterns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the challenge for focused crawling?

A

Finding relevant links.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do we do relevance prediction? What strategies do we use?

A

Define a score as cond. prob. that a page is relevant given its text content.Parent-based: score a fetched page and extend score to all URLs in that pageAnchor-based: score each URL based on that anchor text to that URL: “semantic linkage”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the purpose of personalization in web content mining?

A

Web access or contents to be tuned to fit the preferences of the user.- Edititorial and hand curated - “editor’s pick”- Simple aggregagates (top 10, most popular) - all users- Tailored to individual users -> we recommend for you

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a utility matrix?

A

Assigning recommendations. Trying to assign a score to things you haven’t seen. 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a challenge of a recommendation system?

A

Sparcity. Hard to recommend when you don’t watch.People not giving reviews.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How do you overcome the challenge of sparcity in a recommendation system?

A

Be explicit: - Ask people to rate items- Doesn’t work well in practiceHow? Give a rewardBe Implicit:Learn ratings from user actions- purchase implies high rating

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a cold start?

A

The utility matrix is sparse. Cold-start, new users have no ratings, nor history.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What approaches do you have to recommender systems?

A

1) Content-based2) Collaborative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is content-based recommendation?

A

Look at the preferences of the user and give a recommendation on this.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is an advantage of content-based approach?

A

No need for data on other usersAble to recommend to users with unique tastesAble to recommend new & unpopular itemsAble to provide explanations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is a con of content-based approach?

A
  • Finding the appropriate feature is hard- Recommendations for new users- Overspecialization (too specific, you watch 1 documentary, you get many documentaries)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is collaborative filtering?

A

Look at similar users who have rated something highly, and recommend to another user.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How do you find similar users?

A

A couple of methods, Jaccard SimilarityCosine Similarity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is a pro/con of collaborative filtering?

A

Pro- Works for any kind of itemCon- Need enough users in the system to find a match- Hard to find users that have rated the same items- First rate: hard to recommend an item that hasn’t been previously rated- Popularity bias: Popular items tend to dominate the collaborative items

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

How does PageRank work?

A

Rank based on Backlinks, the number of pages that point to a webpage.Weighting - based on the importance of the pages that link to it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What methods do Text Mining use?

A

Information Retrieval. Pre-processing of text documnets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What tasks do text mining do?

A

Text Classification, Text Clustering or Text Summarization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is an issue with text mining vs traditional data mining?

A

Traditional data mining is structured. Text often has no real structure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is a Vector Space Model?

A

A document is represented as a “bag” of words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is a problem with Vector Space Model?

A

There are many words in the English language.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

How do you fix the limitations of the Vector Space Model?

A

Removing the stop words (“A, the, this, that …“)Stemming (e.g combine the similar verbs (past/present tense)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

How do you assign the weight (importance) of a term in text-mining?

A

Use TF-IDFWeight = TF * IDFTF = Term Frequency (how many times)IDF = Inverse Document Frequency = log (total documents / document frequency)

32
Q

What are the steps involved in text mining?

A
  1. Get the text2. Remove the stop words3. Convert all the words to lowercase (optional step)4. Stem the commonly associated word (interesting-> interested)5. Count the term frequency6. Create an index file, which has all the terms and all their frequency. Sort it alphabetically. 7. Create Vector Space Model: For each occurence, put a 1 in its vector space, occurs 3, put 3).8. Compute the IDF. How many documents did this word appear in? / How many documents there are.9. Compute the weight (tf * idf)10. Normalize to less than 1. For each term, the weight is divided by the square root of the sum of all the weights squared
33
Q

How to measure the similarity between two documents?

A

Use cosine distance.

34
Q

What are anomalies/outliers?

A

The set of data points that are very different than the remainder of the data

35
Q

What is the task for looking for anomolies/outliers?

A

Find all the data points with anomaly scores greater than threshold that you have defined.

36
Q

What are some applications for Anomaly/Outlier detection?

A

Fraud detection

37
Q

Are outliers different from noise data?

A

Yes. Noise is random error.Noise should be removed before outlier detection.Outliers are interesting.

38
Q

What’s the difference between outlier detection vs novelty detection?

A

Novelty is eventually

39
Q

What is a challenge of anomaly detection?

A

Anomaly detection is unsupervised (like Clustering).

40
Q

How do you build an anomaly detection?

A

Build a profile of what is normal and then detect anything that is different

41
Q

What kind of outliers are there?

A

Global OutliersContextual OutliersCollective Outliers

42
Q

What is a global outlier?

A

A point that significantly deviates from the rest of the data set. Issue: You need a measurement of how you measure this

43
Q

What is a contextual outlier?

A

An outlier that deviates significantly based on selected contextE.g Is 40 degrees Celsius an outlier? In winter, yes. In summer, no.

44
Q

What are collective outliers?

A

Every object doesn’t look like an outlier but when you bring many objects together, it starts to look like an outlier. Example: Sports/team: A good player Neymar is just like Messi or Ronaldo. But when you put them together with a good team they become an anomaly.

45
Q

What is a statistical schemes?

A

The objects are generated by a model.Identify objects in low probability regions of the model as outliers.Two types: Parametric/Non-parametric

46
Q

What is parametric model?

A

A model that describes the distribution of the dataIf something in the model has low probability, then it is an outlier. Find the mean and the standard deviation. Check each the difference from the average. If it is greater than a threshold, then it is an anomaly.

47
Q

What is a limitation of a parametric scheme?

A

Not always a normal distributionCan be problematic for high dimensional data

48
Q

What do you use to model a non-parametric scheme?

A

A histogram

49
Q

How do we interpret a histogram for determining anomaly detection?

A

The ‘long tail’ part of the histogram is considered the anomaly area of the model.

50
Q

What is a problem with using a histogram for analysis of anomaly detection?

A

How to set the number of buckets (x-axis) to effectively capture the data

51
Q

What is a false positive in anomaly detection?

A

An anomaly that is in fact that an anomaly. (Our histogram is too detailed).

52
Q

What are the two methods for detecting proximity based outliers?

A
  1. Distance-based2. Density-based
53
Q

What is distance based outlier?

A

An object is considered a distance based outlier if it’s neighbourhood doesn’t have enough other points.

54
Q

What is density based outlier?

A

An object is considered a density-based outlier if its density is relatively much lower than it’s neighbours

55
Q

What is LOF method for finding density based outlier?

A

General idea: For each point, calculate the density of it’s neighbourhood.Compute: Local Outlier Factor: it’s the average of the ratio of density of the sample p and the density of it’s nearest neighbourOutliers are the points with low LOF.

56
Q

How do we measure density?

A

Density = k / distance to the k-nearest neighbours, or compare with the set of N - nearest neighbours

57
Q

Can you get a different result for density/distance based outliers?

A

Yes

58
Q

If you have clustering, how do you determine if there are outliers?

A

It doesn’t belong to a cluster.There is a large distance between an object and it’s cluster.It belongs to a very small or sparse cluster

59
Q

What is Case 1: Far from closest cluster way of testing outliers?

A

Use k-means and build clusters, get an outlier (measure the distance to its closest centre. If it’s distance is higher than average then it is likely an outlier

60
Q

What is Case 2: Outliers in small clusters way of testing outliers?

A

Assign a cluster-based local outlier factor.If p belongs to a large cluster: CBLOF = cluster size * similarity between P and ClusterIf p belongs to a small cluster: CBLOF = cluster size * similarity between p and the closest large clusterLOW CBLOF scores are suspected outliers`

61
Q

What is a limitation of a cluster-based method?

A

High computational cost

62
Q

What’s the motivation for studying Mining Association Rules?

A

To look for interesting relationships between objects in large datasets.

63
Q

What are we trying to do when studying Mining Association Rules?

A

Find all rules that correlate the presence of one set of items with another set of items E.g., 80% of customers who buy {diapers} tend to buy {beer, milk}.

64
Q

Provide Formal Notations of the followingitemitemsetk-itemset transaction transaction dataset

A
  • An item: an item in a basket - An itemset is a set of items. n E.g., X = {milk, bread, cereal} is an itemset. - A k-itemset is an itemset with k items. - A transaction: items purchased in a basket n it may have TID (transaction ID) - A transactional dataset: A set of transactions
65
Q

What do we mean when we say X-> Y in Mining Association Rules?

A

If they buy X, they will buy Y.

66
Q

What is support and confidence in Association Rule Mining?

A

Support is a measure of how frequent an item appears in the set. E.g Half of the people at Woolworths have milk in their basket. The support is 0.5 or 50%. Confidence is a measure of how likely an item is bought if another item is also bought (X->Y). Of the people who buy milk, 80% of people buy bread as well. Confidence is 0.8

67
Q

What do we call association rules that satisfy both the Min_Support and Min_Confidence?

A

These are Strong Association Rules.

68
Q

What is the minimum support mean?

A

The minimum frequency we care about.If minimum support equals 3.Any item that occurs only 2 times is not important for our analysis.

69
Q

What is the conditional Probability formula for confdence?

A

Confidence (X -> Y) = P(Y | X) = P(X U Y) / P(X)

70
Q

What is the goal of association rule mining? What do we minimally want for a rule?

A

The goal of association rule mining is to find all rules having1. support ≥ min_sup threshold2. confidence ≥ min_conf threshold

71
Q

What algorithms do we you use for Mining Association Rules?

A
  1. Apriori Algorithm2. Frequent Pattern (FP) Growth Algorithm
72
Q

What are the two steps in Mining Association Rules?

A
  1. Frequent Itemset Generation– Get all itemsets whose support ≥ minsup 2. - Generate high confidence rules from each frequent itemset
73
Q

What is the principle of the Apriori Algorithm?

A

If an itemset is frequent, then all of its subsets must also be frequent.

74
Q

What are some factors that affect the complexity of the Apriori Algorithm?

A
  • The choice of minimum support threshold - Dimensionality (number of items) in the data set - Size of database - Average transaction width
75
Q

Briefly describe the general objective of Association Rule Mining

A

Discover interesting relations between objects in large databases

76
Q

Explain the benefits of applying the Apriori Principle in the context of the Apriori Algorithm for Association Rules mining.

A

The benefit of applying the Apriori Algorithm is you can eliminate patterns that do not meet the mininum support threshold and save yourself computation time as calculating the support for different patterns in a large dataset is costly..

77
Q

What are the challenges of using histograms for detecting outliers?

A

It’s hard to choose an appropriate bin size for histograms. Too small - you capture normal objects in an outlier bins. Too large - you capture outliers in some frequent bins