Final Exam - Past Questions Flashcards

Question 1

Q

What are the difference between lazy and eager learners?

Answer

A

Two types: lazy and eager. Eager learners generate the model explicitly (decision tree), lazy do not (kNN).

Question 2

Q

Briefly describe the general objective of Association Rules mining.

Answer

A

The objective of association rules mining is to discover interesting relations between objects in large databases.

Question 3

Q

Explain the benefits of applying the Apriori Principle in the context of the Apriori Algorithm for Association Rules mining.

Answer

A

Apriori is a algorithm used to determine association rules in the database. This is done by identifying frequent individual terms to construct itemsets above a support threshold.

The apriori principle

If an itemset is frequent, it’s subsets are also frequent.
If an itemset isn’t frequent, it’s subsets are also not frequent. They can be pruned.

Question 4

Q

Compare the Precision and Recall metrics for classifier evaluation. Illustrate your answer using the Confusion Matrix.

Answer

A

Precision: exactness - what percent of tuples that the classifier labeled as positive are actually positive?

Precision = True Positive / (True Positive + False Positive)

Recall: completeness - what percent of positive tuples did the classifier correctly labelled as positive?

Recall = True Positive / (True Positive + False Negative)

Question 5

Q

What is the main limitation of the Naïve Bayesian Classifier?

Answer

A

Assumes attributes are independent. This is not likely in the real world for some cases. Usually there are dependencies in data.

Question 6

Q

In the context of Decision Tree induction, what does overfitting mean? And how to avoid overfitting?

Answer

A

Overfitting in the context of a decision tree is where your decision tree matches the data too precisely. The decision tree may be capturing noise.

Stop the algorithm before it makes a fully grown tree.

Stop if all instances belong to the same class.
Stop if all the attribute values are the same.

Other notes Stop if it doesn’t improve the Gini index.

Question 7

Q

Describe the problems involved in selecting the initial points of the K-means clustering method. Describe post-processing techniques to solve such problems.

Answer

A

K-means can return sub-optimal clusters because sometimes the centroids do not readjust in the ‘right’ way.

Post-processing techniques to overcome this problem consist of

Eliminating ‘small’ clusters that may represent outliers,
Splitting ‘loose’ clusters with relatively high sum of squared error and
Merging ‘close’ clusters with relatively low sum of squared error.

Question 8

Q

What is the edit distance of

university —————–> unverstiy?

How do you calculate it?

Answer

A

Edit distance is three start.

university

unversity (delete first i)
unversty (delete second i)
unverstiy (insert final i)

Question 9

Q

What are the confidence rules to test for (a,b,c)?

Answer

A

The following are confidence rules? a -> b a -> c b -> a b -> c a, b -> c c -> a, b a, c -> b b -> a, c b, c -> a a -> b, c

Question 10

Q

How do you calculate support in a dataset?

Answer

A

The count the number of times the word appears and divide by the number of rows

Question 11

Q

How do you calculate the confidence for a rule (a,b) -> c

Answer

A

prob (a, b, c) / prob (a,b)

Question 12

Q

What are the steps for doing a VSM?

Answer

A

Extract text 2. Remove stop words 3. Make lower case 4. Stemming 5. Count term frequencies 6. Create index file 7. Create VSM 8. Calculate the IDF log(Number of documents / term frequency 9. Calculate the TF-IDF 10. Normalize the weights 11. Compute the cosine distance with a query

Question 13

Q

What’s the motivation for studying Mining Association Rules?

Answer

A

To look for interesting relationships between objects in large datasets.

Question 14

Q

Provide Formal Notations of the following

item

itemset

k-itemset

transaction

transaction dataset

Answer

A

An item: an item in a basket
An itemset is a set of items. n E.g., X = {milk, bread, cereal} is an itemset.
A k-itemset is an itemset with k items.
A transaction: items purchased in a basket n it may have TID (transaction ID)
A transactional dataset: A set of transactions

Question 15

Q

What do we mean when we say X-> Y in Mining Association Rules?

Answer

A

If they buy X, they will buy Y.

Question 16

Q

What is support and confidence in Association Rule Mining?

Answer

Study These Flashcards

A

Support is a measure of how frequent an item appears in the set.

E.g Half of the people at Woolworths have milk in their basket. The support is 0.5 or 50%.

Confidence is a measure of how likely an item is bought if another item is also bought (X->Y). Of the people who buy milk, 80% of people buy bread as well. Confidence is 0.8

Question 17

Q

What do we call association rules that satisfy both the Min_Support and Min_Confidence?

Answer

Study These Flashcards

A

These are Strong Association Rules.

Question 18

Q

What is the minimum support mean?

Answer

Study These Flashcards

A

The minimum frequency we care about.

If minimum support equals 3.

Any item that occurs only 2 times is not important for our analysis.

Question 19

Q

What is the goal of association rule mining? What do we minimally want for a rule?

Answer

Study These Flashcards

A

The goal of association rule mining is to find all rules having

support ≥ min_sup threshold
confidence ≥ min_conf threshold

Question 20

Q

What algorithms do we you use for Mining Association Rules?

Answer

Study These Flashcards

A

Apriori Algorithm
Frequent Pattern (FP) Growth Algorithm

Question 21

Q

What are the two steps in Mining Association Rules?

Answer

Study These Flashcards

A

Frequent Itemset Generation

– Get all itemsets whose support ≥ minsup

- Generate high confidence rules from each frequent itemset

Question 22

Q

What is the principle of the Apriori Algorithm?

Answer

Study These Flashcards

A

If an itemset is frequent, then all of its subsets must also be frequent.

Question 23

Q

What are some factors that affect the complexity of the Apriori Algorithm?

Answer

Study These Flashcards

A

The choice of minimum support threshold
Dimensionality (number of items) in the data set
Size of database
Average transaction width

Question 24

Q

Explain the challenges when using histograms for detecting outliers?

Answer

Study These Flashcards

A

It’s hard to choose an appropriate bin size for histograms.

Too small: normal objects in empty/rare bins. You get false positives

Too big: outliers in some frequent bins: You get false negatives

Final Exam - Past Questions Flashcards

(24 cards)