Final Exam - Past Questions Flashcards

1
Q

What are the difference between lazy and eager learners?

A

Two types: lazy and eager. Eager learners generate the model explicitly (decision tree), lazy do not (kNN).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Briefly describe the general objective of Association Rules mining.

A

The objective of association rules mining is to discover interesting relations between objects in large databases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain the benefits of applying the Apriori Principle in the context of the Apriori Algorithm for Association Rules mining.

A

Apriori is a algorithm used to determine association rules in the database. This is done by identifying frequent individual terms to construct itemsets above a support threshold.

The apriori principle

  • If an itemset is frequent, it’s subsets are also frequent.
  • If an itemset isn’t frequent, it’s subsets are also not frequent. They can be pruned.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Compare the Precision and Recall metrics for classifier evaluation. Illustrate your answer using the Confusion Matrix.

A

Precision: exactness - what percent of tuples that the classifier labeled as positive are actually positive?

Precision = True Positive / (True Positive + False Positive)

Recall: completeness - what percent of positive tuples did the classifier correctly labelled as positive?

Recall = True Positive / (True Positive + False Negative)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the main limitation of the Naïve Bayesian Classifier?

A

Assumes attributes are independent. This is not likely in the real world for some cases. Usually there are dependencies in data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

In the context of Decision Tree induction, what does overfitting mean? And how to avoid overfitting?

A

Overfitting in the context of a decision tree is where your decision tree matches the data too precisely. The decision tree may be capturing noise.

Stop the algorithm before it makes a fully grown tree.

  1. Stop if all instances belong to the same class.
  2. Stop if all the attribute values are the same.

Other notes Stop if it doesn’t improve the Gini index.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe the problems involved in selecting the initial points of the K-means clustering method. Describe post-processing techniques to solve such problems.

A

K-means can return sub-optimal clusters because sometimes the centroids do not readjust in the ‘right’ way.

Post-processing techniques to overcome this problem consist of

  1. Eliminating ‘small’ clusters that may represent outliers,
  2. Splitting ‘loose’ clusters with relatively high sum of squared error and
  3. Merging ‘close’ clusters with relatively low sum of squared error.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the edit distance of

university —————–> unverstiy?

How do you calculate it?

A

Edit distance is three start.

university

  1. unversity (delete first i)
  2. unversty (delete second i)
  3. unverstiy (insert final i)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the confidence rules to test for (a,b,c)?

A

The following are confidence rules? a -> b a -> c b -> a b -> c a, b -> c c -> a, b a, c -> b b -> a, c b, c -> a a -> b, c

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do you calculate support in a dataset?

A

The count the number of times the word appears and divide by the number of rows

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do you calculate the confidence for a rule (a,b) -> c

A

prob (a, b, c) / prob (a,b)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the steps for doing a VSM?

A
  1. Extract text 2. Remove stop words 3. Make lower case 4. Stemming 5. Count term frequencies 6. Create index file 7. Create VSM 8. Calculate the IDF log(Number of documents / term frequency 9. Calculate the TF-IDF 10. Normalize the weights 11. Compute the cosine distance with a query
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What’s the motivation for studying Mining Association Rules?

A

To look for interesting relationships between objects in large datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Provide Formal Notations of the following

item

itemset

k-itemset

transaction

transaction dataset

A
  • An item: an item in a basket
  • An itemset is a set of items. n E.g., X = {milk, bread, cereal} is an itemset.
  • A k-itemset is an itemset with k items.
  • A transaction: items purchased in a basket n it may have TID (transaction ID)
  • A transactional dataset: A set of transactions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What do we mean when we say X-> Y in Mining Association Rules?

A

If they buy X, they will buy Y.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is support and confidence in Association Rule Mining?

A

Support is a measure of how frequent an item appears in the set.

E.g Half of the people at Woolworths have milk in their basket. The support is 0.5 or 50%.

Confidence is a measure of how likely an item is bought if another item is also bought (X->Y). Of the people who buy milk, 80% of people buy bread as well. Confidence is 0.8

17
Q

What do we call association rules that satisfy both the Min_Support and Min_Confidence?

A

These are Strong Association Rules.

18
Q

What is the minimum support mean?

A

The minimum frequency we care about.

If minimum support equals 3.

Any item that occurs only 2 times is not important for our analysis.

19
Q

What is the goal of association rule mining? What do we minimally want for a rule?

A

The goal of association rule mining is to find all rules having

  1. support ≥ min_sup threshold
  2. confidence ≥ min_conf threshold
20
Q

What algorithms do we you use for Mining Association Rules?

A
  1. Apriori Algorithm
  2. Frequent Pattern (FP) Growth Algorithm
21
Q

What are the two steps in Mining Association Rules?

A
  1. Frequent Itemset Generation

– Get all itemsets whose support ≥ minsup

    • Generate high confidence rules from each frequent itemset
22
Q

What is the principle of the Apriori Algorithm?

A

If an itemset is frequent, then all of its subsets must also be frequent.

24
Q

What are some factors that affect the complexity of the Apriori Algorithm?

A
  • The choice of minimum support threshold
  • Dimensionality (number of items) in the data set
  • Size of database
  • Average transaction width
25
Q

Explain the challenges when using histograms for detecting outliers?

A

It’s hard to choose an appropriate bin size for histograms.

Too small: normal objects in empty/rare bins. You get false positives

Too big: outliers in some frequent bins: You get false negatives