Final Exam - Past Questions Flashcards
What are the difference between lazy and eager learners?
Two types: lazy and eager. Eager learners generate the model explicitly (decision tree), lazy do not (kNN).
Briefly describe the general objective of Association Rules mining.
The objective of association rules mining is to discover interesting relations between objects in large databases.
Explain the benefits of applying the Apriori Principle in the context of the Apriori Algorithm for Association Rules mining.
Apriori is a algorithm used to determine association rules in the database. This is done by identifying frequent individual terms to construct itemsets above a support threshold.
The apriori principle
- If an itemset is frequent, it’s subsets are also frequent.
- If an itemset isn’t frequent, it’s subsets are also not frequent. They can be pruned.
Compare the Precision and Recall metrics for classifier evaluation. Illustrate your answer using the Confusion Matrix.
Precision: exactness - what percent of tuples that the classifier labeled as positive are actually positive?
Precision = True Positive / (True Positive + False Positive)
Recall: completeness - what percent of positive tuples did the classifier correctly labelled as positive?
Recall = True Positive / (True Positive + False Negative)
What is the main limitation of the Naïve Bayesian Classifier?
Assumes attributes are independent. This is not likely in the real world for some cases. Usually there are dependencies in data.
In the context of Decision Tree induction, what does overfitting mean? And how to avoid overfitting?
Overfitting in the context of a decision tree is where your decision tree matches the data too precisely. The decision tree may be capturing noise.
Stop the algorithm before it makes a fully grown tree.
- Stop if all instances belong to the same class.
- Stop if all the attribute values are the same.
Other notes Stop if it doesn’t improve the Gini index.
Describe the problems involved in selecting the initial points of the K-means clustering method. Describe post-processing techniques to solve such problems.
K-means can return sub-optimal clusters because sometimes the centroids do not readjust in the ‘right’ way.
Post-processing techniques to overcome this problem consist of
- Eliminating ‘small’ clusters that may represent outliers,
- Splitting ‘loose’ clusters with relatively high sum of squared error and
- Merging ‘close’ clusters with relatively low sum of squared error.
What is the edit distance of
university —————–> unverstiy?
How do you calculate it?
Edit distance is three start.
university
- unversity (delete first i)
- unversty (delete second i)
- unverstiy (insert final i)
What are the confidence rules to test for (a,b,c)?
The following are confidence rules? a -> b a -> c b -> a b -> c a, b -> c c -> a, b a, c -> b b -> a, c b, c -> a a -> b, c
How do you calculate support in a dataset?
The count the number of times the word appears and divide by the number of rows
How do you calculate the confidence for a rule (a,b) -> c
prob (a, b, c) / prob (a,b)
What are the steps for doing a VSM?
- Extract text 2. Remove stop words 3. Make lower case 4. Stemming 5. Count term frequencies 6. Create index file 7. Create VSM 8. Calculate the IDF log(Number of documents / term frequency 9. Calculate the TF-IDF 10. Normalize the weights 11. Compute the cosine distance with a query
What’s the motivation for studying Mining Association Rules?
To look for interesting relationships between objects in large datasets.
Provide Formal Notations of the following
item
itemset
k-itemset
transaction
transaction dataset
- An item: an item in a basket
- An itemset is a set of items. n E.g., X = {milk, bread, cereal} is an itemset.
- A k-itemset is an itemset with k items.
- A transaction: items purchased in a basket n it may have TID (transaction ID)
- A transactional dataset: A set of transactions
What do we mean when we say X-> Y in Mining Association Rules?
If they buy X, they will buy Y.
What is support and confidence in Association Rule Mining?
Support is a measure of how frequent an item appears in the set.
E.g Half of the people at Woolworths have milk in their basket. The support is 0.5 or 50%.
Confidence is a measure of how likely an item is bought if another item is also bought (X->Y). Of the people who buy milk, 80% of people buy bread as well. Confidence is 0.8
What do we call association rules that satisfy both the Min_Support and Min_Confidence?
These are Strong Association Rules.
What is the minimum support mean?
The minimum frequency we care about.
If minimum support equals 3.
Any item that occurs only 2 times is not important for our analysis.
What is the goal of association rule mining? What do we minimally want for a rule?
The goal of association rule mining is to find all rules having
- support ≥ min_sup threshold
- confidence ≥ min_conf threshold
What algorithms do we you use for Mining Association Rules?
- Apriori Algorithm
- Frequent Pattern (FP) Growth Algorithm
What are the two steps in Mining Association Rules?
- Frequent Itemset Generation
– Get all itemsets whose support ≥ minsup
- Generate high confidence rules from each frequent itemset
What is the principle of the Apriori Algorithm?
If an itemset is frequent, then all of its subsets must also be frequent.
What are some factors that affect the complexity of the Apriori Algorithm?
- The choice of minimum support threshold
- Dimensionality (number of items) in the data set
- Size of database
- Average transaction width
Explain the challenges when using histograms for detecting outliers?
It’s hard to choose an appropriate bin size for histograms.
Too small: normal objects in empty/rare bins. You get false positives
Too big: outliers in some frequent bins: You get false negatives