Final Exam - Past Questions Flashcards
What are the difference between lazy and eager learners?
Two types: lazy and eager. Eager learners generate the model explicitly (decision tree), lazy do not (kNN).
Briefly describe the general objective of Association Rules mining.
The objective of association rules mining is to discover interesting relations between objects in large databases.
Explain the benefits of applying the Apriori Principle in the context of the Apriori Algorithm for Association Rules mining.
Apriori is a algorithm used to determine association rules in the database. This is done by identifying frequent individual terms to construct itemsets above a support threshold.
The apriori principle
- If an itemset is frequent, it’s subsets are also frequent.
- If an itemset isn’t frequent, it’s subsets are also not frequent. They can be pruned.
Compare the Precision and Recall metrics for classifier evaluation. Illustrate your answer using the Confusion Matrix.
Precision: exactness - what percent of tuples that the classifier labeled as positive are actually positive?
Precision = True Positive / (True Positive + False Positive)
Recall: completeness - what percent of positive tuples did the classifier correctly labelled as positive?
Recall = True Positive / (True Positive + False Negative)
What is the main limitation of the Naïve Bayesian Classifier?
Assumes attributes are independent. This is not likely in the real world for some cases. Usually there are dependencies in data.
In the context of Decision Tree induction, what does overfitting mean? And how to avoid overfitting?
Overfitting in the context of a decision tree is where your decision tree matches the data too precisely. The decision tree may be capturing noise.
Stop the algorithm before it makes a fully grown tree.
- Stop if all instances belong to the same class.
- Stop if all the attribute values are the same.
Other notes Stop if it doesn’t improve the Gini index.
Describe the problems involved in selecting the initial points of the K-means clustering method. Describe post-processing techniques to solve such problems.
K-means can return sub-optimal clusters because sometimes the centroids do not readjust in the ‘right’ way.
Post-processing techniques to overcome this problem consist of
- Eliminating ‘small’ clusters that may represent outliers,
- Splitting ‘loose’ clusters with relatively high sum of squared error and
- Merging ‘close’ clusters with relatively low sum of squared error.
What is the edit distance of
university —————–> unverstiy?
How do you calculate it?
Edit distance is three start.
university
- unversity (delete first i)
- unversty (delete second i)
- unverstiy (insert final i)
What are the confidence rules to test for (a,b,c)?
The following are confidence rules? a -> b a -> c b -> a b -> c a, b -> c c -> a, b a, c -> b b -> a, c b, c -> a a -> b, c
How do you calculate support in a dataset?
The count the number of times the word appears and divide by the number of rows
How do you calculate the confidence for a rule (a,b) -> c
prob (a, b, c) / prob (a,b)
What are the steps for doing a VSM?
- Extract text 2. Remove stop words 3. Make lower case 4. Stemming 5. Count term frequencies 6. Create index file 7. Create VSM 8. Calculate the IDF log(Number of documents / term frequency 9. Calculate the TF-IDF 10. Normalize the weights 11. Compute the cosine distance with a query
What’s the motivation for studying Mining Association Rules?
To look for interesting relationships between objects in large datasets.
Provide Formal Notations of the following
item
itemset
k-itemset
transaction
transaction dataset
- An item: an item in a basket
- An itemset is a set of items. n E.g., X = {milk, bread, cereal} is an itemset.
- A k-itemset is an itemset with k items.
- A transaction: items purchased in a basket n it may have TID (transaction ID)
- A transactional dataset: A set of transactions
What do we mean when we say X-> Y in Mining Association Rules?
If they buy X, they will buy Y.