Final Flashcards
Given an input dataset and the Apriori algorithm, how to trace the algorithm for intermediate results? (Review)

How to derive strong rules from the given frequent itemsets L and a conf_rate?
Test for strong rules by filtering the rules with conf < min_conf.

How to improve the efficiency of the rule generation procedure by applying the apriori property?
Pruning while generating rules.

What are the two general purposes of DM, use some examples of mined association patterns to explain for each purpose?
- Find frequent itemsets
- find all frequent itemsets from D
- Generate association rules
- Dervice rules from each frequent itemset
How can the association mining process be mapped to the empirical cycle model of scientific research?
- Obersevation: observe all the data
- Analysis: generating all associations
- Theory: apply apriori knowledge (support / confidence)
- Prediction: predicting X given apriori knowledge

Why classification mining is a supervised learning process? How about association mining?
- Partitioning training data based on divide-and-conquer strategy.
- It trains on a portion of the data and then tests on the other portion. If the accuracy is acceptable it can be used to predict.
What are the major phases of conducting a classification mining application?
-
Training
- Each tuple is assumed to belong to a predefined class, as determined by the class label attribute
- The set of tuples used for model construction is training set
-
Testing
- The known label of test sample is compared with the classified result from the model
- Accuracy rate is the percentage of test set samples that are correctly classified by the model
-
Predicting
- If the accuracy is acceptable, use the model to classify unseen data
Can you describe a mapping between a classification application process and the empirical cycle?
- Analysis –> Classification algorithm
- Theory –> Classification Model
- Prediction –> Testing & Prediction
- Observation –> Training Data

What is the general idea/strategy/method/alorithm of DT induction for classification mining?
-
Supervised learning
- Derive a model form a training data set
-
Inductive learning process
- Contructing a tree using to-down, divide-and-conquer strategy
- Testing -> Choose -> Split
-
Tree constuction/induction by greedy search
- Depth-first search
- Heuristic function

What is the general strategy of Inductive Learning (via observing examples)?
- Divide and conquer strategy
- Continue dividing D into subsets, based a search method, until each subset has only one label, i.e. all examples in the subset share a same class label.
What are the major technical issues of DT Induction approach for classification mining?
- Preparing datasets: (training & testing)
- A training dataset for learning a model
- A test dataset for evaluating the learned model
- Classification model discovery: (constructing a DT)
- Stopping criteria for testing at each node
- How to choose which attribute to split, and how to split (method)
- Control structure for tree construction (recursive process)
- Pruning method
What is the heuristic function used in ID3 algorithm for evaluating search directions?
Entropy Calculation
What is the notion of Information Gain, and how it is applied in ID3 algorithm?
- Expected reduction in entropy
- Define a preferred sequence of attributes to investigate to most rapidly narrow down the state of X
- ID3 uses information gain to select among the candidate attributes at each step while growing the tree
How to convert the ID3 algorithm into an implementation code structure?

How to quantify information contained in a message?
Q(message) = P2(outcome_after) - P1(outcome_before)
Q, the quantity of information contained in a message.
P1, the probability of outcome before receiving a message.
P2, the probability of outcome after receiving the message.
- Information in general always associates with a question and a message for answering the question
- Information measure is to quantify the outcome of some expectation from the message for answering the question
Suppose a missing cow has strayed into a pasture represented as an
8 x 8 array of “cells”.
Question: Where is the cow?
Outcome: the probability of findng the cow.
Answer 1: Nobody knows.
Answer 2: The cow is in cell (4, 7).
What is the information received?
Outcome1 (cow before) = 1/64
Outcome2 (cow after = 1
Information received = log2 P2 - log2 P1
= log2 (P2/P1)
= log2 (1 / (1/64)
= log2 (64)
= 6 bits
What is the message and information received formulas?
- Q(message) = P2(after) - P1(before)
-
Information received = log2 P2 - log2 P1
- log2(P2/P1)
How the concept of 1 can be applied to a classification method, such as ID3 algorithm?
[????]
What is entropy and information gain? How to use information gain for choosing an attribute?
- Purpose is to select the best attribute with the highest information gain
- Entropy measures information required to classify any arbitrary tuple
- Information gain is used to dtermine the best split
What is ID3’s induction bias?
There is a natural bias in the information gain measure that favors attributes with many distinct values over those with few distinct values
What is over fitting?
Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations.
What are the technical options of overcoming the bias problem for possible improvement?
- Use the gain ratio instead (The measure Gain Ratio has been used in C4.5 algorithm)
*
How different a classification task is done by DT induction and by Naïve Bayes classifier? (*Give 3 differences.)
- NB uses a probable model vs. inductive model
- Use of bayes theorem weighting the evidence
- Calculate prior probabilities
[Need review]
What are the two assumptions for using NB classifier?
- The quantities of interest are governed by the distribution of prior probabilities in that optimal decisions can be made by reasoning about these probabilities together with observed data
- Attributes are independant of each other
Why Naïve Bayes algorithm is more suitable to high dimensional data?
Because it makes a very strong assumption.
What is text classification? What is the basic idea to convert unstructured text data into structured for classification?
Consider a classification problem in which the instances are text documents:
- Spam filtering
- Text document categorization
- Classify survey comments
Make text data be structured:
- Treat each text document as an object
- Treat word postion as attribute
- Treat words as domain values
How to estimate the number of prior probabilities which need to be calculated for text classification?
Estimation = C * I * K
Target value (C)
Text positions (I)
Distrinct words (K)
- OR -
Estimation = C * K

Why Naïve Bayes algorithm is more suitable for text data mining? What is its limitation?
Pros
- Very good in domains with many equally important features
- A good dependable baseline for text classification
Cons
- The independence assumption states that the word probability contributes to the text classification has not association with its location in the text (i.e. words are independent each other)
- This assumption is clearly inaccurate. In practice, however, the naive Bayes algorithm performs remarkably well when classify text by topic despite the incorrectness of this independence assumption.
What are the main differences between topic oriented and sentiment oriented text classification?
- Topic oriented relies on word distribution and the assumption of independant contribution of words to the classification evidence
- Sentiment analysis requires knoweldge of individual word meaning and the relationships between the words.
Based on what principle all classification methods can be commonly divided into two general categories (name each)? Provide two example classification algorithms to illustrate each category.
-
Model based classification
- DT and ANN
-
Instance based classification
- Naive-Bayes and k-nearest neigbor
CommentonANNclassificationapproachinterms of its principle and trade-offs.
- Borrows from the human brain: neurones and synapses
- Has an input, hidden layers and output
- Weights are used and a refined using training data
Tradoffs
- Expensive
- Unpredictable
What are the main differences between Classification and Clustering DM (list 3 from different perspectives)?
- Clustering doesn’t not use training data
- There is not predefined class, rather you discover the classes
- Clustered mining is less precise (by design) - looks at the big picture
- Gain insights with hidden data concepts and data distribution
Provide two application examples of clustering DM, explain how the DM result may be used for supporting business decision making.
- Data clustering for customer group analysis
- Generated concepts for supporting information retrieval (clustering content for search engines)
What are the general criteria for judging quality of clustering DM results?
- high intra-class similarity
- low inter-class similarity
- scalability
- deal with different types of attributes
- deal with noise and outliers
- high dimentionality
- interpretability and usability
What is the basic idea to convert non-numerical data into numerical ones and how to prepare your data with various attribute types for clustering?
You need to use calculate the Euclidean distance between objects therefore they need to be quantified. You prepare you data using the following:
- Binary values {symmetric, asymmetric}
- Norminal variables (e.g. {red, green, blue}
- Ordinal variables (e.g. medals)
What are the basic data structures for clustering mining?
- Data matrix
- Dissimilarity matrix

What are basic data structures for clustering mining?
- Data matrix
- Dissimilarity matrix
How to calculate dissimilarity of object pairs for a dataset with mixed attribute types for clustering?
Euclidean distance
Euclidean distance, let x1 = (1,2) and x2 = (3,5) represent two objects in the data matrix figure above.
sqrt(22 + 32) = 3.61
What are the main categories of clustering mining methods?
- Partition-based clustering
- Hierarchical clustering
How to trace K-means algorithm on a given dataset?
Given
1) a 1D data set: {2, 4, 10, 12, 3, 20, 30, 11, 25}
2) K = 2.

Explain why clustering DM is said to discovery concepts hidden in large datasets?
Is a provides a high-level view of dataset, that is, see the bigger picture. Clustering doesn’t use a a predetermined class but can be used to find a class.
[Review]
What are outliers? How to handle them in applications?
- Outliers are the data points with the values much different from those of the remaining set of data
- Handle them as noise or as targets (through algorithms)
What are the main differences between the two partition based methods: k-means and k-medoids?
- Instead of taking mean values of the objects in a cluster as a reference point - medoid - i.e. the most centrally located object is used in a cluster.
- A medoids average dissimilarity to all objects is the cluster is minimal. i.e. it is most centrally located point in the cluster
- K-medoids handles outliers better than k-means
- K-medoids is efficient on small data sets
What are the main strength and limitation of k-medoids algorithm?
- Strenghts
- Efficient on small data sets
- Handles outliers better than k-means
- Limitations
- Inefficient on large data sets
Use a small dataset to trace each algorithm.
See Wikipedia article.
What are the differences between partition based clustering and hierarchical clustering?
Clusters are formed in levels - actually creating sets of clusters at each level
How are the mined clusters stored in a dendrogram data structure in a hierarchical clustering result?
- Root is one cluster
- Leaf is individual cluster
- A cluster at level i is the union of its children clusters at level i+1

What kind of constraints may be applied to cluster analysis?
- A single target concepts (i.e. single dimension clustering) such as salary
- A selected set of attributes such as salary, female, age (20-39)
- All attributes