Final Flashcards by Eric Desjardins

Given an input dataset and the Apriori algorithm, how to trace the algorithm for intermediate results? (Review)

How well did you know this?

Not at all

Perfectly

How to derive strong rules from the given frequent itemsets L and a conf_rate?

Test for strong rules by filtering the rules with conf < min_conf.

How well did you know this?

Not at all

Perfectly

How to improve the efficiency of the rule generation procedure by applying the apriori property?

Pruning while generating rules.

How well did you know this?

Not at all

Perfectly

What are the two general purposes of DM, use some examples of mined association patterns to explain for each purpose?

Find frequent itemsets
- find all frequent itemsets from D
Generate association rules
- Dervice rules from each frequent itemset

How well did you know this?

Not at all

Perfectly

How can the association mining process be mapped to the empirical cycle model of scientific research?

Obersevation: observe all the data
Analysis: generating all associations
Theory: apply apriori knowledge (support / confidence)
Prediction: predicting X given apriori knowledge

How well did you know this?

Not at all

Perfectly

Why classification mining is a supervised learning process? How about association mining?

Partitioning training data based on divide-and-conquer strategy.
It trains on a portion of the data and then tests on the other portion. If the accuracy is acceptable it can be used to predict.

How well did you know this?

Not at all

Perfectly

What are the major phases of conducting a classification mining application?

Training
- Each tuple is assumed to belong to a predefined class, as determined by the class label attribute
- The set of tuples used for model construction is training set
Testing
- The known label of test sample is compared with the classified result from the model
- Accuracy rate is the percentage of test set samples that are correctly classified by the model
Predicting
- If the accuracy is acceptable, use the model to classify unseen data

How well did you know this?

Not at all

Perfectly

Can you describe a mapping between a classification application process and the empirical cycle?

Analysis –> Classification algorithm
Theory –> Classification Model
Prediction –> Testing & Prediction
Observation –> Training Data

How well did you know this?

Not at all

Perfectly

What is the general idea/strategy/method/alorithm of DT induction for classification mining?

Supervised learning
- Derive a model form a training data set
Inductive learning process
- Contructing a tree using to-down, divide-and-conquer strategy
- Testing -> Choose -> Split
Tree constuction/induction by greedy search
- Depth-first search
- Heuristic function

How well did you know this?

Not at all

Perfectly

What is the general strategy of Inductive Learning (via observing examples)?

Divide and conquer strategy
Continue dividing D into subsets, based a search method, until each subset has only one label, i.e. all examples in the subset share a same class label.

How well did you know this?

Not at all

Perfectly

What are the major technical issues of DT Induction approach for classification mining?

Preparing datasets: (training & testing)
- A training dataset for learning a model
- A test dataset for evaluating the learned model
Classification model discovery: (constructing a DT)
- Stopping criteria for testing at each node
- How to choose which attribute to split, and how to split (method)
- Control structure for tree construction (recursive process)
- Pruning method

How well did you know this?

Not at all

Perfectly

What is the heuristic function used in ID3 algorithm for evaluating search directions?

Entropy Calculation

How well did you know this?

Not at all

Perfectly

What is the notion of Information Gain, and how it is applied in ID3 algorithm?

Expected reduction in entropy
- Define a preferred sequence of attributes to investigate to most rapidly narrow down the state of X
ID3 uses information gain to select among the candidate attributes at each step while growing the tree

How well did you know this?

Not at all

Perfectly

How to convert the ID3 algorithm into an implementation code structure?

How well did you know this?

Not at all

Perfectly

How to quantify information contained in a message?

Q(message) = P2(outcome_after) - P1(outcome_before)

Q, the quantity of information contained in a message.

P1, the probability of outcome before receiving a message.

P2, the probability of outcome after receiving the message.

Information in general always associates with a question and a message for answering the question
Information measure is to quantify the outcome of some expectation from the message for answering the question

How well did you know this?

Not at all

Perfectly

Suppose a missing cow has strayed into a pasture represented as an
8 x 8 array of “cells”.

Question: Where is the cow?
Outcome: the probability of findng the cow.
Answer 1: Nobody knows.
Answer 2: The cow is in cell (4, 7).

What is the information received?

Outcome1 (cow before) = 1/64

Outcome2 (cow after = 1

Information received = log2 P2 - log2 P1

= log2 (P2/P1)

= log2 (1 / (1/64)

= log2 (64)

= 6 bits

How well did you know this?

Not at all

Perfectly

What is the message and information received formulas?

Q(message) = P2(after) - P1(before)
Information received = log2 P2 - log2 P1
- log2(P2/P1)

How well did you know this?

Not at all

Perfectly

How the concept of 1 can be applied to a classification method, such as ID3 algorithm?

[????]

How well did you know this?

Not at all

Perfectly

What is entropy and information gain? How to use information gain for choosing an attribute?

Purpose is to select the best attribute with the highest information gain
Entropy measures information required to classify any arbitrary tuple
Information gain is used to dtermine the best split

How well did you know this?

Not at all

Perfectly

What is ID3’s induction bias?

Study These Flashcards

There is a natural bias in the information gain measure that favors attributes with many distinct values over those with few distinct values

What is over fitting?

Study These Flashcards

Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations.

What are the technical options of overcoming the bias problem for possible improvement?

Study These Flashcards

Use the gain ratio instead (The measure Gain Ratio has been used in C4.5 algorithm)
*

How different a classification task is done by DT induction and by Naïve Bayes classifier? (*Give 3 differences.)

Study These Flashcards

NB uses a probable model vs. inductive model
Use of bayes theorem weighting the evidence
Calculate prior probabilities

[Need review]

What are the two assumptions for using NB classifier?

Study These Flashcards

The quantities of interest are governed by the distribution of prior probabilities in that optimal decisions can be made by reasoning about these probabilities together with observed data
Attributes are independant of each other

Why Naïve Bayes algorithm is more suitable to high dimensional data?

Because it makes a very strong assumption.

What is text classification? What is the basic idea to convert unstructured text data into structured for classification?

Consider a classification problem in which the instances are text documents: * Spam filtering * Text document categorization * Classify survey comments **Make text data be structured:** * Treat each text document as an object * Treat word postion as attribute * Treat words as domain values

How to estimate the number of prior probabilities which need to be calculated for text classification?

Estimation = C \* I \* K Target value (C) Text positions (I) Distrinct words (K) **- OR -** Estimation = C \* K

Why Naïve Bayes algorithm is more suitable for text data mining? What is its limitation?

**Pros** * Very good in domains with many equally important features * A good dependable baseline for text classification **Cons** * The **independence assumption** states that the word probability contributes to the text classification has not association with its location in the text (i.e. words are independent each other) * **This assumption is clearly inaccurate**. In practice, however, the naive Bayes algorithm performs remarkably well when classify text by topic despite the incorrectness of this independence assumption.

What are the main differences between topic oriented and sentiment oriented text classification?

- **Topic oriented** relies on word distribution and the assumption of independant contribution of words to the classification evidence - S**entiment analysis** requires knoweldge of individual word meaning **and** the relationships between the words.

Based on what principle all classification methods can be commonly divided into two general categories (name each)? Provide two example classification algorithms to illustrate each category.

* **Model based classification** * DT and ANN * **Instance based classification** * Naive-Bayes and k-nearest neigbor

CommentonANNclassificationapproachinterms of its principle and trade-offs.

* Borrows from the human brain: neurones and synapses * Has an input, hidden layers and output * Weights are used and a refined using training data **Tradoffs** * Expensive * Unpredictable

What are the main differences between Classification and Clustering DM (list 3 from different perspectives)?

* Clustering doesn't not use training data * There is not predefined class, rather you discover the classes * Clustered mining is less precise (by design) - looks at the big picture * Gain insights with hidden data concepts and data distribution

Provide two application examples of clustering DM, explain how the DM result may be used for supporting business decision making.

* Data clustering for customer group analysis * Generated concepts for supporting information retrieval (clustering content for search engines)

What are the general criteria for judging quality of clustering DM results?

* **high intra-class similarity** * **low inter-class similarity** * scalability * deal with different types of attributes * deal with noise and outliers * high dimentionality * interpretability and usability

What is the basic idea to convert non-numerical data into numerical ones and how to prepare your data with various attribute types for clustering?

You need to use calculate the Euclidean distance between objects therefore they need to be quantified. You prepare you data using the following: * Binary values {symmetric, asymmetric} * Norminal variables (e.g. {red, green, blue} * Ordinal variables (e.g. medals)

What are the basic data structures for clustering mining?

* Data matrix * Dissimilarity matrix

What are basic data structures for clustering mining?

* Data matrix * Dissimilarity matrix

How to calculate dissimilarity of object pairs for a dataset with mixed attribute types for clustering?

Euclidean distance Euclidean distance, let x1 = (1,2) and x2 = (3,5) represent two objects in the data matrix figure above. sqrt(2² + 3²) = 3.61

What are the main categories of clustering mining methods?

* Partition-based clustering * Hierarchical clustering

How to trace K-means algorithm on a given dataset? Given 1) a 1D data set: {2, 4, 10, 12, 3, 20, 30, 11, 25} 2) K = 2.

Explain why clustering DM is said to discovery concepts hidden in large datasets?

Is a provides a high-level view of dataset, that is, see the bigger picture. Clustering doesn't use a a predetermined class but can be used to find a class. ## Footnote **[Review]**

What are outliers? How to handle them in applications?

* Outliers are the data points with the values much different from those of the remaining set of data * Handle them as noise or as targets (through algorithms)

What are the main differences between the two partition based methods: k-means and k-medoids?

* Instead of taking mean values of the objects in a cluster as a reference point - medoid - i.e. the most centrally located object is used in a cluster. * A medoids average dissimilarity to all objects is the cluster is minimal. i.e. it is most centrally located point in the cluster * K-medoids handles outliers better than k-means * K-medoids is efficient on small data sets

What are the main strength and limitation of k-medoids algorithm?

* Strenghts * Efficient on small data sets * Handles outliers better than k-means * Limitations * Inefficient on large data sets

Use a small dataset to trace each algorithm.

See Wikipedia article.

What are the differences between partition based clustering and hierarchical clustering?

Clusters are formed in levels - actually creating sets of clusters at each level

How are the mined clusters stored in a dendrogram data structure in a hierarchical clustering result?

* Root is one cluster * Leaf is individual cluster * A cluster at level i is the union of its children clusters at level i+1

What kind of constraints may be applied to cluster analysis?

* A **single target concepts** (i.e. single dimension clustering) such as salary * A **selected set of attributes** such as salary, female, age (20-39) * **All attributes**

Final Flashcards

(49 cards)