Pre-exam Flashcards
What is model underfitting?
When the model has yet to learn the true structure of the data. Performs poorly on both training and test sets.
What is model overfitting?
The model is too large and fits the data so well that it is no longer able to generalize
When a model is overfit, what happens to the test and training error rate?
Test error rate increases Training error rate continues to decrease
What are some reasons overfitting can occur?
Fitting noise points Not enough representative data
What is MDL?
The minimum description length (MDL) principle is a formalization of Occam’s razor in which the best hypothesis for a given set of data is the one that leads to the best compression of the data. - wikipedia
What is a validation set used for when building decision trees?
The data set is divided into two smaller subsets: 1 for training (2/3) and the other for testing the generalization error (1/2).
What is prepruning?
A method used on decision trees to stop growing the tree before it is fully grown.
What is post pruning?
Pruning a decision tree after it has been constructed.
What are 3 methods used for evaluating the performance of a classifier?
- Holdout (part data training, part data testing) 2. Random subsampling (repeating holdout several times) 3. Cross validation (each record is used the same # of times for training and 1 time for testing) 4. Bootstrap - sample w/ replacement
What is a rule antecedent?
The condition for a rule.
What is a rule consequent?
The end result for a rule (a class).
In a rule set, what is coverage?
The fraction of records in a data set that trigger the rule (satisfy the antecedent).
e.g. rule is triggered 5 times out of a data set of size 10 would yield a coverage of 0.5.
In a rule set, what is accuracy?
The fraction of records triggered by the rule whose class labels are equal to y.
The fraction of records that satisfy both the antecedent and consequent out of the number of time the rule satisfies the antecedant.
What is a mutually exclusive rule?
The rules are independent of each other.
Every record is covered by no more than 1 rule (a recorord is not covered by 2 or more rules).
What are exhaustive rules?
Accounts for every possible combination of attribute values Each record is covered by at least one rule
What is a default rule?
It has an empty antecedent and a default class. It is triggered when all other rules have failed (if used).
What are ordered rules?
Rules that are ordered in decreasing priority (what ever is defined such as accuracy, coverage, etc…). Also known as a decision list.
What are the two methods for extracting classification rules?
- Direct methods which extract rules directly from the data 2. Indirect methods which extract rules from other classification models such as decision trees. `
Explain the sequential covering algorithm.
- Start from an empty rule 2. Grow a rule using the Learn-One-Rule function 3. Remove the training records covered by the rule 4. Repeat step 2 and 3 until a stopping criterion is met.
What are the two rule growing strategies?
General to specific Specific to general
What are some advantages of rule based classifiers?
As highly expressive as decision trees Easy to interpret Can classify new instances rapidly Performance is comparable to decision trees
What is a rote-learner?
Memorizes the entire training data and performs classification only if attributes of a record match one of the training examples exactly
What is an instance-based classifier?
Stores the training records Uses the training records to predict the class label of unseen cases
What is a Voronoi diagram?
It’s basically a model for KNN. It splits the solution space.. on a line = equal distance to parents on an intersectoin = equal distance to 3 parents
What is the algorithm for k-nearest neighbour?
Compute the distance between two points (Euclidean distance) Determine the class from nearest neighbour list - take the majority vote of class labels among the k-nearest neighbours - weight the vote according to distance
What 3 things does KNN classifiers require?
-The set of stored records -Distance metric to compute the distance between records -The value of k, the number of nearest neighbours to retrieve.
What can be the issue with choosing the value of K for KNN?
- If k is too small, the model is sensitive to noise points - If k is too large, the neighbourhood may include points from other classes
Explain scaling issues with distance based classifiers.
Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes. e.g. height may var from 1.5m to 1.8m weight may vary from 90lb to 300lb
What are some of the characteristics of a KNN classifier?
- It’s a lazy learner - It does not build a model explicitly - Classifying unknown records can be relatively expensive
What is a Bayes classifier?
A probabilistic framework for solving classification problems. It uses conditional probability.
What is Bayes theorem?
What is a Naive Bayes Classifier?
It assumes independence among attributes.
Why is a large data set required for Bayes classifiers?
Smaller data sets means rows have more influrence
Need repeated passes over the data
How is the probability for continuous attributes used in Bayes theorm?
Discretize the range of values into bins
- Can use two way split (A < v) or (A > v)
- Or Probnability density estimation
assume the values follow a normal distribution
Use data to estimate parameters of distribution
Can use it to estimate conditional probability
What are some of the characteristics of Naive Bayes classifiers?
Robust to noise points
Can handle missing values by ignoring the instance during probability estimate calculations
Robust to irrelevant attributes
Independance assumption may not hold for some attributes (but can use other techniques such as bayesian belief networks instead)
What is a Bayesian Belief Network?
A classifier similar to Naive Bayes, but it assumes that attributes are dependeant on one another.
This approach is more sophisticated, but more realistic.
How is a Bayesian Belief Network represented?
- Create a directed acylic graph (dag) encoding the dependence relationships among a set of variables
- Create a probabilit table associating each node to its immediate parent nodes.
What is the issue with an ANN that has many nodes?
What is the solution?
The more nodes you have, the more likely you are going to get stuck in a local minimum.
Solutions:
- Reduce the number of nodes
- Or, repeat the model building process seveal times
What are the properties of a feedforward ANN?
Each layer only connects to the next layer.
There are no connections between layers.
What is an Acyclic Network (ANN)?
The connections between nodes in the same layer are cut out.
A given layer can still be connected to many other layers (unlike feed forward).
What is the typical setup for an ANN with respet to classes and attributes?
The number of input nodes is equal to the # of attributes.
The number of output nodes is equal to the # of classes (with the exception of 2 classes – just uses 1 output node).
Why is it important to remove redundant or irrelevant attributes in an ANN?
The # of attributes determines the # of input nodes, so its important to remove redundant or irrelevant attributes to keep down the # of connections and avoid a local minimum.
How is the output of an ANN interpreted?
A high value at an output node suggests a high probability of the input belonging to the class associated with the output node.
Describe the back propogation algorithm used for ANNs.
- Present sample to input nodes.
- Propogate data through layers
- Calculate results at output nodes
- Determine error at output nodes
- Propogate error backwards to adjust the weights
Repeat until stopping criterion is satisified
What are the 3 types of ANN learning?
Correlation
Competative learning
Feedback-based
Describe the Correlation learning method for ANNs.
Correlation (similar things should be more similar and less similar things should be made less similar)
- When different nerurons fire at the same time, the weight representing the connection between them should be increased (associative learning).
Describe the competitive learning, learning method for ANNs.
Competative learning
- For each sample (set of input value, one node wil lbe the best match. The weights between this winner and the input nodes should be increased.
Describe the Feedback-based learning method for ANNs.
Corret behavior should be rewarded by increasing weights
What are some applications of neural networks?
pattern association
character recognition
image compression
classification
forecasting
optimization
etc…
What are some characteristics of ANNs?
Slow
Poor interpreability of results
Able to approximate any target function
Can ignore irrelevant or redundant attributes
Easy to parallelize
May converge to a local minimum (can be mitigated with similated annealing)
What is an epoch with respect to ANNs?
A full application of a complete data set to a neural network.
What is similated annealing?
A momentum term to avoid falling in a local minimum.
A noise term used over multiple iterations. At first the term is very noisy and gets less noisy as tree becomes more accurate. Tries to prevent a local minimum.
How many training sets are required in an ANN?
The number of training sets has to be larger than the # of weights divided by (1 - accuracy).
Training sets = weights / (1 - accuracy)
How is an ANN pruned?
Train a neural network.
While the performance doesn’t degrade by more than a threshold value {
- delete a node or connection and retrain the network
}
Connections with small weights (irrelevant or redundant attributes) are pruned.
What preprocessing techniques are required for ANNs?
- remove redundant or irrelevant features
- transform to numerical values
- normalize data to the range of 0 to 1 or -1 to 1
How do support vector machines represent the decision boundary?
Using a subset of the training examples, known as support vectors.
Describe Maximum Margin Hyperplanes for SVMs.
There are infinitely many hyperplanes (ways to split classes in a model).
The SVM must choose one of the hyperplanes to represent its decision boundary, based on how well they are expecte4d to perform on test examples.
Trying to maximize the width of the “road” / margin.
How is the decision boundary selected in SVMs?
The goal is to maximize the width of the margin “road”.
Why is it important for an SVM to maximize the margin of their decision boundary?
To ensure that their worst-case generalization errors are minimized.
What are support vectors?
Points in a SVM that lie next to the maximized margins. SVMs use a subset of training examples called support vectors.
What approach can be used for SVMs when decision boundaries aren’t as clear?
Allow for some contamination by using slack variables.
What approach can be used if a support vector machine is not linear?
Math can be used to map the data to a new space and the new data can be used to classify.
In this case, the data needs to be numerical.
Why are support vector machines a popular choice of classifer?
They don’t get stuck in a local minimum like ANNs and KNNs. Especially when the data is transformed to a new space.
What is required for data to be transformed to a new space when using SVMs?
The data must be linearly seperable.
What is the difference between a global model and a local model?
A global model is 1 model (SVMs)
A local model combines smaller models (KNN and ANNs)
Does SVM try to find a global minimum or emply a greedy based strategy to search the hypothesis space?
SVMs try to find the global minimum unlike neural networks, which employ a greedy based strategy to search the hypothsis space.
What is an ensemble method for classification?
Improves classification accuracy by aggregating the predictions of multiple classifiers.
Describe a high-level overview of how ensemble methods work.
An esemble method constructs a set of base classifiers from training data and performs classification by taking a vote on the predictions made by each base classifier.
Using ensemble methods, what can be done to a classifier if one classifer performs better than others.
If one classifier is better than others, you can weight the vote of the classifiers and make the better performing classifier have a higher weight.
What is bagging with respect to data mining and ensemble methods?
Bagging (bootstrap aggregating) is basically out of bag estimation.
Sample repreatedly with replacement from the original data to create new training sets.
Reduces variance and helps to avoid overfitting.
What is boosting with respect to data mining and ensemble methods?
Give more emphasis on specific examples that are difficult to classify. Assign a higher weight, greater probability of being selected to them.
Records that are wrongly classified will have their weights increased.
Records that are classified correctly will have their weights decreased.
Describe the AdaBoost method used in ensembles.
AdaBoost creates many classifiers / models and repreatedly draws from samples.
Samples that are easy to classifiy get a lower weight, and ones that are harder to classify get a higher weight.
If any intermediate rounds produce an error rate higher than 50%, the weights are reverted back and the resampling procedure is repreated.
The classifier also gets a weight.
What conditions must be satisified for ensemble methods to improve the overall classification accuracy?
1) the different classifiers make different mistakes in the data
2) the different classifiers perform better than random guessing
Exam question: think about how to work with noisy data and what this means with respect to bias and variance.
A model with high variance and low bias tends to generalize new test instances well, but is susceptible to overfitting noisy data.
If data is noisy, perhaps it’s better to have high bias, and lower variance.
Choise of classifier is important.
Bagging and Boosting can help.
What is bias?
Bias is a systematic shift in ground truth
The stronger the assumptinos made by a classifier about the nature of its decision boundary, the larger the classifier’s bias will be.
Design choices such as choice of algorithm can introduce bias too.
What is variance?
Variance is a measure of spread of data
Describe a model with high bias, but low variance.
A model constructed with high bias and low variance tends to underfit training data.
Describe a model with high variance and low bias.
A model with high variance and low bias tends to generalize new test instances well, but is susceptible to overfitting noisy data.
Describe a model with high bias and low variance.
A model constructed with low variance and high bias tends to underfit training data. Both underfitting and overfitting can lead to a model that performs poorly.
What happens to the bias and variance as the complexity of a model grows (for example, the number of nodes in a decision tree).
If you increase the # of nodes (the complexity of the tree) the variance will increase and the bias will decrease.
Describe overfitting with respect to the bias-variance decomposition.
Overfitting is modelling a random noise component in the data (model is too complex).
Increasing the complexity of the model means you have to estimate more parameters and their is a greater probability for error.
Simpler models tend to have low variance and potentially higher bias.
Vizualization can help you to pick a good model.
What is the class imbalance problem?
Data sets with imbalanced class distirubtions.
E.g. Credit card fraud. It’s very rare that fraud exists, but when it does, it’s important and should be given a higher weight.
What is accuracy measure?
It’s a metric used to compare the performance of classifiers.
Accuracy = TP + TN / TP + TN + FP + FN
What is a confusion matrix?
It tells you how accurate your model is by showing you the TP FN FP and TN instances for a given classifier in a matrix format.
What is a cost matrix?
A cost matrix assigns a cost to the TP FN FP TN instances. It’s good for imbalanced classes. You can assign a high cost to instances that are classified incorrectly.
The goal is to have high accuracy and low cost.
What is percision?
Determines the fraction of reecords that actually turns out to be positive in the group the classifier has declared as a positive class.
TP / ( TP + FP )
What is recall?
Recall measures the fraction of positive examples correctly predicted by the classifier.
r = TP / ( TP + FN )
What is true negative rate (specificity)?
The fraction of negative examples correctly predicted by the model.
TNR = TN / (TN + FP)
What is true positive rate?
The fraction of positive examples predicted correctly by the model.
TPR = TP / ( TP + FN )
What is F measure?
A measure that tries to maximize both precision and recall.
F1 = 2 x TP / ( 2 x TP + FP + FN )
What is a learning curve?
It shows how accuracy changes with varying sample size.
Requires a sampling schedule.
In the graph their is a horizontal bar near the top which is an upper limit on accuracy. You can never surpass this bar due to noise, etc… in the data.
* on exam
What is a ROC curve?
Reciever Operating Characteristic (ROC)
It can be used to visualize the TP (x) vs FP rate (y) of a given model.
It also allows for relative comparrison across different models (each model is represented by a curve).
Using a ROC curve, how is one model compared to another?
The area under the curve tells you how the model performed. The ideal area is one. You want the curve to be as close to the upper left corner as possible (high TP rate and low FP rate).
What would the model of a classifier that makes random guesses look like on a ROC curve?
A line across the main diagnonal connecting 0,0 and 1,1.
What is association rule mining?
Given a set of transactions, find rules that will predict the occurance of an item based on the occurences of other items in the transaction.
Define itemset with respect to association rule mining.
A collection of one or more items.
E.g: (Milk, Bread, Diaper)
Define support count with respect to association rule mining.
The frequency of occurances of an itemset.
Denoted by σ
e.g. σ ( { milk, bread, diaper } ) = 2
Define support with respect to association rule mining.
The fraction of transactions that contain an itemset.
s ( { milk, bread, diaper } ) = 2/5
Define frequent itemset with respect to association rule mining.
An itemset whose support is greater than or equal to a minsup threshold.
e.g. at least 10 occurances together in the database.
What is an association rule with respect to association rule mining?
An implication expression of the form X → Y where X and Y are itemsets.
e.g. { milk, diaper } → { beer }
What are two rule evaluation metrics with respect to assocation rule analysis?
Support ( s )
Confidence ( c )
What is support count ( s ) with respect to assocation rule analysis?
The fraction of transactions that contain both X and Y.
Support determines how often a rule is applicable to a given data set.
{ X } → { Y }
What is confidence with respect to assocation rule analysis?
Measures how often items in Y appear in transactions that contain X.
{ X } → { Y }
Why is support count important?
Rules that have low support may only occur by chance.
A low support rule is likely to be uninteresting from a business perspective.
Why is confidence important with respect to assocation rule analysis?
Condience measures the reliability of the inference made by a rule.
The higher the confidence, the more likely it is for Y to be present in transactions that contain X.
Confidence also provides an estimate of the conditional probability of Y given X.
What is the 2 step approach to assocation rule mining?
- Generate all itemsets whose support >= minsup
- Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset.
What is the goal of assocation rule mining?
Given a set of transactions T, find rules having:
support >= minsup threshold
confidence >= minconf threshold
Is a brute-force approach feasible with respect to association rule mining?
It’s computationally prohibative.
It would involve:
list all possible assocation rules
computing the suppor tand confidence for each rule
prune rules that fail the minsup and minconf thresholds
What is an itemset lattice?
A series of connected nodes that show all possible combinations of an itemset, and sometimes a subset of combinations.
What is the objective of frequent itemset generation with respect to association rule analysis.
Find all the itemsets that satisfy the minsup threshold. These itemsets are called frequent itemsets.
What is the objective of rule generaiton with respect to association rule analsysis?
Extract all the high-confidence rules from the frequent itemsets found in the previous step (frequent itemset generation). These rules are called strong rules.
What are the 3 strategies for Frequent Itemset Generation?
- Reduce the number of candidates (M) using pruning
- Reduce the number of transactions by reducing the size of N as the size of the itemset increases
- Reduce the number of comparisions by using efficient data structures to store candidates or transactions. No need to match every candidate against every transaction.
What is the Apriori Principle with respect to association rule analysis?
If an itemset is frequent, then all of its subsets must also be frequent.
{A B C} = frequent then {A B }, {A C}, {B C} = frequent
Conversely, if an itemset is infrequent, then all of its supersets must be infrequent too.
{A B }
infrequent itemsets can be pruned
What is support based pruning with respect to association rule analysis?
trimming the exponential search space based on the support measure.
What is the anti-monotone property with respect to association rule analysis and support-based pruning?
The support for an itemset never exceeds the support for its subsets.
Describe the Apriori algorithm used for assocation rule mining.
Let K=1
Generate frequent itemsets of length 1
Repeat until no new frequent itemsets are identified
Generate length (k+1) candidate itemsets from length k frequent itemsets
Prune candidate itemsets containing subsets of length k that are infrequent
Count the support for each candidate by scanning the DB
Eliminate cadidates that are ifnrequent, leaving only those that are frequent
Describe how the Apriori algoirthm works.
Every item is considered as a candidate 1-itemset {cola}. Discard itemsets that don’t meet support threshold.
Do the same with 2-itemsets using only the frequent 1-itemsets (because the Apriori principal holds).
Repeat for the maximum number of items in an itemset.
Describe the apriori pruning algorithm for association rule analysis
Initally make a pass over data to determine the support of teach item and determine 1-itemsets.
Iteratively generate new candidate k-itemsets using the k-1 itemsets found in the previous iteration.
Make an additional pass over the data set to count the support of canddiates and eliminate candidates whose support is less than minsup.
Terminate when there are no new frequent itemsets generated.
In the Apriori frequent itemset generation algorithm, what is a hash tree used for?
Determining if each enuerated k-itemset corresponds to an existing candidate itemset.
What factors affect the complexity of the Apriori algorithm?
Support threshold (lower the threshold results in more itemsets being declared as frequent)
Number of items (dimensionality - more space will be required to store the support counts of items).
The number of transactions (Apriori makes repeated passes over data set - increases with larger # of transactions)
Average transaction width - dense data sets, the average transaction width can be large.
What is a closed itemset with respect to association analsysis?
An itemset is closed if none of its immediate supersets has the same support as the itemset.
What are two ways of representing frequent itemsets in a compact way?
- A binary table
- A lattice with a boundary where we found frequent items (all subsets will be frequent).
What is a maximimal frequent itemset?
a frequent itemset for which none of its immediate supersets are frequent.
* When a border is drawn in a lattice to distinguish between frequent and non frequent, the items residing near the border (on the frequent side) are maximal frequent itemsets. Their immediate supersets are infrequent. This is the same for non-maximal itemsets (on the other side of the border).
What are maximal frequent itemsets useful for?
Providing a compact representation of frequent itemsets. They form the smallest set of itemsets from which all frequent itemsets can be derived.
What are closed frequent itemsets?
Provide a mininimal representation of itemsets without losing their support information.
An itemset is a closed frequent itemset if it is closed and its support is greater than or equal to minsup.
What is the relationship between closed frequent itemsets, maximal frequent itemsets and frequent itemsets?
Maximal freuqnet itemsets are subset of closed frequent itemsets. Closed frequent itemsets are a subset of frequent itemsets.