Final Exam Flashcards
What are the 3 normalization schemes used in class?
min-max
z-score
decimal scaling
What are the 3 preprocessing techniques?
- Feature selection
- Dimensionality reduction
- Normalization
What is spatial autocorrelation?
objecsts that are physically close tend to be similar
What are the 3 types of data sets?
Record
- data matrix
- document data
- transaction data
Graph
- web data
- molecular structures
Ordered
- spatial
- temporal
- sequence
- sequential
What are the two different types of mappings?
1 of n
n of m
explain the 1 of n mapping:
create 1 new attribute for every ordinal value
explain the n of m mapping
create 1 new attribute with a unique representation for each ordinal value
What are the 3 types of missing values?
Missing completely at random
- scratch random items out
- can subsitute mean but effects variance
Missing at Random
- value missing is due to value of another variable
Non Ignorable Data
- Value missing is due to limitations of measuring device
What are 4 sampling schemes?
- Simple Random Sampling: each object is selected with equal probability
- Sampling with replacement: do not remove the object when sample is selected
- Sampling without replacement: remove the object when sample is selected
- Stratified sampling: split the data into several paritions and sample randomly from each one
What is the difference between a training and test set?
Training set is used to build a model.
Test set is used to test a model.
Why is sampling important with respect to a training and test set?
We want to reduce the amount of bias.
What is accuracy?
Used to compare performance of models
of correct predictions / # of predictons
TP + TN / (TP + TN + FP + FN)
What is True Positive Rate / sensitivity?
fraction of positive examples predicted correctly by the classifier
TPR = TP / (TP + FN)
What is the True Negative Rate / specificity?
fraction of negative examples predicted correctly by the classifier
TNR = TN / (TN + FP)
What is False Positive Rate?
fraction of negative examples predicted as positive class
FPR = FP / (FP + TN)
What is False Negative Rate?
fraction of positive examples predicted as negative class
FNR = FN / (FN + TP)
What is percision?
The fraction of positive examples out of examples declared as positive
p = TP / TP + FP
What is recall?
the fraction of positive examples correctly predicted by the classifier
r = TP / (TP + FN)
What is F-measure?
summarizes precision and recall
2TP / ( 2TP + FP + FN)
What is the apriori principal?
if an itemset is frquent, then all of its subsets are frequent. Conversely, if an itemset if infrequen, then all of its supersets must be infrequent too.
Recite the Apriori algorithm?
Let k = 1
generate frequent itemsets of length 1
Repeat:
- generate (k+1) candidate itemsets from k frequent itemsets
- prune candidate itemsets containing subsets of length k that are infrequent
- count support for each canddiate by scanning the DB
- eliminate candidates that are infrequent
What is a frequent itemset?
An itemset that meets the minsup threshold.
What is a maximal frequent itemset?
An itemset is maximal frequent if none of its immediate supersets is frequent
What is a closed itemset?
An itemset is closed if none of its immediate supersets has the same support as the itemset.
It’s a compressed representation of support
How is a ROC curve depicted?
A graphical chart with TP rate on y axis and FP rate on x axis.
Each points repersents a point that corresponds to a model induced by a classifier.
What 2 things is a ROC curve useful for?
- Picking the best model after repeatedly adjusting a specific parameter (t)
- Comparing the relative performance among classifiers
In the Apriori frequent itemset generation algorithm, what is a hash tree used for?
Determining if each enuerated k-itemset corresponds to an existing candidate itemset.
Describe an FP tree used by the FP growth algoriothm for frequent itemset generation
Reads data set one transaction at a time
Maps each transation onto a path in the FP-tree
Paths may overlap as transactions are similar
The more paths overlap, the more compression
Sometimes makes tree small enough to fit into main memory
What are the two datastructures used in the Apriori Algorithm and the FP Growth Algorithm?
FP tree
Hash tree
How does the FP growth algorithm mine frequent itemsets?
It uses a recursive divided-and-conquer approach
It uses pointrs to assist frequent itemset generation.
It requires preprocessing.
What is the mahalnanobis distance?
- takes the average values of pairs of attributes and subtracts them from the mean of values
- an alternative to normalizzation (scaling is built in)
- take into account the dspread of the data in a direction
What is the Minkowski Distance?
- a generalization of the euclidean distance
- it’s the same as the euclidean distance, but with a parameter r instead of 2
What difference measures can be used to deal with attributes that have different ranges?
- Normalization techniques
- scaling
- min-max
- decimal scaling
- Can choose an algorithm that is not affected by different ranges (decision trees)
- Mahalanobis distance
- PCA
- cpatures largest variation and reduces dimensions
What are the 3 types of outliers?
- Point: just a point on a 1D graph
- Contextual: depends on the context that you are looking at
- Collective: an outlier that exists as a sequence
What are the 3 techniques for outlier detection?
- Statistical (grubbs test & linear regression)
- Density based (DBscan)
- Proximity based (KNN)
What are 2 methods to evaluate a model?
Out of bag estimation
Cross validation
What is bagging with respect to data mining and ensemble methods?
Bagging (bootstrap aggregating) is basically out of bag estimation.
Sample repreatedly with replacement from the original data to create new training sets.
Reduces variance and helps to avoid overfitting.
What is boosting with respect to data mining and ensemble methods?
- Give more emphasis on specific examples that are difficult to classify. Assign a higher weight, greater probability of being selected to them.
- Records that are wrongly classified will have their weights increased.
- Records that are classified correctly will have their weights decreased.
Describe the AdaBoost method used in ensembles.
- AdaBoost creates many classifiers / models and repreatedly draws from samples.
- Samples that are easy to classifiy get a lower weight, and ones that are harder to classify get a higher weight.
- If any intermediate rounds produce an error rate higher than 50%, the weights are reverted back and the resampling procedure is repreated.
- The classifier also gets a weight.
What conditions must be satisified for ensemble methods to improve the overall classification accuracy?
1) the different classifiers make different mistakes in the data
2) the different classifiers perform better than random guessing
What are the 3 components of the curse of dimensionality?
- Runtime: if the algorithm does not behave at most linear with the # of attributes, the runtime increase too quickly
- Amount of Data: The amount of samples needed to cover the space with equal density grows exponentially with the number of dimensions
- Distances: distances between the data becomes meaningless. The maximum distance between data points does not grow linearly with the # of dimensions.
What is the formula for the bias / variance tradeoff?
Mean Squared Error = bias2 + variance
How is a ROC curve constructed?
Sort the threshold values from lowest to highest for all of the samples.
Plot the TPR vs the TNR for each sample.
What are the 5 common clustering algorithms?
K-means
Hierarchical
DBscan
Expectation Maximization
Self Organizing Maps
What is expectation maximization (EM)?
Represents data as distrubutions (usually nomal distribution)
Randomly initialize parameters (mean and standard deviation)
- Expectation: calculate likelyhood of falling into distribution based on parameters & assign data to parameters
- Maximization: find parameters that best represent the assignments
Describe the characteristics of the K-means algorithm
- deterministic? no (difference results each time)
- suceptible to falling in a local minimum? yes
- need to specify # of clusters? yes
- is noise removed? no
- partitioned based? yes
- can handle varying densities? yes
Describe the charactersistics of hierarchical clustering
deterministic? yes
suceptible to falling in a local minimum? no
need to specify # of clusters? no
is noise removed? yes
partitioned based? no
can handle varying densities? yes
Describe the characteristics of the DB scan clustering algorithm
deterministic? yes
suceptible to falling in a local minimum? no
need to specify # of clusters? no
is noise removed? yes
partitioned based? yes
can handle varying densities? no
Describe the characteristics of expectation maximization
deterministic? no (random component at start)
suceptible to falling in a local minimum? yes
need to specify # of clusters? no (run multiple times)
is noise removed? no
partitioned based? yes
can handle varying densities? not sure…
Describe the characteristics of self organizing maps
deterministic? no (the map is randomly initialized)
suceptible to falling in a local minimum? yes
need to specify # of clusters?
is noise removed? no
partitioned based? not sure…
can handle varying densities? not sure…
What are self organizing maps?
An effective clustering algorithm
An abstraction of the input data
Describe how self organaizing maps work
Randomly initialize map
repeat:
compare a sample and all input datum in map
select the prottype that is most imilar
update the winner and neighbourhood to more ismilar to winner
What is Principal Component Analysis?
It’s a preprocessing technique that maps the data to lower dimensional space
It captures the largest variation
reduces noise and dimensions
“knee in curve”