[1] Data Mining Flashcards
What is KDD?
Knowledge Discovery in Databases; the process of extracting knowledge from data
What are the steps for KDD?
[1] Selection - choose which data to use
[2] Pre-processing - handle missing data and errors
[3] Transformation - change the data format; consider feature reduction or extraction
[4] Data mining - apply algorithms to the transformed data to generate results
[5] Interpretation and evaluation - i.e. visualization
Why is data mining important?
The exponential growth of data exceeds human abilities to understand it themselves
What factors influence the selection of DM algorithms?
- The kind of data i.e. numeric or symbolic
- Whether the task is supervised or unsupervised
- What is the desired output i.e. a blackbox or a tree
What are the ways to approach KDD?
- Top-down discovery - analyst suggests hypotheses; results are analysed to see if they support this
- Bottom-up discovery - system automatically suggests patterns, analyst considers if they are relevant
- Mixed - analyst provides an area of search, and refine this based on the hypotheses generated
How are database queries different from DM?
Database queries return a subset of the data or an aggregate of it
DM outputs a KDD object i.e. a rule or tree which dind’t exist before
What is a data warehouse?
A collection of data on a single subject where the number of datasets can change
How is KDD applied to websites?
Web content mining - collect and analyse the content of pages
Web structure mining - use graph theory to study how pages are connected with hyperlinks
What is IR?
Information Retrieval i.e. search
At a high-level, what do classification algorithms do?
Find separating curves. Note that SVM provides the equations for these, while neural networks only give the regions
What defines AI systems?
They are machines that mimic cognitive functions such as learning and problem solving
What are fuzzy sets?
In contrast to crisp sets, they allow partial membership based on an membership function
How do set operators work for fuzzy sets?
AND(x,y) -> m(x N y) = min(m(x), m(y))
OR(x,y) -> m(x V y) = max(m(x), m(y))
NOT(x) -> m(~x) = 1 - m(x)
What is the process for fuzzy logic?
~ Fuzzify the input values into membership functions
~ Use rules to compute fuzzy outputs
~ De-fuzzify the output functions to get crisp outputs
What are the main DM tasks?
Classification - map data into #pre-defined# classes
Regression - map data to real-valued output
Prediction - predict a future state #in time# based on current state
Time Series Analysis - special case of classification, regression and prediction when the attributes vary over time
[C.R.P.T]
Clustering - the groups are pre-defined, so it is unsupervised
Segmentation - special cases of clustering where the DB is partitioned into #dis-jointed# sets
Association Rules - find relationships between instances i.e. items that frequently occur together
Sequence Discovery - find relationships based on time i.e. if X then Y will happen
[C.S.A.S]
What are the assumptions used for Bayes classifiers?
- Only one hypothesis (class) can occur at a time
- Conditional independence - P(X1, X2 | Y) = P(X1|Y) * P(X2|Y)
What is a key practical consideration for Bayes classifiers?
When the conditional etc. probabilities are calculated, start the counts at 1 to avoid multiplying by 0
What at the names of the probabilities in Bayes classifiers?
Posterior - p(hj|xi) - [this is the target]
Prior - p(hj)
Unconditional probability - p(x)
Conditional probability - p(xi|hj)
What do SVMs do?
Find a hyperplane which correctly classifies the data points and #separates them as far as possible#
In the context of SVMs, what are margins and what types are there?
This is a measure of how well they separate the data
A hard margin is the distance to the nearest point; a soft margin penalizes close points
What is a key limitation of SVMs, and how can it be addressed?
They can only classify data if it is linearly separable.
However, kernels transform data into higher dimensions so they do become separable (the actual hyperplanes are always linear)
What are the limitations of SVM?
- They require selection a good kernel
- They use a lot of memory and CPU
- Special consideration is needed for multiple classes i.e. non-binary problems
How do SVMs handle multiple classes?
Transform the problem into a set of binary problems, and either:
- one-vs-rest - train a classifier for each class to determine if it is or isn’t that class
- one-vs-many - train a classifier for each pair of classes, and use a voting protocol
In the context of DM, what is segmentation?
A special case of clustering where the tuples are disjointed i.e. aren’t neighbors
What is the capacity of a function?
A measure of its expressive power and flexibility etc.
What does shattering mean?
A function f(theta) shatters a set of point X if for all assignments of classes to that point, there exists a theta such that all the points are correctly classified
What is the VC Dimension?
A measure of #capacity# for a function based on the largest number of points it can shatter.
Note that it might only work for a particular set of locations for the points
What is Structural Risk Minimization?
The process of finding models with the lowest risk
Risk is the chance of prediction error on some unseen data points which are iid for some unknown distribution
How is risk found?
The expected risk is the true risk, but it can’t be determined
However, it is bounded by the empirical risk (based on points already classified) and the VC dimension confidence (based on the number of points and the model’s VC)
How the risk of a model be reduced?
- Train a better model to reduce to empirical risk
- Minimize the VC dimension i.e. use a simpler model
- Use more data points, as this lowers the VC dimension confidence
Why is Structural Risk Minimization often infeasible?
The VC dimension can only be calculated for a limited number of models
What are some key ways to measure the error of an algorithm?
- TSS (total sum of squared error)
- MSE
- RMSE
- R-squared - the ratio of the error to the variance of the response
What is the advantage of R-squared, and how is it calculated?
It is easily interprete.
R2 = 1 - TSS/Var(y)
What is accuracy? What is its opposite?
Accuracy is the portion of instances correctly classified.
Error rate is the portion of instances incorrectly classified
What is TPR?
The portion of all actually true instances which are classified as true.
It is also known as sensitivity or recall
TPR = TP / (TP + FN)
What is sensitivity?
The TPR
What is recall?
The TPR
What is TNR?
The portion of truly negative instances which are classified as negative
It is also known as specificity
What is specificity?
TNR
How are FPR and FNR related to TPR and TNR
They sum to 1 i.e.e FPR+TPR = 1
What is an ROC curve?
It plots how the FPR on the x-axis affects the TPR on the y-axis