[1] Data Mining Flashcards
What is KDD?
Knowledge Discovery in Databases; the process of extracting knowledge from data
What are the steps for KDD?
[1] Selection - choose which data to use
[2] Pre-processing - handle missing data and errors
[3] Transformation - change the data format; consider feature reduction or extraction
[4] Data mining - apply algorithms to the transformed data to generate results
[5] Interpretation and evaluation - i.e. visualization
Why is data mining important?
The exponential growth of data exceeds human abilities to understand it themselves
What factors influence the selection of DM algorithms?
- The kind of data i.e. numeric or symbolic
- Whether the task is supervised or unsupervised
- What is the desired output i.e. a blackbox or a tree
What are the ways to approach KDD?
- Top-down discovery - analyst suggests hypotheses; results are analysed to see if they support this
- Bottom-up discovery - system automatically suggests patterns, analyst considers if they are relevant
- Mixed - analyst provides an area of search, and refine this based on the hypotheses generated
How are database queries different from DM?
Database queries return a subset of the data or an aggregate of it
DM outputs a KDD object i.e. a rule or tree which dind’t exist before
What is a data warehouse?
A collection of data on a single subject where the number of datasets can change
How is KDD applied to websites?
Web content mining - collect and analyse the content of pages
Web structure mining - use graph theory to study how pages are connected with hyperlinks
What is IR?
Information Retrieval i.e. search
At a high-level, what do classification algorithms do?
Find separating curves. Note that SVM provides the equations for these, while neural networks only give the regions
What defines AI systems?
They are machines that mimic cognitive functions such as learning and problem solving
What are fuzzy sets?
In contrast to crisp sets, they allow partial membership based on an membership function
How do set operators work for fuzzy sets?
AND(x,y) -> m(x N y) = min(m(x), m(y))
OR(x,y) -> m(x V y) = max(m(x), m(y))
NOT(x) -> m(~x) = 1 - m(x)
What is the process for fuzzy logic?
~ Fuzzify the input values into membership functions
~ Use rules to compute fuzzy outputs
~ De-fuzzify the output functions to get crisp outputs
What are the main DM tasks?
Classification - map data into #pre-defined# classes
Regression - map data to real-valued output
Prediction - predict a future state #in time# based on current state
Time Series Analysis - special case of classification, regression and prediction when the attributes vary over time
[C.R.P.T]
Clustering - the groups are pre-defined, so it is unsupervised
Segmentation - special cases of clustering where the DB is partitioned into #dis-jointed# sets
Association Rules - find relationships between instances i.e. items that frequently occur together
Sequence Discovery - find relationships based on time i.e. if X then Y will happen
[C.S.A.S]
What are the assumptions used for Bayes classifiers?
- Only one hypothesis (class) can occur at a time
- Conditional independence - P(X1, X2 | Y) = P(X1|Y) * P(X2|Y)
What is a key practical consideration for Bayes classifiers?
When the conditional etc. probabilities are calculated, start the counts at 1 to avoid multiplying by 0
What at the names of the probabilities in Bayes classifiers?
Posterior - p(hj|xi) - [this is the target]
Prior - p(hj)
Unconditional probability - p(x)
Conditional probability - p(xi|hj)
What do SVMs do?
Find a hyperplane which correctly classifies the data points and #separates them as far as possible#
In the context of SVMs, what are margins and what types are there?
This is a measure of how well they separate the data
A hard margin is the distance to the nearest point; a soft margin penalizes close points
What is a key limitation of SVMs, and how can it be addressed?
They can only classify data if it is linearly separable.
However, kernels transform data into higher dimensions so they do become separable (the actual hyperplanes are always linear)
What are the limitations of SVM?
- They require selection a good kernel
- They use a lot of memory and CPU
- Special consideration is needed for multiple classes i.e. non-binary problems
How do SVMs handle multiple classes?
Transform the problem into a set of binary problems, and either:
- one-vs-rest - train a classifier for each class to determine if it is or isn’t that class
- one-vs-many - train a classifier for each pair of classes, and use a voting protocol
In the context of DM, what is segmentation?
A special case of clustering where the tuples are disjointed i.e. aren’t neighbors
What is the capacity of a function?
A measure of its expressive power and flexibility etc.
What does shattering mean?
A function f(theta) shatters a set of point X if for all assignments of classes to that point, there exists a theta such that all the points are correctly classified
What is the VC Dimension?
A measure of #capacity# for a function based on the largest number of points it can shatter.
Note that it might only work for a particular set of locations for the points
What is Structural Risk Minimization?
The process of finding models with the lowest risk
Risk is the chance of prediction error on some unseen data points which are iid for some unknown distribution
How is risk found?
The expected risk is the true risk, but it can’t be determined
However, it is bounded by the empirical risk (based on points already classified) and the VC dimension confidence (based on the number of points and the model’s VC)
How the risk of a model be reduced?
- Train a better model to reduce to empirical risk
- Minimize the VC dimension i.e. use a simpler model
- Use more data points, as this lowers the VC dimension confidence
Why is Structural Risk Minimization often infeasible?
The VC dimension can only be calculated for a limited number of models
What are some key ways to measure the error of an algorithm?
- TSS (total sum of squared error)
- MSE
- RMSE
- R-squared - the ratio of the error to the variance of the response
What is the advantage of R-squared, and how is it calculated?
It is easily interprete.
R2 = 1 - TSS/Var(y)
What is accuracy? What is its opposite?
Accuracy is the portion of instances correctly classified.
Error rate is the portion of instances incorrectly classified
What is TPR?
The portion of all actually true instances which are classified as true.
It is also known as sensitivity or recall
TPR = TP / (TP + FN)
What is sensitivity?
The TPR
What is recall?
The TPR
What is TNR?
The portion of truly negative instances which are classified as negative
It is also known as specificity
What is specificity?
TNR
How are FPR and FNR related to TPR and TNR
They sum to 1 i.e.e FPR+TPR = 1
What is an ROC curve?
It plots how the FPR on the x-axis affects the TPR on the y-axis
What is the worst possible result on an ROC curve?
A diagonal line (y=x), if it were below this, predictions could be flipped
What is AUC and what are its limits?
It is bounded between 0.5 and 1, with higher values being better
However:
- it is difficult to calculate when the points are not well spread across the ROC space
- it depends a lot on the non-interesting part of the ROC curve
What is the distance measure?
The distance from the center of the ROC and the intersection of the ROC curve and the minor diagonal
What metrics are used for object detection?
- Detection rate (DR)
- False alarm / objects (FA/O)
- Extended ROC
What is detection rate?
objects found / total objects
What is false alarm / objects?
non-objects reported / total objects
It can be greater than 100% for difficult problems
What is the extended ROC curve?
It has FA/O on the x-axis and DR on the y-axis
Unlike ROC, models can perform worse than the diagonal
What evaluation metrics are used for information retrieval?
Precision and recall
In the context of IR, what is precision?
The quality of records retrieved:
relevant documents retrieved / number of documents retrieved
In the context of IR, what is recall?
How useful the information retrieved was:
number of relevant documents retrieved / total relevant documents in DB
What are the disciplines within computer imaging?
- computer vision* - the outputs are for machines
* image processing* - the outputs are for people
What are two tasks to do with improving the quality of images?
- image restoration* - return image to its original appearance
- image enhancement* - make image better than ever
What is dynamic range?
The ratio between brightest and darkest values that can be shown i.e. the contrast ratio
How many colors does 8 bit color give?
256 i.e. 0 to 255
What are images outside human’s perceptual range called?
Multi-spectral
Describe JPEG images.
They have 24 bit color (16 million colors); compress well but have artifacts; don’t support transparency or animation
Describe PNG images.
They use lossless compression. Tehy support 24-bit color but can’t be animated
Describe GIF images.
Support transparency and animation.
They use a palette of 256 colors. They are lossless if less than this many colors
Describe TIFF images.
They support a range of color formats i.e. indexed color, 24-bit RGB, and 32-bit RGBA
They can have a range of compression methods and support layers
Describe BMP images.
They support 8-bit, 16-bit or 24-bit RGB color, but don’t scale well
What is the image format taught in this course?
PNM (Portable aNyMap). It supports:
- binary (.pbm)
- grayscale (.pgm)
- pixmap (.ppm)
What is the format of PNM images?
- A magic number
- The width of the image
- The height of the image
- The maximum grayscale value i.e. 15 or 255
- Optional: comments starting with #
- The actual image data: pixels separated by whitespace
What are the magic numbers for PNM images?
P2 for PGM in ASCII, P5 for PGM in RAWBITS
What are the basic types of image pre-processing?
- Region of Interest Extraction
- Image Algebra
- Spatial filters
What is ROI extraction?
Use cropping, zooming, shrinking, translation or rotation focus on part of the image
What is the opposite of zooming?
Shrinking
What are the techniques for zooming?
zero-order hold repeats pixel values; first-order-hold uses averaging or convolution etc.
What is image algebra used for?
- Addition: superimpose image
- Subtraction: remove additive noise or detect motion
- Multiplication: adjust brightness (generally with a constant)
- Logical operators: masking
What are spatial filters?
Image operations which change each pixel based on its neighborhood. They include:
- mean filters
- median filters
- enhancement filters
- image quantization
What is special about mean filters?
They are form of linear filter, which means that:
- If the coefficients sum to 1, the average brightness is maintained
- If the coefficients sum to 0, the average brightness is lost
- If the coefficients alternate positive and negative, edge information will be returned
- If the coefficients are all positive, the image will be blurred
What are enhancement features? What are the main types?
They enhance features in an image
Laplacian filters enhance the image in all directions equally:
- -1 on edges with 8 in the centre
- -1 in NESW, 4 in the centre, 0 on the diagonals
Difference filters enhance in a particular direction. Note they have two 1s and one -1
What is image quantization?
The process of reducing the amount of image data. It includes:
- Grey-level reduction
- Spatial reduction
What is grey-level reduction?
A form of image quantization in which the number of grey levels is reduced.
Thresholding makes it binary based on whether the value is above or below the threshold
What is spatial reduction?
Image quantization where groups of pixels that are #spatially adjacent# are mapped to a single pixel
This can be done with averaging, median or decimation in which some of the pixels are simply eliminated
What is edge detection?
The process of identifying sharp changes in brightness over short spatial distances
This is particularly difficult in noisy images
What are the methods for edge detection?
- Roberts operator
- Sobel operator
- Canny edge detector
What is Roberts operator?
It only marks edge points. It does so by checking change of intensity in the diagonal direction.
It is generally used for binary images
What is the Sobel operator?
Detect edges in the horizontal and vertical directions.
The kernels are: [-1 -2 -1; 0 0 0; 1 2 2] for horizontal (S1)
The edge magnitude is sqrt(S1^2 + S2^2). The direction in atan(S1/S2)
What is the Canny edge detector?
A general process for edge detection:
- Smooth the image using Gaussian to minimize the impact of noise
- Estimate the gradients using partial derivatives or Sobel
- Thin edges using non-maxima suppression (suppress pixels that aren’t larger than their neighbor in the positive and negative #gradient direction#
What is image segmentation?
It divides images into segments (#connected# pixels)
How is connectivity of pixels defined?
Four-connectivity, eight-connectivity and six-connectivity (includes top-left and bottom-right)
What is grey level segmentation? What are its limitations?
Uses thresholding to segment based on brightness.
It is very vulnerable to noise.
The threshold can be chosen manually, or automatically using K-means clustering or by minimizing group variance
What is a general algorithm for image segmentation?
The splitting and merging algorithm
[1] Define a homogeneity test to measure how similar regions are
[2] Split the image into equally sized regions
[3] Calculate the homogeneity measure for each region. If the test is passed, attempt to merge with neighbors; Otherwise, split
[4] Repeat [3] until all regions pass the homogeneity test
What are the variations of the splitting are merging algorithm?
Growing - no splitting is applied
Shrinking - no merging is applied
How can changes in brightness over space be analysed?
By transformation from the spatial domain to the frequency domain (spectral domain)
What are the affects of transformation to the frequency domain?
The image will have the same size
Areas of rapid brightness change correspond to high frequency.
How are images transformed to the frequency domain?
A set of basis images is used, where each basis image is the same size of the original image
T(u,v) = SumSum I(r,c)B(r,c;u,v)
Fourier transforms are generally used; cosine transforms don’t use sine functions
How is filtering applied to the frequency domain?
Low-pass filters eliminate high-frequencies, blurring the image
High-pass frequencies can be used for edge detection
What are non-ideal filters?
Those that let some of the signal through if it is close to the threshold
What are wavelet transforms?
Transforms that include both frequency and spatial information
They break the image into four sub-sampled images by high-pass and low-pass filtering the image in both directions
They are commonly used for image compression
What is feature extraction for images? What are some key considerations?
Feature extraction is the general process of turning image data into feature vectors
The features may be domain specific or domain independent
Good features are RST invariant i.e. unaffected by rotation, scale or transformation
What are some key ways of doing feature extraction in images?
- Using binary objects of interest
- Histogram features
- Global features
- LBP
- SIFT
- Pixel Statistics
How are features extracted from a binary object of interest?
Features can be extracted based on its:
- area
- center of area
- axis of the least second moment (gives an indication of its orientation)
- thinness
- the Euler number
What is thinness?
A measure of how thin an object is based on the ratio of its area to perimeter:
T = 4 pi A/P^2
Note that a circle has a thinness of 1, a very thin object approaches 0
What is the Euler number of an object?
The number of objects in an image minus the number of holes
What are histogram features?
Global features based on the image’s brightness or contrast
How do global and local features compare semantically?
Global features generally assume that there is only one object in the image; local filters can be more difficult because you have to implement the case when there are multiple objects
Local features generally perform better than global if there is significant clutter or occlusion
What is LBP?
Local Binary Pattern is a #global# feature based on:
- For each central pixel, identify p points at distance r from the central pixel
- Calculate the difference between the central pixel and each pixel
- Encode these differences in binary - 1 if the positive values and 0 for negative
- Create a histogram of these binary codes across the image
What is SIFT?
An algorithm for matching features between images.
What should be considered when evaluating image features?
- Resolution - what is the minimum size objects that can be distinguished
- Number of features - algorithms take longer if more features are used
- Invariances - are the features affected by rotation or lighting changes etc.
Redundancy - could fewer features encode the same amount of information?
Completeness - can the original image be reconstructed based on the features
What are pixel statistics?
The image is divided into regions and aggregates of pixel values are computed for each region
What are the basic approaches to image classification?
Template matching - the template is swept across the image to find matches
Nearest neighbor - find labelled images most similar to the current image
How are detected objects classified?
With a cutout square. It should be large enough to contain any object, but not large enough to contain multiple objects