Week 5: Data Preparation Flashcards
REVERSED
from py_stringmatching import similarity_measure as sm
What is the python library for computing similarity measures?
REVERSED
- Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the dataset
- The attribute with the most distinct values is placed at the lowest level of the hierarchy
- E.g. Country (highest level) -> state -> city -> street (lowest level)
- This is also a type of data smoothing
What is concept hierarchy generation?
REVERSED
Effective if data is clustered but not if data is “smeared”
When is data reduction through clustering useful and when is it not useful?
REVERSED
Random error or variance in a measured variable
What is noise in data?
REVERSED
- Stepwise forward selection: starts with empty set of attributes. Best of original attributes are determined and added to the set at each step
- Stepwise backward elimination: starts with full set of attributes. At each step, removes worst of remaining attributes
- Combination of forward selection and backward elimination: start with empty set, combine methods so that at each step the procedure adds the best attribute to reduced set and removes the worst attribute from initial set
- Decision tree induction: tree is constructed from given data. All attributes that do not appear in the tree are considered irrelevant
What are the 4 heuristic methods for selecting the subset in attribute subset selection?
REVERSED
Quantifies the local density of a data point with the use of a neighbourhood of size k
-Introduces a smoothing parameter: reachability distance RD
RDk(x,y) = max{K dist(x), dist(x,y)}, where K dist(x) is the distance between x and its K-nearest neighbour
-the local reachability distance of point x is:
LRDk(x) = k/[sum of y in KNN(x) * RDk(x,y)]
-the local outlier factor LOF is:
LOFk(x) = sum of y in [KNN(x)*LRDk(y)/LRDk(x)] / k
-Generally, LOF >1 means x has a lower density than its neighbours
What is the local outlier factor for outlier detection?
REVERSED
lev_sim = sm.levenshtein.Levenshtein()
lev_sim.get_sim_score (s1, s2)
How do you compute the levenshtein similarity between strings s1 and s2 in python?
REVERSED
- Conformance to schema: evaluate constraints on a snapshot
- Conformance to business rules: evaluate constraints on changes in the database
- Accuracy: perform inventory (expensive), or use proxy (track complaints)
- Glitches in analysis
- Successful completion of end-to-end process
What are examples of data quality metrics? (5)
REVERSED
Novelty detection involves seeing if new data fits with an existing data or would be considered an outlier
What is the difference between outlier detection and novelty detection?
REVERSED
Attributes that duplicate much or all of the information contained in one or more other attributes
What are redundant attributes?
REVERSED
Transform the multi aria text outlier detection task into a univariate outlier detection problem
What is the general approach for outlier detection with multivariate data?
REVERSED
- Supervised: use class information
- Bottom-up merge: find the best neighbouring intervals to merge
- Initially each distinct value is an interval, Chi squared tests are performed on every adjacent interval and those with the least chi squared values are merged together. Merge performed recursively until a predefined stopping condition is satisfied
What is correlation analysis for discretisation?
REVERSED
Fit a model to the data and save the model instead
What is model based data reduction?
REVERSED
Problem of identifying and linking/grouping different representations of the same real-world object
What is entity resolution?
REVERSED
df.corr()
How do you find the correlation matrix for a dataframe in python?
REVERSED
global, contextual, collective
What are the three kinds of outliers?
REVERSED
Don’t assume an a-priori statistical model and determine the model from the input data
e.g. histogram and kernel density estimation
What are non-parametric methods for outlier detection?
REVERSED
- Supervised methods: domain experts examine and label a sample of the underlying data and the sample is used for testing and training. Outlier detection modelled as a classification problem
- Unsupervised methods: assume that normal objects are somewhat clustered. Outliers are expected to occur far away from any of the groups of normal objects
- Semi-supervised methods: only a small set of the normal or outlier objects are labelled, but most of the data are unlabelled. The labelled normal objects together with unlabelled objects that are close by, can be used to train a model for normal objects
What are the three types of outlier detection methods?
REVERSED
Simple random sampling may have poor performance in the presence of skew
When does simple random sampling have poor performance?
REVERSED
checking permitted characters
finding type-mismatched data
What is data validation?
REVERSED
- Reflects the use of the data
- Leads to improvements in processes
- Measurable (we can define metrics)
What do we need in a definition of data quality? (3)
REVERSED
Assumes that the normal data is generated by a parametric distribution with the parameter theta
- The probability density function of the parametric distribution f(x, gamma) gives the probability that x is generated by the distribution
- The smaller this value, the more likely x is an outlier
What are parametric methods for outlier detection?
REVERSED
#fill each na with the value before it data.fillna(method=‘pad') or method=‘ffill’
#fill each na with the value after it data.fillna(method=‘bfill’) or method=‘backfill’
#set a limit on the number of forward or backward fills data.fillna(method=‘pad’, limit=1)
What are the 2 different methods for filling nas in python?
REVERSED
- Inconsistent: containing discrepancies in codes or names
- Intentional: e.g. disguised missing data such as Jan 1st for all birthdays
What makes data “dirty”? (2)
REVERSED
capitalisation, white space normalisation, correcting typos, replacing abbreviations, variations, nick names
What is data normalisation in text?
REVERSED
- Binning
- Histograms
- Clustering
- Classification (e.g. decision trees)
- Correlation
What are discretisation methods? (5)
REVERSED
O(n^2)
What is the time complexity of computing pairwise similarity?
REVERSED
- Divides range into N intervals, each containing approximately the same number of samples
- Managing categorical attributes can be tricky
What is equal-depth partitioning for discretisation? What is a problem with it?
REVERSED
Transform the data by moving the decimal points of values of attribute A
v’ = v/10j where j is the smallest integer such that max(|v’|) < 1
e.g. if the maximum absolute value of A is 986, divide each value by 1000 (j=3)
How do you normalise data by decimal scaling?
REVERSED
Global approaches: the reference set contains all other data objects
Local approaches: the reference contains a small subset of data objects and there is no assumption on the number of normal mechanisms
What is the difference between global and local approaches to outlier detection
REVERSED
- Accuracy: the data was recorded correctly
- Completeness: all relevant data was recorded
- Uniqueness: entities are recorded once
- Timeliness: the data is kept up to date
- Consistency: the data agrees with itself
- Believability: how much the data is trusted by users
- Interpretability: how easy the data is understood
What is the definition of data quality? (7 parts)
REVERSED
- Divides the range into N intervals of equal size: uniform grid
- If A and B are the smallest and largest values of the attribute, the width of the intervals will be W = (B-A)/N
- The most straightforward, but outliers may dominate presentation
- Skewed data is not handled well
What is equal width partitioning for discretisation? What are the 2 problems with it?
REVERSED
- similarity measures have different scales
- pairwise similarity between records is expensive?
What are issues with computing similarity measures? (2)
REVERSED
Given two records, compute a vector of similarity scores for corresponding features
-Score can be Boolean (match/mismatch) or a continuous value based on specific similarity measure (distance function)
What is matching features?
REVERSED
- Binning: first sort data and partition into equal frequency (equidepth) bins, then one can smooth by bin means, smooth by bin median, smooth by bin boundaries etc
- Regression: smooth by fitting the data into regression functions
- Clustering: detect and remove outliers that do not belong to any of the clusters
- Combined computer and human inspection: detect suspicious values and check by human
What are 4 ways to handle noisy data?
REVERSED
data.fillna() #inplace=TRUE replaces the values in the original dataframe
What is the python code for filling in missing values?
REVERSED
unsupervised, top down splitting method
What type of discretisation method is binning?
REVERSED
Attributes that contain no information that is useful for the data mining task at hand
What are irrelevant attributes?
REVERSED
Object is Oc (or conditional outlier) if it deviates significantly based on a selected context
Issue: how to define or formulae meaningful context
What is a contextual outlier and what is the issue with detecting them
REVERSED
- Ignore the tuple: usually done when class label is missing - not effective when the % of missing values is large
- Fill in the missing value manually: tedious + inflatable
- Fill in the missing value automatically (data imputation) with: a global constant e.g. “unknown” or a new class, the attribute mean, the attribute mean for all samples belonging to the same class, the most probable value found through regression, inference or decision tree
What are 3 ways of handling missing data? (3)
REVERSED
Transform the data from a given range with [minA, maxA] to a new interval [new_maxA, new_minA] for a given attribute A
v’ = (v - minA)/(maxA - minA) * (newmaxA - newminA) + newminA
where v is the current value
What is min-max normalisation?
REVERSED
remove unimportant attributes
What is dimensionality reduction?