Week 5: Data Preparation Flashcards
REVERSED
from py_stringmatching import similarity_measure as sm
What is the python library for computing similarity measures?
REVERSED
- Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the dataset
- The attribute with the most distinct values is placed at the lowest level of the hierarchy
- E.g. Country (highest level) -> state -> city -> street (lowest level)
- This is also a type of data smoothing
What is concept hierarchy generation?
REVERSED
Effective if data is clustered but not if data is “smeared”
When is data reduction through clustering useful and when is it not useful?
REVERSED
Random error or variance in a measured variable
What is noise in data?
REVERSED
- Stepwise forward selection: starts with empty set of attributes. Best of original attributes are determined and added to the set at each step
- Stepwise backward elimination: starts with full set of attributes. At each step, removes worst of remaining attributes
- Combination of forward selection and backward elimination: start with empty set, combine methods so that at each step the procedure adds the best attribute to reduced set and removes the worst attribute from initial set
- Decision tree induction: tree is constructed from given data. All attributes that do not appear in the tree are considered irrelevant
What are the 4 heuristic methods for selecting the subset in attribute subset selection?
REVERSED
Quantifies the local density of a data point with the use of a neighbourhood of size k
-Introduces a smoothing parameter: reachability distance RD
RDk(x,y) = max{K dist(x), dist(x,y)}, where K dist(x) is the distance between x and its K-nearest neighbour
-the local reachability distance of point x is:
LRDk(x) = k/[sum of y in KNN(x) * RDk(x,y)]
-the local outlier factor LOF is:
LOFk(x) = sum of y in [KNN(x)*LRDk(y)/LRDk(x)] / k
-Generally, LOF >1 means x has a lower density than its neighbours
What is the local outlier factor for outlier detection?
REVERSED
lev_sim = sm.levenshtein.Levenshtein()
lev_sim.get_sim_score (s1, s2)
How do you compute the levenshtein similarity between strings s1 and s2 in python?
REVERSED
- Conformance to schema: evaluate constraints on a snapshot
- Conformance to business rules: evaluate constraints on changes in the database
- Accuracy: perform inventory (expensive), or use proxy (track complaints)
- Glitches in analysis
- Successful completion of end-to-end process
What are examples of data quality metrics? (5)
REVERSED
Novelty detection involves seeing if new data fits with an existing data or would be considered an outlier
What is the difference between outlier detection and novelty detection?
REVERSED
Attributes that duplicate much or all of the information contained in one or more other attributes
What are redundant attributes?
REVERSED
Transform the multi aria text outlier detection task into a univariate outlier detection problem
What is the general approach for outlier detection with multivariate data?
REVERSED
- Supervised: use class information
- Bottom-up merge: find the best neighbouring intervals to merge
- Initially each distinct value is an interval, Chi squared tests are performed on every adjacent interval and those with the least chi squared values are merged together. Merge performed recursively until a predefined stopping condition is satisfied
What is correlation analysis for discretisation?
REVERSED
Fit a model to the data and save the model instead
What is model based data reduction?
REVERSED
Problem of identifying and linking/grouping different representations of the same real-world object
What is entity resolution?
REVERSED
df.corr()
How do you find the correlation matrix for a dataframe in python?
REVERSED
global, contextual, collective
What are the three kinds of outliers?
REVERSED
Don’t assume an a-priori statistical model and determine the model from the input data
e.g. histogram and kernel density estimation
What are non-parametric methods for outlier detection?
REVERSED
- Supervised methods: domain experts examine and label a sample of the underlying data and the sample is used for testing and training. Outlier detection modelled as a classification problem
- Unsupervised methods: assume that normal objects are somewhat clustered. Outliers are expected to occur far away from any of the groups of normal objects
- Semi-supervised methods: only a small set of the normal or outlier objects are labelled, but most of the data are unlabelled. The labelled normal objects together with unlabelled objects that are close by, can be used to train a model for normal objects
What are the three types of outlier detection methods?
REVERSED
Simple random sampling may have poor performance in the presence of skew
When does simple random sampling have poor performance?
REVERSED
checking permitted characters
finding type-mismatched data
What is data validation?
REVERSED
- Reflects the use of the data
- Leads to improvements in processes
- Measurable (we can define metrics)
What do we need in a definition of data quality? (3)
REVERSED
Assumes that the normal data is generated by a parametric distribution with the parameter theta
- The probability density function of the parametric distribution f(x, gamma) gives the probability that x is generated by the distribution
- The smaller this value, the more likely x is an outlier
What are parametric methods for outlier detection?
REVERSED
#fill each na with the value before it data.fillna(method=‘pad') or method=‘ffill’
#fill each na with the value after it data.fillna(method=‘bfill’) or method=‘backfill’
#set a limit on the number of forward or backward fills data.fillna(method=‘pad’, limit=1)
What are the 2 different methods for filling nas in python?
REVERSED
- Inconsistent: containing discrepancies in codes or names
- Intentional: e.g. disguised missing data such as Jan 1st for all birthdays
What makes data “dirty”? (2)