Week 5: Data Preparation Flashcards
What are the problems in the definition of data quality (4)
- Unmeasurable: accuracy and completeness are extremely difficult, perhaps impossible to measure
- Context independent: no accounting for what is important
- Incomplete: what about interpretability, accessibility, metadata, analysis etc
- Vague: the previous definition provides no guidance towards practical improvements of the data
How are correlation and covariance related?
corr(A,B) = cov(A,B)/sd(A)*sd(B)
How do you compute the levenshtein similarity between strings s1 and s2 in python?
lev_sim = sm.levenshtein.Levenshtein()
lev_sim.get_sim_score (s1, s2)
What are data quality issues? (7)
- Noise
- Duplicate data
- Outliers
- Unreliable sources
- Inconsistent values
- Outdated values
- Missing values
How do you normalise data by decimal scaling?
Transform the data by moving the decimal points of values of attribute A
v’ = v/10j where j is the smallest integer such that max(|v’|) < 1
e.g. if the maximum absolute value of A is 986, divide each value by 1000 (j=3)
What is the python library for computing similarity measures?
from py_stringmatching import similarity_measure as sm
What is data validation?
checking permitted characters
finding type-mismatched data
What are irrelevant attributes?
Attributes that contain no information that is useful for the data mining task at hand
What are 3 ways of handling missing data? (3)
- Ignore the tuple: usually done when class label is missing - not effective when the % of missing values is large
- Fill in the missing value manually: tedious + inflatable
- Fill in the missing value automatically (data imputation) with: a global constant e.g. “unknown” or a new class, the attribute mean, the attribute mean for all samples belonging to the same class, the most probable value found through regression, inference or decision tree
How do you reduce data with histograms?
- Divide data into buckets and store average sum for each bucket
- Partitioning rules: equal-width (equal bucket range) and equal-frequency (equal depth) (each bucket contains same number of data points)
What type of discretisation method is binning?
unsupervised, top down splitting method
What are the three types of outlier detection methods?
- Supervised methods: domain experts examine and label a sample of the underlying data and the sample is used for testing and training. Outlier detection modelled as a classification problem
- Unsupervised methods: assume that normal objects are somewhat clustered. Outliers are expected to occur far away from any of the groups of normal objects
- Semi-supervised methods: only a small set of the normal or outlier objects are labelled, but most of the data are unlabelled. The labelled normal objects together with unlabelled objects that are close by, can be used to train a model for normal objects
What is the code in python to: fill nas in column 1 with mean values of column 1 grouped by column 2
data[“column1”].fillna(data.groupby(“column2)[“column1”].transform(“mean”))
What is the python code for removing missing values?
data.dropna()
What is univariate data?
data set involving only one attribute or variable
How do you reduce data using clustering?
Partition data set into clusters based on similarity and store cluster representation (e.g. centroid and diameter) only
How do you normalise data by z-score in python?
from sklearn.preprocessing import StandardScaler
StandardScaler().fit_transform(df)
What are proximity based methods for outlier detection?
- Assume that an object is an outlier if the nearest neighbours of the object are far away
- Two types of proximity based methods: distance-based and density-based
What is an outlier?
- An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism
- Outliers are data or model glitches
What is data discretisation?
dividing the range of a continuous attribute into intervals
What is difference between labelling versus scoring for outlier detection:
Considering the output of an outlier detection algorithm
Labelling approaches: binary output - data objects are labeled either normal or outlier
Scoring approaches: continuous output - for each object an outlier score is computed. E.g. the probability for it being an outlier
What are the steps of CRISP-DM (Cross industry processing for data mining) (6)
Business understanding
Data understanding
Data preparation
Modelling
Evaluation
Deployment
What is mahalanobis distance for outlier detection?
Let o* be the mean vector for a multivariate dataset. Mahalanobis distance for an object o to o* is:
MDist(o, o*) = (o-o*)^TS^-1(o-o*) where S is the covariance matrix
Use the outlier detection technique of Grubbs test on the MDist to detect outliers
What is the time complexity of computing pairwise similarity?
O(n^2)
What is the time complexity of doing pairwise similarity in blocks with k blocks and block size n/k?
O(k(n/k)^2)
What similarity measures can be used for matching features? (6)
- Difference between numerical values
- Jaro for comparing names
- Edit distance for typos
- Phonetic-based
- Jaccard for sets
- Cosine for vectors
What is multivariate data?
data set involving two or more attributes or variables
What is data reduction?
Obtain a reduced representation of the dataset that is much smaller in volume but yet produces the same (or almost the same) analytical results
What are the names of 2 techniques for turning categorical data into numerical data?
label encoding, one-hot encoding
What are the three kinds of outliers?
global, contextual, collective
What are examples of data quality metrics? (5)
- Conformance to schema: evaluate constraints on a snapshot
- Conformance to business rules: evaluate constraints on changes in the database
- Accuracy: perform inventory (expensive), or use proxy (track complaints)
- Glitches in analysis
- Successful completion of end-to-end process
What are collective outliers?
A subset of data objects collectively deviate significantly from the whole data set, even if the individual data object may not be outliers
Need to have the background knowledge on the relationship among the data objects, such as distance or similarity measure on objects
What is the definition of data quality? (7 parts)
- Accuracy: the data was recorded correctly
- Completeness: all relevant data was recorded
- Uniqueness: entities are recorded once
- Timeliness: the data is kept up to date
- Consistency: the data agrees with itself
- Believability: how much the data is trusted by users
- Interpretability: how easy the data is understood
What is z-score normalisation?
Transform the data by converting the values to a common scale with an average of zero and a standard deviation of one
v’ = (v - mean(A))/sd(A)
What ways can you handle noisy data through binning? (3)
- Smoothing by bin means: each value in a bin is replaced by the mean value of the bin
- Smoothing by bin medians: each value in a bin is replaced by the median value of the bin
- Smoothing by bin boundary: the minimum and maximum values in a given bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary value
What is correlation analysis for discretisation?
- Supervised: use class information
- Bottom-up merge: find the best neighbouring intervals to merge
- Initially each distinct value is an interval, Chi squared tests are performed on every adjacent interval and those with the least chi squared values are merged together. Merge performed recursively until a predefined stopping condition is satisfied
What is the python code for filling in missing values?
data.fillna() #inplace=TRUE replaces the values in the original dataframe
What is the maximum likelihood method for outlier detection?
Assume that the data are normally distributed, learn the parameters from the input data. An object is an outlier if it is more than 3sd from the mean. Ie the z-score (x-mean/sd) has absolute value more than 3
What are the disadvantages of too many or too little bin numbers for smoothing data?
- Too many bins, won’t smooth data, will keep the noise, lot of computation required
- Too little bins, hide a lot of details in the data
How can you reduce the time complexity of pairwise similarity
Blocking: divide the records into blocks, perform pairwise comparison between records in the same block only
What is equal width partitioning for discretisation? What are the 2 problems with it?
- Divides the range into N intervals of equal size: uniform grid
- If A and B are the smallest and largest values of the attribute, the width of the intervals will be W = (B-A)/N
- The most straightforward, but outliers may dominate presentation
- Skewed data is not handled well
What is equal-depth partitioning for discretisation? What is a problem with it?
- Divides range into N intervals, each containing approximately the same number of samples
- Managing categorical attributes can be tricky