Week 5: Data Preparation Flashcards

Question 1

Q

What are the problems in the definition of data quality (4)

Answer

A

Unmeasurable: accuracy and completeness are extremely difficult, perhaps impossible to measure
Context independent: no accounting for what is important
Incomplete: what about interpretability, accessibility, metadata, analysis etc
Vague: the previous definition provides no guidance towards practical improvements of the data

Question 2

Q

How are correlation and covariance related?

Answer

A

corr(A,B) = cov(A,B)/sd(A)*sd(B)

Question 3

Q

How do you compute the levenshtein similarity between strings s1 and s2 in python?

Answer

A

lev_sim = sm.levenshtein.Levenshtein()
lev_sim.get_sim_score (s1, s2)

Question 4

Q

What are data quality issues? (7)

Answer

A

Noise
Duplicate data
Outliers
Unreliable sources
Inconsistent values
Outdated values
Missing values

Question 5

Q

How do you normalise data by decimal scaling?

Answer

A

Transform the data by moving the decimal points of values of attribute A
v’ = v/10j where j is the smallest integer such that max(|v’|) < 1
e.g. if the maximum absolute value of A is 986, divide each value by 1000 (j=3)

Question 6

Q

What is the python library for computing similarity measures?

Answer

A

from py_stringmatching import similarity_measure as sm

Question 7

Q

What is data validation?

Answer

A

checking permitted characters
finding type-mismatched data

Question 8

Q

What are irrelevant attributes?

Answer

A

Attributes that contain no information that is useful for the data mining task at hand

Question 9

Q

What are 3 ways of handling missing data? (3)

Answer

A

Ignore the tuple: usually done when class label is missing - not effective when the % of missing values is large
Fill in the missing value manually: tedious + inflatable
Fill in the missing value automatically (data imputation) with: a global constant e.g. “unknown” or a new class, the attribute mean, the attribute mean for all samples belonging to the same class, the most probable value found through regression, inference or decision tree

Question 10

Q

How do you reduce data with histograms?

Answer

A

Divide data into buckets and store average sum for each bucket
Partitioning rules: equal-width (equal bucket range) and equal-frequency (equal depth) (each bucket contains same number of data points)

Question 11

Q

What type of discretisation method is binning?

Answer

A

unsupervised, top down splitting method

Question 12

Q

What are the three types of outlier detection methods?

Answer

A

Supervised methods: domain experts examine and label a sample of the underlying data and the sample is used for testing and training. Outlier detection modelled as a classification problem
Unsupervised methods: assume that normal objects are somewhat clustered. Outliers are expected to occur far away from any of the groups of normal objects
Semi-supervised methods: only a small set of the normal or outlier objects are labelled, but most of the data are unlabelled. The labelled normal objects together with unlabelled objects that are close by, can be used to train a model for normal objects

Question 13

Q

What is the code in python to: fill nas in column 1 with mean values of column 1 grouped by column 2

Answer

A

data[“column1”].fillna(data.groupby(“column2)[“column1”].transform(“mean”))

Question 14

Q

What is the python code for removing missing values?

Answer

A

data.dropna()

Question 15

Q

What is univariate data?

Answer

A

data set involving only one attribute or variable

Question 16

Q

How do you reduce data using clustering?

Answer

A

Partition data set into clusters based on similarity and store cluster representation (e.g. centroid and diameter) only

Question 17

Q

How do you normalise data by z-score in python?

Answer

A

from sklearn.preprocessing import StandardScaler
StandardScaler().fit_transform(df)

Question 18

Q

What are proximity based methods for outlier detection?

Answer

A

Assume that an object is an outlier if the nearest neighbours of the object are far away
Two types of proximity based methods: distance-based and density-based

Question 19

Q

What is an outlier?

Answer

A

An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism
Outliers are data or model glitches

Question 20

Q

What is data discretisation?

Answer

A

dividing the range of a continuous attribute into intervals

Question 21

Q

What is difference between labelling versus scoring for outlier detection:

Answer

A

Considering the output of an outlier detection algorithm
Labelling approaches: binary output - data objects are labeled either normal or outlier
Scoring approaches: continuous output - for each object an outlier score is computed. E.g. the probability for it being an outlier

Question 22

Q

What are the steps of CRISP-DM (Cross industry processing for data mining) (6)

Answer

A

Business understanding
Data understanding
Data preparation
Modelling
Evaluation
Deployment

Question 23

Q

What is mahalanobis distance for outlier detection?

Answer

A

Let o* be the mean vector for a multivariate dataset. Mahalanobis distance for an object o to o* is:
MDist(o, o*) = (o-o*)^TS^-1(o-o*) where S is the covariance matrix
Use the outlier detection technique of Grubbs test on the MDist to detect outliers

Question 24

Q

What is the time complexity of computing pairwise similarity?

Question 25

Q

What is the time complexity of doing pairwise similarity in blocks with k blocks and block size n/k?

Answer

A

O(k(n/k)^2)

Question 26

Q

What similarity measures can be used for matching features? (6)

Answer

A

Difference between numerical values
Jaro for comparing names
Edit distance for typos
Phonetic-based
Jaccard for sets
Cosine for vectors

Question 27

Q

What is multivariate data?

Answer

A

data set involving two or more attributes or variables

Question 28

Q

What is data reduction?

Answer

A

Obtain a reduced representation of the dataset that is much smaller in volume but yet produces the same (or almost the same) analytical results

Question 29

Q

What are the names of 2 techniques for turning categorical data into numerical data?

Answer

A

label encoding, one-hot encoding

Question 30

Q

What are the three kinds of outliers?

Answer

A

global, contextual, collective

Question 31

Q

What are examples of data quality metrics? (5)

Answer

A

Conformance to schema: evaluate constraints on a snapshot
Conformance to business rules: evaluate constraints on changes in the database
Accuracy: perform inventory (expensive), or use proxy (track complaints)
Glitches in analysis
Successful completion of end-to-end process

Question 32

Q

What are collective outliers?

Answer

A

A subset of data objects collectively deviate significantly from the whole data set, even if the individual data object may not be outliers
Need to have the background knowledge on the relationship among the data objects, such as distance or similarity measure on objects

Question 33

Q

What is the definition of data quality? (7 parts)

Answer

A

Accuracy: the data was recorded correctly
Completeness: all relevant data was recorded
Uniqueness: entities are recorded once
Timeliness: the data is kept up to date
Consistency: the data agrees with itself
Believability: how much the data is trusted by users
Interpretability: how easy the data is understood

Question 34

Q

What is z-score normalisation?

Answer

A

Transform the data by converting the values to a common scale with an average of zero and a standard deviation of one
v’ = (v - mean(A))/sd(A)

Question 35

Q

What ways can you handle noisy data through binning? (3)

Answer

A

Smoothing by bin means: each value in a bin is replaced by the mean value of the bin
Smoothing by bin medians: each value in a bin is replaced by the median value of the bin
Smoothing by bin boundary: the minimum and maximum values in a given bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary value

Question 36

Q

What is correlation analysis for discretisation?

Answer

A

Supervised: use class information
Bottom-up merge: find the best neighbouring intervals to merge
Initially each distinct value is an interval, Chi squared tests are performed on every adjacent interval and those with the least chi squared values are merged together. Merge performed recursively until a predefined stopping condition is satisfied

Question 37

Q

What is the python code for filling in missing values?

Answer

A

data.fillna() 
#inplace=TRUE replaces the values in the original dataframe

Question 38

Q

What is the maximum likelihood method for outlier detection?

Answer

A

Assume that the data are normally distributed, learn the parameters from the input data. An object is an outlier if it is more than 3sd from the mean. Ie the z-score (x-mean/sd) has absolute value more than 3

Question 39

Q

What are the disadvantages of too many or too little bin numbers for smoothing data?

Answer

A

Too many bins, won’t smooth data, will keep the noise, lot of computation required
Too little bins, hide a lot of details in the data

Question 40

Q

How can you reduce the time complexity of pairwise similarity

Answer

A

Blocking: divide the records into blocks, perform pairwise comparison between records in the same block only

Question 41

Q

What is equal width partitioning for discretisation? What are the 2 problems with it?

Answer

A

Divides the range into N intervals of equal size: uniform grid
If A and B are the smallest and largest values of the attribute, the width of the intervals will be W = (B-A)/N
The most straightforward, but outliers may dominate presentation
Skewed data is not handled well

Question 42

Q

What is equal-depth partitioning for discretisation? What is a problem with it?

Answer

A

Divides range into N intervals, each containing approximately the same number of samples
Managing categorical attributes can be tricky

Question 43

Q

What is data compression?

Answer

A

Transformations are applied to obtain a reduced or compressed representation of the original data

Question 44

Q

What is the chi-squared correlation test for nominal data?

Answer

A

Tests the hypothesis that attributes A and B are independent based on the chi-squared statistic

Question 45

Q

What are parametric methods for outlier detection?

Answer

A

Assumes that the normal data is generated by a parametric distribution with the parameter theta

The probability density function of the parametric distribution f(x, gamma) gives the probability that x is generated by the distribution
The smaller this value, the more likely x is an outlier

Question 46

Q

What does a low local reachability distance mean?

Answer

A

The closest cluster is far from x

Question 47

Q

What is a non-parametric method for outlier detection with multivariate data?

Answer

A

Using a histogram: use histogram to graph results as a percentage, a number is in outlier if it falls with a very small percentage of the data
Or use a kernel density estimation the probability density distribution of the data. For an object o, the density function f(o) gives the estimated probability that the object is generated by the stochastic process. If f(o) is low the object is likely an outlier

Question 48

Q

What is the general approach for outlier detection with multivariate data?

Answer

A

Transform the multi aria text outlier detection task into a univariate outlier detection problem

Question 49

Q

What is min-max normalisation?

Answer

A

Transform the data from a given range with [minA, maxA] to a new interval [new_maxA, new_minA] for a given attribute A
v’ = (v - minA)/(maxA - minA) * (newmaxA - newminA) + newminA
where v is the current value

Question 50

Q

What are methods for data transformation? (8)

Answer

A

Smoothing: removing noise from the data. includes binning, regression, clustering
Attribute/feature construction: new attributes constructed from the given ones
Aggregation: summary or aggregation operations applied, data cube construction
Normalisation: scaled to fall within a smaller, specified range. Includes min-max normalisation, Z-score normalisation, normalisation by decimal scaling
Data reformatting: e.g. Jack Wilsher -> Wilsher, J.
Using the same unit: e.g. inches and cm
Discretisation: raw values of numeric data attributes by interval labels or conceptual labels
Concept hierarchy generation: attributes such as street generalised to higher level concepts like city

Question 51

Q

How do you find the correlation matrix for a dataframe in python?

Answer

A

df.corr()

Question 52

Q

What is the python code to generate a dataframe with 20 elements with 5 rows and 4 columns?

Answer

A

df = pd.DataFrame(np.arange(20).reshape(5, 4))

Question 53

Q

What is attribute subset selection?

Answer

A

Removing irrelevant or redundant attributes

Question 54

Q

What are issues with computing similarity measures? (2)

Answer

A

similarity measures have different scales
pairwise similarity between records is expensive?

Question 55

Q

When is data reduction through clustering useful and when is it not useful?

Answer

A

Effective if data is clustered but not if data is “smeared”

Question 56

Q

What is schema normalisation?

Answer

A

Schema matching: e.g. contact number vs phone
Compound attributes: e.g. address vs street, city, zip

Question 57

Q

How do you reduce data by sampling?

Answer

A

obtaining a small sample s to represent the whole data set N. choose a representative subset of the data

Question 58

Q

What is a statistical approach to outlier detection?

Answer

A

Assume that the normal data objects are generated by a stochastic process (a generative model) and that data not following the model are outliers. Learn a generative model fitting the given data set, and then identify the objects in low probability regions of the model as outliers

Question 59

Q

What makes data “dirty”? (2)

Answer

A

Inconsistent: containing discrepancies in codes or names
Intentional: e.g. disguised missing data such as Jan 1st for all birthdays

Question 60

Q

What is the difference between global and local approaches to outlier detection

Answer

A

Global approaches: the reference set contains all other data objects
Local approaches: the reference contains a small subset of data objects and there is no assumption on the number of normal mechanisms

Question 61

Q

What type of data can you perform principal component analysis on?

Answer

A

numeric data only

Question 62

Q

What do we need in a definition of data quality? (3)

Answer

A

Reflects the use of the data
Leads to improvements in processes
Measurable (we can define metrics)

Question 63

Q

How do you take a sample of a dataframe with and without replacement in python?

Answer

A

#take sample of 3 rows without replacement: 
df.sample(3)

#take sample of 3 rows with replacement: 
df.sample(3, replace=True)

Question 64

Q

What is schema integration?

Answer

A

integrate metadata from different sources

Answer 64

A

Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the dataset
The attribute with the most distinct values is placed at the lowest level of the hierarchy
E.g. Country (highest level) -> state -> city -> street (lowest level)
This is also a type of data smoothing

Answer 65

A

assume the normal data is generated by a mixture of normal distributions
For any object o in the dataset, the probability that o is generated by a mixture of distributions is the sum of the probability density functions at o
Use the EM algorithm to learn the parameters of the data and an object is an outlier if it does not belong to any of the main groups of the data

Answer 66

A

contextual attributes define the context, behavioural attributes define the characteristics of the object used in outlier evaluation

Answer 67

A

Find a projection that captures the largest amount of variation in data
We find the eigenvectors of the covariance matrix, and these eigenvectors define the new space

Answer 68

A

Attributes that duplicate much or all of the information contained in one or more other attributes

Answer 69

A

Given two records, compute a vector of similarity scores for corresponding features
-Score can be Boolean (match/mismatch) or a continuous value based on specific similarity measure (distance function)

Answer 70

A

A function that maps the entires set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new value

Answer 71

A

Quantifies the local density of a data point with the use of a neighbourhood of size k
-Introduces a smoothing parameter: reachability distance RD
RDk(x,y) = max{K dist(x), dist(x,y)}, where K dist(x) is the distance between x and its K-nearest neighbour
-the local reachability distance of point x is:
LRDk(x) = k/[sum of y in KNN(x) * RDk(x,y)]
-the local outlier factor LOF is:
LOFk(x) = sum of y in [KNN(x)*LRDk(y)/LRDk(x)] / k

-Generally, LOF >1 means x has a lower density than its neighbours

Answer 72

A

split = top down method 
merge = bottom up method

Answer 73

A

Random error or variance in a measured variable

Answer 74

A

Supervised: given class labels, top down recursive split
Using entropy to determine split point (discretisation point)

Answer 75

A

Must use density. distance based can’t detect local outliers

Answer 76

A

jaro_sim = sm.jaro.Jaro()
jaro_sim.get_raw_score(s1, s2)

Answer 77

A

remove unimportant attributes

Answer 78

A

#fill each na with the value before it 
data.fillna(method=‘pad') or method=‘ffill’

#fill each na with the value after it 
data.fillna(method=‘bfill’) or method=‘backfill’

#set a limit on the number of forward or backward fills 
data.fillna(method=‘pad’, limit=1)

Answer 79

A

np.mean(data)

Answer 80

A

Binning: first sort data and partition into equal frequency (equidepth) bins, then one can smooth by bin means, smooth by bin median, smooth by bin boundaries etc
Regression: smooth by fitting the data into regression functions
Clustering: detect and remove outliers that do not belong to any of the clusters
Combined computer and human inspection: detect suspicious values and check by human

Answer 81

A

Redundant attributes can be detected by correlation and covariance analysis

Answer 82

A

Combining data from multiple sources into a coherent data store

Answer 83

A

Novelty detection involves seeing if new data fits with an existing data or would be considered an outlier

Answer 84

A

Don’t assume an a-priori statistical model and determine the model from the input data
e.g. histogram and kernel density estimation

Answer 85

A

Simple random sampling may have poor performance in the presence of skew

Answer 86

A

Modelling normal objects and outliers properly
Application-specific outlier detection
Handling noise in outlier detection
Understandability
A data set may have multiple types of outlier
One object may belong to more than one type of outlier

Answer 87

A

Given N data vectors from d-dimensions, find k <= d principal components that can accurately represent the data. Steps:

Normalise the input data: so that each attribute falls within the same range
Compute k orthonormal (unit) vectors i.e. principal components. These are unit vectors that each point in a direction perpendicular to the others. Each input data (vector) is a linear combination of the k principal components
The principal components are sorted in order of decreasing significance or strength. The principal components serve as new set of axes for the data. The first axis (first ranked principal component) shows the most variance among the data
The components are sorted. Reduce the data dimensionality by eliminating the weak components. Weak components have low variance

Answer 88

A

Object is a global outlier (Og) (or point anomaly) if it significantly deviates from the rest of the data set
Issue: find an appropriate measure of deviation

Answer 89

A

The cost of obtaining a sample is proportional to the size of the sample s, not the size of the dataset N. Therefore sampling complexity is potentially sublinear to the size of the data

Answer 90

A

Replace the original data volume by alternative, smaller forms of data representation
Includes modelling, histograms, clustering, sampling and data cube aggregation

Answer 91

A

Object is Oc (or conditional outlier) if it deviates significantly based on a selected context
Issue: how to define or formulae meaningful context

Answer 92

A

Data can be aggregated for example if you have the sales for each quarter, create a new variable with yearly sales. the resulting dataset is smaller
Data cubes store multidimensional aggregated information

Answer 93

A

Judge a point based on its distance to its neighbours
Given a radius (r) and a percentage (pi), a datapoint x is considered to be an outlier if the ratio of all other data points that have a distance less than r to x to the total size of the dataset is less than pi

Answer 94

A

Assume that the normal data objects belong to large and dense clusters, whereas outliers belong to small or sparse clusters

Answer 95

A

Stepwise forward selection: starts with empty set of attributes. Best of original attributes are determined and added to the set at each step
Stepwise backward elimination: starts with full set of attributes. At each step, removes worst of remaining attributes
Combination of forward selection and backward elimination: start with empty set, combine methods so that at each step the procedure adds the best attribute to reduced set and removes the worst attribute from initial set
Decision tree induction: tree is constructed from given data. All attributes that do not appear in the tree are considered irrelevant

Answer 96

A

Principal component analysis (PCA)
Singular value decomposition (SVD)
Feature subset selection, feature creation

Answer 97

A

aff = sm.affine.Affine(…)
aff.get_raw_score(s1, s2)

Answer 98

A

np.add(A, B)

Answer 99

A

Use a model to summarise the data e.g. linear regression. data points that do not conform to the model are potential outliers

Answer 100

A

Fit a model to the data and save the model instead

Answer 101

A

Binning
Histograms
Clustering
Classification (e.g. decision trees)
Correlation

Answer 102

A

Problem of identifying and linking/grouping different representations of the same real-world object

Answer 103

A

capitalisation, white space normalisation, correcting typos, replacing abbreviations, variations, nick names

Answer 104

A

when dimensionality increases, data becomes increasingly spare and density and distance between points becomes less meaningful

Answer 105

A

Simple random sampling: there is an equal probability of selecting any particular item
Simple random sampling without replacement: once an object is selected, it is removed from the population
Simple random sampling with replacement: a selected object is not removed from the population
Cluster sampling: random sampling of clusters
Stratified sampling: partition data set and draw samples from each partition proportionally, i.e. approximately the same percentage of the data. Used in conjunction with skewed data

Brainscape's Knowledge GenomeTM

Week 5: Data Preparation Flashcards

Brainscape's Knowledge Genome^TM