Week 5: Data Preparation Flashcards

Question

What is the time complexity of doing pairwise similarity in blocks with k blocks and block size n/k?

Answer 1

O(k(n/k)^2)

Answer 2

- Difference between numerical values - Jaro for comparing names - Edit distance for typos - Phonetic-based - Jaccard for sets - Cosine for vectors

Answer 3

data set involving two or more attributes or variables

Answer 4

Obtain a reduced representation of the dataset that is much smaller in volume but yet produces the same (or almost the same) analytical results

Answer 5

label encoding, one-hot encoding

Answer 6

global, contextual, collective

Answer 7

- Conformance to schema: evaluate constraints on a snapshot - Conformance to business rules: evaluate constraints on changes in the database - Accuracy: perform inventory (expensive), or use proxy (track complaints) - Glitches in analysis - Successful completion of end-to-end process

Answer 8

A subset of data objects collectively deviate significantly from the whole data set, even if the individual data object may not be outliers Need to have the background knowledge on the relationship among the data objects, such as distance or similarity measure on objects

Answer 9

- Accuracy: the data was recorded correctly - Completeness: all relevant data was recorded - Uniqueness: entities are recorded once - Timeliness: the data is kept up to date - Consistency: the data agrees with itself - Believability: how much the data is trusted by users - Interpretability: how easy the data is understood

Answer 10

Transform the data by converting the values to a common scale with an average of zero and a standard deviation of one v’ = (v - mean(A))/sd(A)

Answer 11

- Smoothing by bin means: each value in a bin is replaced by the mean value of the bin - Smoothing by bin medians: each value in a bin is replaced by the median value of the bin - Smoothing by bin boundary: the minimum and maximum values in a given bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary value

Answer 12

- Supervised: use class information - Bottom-up merge: find the best neighbouring intervals to merge - Initially each distinct value is an interval, Chi squared tests are performed on every adjacent interval and those with the least chi squared values are merged together. Merge performed recursively until a predefined stopping condition is satisfied

Answer 13

``` data.fillna() #inplace=TRUE replaces the values in the original dataframe ```

Answer 14

Assume that the data are normally distributed, learn the parameters from the input data. An object is an outlier if it is more than 3sd from the mean. Ie the z-score (x-mean/sd) has absolute value more than 3

Answer 15

- Too many bins, won’t smooth data, will keep the noise, lot of computation required - Too little bins, hide a lot of details in the data

Answer 16

Blocking: divide the records into blocks, perform pairwise comparison between records in the same block only

Answer 17

- Divides the range into N intervals of equal size: uniform grid - If A and B are the smallest and largest values of the attribute, the width of the intervals will be W = (B-A)/N - The most straightforward, but outliers may dominate presentation - Skewed data is not handled well

Answer 18

- Divides range into N intervals, each containing approximately the same number of samples - Managing categorical attributes can be tricky

Answer 19

Transformations are applied to obtain a reduced or compressed representation of the original data

Answer 20

Tests the hypothesis that attributes A and B are independent based on the chi-squared statistic

Answer 21

Assumes that the normal data is generated by a parametric distribution with the parameter theta - The probability density function of the parametric distribution f(x, gamma) gives the probability that x is generated by the distribution - The smaller this value, the more likely x is an outlier

Answer 22

The closest cluster is far from x

Answer 23

Using a histogram: use histogram to graph results as a percentage, a number is in outlier if it falls with a very small percentage of the data Or use a kernel density estimation the probability density distribution of the data. For an object o, the density function f(o) gives the estimated probability that the object is generated by the stochastic process. If f(o) is low the object is likely an outlier

Answer 24

Transform the multi aria text outlier detection task into a univariate outlier detection problem

Answer 25

Transform the data from a given range with [minA, maxA] to a new interval [new\_maxA, new\_minA] for a given attribute A v' = (v - minA)/(maxA - minA) \* (newmaxA - newminA) + newminA where v is the current value

Answer 26

- Smoothing: removing noise from the data. includes binning, regression, clustering - Attribute/feature construction: new attributes constructed from the given ones - Aggregation: summary or aggregation operations applied, data cube construction - Normalisation: scaled to fall within a smaller, specified range. Includes min-max normalisation, Z-score normalisation, normalisation by decimal scaling - Data reformatting: e.g. Jack Wilsher -\> Wilsher, J. - Using the same unit: e.g. inches and cm - Discretisation: raw values of numeric data attributes by interval labels or conceptual labels - Concept hierarchy generation: attributes such as street generalised to higher level concepts like city

Answer 27

df = pd.DataFrame(np.arange(20).reshape(5, 4))

Answer 28

Removing irrelevant or redundant attributes

Answer 29

- similarity measures have different scales - pairwise similarity between records is expensive?

Answer 30

Effective if data is clustered but not if data is “smeared”

Answer 31

- Schema matching: e.g. contact number vs phone - Compound attributes: e.g. address vs street, city, zip

Answer 32

obtaining a small sample s to represent the whole data set N. choose a representative subset of the data

Answer 33

Assume that the normal data objects are generated by a stochastic process (a generative model) and that data not following the model are outliers. Learn a generative model fitting the given data set, and then identify the objects in low probability regions of the model as outliers

Answer 34

- Inconsistent: containing discrepancies in codes or names - Intentional: e.g. disguised missing data such as Jan 1st for all birthdays

Answer 35

Global approaches: the reference set contains all other data objects Local approaches: the reference contains a small subset of data objects and there is no assumption on the number of normal mechanisms

Answer 36

numeric data only

Answer 37

- Reflects the use of the data - Leads to improvements in processes - Measurable (we can define metrics)

Answer 38

``` #take sample of 3 rows without replacement: df.sample(3) ``` ``` #take sample of 3 rows with replacement: df.sample(3, replace=True) ```

Answer 39

integrate metadata from different sources

Answer 40

- Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the dataset - The attribute with the most distinct values is placed at the lowest level of the hierarchy - E.g. Country (highest level) -\> state -\> city -\> street (lowest level) - This is also a type of data smoothing

Answer 41

assume the normal data is generated by a mixture of normal distributions For any object o in the dataset, the probability that o is generated by a mixture of distributions is the sum of the probability density functions at o Use the EM algorithm to learn the parameters of the data and an object is an outlier if it does not belong to any of the main groups of the data

Answer 42

contextual attributes define the context, behavioural attributes define the characteristics of the object used in outlier evaluation

Answer 43

- Find a projection that captures the largest amount of variation in data - We find the eigenvectors of the covariance matrix, and these eigenvectors define the new space

Answer 44

Attributes that duplicate much or all of the information contained in one or more other attributes

Answer 45

Given two records, compute a vector of similarity scores for corresponding features -Score can be Boolean (match/mismatch) or a continuous value based on specific similarity measure (distance function)

Answer 46

A function that maps the entires set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new value

Answer 47

Quantifies the local density of a data point with the use of a neighbourhood of size k -Introduces a smoothing parameter: reachability distance RD RDk(x,y) = max{K dist(x), dist(x,y)}, where K dist(x) is the distance between x and its K-nearest neighbour -the local reachability distance of point x is: LRDk(x) = k/[sum of y in KNN(x) \* RDk(x,y)] -the local outlier factor LOF is: LOFk(x) = sum of y in [KNN(x)\*LRDk(y)/LRDk(x)] / k -Generally, LOF \>1 means x has a lower density than its neighbours

Answer 48

``` split = top down method merge = bottom up method ```

Answer 49

Random error or variance in a measured variable

Answer 50

- Supervised: given class labels, top down recursive split - Using entropy to determine split point (discretisation point)

Answer 51

Must use density. distance based can't detect local outliers

Answer 52

jaro\_sim = sm.jaro.Jaro() jaro\_sim.get\_raw\_score(s1, s2)

Answer 53

remove unimportant attributes

Answer 54

``` #fill each na with the value before it data.fillna(method=‘pad') or method=‘ffill’ ``` ``` #fill each na with the value after it data.fillna(method=‘bfill’) or method=‘backfill’ ``` ``` #set a limit on the number of forward or backward fills data.fillna(method=‘pad’, limit=1) ```

Answer 55

np.mean(data)

Answer 56

- Binning: first sort data and partition into equal frequency (equidepth) bins, then one can smooth by bin means, smooth by bin median, smooth by bin boundaries etc - Regression: smooth by fitting the data into regression functions - Clustering: detect and remove outliers that do not belong to any of the clusters - Combined computer and human inspection: detect suspicious values and check by human

Answer 57

Redundant attributes can be detected by correlation and covariance analysis

Answer 58

Combining data from multiple sources into a coherent data store

Answer 59

Novelty detection involves seeing if new data fits with an existing data or would be considered an outlier

Answer 60

Don't assume an a-priori statistical model and determine the model from the input data e.g. histogram and kernel density estimation

Answer 61

Simple random sampling may have poor performance in the presence of skew

Answer 62

- Modelling normal objects and outliers properly - Application-specific outlier detection - Handling noise in outlier detection - Understandability - A data set may have multiple types of outlier - One object may belong to more than one type of outlier

Answer 63

Given N data vectors from d-dimensions, find k \<= d principal components that can accurately represent the data. Steps: - Normalise the input data: so that each attribute falls within the same range - Compute k orthonormal (unit) vectors i.e. principal components. These are unit vectors that each point in a direction perpendicular to the others. Each input data (vector) is a linear combination of the k principal components - The principal components are sorted in order of decreasing significance or strength. The principal components serve as new set of axes for the data. The first axis (first ranked principal component) shows the most variance among the data - The components are sorted. Reduce the data dimensionality by eliminating the weak components. Weak components have low variance

Answer 64

Object is a global outlier (Og) (or point anomaly) if it significantly deviates from the rest of the data set Issue: find an appropriate measure of deviation

Answer 65

The cost of obtaining a sample is proportional to the size of the sample s, not the size of the dataset N. Therefore sampling complexity is potentially sublinear to the size of the data

Answer 66

- Replace the original data volume by alternative, smaller forms of data representation - Includes modelling, histograms, clustering, sampling and data cube aggregation

Answer 67

Object is Oc (or conditional outlier) if it deviates significantly based on a selected context Issue: how to define or formulae meaningful context

Answer 68

- Data can be aggregated for example if you have the sales for each quarter, create a new variable with yearly sales. the resulting dataset is smaller - Data cubes store multidimensional aggregated information

Answer 69

Judge a point based on its distance to its neighbours Given a radius (r) and a percentage (pi), a datapoint x is considered to be an outlier if the ratio of all other data points that have a distance less than r to x to the total size of the dataset is less than pi

Answer 70

Assume that the normal data objects belong to large and dense clusters, whereas outliers belong to small or sparse clusters

Answer 71

- Stepwise forward selection: starts with empty set of attributes. Best of original attributes are determined and added to the set at each step - Stepwise backward elimination: starts with full set of attributes. At each step, removes worst of remaining attributes - Combination of forward selection and backward elimination: start with empty set, combine methods so that at each step the procedure adds the best attribute to reduced set and removes the worst attribute from initial set - Decision tree induction: tree is constructed from given data. All attributes that do not appear in the tree are considered irrelevant

Answer 72

- Principal component analysis (PCA) - Singular value decomposition (SVD) - Feature subset selection, feature creation

Answer 73

aff = sm.affine.Affine(...) aff.get\_raw\_score(s1, s2)

Answer 74

np.add(A, B)

Answer 75

Use a model to summarise the data e.g. linear regression. data points that do not conform to the model are potential outliers

Answer 76

Fit a model to the data and save the model instead

Answer 77

- Binning - Histograms - Clustering - Classification (e.g. decision trees) - Correlation

Answer 78

Problem of identifying and linking/grouping different representations of the same real-world object

Answer 79

capitalisation, white space normalisation, correcting typos, replacing abbreviations, variations, nick names

Answer 80

when dimensionality increases, data becomes increasingly spare and density and distance between points becomes less meaningful

Answer 81

- Simple random sampling: there is an equal probability of selecting any particular item - Simple random sampling without replacement: once an object is selected, it is removed from the population - Simple random sampling with replacement: a selected object is not removed from the population - Cluster sampling: random sampling of clusters - Stratified sampling: partition data set and draw samples from each partition proportionally, i.e. approximately the same percentage of the data. Used in conjunction with skewed data

Week 5: Data Preparation Flashcards

(106 cards)