Week 5: Data Preparation Flashcards
What are the problems in the definition of data quality (4)
- Unmeasurable: accuracy and completeness are extremely difficult, perhaps impossible to measure
- Context independent: no accounting for what is important
- Incomplete: what about interpretability, accessibility, metadata, analysis etc
- Vague: the previous definition provides no guidance towards practical improvements of the data
How are correlation and covariance related?
corr(A,B) = cov(A,B)/sd(A)*sd(B)
How do you compute the levenshtein similarity between strings s1 and s2 in python?
lev_sim = sm.levenshtein.Levenshtein()
lev_sim.get_sim_score (s1, s2)
What are data quality issues? (7)
- Noise
- Duplicate data
- Outliers
- Unreliable sources
- Inconsistent values
- Outdated values
- Missing values
How do you normalise data by decimal scaling?
Transform the data by moving the decimal points of values of attribute A
v’ = v/10j where j is the smallest integer such that max(|v’|) < 1
e.g. if the maximum absolute value of A is 986, divide each value by 1000 (j=3)
What is the python library for computing similarity measures?
from py_stringmatching import similarity_measure as sm
What is data validation?
checking permitted characters
finding type-mismatched data
What are irrelevant attributes?
Attributes that contain no information that is useful for the data mining task at hand
What are 3 ways of handling missing data? (3)
- Ignore the tuple: usually done when class label is missing - not effective when the % of missing values is large
- Fill in the missing value manually: tedious + inflatable
- Fill in the missing value automatically (data imputation) with: a global constant e.g. “unknown” or a new class, the attribute mean, the attribute mean for all samples belonging to the same class, the most probable value found through regression, inference or decision tree
How do you reduce data with histograms?
- Divide data into buckets and store average sum for each bucket
- Partitioning rules: equal-width (equal bucket range) and equal-frequency (equal depth) (each bucket contains same number of data points)
What type of discretisation method is binning?
unsupervised, top down splitting method
What are the three types of outlier detection methods?
- Supervised methods: domain experts examine and label a sample of the underlying data and the sample is used for testing and training. Outlier detection modelled as a classification problem
- Unsupervised methods: assume that normal objects are somewhat clustered. Outliers are expected to occur far away from any of the groups of normal objects
- Semi-supervised methods: only a small set of the normal or outlier objects are labelled, but most of the data are unlabelled. The labelled normal objects together with unlabelled objects that are close by, can be used to train a model for normal objects
What is the code in python to: fill nas in column 1 with mean values of column 1 grouped by column 2
data[“column1”].fillna(data.groupby(“column2)[“column1”].transform(“mean”))
What is the python code for removing missing values?
data.dropna()
What is univariate data?
data set involving only one attribute or variable
How do you reduce data using clustering?
Partition data set into clusters based on similarity and store cluster representation (e.g. centroid and diameter) only
How do you normalise data by z-score in python?
from sklearn.preprocessing import StandardScaler
StandardScaler().fit_transform(df)
What are proximity based methods for outlier detection?
- Assume that an object is an outlier if the nearest neighbours of the object are far away
- Two types of proximity based methods: distance-based and density-based
What is an outlier?
- An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism
- Outliers are data or model glitches
What is data discretisation?
dividing the range of a continuous attribute into intervals
What is difference between labelling versus scoring for outlier detection:
Considering the output of an outlier detection algorithm
Labelling approaches: binary output - data objects are labeled either normal or outlier
Scoring approaches: continuous output - for each object an outlier score is computed. E.g. the probability for it being an outlier
What are the steps of CRISP-DM (Cross industry processing for data mining) (6)
Business understanding
Data understanding
Data preparation
Modelling
Evaluation
Deployment
What is mahalanobis distance for outlier detection?
Let o* be the mean vector for a multivariate dataset. Mahalanobis distance for an object o to o* is:
MDist(o, o*) = (o-o*)^TS^-1(o-o*) where S is the covariance matrix
Use the outlier detection technique of Grubbs test on the MDist to detect outliers
What is the time complexity of computing pairwise similarity?
O(n^2)