lecture 2: data engineering Flashcards
types of data
continuous, discrete, ordinal, categorical, missing, censored
what is continuous data
measured on a quantitative scale, can be any fractional number
what is discrete data
data points have a countable number of values between any 2 points
what is ordinal data
hint ordinal = ordered
have a fixed number of possible values (<100), called levels that are ranked/ordered
what is categorical data
multiple categories that are not ordered
what is missing data and how do we deal with it
a missing data point that we do not know the mechanism of
we should use a non number code to denote such date, eg. NA
what is censored data and how do we deal with it
a missing value but we know the mechanism on some level
coded as NA as well, add a column for censored TRUE or FALSE
what is a top down vs bottom up approach
top down: starting from a problem/question and then finding to date to solve that problem
bottom up: starting from the data set, study and analyse it to see what problem/question it can solve
what is data wrangling
process of transforming/mapping raw data into another format to make it more appropriate for downstream analytics
*downstream analytics: just means the data is used after some processing step, for a specific purpose
types of data wrangling
scaling(eg. min-max), clipping(eg. feature clipping), z-score
when to use z-score standardisation
when the population of each independent dimension of data is normally distributed
what is data cleaning/cleansing
process of detecting and correcting/removing corrupt or inaccurate records from a data set
what are the methods of dealing with missing features
- removing the examples with missing features (only if dataset is big enough)
- use a learning algorithm
- using data imputation
what is data imputation
replacing the missing value of a feature with an average value of this feature in the dataset
OR
replace the missing value with a value outside the normal range of values
what is data integrity
maintenance and assurance of the accuracy and consistency of data over its entire life cycle, eg. credit card numbers