lecture 2: data engineering Flashcards

Question 1

Q

types of data

Answer

A

continuous, discrete, ordinal, categorical, missing, censored

Question 2

Q

what is continuous data

Answer

A

measured on a quantitative scale, can be any fractional number

Question 3

Q

what is discrete data

Answer

A

data points have a countable number of values between any 2 points

Question 4

Q

what is ordinal data

hint ordinal = ordered

Answer

A

have a fixed number of possible values (<100), called levels that are ranked/ordered

Question 5

Q

what is categorical data

Answer

A

multiple categories that are not ordered

Question 6

Q

what is missing data and how do we deal with it

Answer

A

a missing data point that we do not know the mechanism of

we should use a non number code to denote such date, eg. NA

Question 7

Q

what is censored data and how do we deal with it

Answer

A

a missing value but we know the mechanism on some level

coded as NA as well, add a column for censored TRUE or FALSE

Question 8

Q

what is a top down vs bottom up approach

Answer

A

top down: starting from a problem/question and then finding to date to solve that problem
bottom up: starting from the data set, study and analyse it to see what problem/question it can solve

Question 9

Q

what is data wrangling

Answer

A

process of transforming/mapping raw data into another format to make it more appropriate for downstream analytics
*downstream analytics: just means the data is used after some processing step, for a specific purpose

Question 10

Q

types of data wrangling

Answer

A

scaling(eg. min-max), clipping(eg. feature clipping), z-score

Question 11

Q

when to use z-score standardisation

Answer

A

when the population of each independent dimension of data is normally distributed

Question 12

Q

what is data cleaning/cleansing

Answer

A

process of detecting and correcting/removing corrupt or inaccurate records from a data set

Question 13

Q

what are the methods of dealing with missing features

Answer

A

removing the examples with missing features (only if dataset is big enough)
use a learning algorithm
using data imputation

Question 14

Q

what is data imputation

Answer

A

replacing the missing value of a feature with an average value of this feature in the dataset
OR
replace the missing value with a value outside the normal range of values

Question 15

Q

what is data integrity

Answer

A

maintenance and assurance of the accuracy and consistency of data over its entire life cycle, eg. credit card numbers

lecture 2: data engineering Flashcards

(15 cards)