lecture 2: data engineering Flashcards

1
Q

types of data

A

continuous, discrete, ordinal, categorical, missing, censored

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is continuous data

A

measured on a quantitative scale, can be any fractional number

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is discrete data

A

data points have a countable number of values between any 2 points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is ordinal data

hint ordinal = ordered

A

have a fixed number of possible values (<100), called levels that are ranked/ordered

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what is categorical data

A

multiple categories that are not ordered

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what is missing data and how do we deal with it

A

a missing data point that we do not know the mechanism of

we should use a non number code to denote such date, eg. NA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what is censored data and how do we deal with it

A

a missing value but we know the mechanism on some level

coded as NA as well, add a column for censored TRUE or FALSE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what is a top down vs bottom up approach

A

top down: starting from a problem/question and then finding to date to solve that problem
bottom up: starting from the data set, study and analyse it to see what problem/question it can solve

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what is data wrangling

A

process of transforming/mapping raw data into another format to make it more appropriate for downstream analytics
*downstream analytics: just means the data is used after some processing step, for a specific purpose

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

types of data wrangling

A

scaling(eg. min-max), clipping(eg. feature clipping), z-score

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

when to use z-score standardisation

A

when the population of each independent dimension of data is normally distributed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is data cleaning/cleansing

A

process of detecting and correcting/removing corrupt or inaccurate records from a data set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what are the methods of dealing with missing features

A
  1. removing the examples with missing features (only if dataset is big enough)
  2. use a learning algorithm
  3. using data imputation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what is data imputation

A

replacing the missing value of a feature with an average value of this feature in the dataset
OR
replace the missing value with a value outside the normal range of values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is data integrity

A

maintenance and assurance of the accuracy and consistency of data over its entire life cycle, eg. credit card numbers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly