intro/eda Flashcards

1
Q

ways to classify analytics

A
  • descriptive
  • prescriptive
  • predictive
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is descriptive analytics

A

gather, organise
- tells you what is happening

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is predictive analytics

A
  • uses data to predict future
  • uses association among variables and predicting the likelihood of a phenomenon based on the relationships identified
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is prescriptive analytics/decision analytics

A
  • looks at multiple options and strategies then decide best course of action
  • recommends course of action
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

steps to EDA

A
  1. define problem
  2. gater data
  3. analyse data
  4. act on anaylsis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

population

A

includes all entities of interest in a study

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

sample

A
  • subset of population
  • often randomly chosen and preferably representative of the population as a whole
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what is descriptive statistics

A

data for whole population
- techniques to describe data
- parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what is inferential statistics
- steps (3)?

A
  • generalise findings of a sample to population
  • statistics
  • model, estimate parameters, estimate erros via testing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

types of data

A
  1. numerical data: continuous, discrete / interval, ratio
  2. categorical data: nominal, ordinal
  3. text
  4. geolocation data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

types of data

A
  1. numerical data: continuous, discrete / interval, ratio
  2. categorical data: nominal, ordinal
  3. text
  4. geolocation data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is numerical data
- types?

A
  • discrete
  • continuous
  • cross-sectional: data on cross section of a population at a distinct point in time
  • time series: data collected over time
  • pooed data: time series of cross sections; observations in each cross section not the same
  • panel data: samples of SAME cross-sectional data observed at multiple points in time
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

descriptive measures for numerical variables

A
  • mean: average value of an interval or ratio variable; affected by outliers
  • median: middle value for data arranged in either ascending or descending order; good for ordinal data; less affected by outliers
  • mode: most frequent value for a variable; good for nominal data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

measuring spread for numerical data

A
  • range: max-min
  • IQR: Q3 - Q1
  • variance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

measures of symmetry of distribution

A
  • skewness
  • kurtosis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

charts for numerical variables
- for cross-sectional variables?
- for time series variables?

A

CR: histogram, boxplot
TS: time series graphs

16
Q

histogram/boxplot
- purpose?
- difference?

A
  • shows distribution/skewness of a distribution
  • histogram for values, bar chart for categorical
17
Q

time series graphs
- purpose?

A
  • how variables change over time
18
Q

analysis for categorical data
- types of summarization
- data processing?

A

bar graph, pie chart, 2way frequency table

categorical variables can be coded numerically or left uncoded
- nominal data: dummy encoding, one-hot encoding
- ordinal data: one-hot encoding, label encoding

19
Q

what is dummy encoding

A
  • use (m-1) variables
  • binary or 2-value variable can be coded as 0-1 variable for a specific category
  • for more than 2, drop a category
20
Q

what is one-hot encoding

A
  • m is number of possible values in the categorical variable
21
Q

what is label encoding

A
  • use a variable with an ordering label
  • used if order is required for comparison analysis
22
Q

data errors
- sources?

A
  • data entry/cleaning
  • transmission
  • integration
  • incorrect manipulation
23
Q

data errors
- mitigation?

A
  • check spread, missing, invalid
  • use summary tables, one/two-way frequency tables, etc
24
Q

what are outliers

A
  • a value or observation that lies well outside of the norm
25
Q

missing values
- what to do?

A
  • ignore row with missing data
  • clean data by going back to the source and get actual value
  • impute missing data with substitute data
26
Q

how to impute missing values

A
  • use central tendency
  • use constant value meaningful to domain
  • randomly selected
  • predict by inference
27
Q

whats a univariate analysis

A

analysis of 1 variable
- simplest form of data analysis
- main purpose is to describe the data and find patterns that exist within it

28
Q

bivariate analysis

A

analysis of 2 variables
- find out if there is a relationship between 2 different variables

29
Q

multivariate analysis

A

analysis of 3 of more variables
- eg. regression analysis