Data Types Flashcards

1
Q

3 types of data

A

quantitative, qualitative, textual

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Quantitative data (1 word)

A

numerical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Qualitative data (1 words)

A

categorical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

2 types of quantitative data

A

continuous and discrete

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

numerical qualitative data

A

not mathematical, category numbers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

2 types of qualitative data

A

ordinal (ordered), norminal (un ordered)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

1st step of investigating data

A

examine the univariate statistics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

purposes (3) of starting with univariate stats

A

1) detect distribution anomalies 2) get an idea of some orders of magnitude 3) see how to discretize the continuous variables, if needed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

type of plot for univariate review of discrete or qualitative data

A

frequency table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

type of plot for univariate review of continuous

A

box blot, taking note of extreme percentiles

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

purposes (3) of bivariate analysis

A

1) incompatible variables 2) links between dependent (target) variable and independent 3) links between independent variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

simple table for bivariate analysis

A

contingency table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

2nd step of investigating data

A

Rare or Missing values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

problem with rare values

A

can create bias in factor analysis or skew in measures of center

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

dealing with rare values

A

remove or replace with more frequent value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

problem with missing values

A

1) may not be random, skewing data 2) aggregates over multiple variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

dealing with missing values (4 options)

A

1) remove records 2) remove/replace variable 3) replace value 4) treat ‘missing’ qualitative data as it’s own value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

when missing values >= 15-20% of values

A

cannot use replace values or treat missing data as it’s own value

19
Q

Statistical replacement of the missing values uses a process called

A

imputation

20
Q

simplest method of imputation

A

replace missing value with most frequent value or mean/median

21
Q

most widespread imputation model (simple imputation)

A

each missing value is replaced with an assumed value

22
Q

multiple imputation

A

missing vales are replaced with multiple plausible values creating several complete data tables

23
Q

3rd step of investigating data

A

Aberrant Values

24
Q

define aberrant value

A

erroneous value: can be caused by incorrect measurement, calculation error, input error, false declaration.

25
extreme values and aberrant values relationship
extreme values not always aberrant, aberrant values not always extreme.
26
tools for detecting aberrant values
frequency tables, univariant statistics
27
dealing with aberrant values (4)
1) delete records 2) delete/replace variable 3) replace value 4) tolerate small margin of error and keep values as are.
28
4th step of investigating data
extreme value
29
what situations are tolerant of extreme values
1) decision trees 2) where rare profiles are the subject of study (fraud prediction, etc)
30
what situations are especially intolerant of extreme values
continuous variables used in logistic regression, PCA, and variance.
31
dealing with extreme values (#1)
exclude outliers from model learning sample (ensuring not more than 1-2% are excluded)
32
dealing with extreme values (#2)
divide continuous variable into classes
33
dealing with extreme values (#3)
Winsorize the variable- values beyond 1st or 99th percentile are given 1st and 99th percentile values.
34
Step 5 of investigating data
tests of normality
35
why normality matters
normal distribution is required for all parametric and linear regression methods.
36
Tests of normality: when to use Shapiro–Wilk
uses a P-P plot with a diagonal line that represents normality. small samples (under 2000)
37
Tests of normality: when to use Kolmogorov–Smirnov
very general, typically will use variants
38
Tests of normality: when to use Anderson–Darling test
Variant of Kolmogorov-Smirnov that corrects less-sensitive tails
39
Tests of normality: when to use Lilliefors test
Variant of Kolmogoro-Smirnov for cases when mean and variance are estimated from sample data
40
define homoscedasticity
the property of having equal statistical variances
41
homoscedasticity with a single independent variable in a discrimination model
equality of the variances of the variable in a number of samples, for example in the different groups of a population
42
homoscedasticity with more than one independent variable in a discrimination model
equality of the covariance matrices of the variables in a number of samples
43
homoscedasticity in linear regression
variance of the residuals does not depend on the value of the predictors