Data Types Flashcards

1
Q

3 types of data

A

quantitative, qualitative, textual

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Quantitative data (1 word)

A

numerical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Qualitative data (1 words)

A

categorical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

2 types of quantitative data

A

continuous and discrete

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

numerical qualitative data

A

not mathematical, category numbers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

2 types of qualitative data

A

ordinal (ordered), norminal (un ordered)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

1st step of investigating data

A

examine the univariate statistics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

purposes (3) of starting with univariate stats

A

1) detect distribution anomalies 2) get an idea of some orders of magnitude 3) see how to discretize the continuous variables, if needed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

type of plot for univariate review of discrete or qualitative data

A

frequency table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

type of plot for univariate review of continuous

A

box blot, taking note of extreme percentiles

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

purposes (3) of bivariate analysis

A

1) incompatible variables 2) links between dependent (target) variable and independent 3) links between independent variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

simple table for bivariate analysis

A

contingency table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

2nd step of investigating data

A

Rare or Missing values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

problem with rare values

A

can create bias in factor analysis or skew in measures of center

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

dealing with rare values

A

remove or replace with more frequent value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

problem with missing values

A

1) may not be random, skewing data 2) aggregates over multiple variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

dealing with missing values (4 options)

A

1) remove records 2) remove/replace variable 3) replace value 4) treat ‘missing’ qualitative data as it’s own value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

when missing values >= 15-20% of values

A

cannot use replace values or treat missing data as it’s own value

19
Q

Statistical replacement of the missing values uses a process called

A

imputation

20
Q

simplest method of imputation

A

replace missing value with most frequent value or mean/median

21
Q

most widespread imputation model (simple imputation)

A

each missing value is replaced with an assumed value

22
Q

multiple imputation

A

missing vales are replaced with multiple plausible values creating several complete data tables

23
Q

3rd step of investigating data

A

Aberrant Values

24
Q

define aberrant value

A

erroneous value: can be caused by incorrect measurement, calculation error, input error, false declaration.

25
Q

extreme values and aberrant values relationship

A

extreme values not always aberrant, aberrant values not always extreme.

26
Q

tools for detecting aberrant values

A

frequency tables, univariant statistics

27
Q

dealing with aberrant values (4)

A

1) delete records 2) delete/replace variable 3) replace value 4) tolerate small margin of error and keep values as are.

28
Q

4th step of investigating data

A

extreme value

29
Q

what situations are tolerant of extreme values

A

1) decision trees 2) where rare profiles are the subject of study (fraud prediction, etc)

30
Q

what situations are especially intolerant of extreme values

A

continuous variables used in logistic regression, PCA, and variance.

31
Q

dealing with extreme values (#1)

A

exclude outliers from model learning sample (ensuring not more than 1-2% are excluded)

32
Q

dealing with extreme values (#2)

A

divide continuous variable into classes

33
Q

dealing with extreme values (#3)

A

Winsorize the variable- values beyond 1st or 99th percentile are given 1st and 99th percentile values.

34
Q

Step 5 of investigating data

A

tests of normality

35
Q

why normality matters

A

normal distribution is required for all parametric and linear regression methods.

36
Q

Tests of normality: when to use Shapiro–Wilk

A

uses a P-P plot with a diagonal line that represents normality. small samples (under 2000)

37
Q

Tests of normality: when to use Kolmogorov–Smirnov

A

very general, typically will use variants

38
Q

Tests of normality: when to use Anderson–Darling test

A

Variant of Kolmogorov-Smirnov that corrects less-sensitive tails

39
Q

Tests of normality: when to use Lilliefors test

A

Variant of Kolmogoro-Smirnov for cases when mean and variance are estimated from sample data

40
Q

define homoscedasticity

A

the property of having equal statistical variances

41
Q

homoscedasticity with a single independent variable in a discrimination model

A

equality of the variances of the variable in a number of samples, for example in the different groups of a population

42
Q

homoscedasticity with more than one independent variable in a discrimination model

A

equality of the covariance matrices of the variables in a number of samples

43
Q

homoscedasticity in linear regression

A

variance of the residuals does not depend on the value of the predictors