exploratory data analysis (week 1-2) Flashcards

1
Q

types of data analysis

A

descriptive, inferential, predictive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is descriptive data

A

summarize data, highlight any patterns, have central tendency, dispersion, and shape of distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is inferential data

A

collect sample to represent the wider population, estimate parameter, testing hypothesis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is predictive analysis

A

use past data to make predictions, divide data into training and testing set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

process of data analysis

A

a. develop clear analysis
b. identify data required
c. collect data (external and internal)
d. process/format/clean data
e. perform exploratory or preliminary data analysis (basic)
f. fit the model into the data
g. communicate the result
h. monitor ongoing experience
i. comply with professional guidance and legal requirements

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

4 data resources?

A

a. simple random sampling
b. stratified sampling
c. cross-sectional data
d. longitudinal data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what is simple random sampling?

A

random, so have equal chance to be selected

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what is stratified sampling?

A

split group to specific criteria, then pick random

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what is cross-sectional data?

A

different variable of interest are recorded across all objects at a single point of time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what is longitudinal data?

A

different variables of interest of particular object are recorded closely

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

type of error in the data collected (3)

A

a. censored data (only partially known)
b. truncated data (some values are missing)
c. big data (need machine learning)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

reproducility

A

all information to produce the same step of research is given and produce the same result, so third party can start from scratch

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

replication

A

data to generate the same result is provided

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what does exploratory data analysis do

A

it analyze the data and identify any basic patterns or relationship. It also find most important variables, detect any data error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

exploratory data analysis on univariate var

A

mean, median, quantile, sd, skewness, dsb

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

exploratory data analysis on multivariate var

A

scatterplot, association measures, PCA

17
Q

what is the diff between correlation and association/relationship

A

correlation for uni variable, while association/relationship can be used for multilinear relationship

18
Q

what is bivariate analysis

A

assess the strength / shape of relationship between 2 RV
(if lurus then perfect corr)

19
Q

covariance

A

E[XY] - E[X]E[Y]

20
Q

correlation

A

Covariance / SD[X]/SD[Y]
between -1 and 1

21
Q

type of correlations (2)

A

pearson and kendall

22
Q

how to answer question for pearson corr?

A
  1. find the corr (p hat)
  2. find calc test stats
  3. find the t using table
23
Q

pearson correlation used for?

A

quantitative work, measure of linear dependance for multivariate dist

24
Q

dependance vs corr

A

if X and Y are independent, corr is 0 but not vice versa

25
Q

kendall tau used for?

A

measure ordinal relationship (order, but how much not important), discordance vs concordance, less affected by extreme values, can be used as a measure of overall dependance

26
Q

how do we calculate kendall tau?

A

a. calc number of dis and con
b. plug it in to the formula sample tau
c. calc test stats
d. find the N(0.1) in the table

27
Q

Multivariate analysis

A

same as bivariate, to assess strength of relationship but it between several random RV

28
Q

PCA (Principal Component Analysis)

A

due to overlapping information between correlated multivariate RV, we make new uncorrelated var and using first few of them which explain most of data variability.

29
Q

each new var on PCA is

A

linear combination of original var

30
Q

PCA can be used for

A

summarises high dimensional data, identifying major patterns, analysing trends, performing regresison

31
Q

how to perform PCA?

A

a. centre the matrix X
b. compute X^tX
d. diagonalnya diminus lamda
e. diagonal dikali dikurangi perkalian diagonal sebelah
f. cari lamda
g. use the b matrix and minus the diagonal with the lambda (if hv 2 lambda, do step g-i 2 times)
h. kalikan matrix dgn matrix ab
i. eigenvectors dgn cara ..b = ..a terus itu dibuat matrix (b)
(a)
j. matrix di h dipangkat 2 each trs di tambah hasilnya = k trs 1/sqrt(k)
k. hasil j ditaruh depan hasil i
l. hasil k dikali hasil dri a terus setiap valuenya itu adalah z. misal top left row is z(11) trs top right itu z(12)

32
Q

Singular Value Decomposition

A

decompose X into U (PCs), D (contains sqrt(eigenvalues)), V (contain loadings)

33
Q

How to choose PC?

A

Find that at least 90% of data variability
Or perform scree test (var vs PCs), choose PCs after which scree plot levels off