exploratory data analysis (week 1-2) Flashcards by Cindy Patricia

types of data analysis

descriptive, inferential, predictive

How well did you know this?

Not at all

Perfectly

what is descriptive data

summarize data, highlight any patterns, have central tendency, dispersion, and shape of distribution

How well did you know this?

Not at all

Perfectly

what is inferential data

collect sample to represent the wider population, estimate parameter, testing hypothesis

How well did you know this?

Not at all

Perfectly

what is predictive analysis

use past data to make predictions, divide data into training and testing set

How well did you know this?

Not at all

Perfectly

process of data analysis

a. develop clear analysis
b. identify data required
c. collect data (external and internal)
d. process/format/clean data
e. perform exploratory or preliminary data analysis (basic)
f. fit the model into the data
g. communicate the result
h. monitor ongoing experience
i. comply with professional guidance and legal requirements

How well did you know this?

Not at all

Perfectly

4 data resources?

a. simple random sampling
b. stratified sampling
c. cross-sectional data
d. longitudinal data

How well did you know this?

Not at all

Perfectly

what is simple random sampling?

random, so have equal chance to be selected

How well did you know this?

Not at all

Perfectly

what is stratified sampling?

split group to specific criteria, then pick random

How well did you know this?

Not at all

Perfectly

what is cross-sectional data?

different variable of interest are recorded across all objects at a single point of time

How well did you know this?

Not at all

Perfectly

what is longitudinal data?

different variables of interest of particular object are recorded closely

How well did you know this?

Not at all

Perfectly

type of error in the data collected (3)

a. censored data (only partially known)
b. truncated data (some values are missing)
c. big data (need machine learning)

How well did you know this?

Not at all

Perfectly

reproducility

all information to produce the same step of research is given and produce the same result, so third party can start from scratch

How well did you know this?

Not at all

Perfectly

replication

data to generate the same result is provided

How well did you know this?

Not at all

Perfectly

what does exploratory data analysis do

it analyze the data and identify any basic patterns or relationship. It also find most important variables, detect any data error.

How well did you know this?

Not at all

Perfectly

exploratory data analysis on univariate var

mean, median, quantile, sd, skewness, dsb

How well did you know this?

Not at all

Perfectly

exploratory data analysis on multivariate var

Study These Flashcards

scatterplot, association measures, PCA

what is the diff between correlation and association/relationship

Study These Flashcards

correlation for uni variable, while association/relationship can be used for multilinear relationship

what is bivariate analysis

Study These Flashcards

assess the strength / shape of relationship between 2 RV
(if lurus then perfect corr)

covariance

Study These Flashcards

E[XY] - E[X]E[Y]

correlation

Study These Flashcards

Covariance / SD[X]/SD[Y]
between -1 and 1

type of correlations (2)

Study These Flashcards

pearson and kendall

how to answer question for pearson corr?

Study These Flashcards

find the corr (p hat)
find calc test stats
find the t using table

pearson correlation used for?

Study These Flashcards

quantitative work, measure of linear dependance for multivariate dist

dependance vs corr

Study These Flashcards

if X and Y are independent, corr is 0 but not vice versa

kendall tau used for?

measure ordinal relationship (order, but how much not important), discordance vs concordance, less affected by extreme values, can be used as a measure of overall dependance

how do we calculate kendall tau?

a. calc number of dis and con b. plug it in to the formula sample tau c. calc test stats d. find the N(0.1) in the table

Multivariate analysis

same as bivariate, to assess strength of relationship but it between several random RV

PCA (Principal Component Analysis)

due to overlapping information between correlated multivariate RV, we make new uncorrelated var and using first few of them which explain most of data variability.

each new var on PCA is

linear combination of original var

PCA can be used for

summarises high dimensional data, identifying major patterns, analysing trends, performing regresison

how to perform PCA?

a. centre the matrix X b. compute X^tX d. diagonalnya diminus lamda e. diagonal dikali dikurangi perkalian diagonal sebelah f. cari lamda g. use the b matrix and minus the diagonal with the lambda (if hv 2 lambda, do step g-i 2 times) h. kalikan matrix dgn matrix ab i. eigenvectors dgn cara ..b = ..a terus itu dibuat matrix (b) (a) j. matrix di h dipangkat 2 each trs di tambah hasilnya = k trs 1/sqrt(k) k. hasil j ditaruh depan hasil i l. hasil k dikali hasil dri a terus setiap valuenya itu adalah z. misal top left row is z(11) trs top right itu z(12)

Singular Value Decomposition

decompose X into U (PCs), D (contains sqrt(eigenvalues)), V (contain loadings)

How to choose PC?

Find that at least 90% of data variability Or perform scree test (var vs PCs), choose PCs after which scree plot levels off

exploratory data analysis (week 1-2) Flashcards

(33 cards)