exploratory data analysis (week 1-2) Flashcards
types of data analysis
descriptive, inferential, predictive
what is descriptive data
summarize data, highlight any patterns, have central tendency, dispersion, and shape of distribution
what is inferential data
collect sample to represent the wider population, estimate parameter, testing hypothesis
what is predictive analysis
use past data to make predictions, divide data into training and testing set
process of data analysis
a. develop clear analysis
b. identify data required
c. collect data (external and internal)
d. process/format/clean data
e. perform exploratory or preliminary data analysis (basic)
f. fit the model into the data
g. communicate the result
h. monitor ongoing experience
i. comply with professional guidance and legal requirements
4 data resources?
a. simple random sampling
b. stratified sampling
c. cross-sectional data
d. longitudinal data
what is simple random sampling?
random, so have equal chance to be selected
what is stratified sampling?
split group to specific criteria, then pick random
what is cross-sectional data?
different variable of interest are recorded across all objects at a single point of time
what is longitudinal data?
different variables of interest of particular object are recorded closely
type of error in the data collected (3)
a. censored data (only partially known)
b. truncated data (some values are missing)
c. big data (need machine learning)
reproducility
all information to produce the same step of research is given and produce the same result, so third party can start from scratch
replication
data to generate the same result is provided
what does exploratory data analysis do
it analyze the data and identify any basic patterns or relationship. It also find most important variables, detect any data error.
exploratory data analysis on univariate var
mean, median, quantile, sd, skewness, dsb
exploratory data analysis on multivariate var
scatterplot, association measures, PCA
what is the diff between correlation and association/relationship
correlation for uni variable, while association/relationship can be used for multilinear relationship
what is bivariate analysis
assess the strength / shape of relationship between 2 RV
(if lurus then perfect corr)
covariance
E[XY] - E[X]E[Y]
correlation
Covariance / SD[X]/SD[Y]
between -1 and 1
type of correlations (2)
pearson and kendall
how to answer question for pearson corr?
- find the corr (p hat)
- find calc test stats
- find the t using table
pearson correlation used for?
quantitative work, measure of linear dependance for multivariate dist
dependance vs corr
if X and Y are independent, corr is 0 but not vice versa
kendall tau used for?
measure ordinal relationship (order, but how much not important), discordance vs concordance, less affected by extreme values, can be used as a measure of overall dependance
how do we calculate kendall tau?
a. calc number of dis and con
b. plug it in to the formula sample tau
c. calc test stats
d. find the N(0.1) in the table
Multivariate analysis
same as bivariate, to assess strength of relationship but it between several random RV
PCA (Principal Component Analysis)
due to overlapping information between correlated multivariate RV, we make new uncorrelated var and using first few of them which explain most of data variability.
each new var on PCA is
linear combination of original var
PCA can be used for
summarises high dimensional data, identifying major patterns, analysing trends, performing regresison
how to perform PCA?
a. centre the matrix X
b. compute X^tX
d. diagonalnya diminus lamda
e. diagonal dikali dikurangi perkalian diagonal sebelah
f. cari lamda
g. use the b matrix and minus the diagonal with the lambda (if hv 2 lambda, do step g-i 2 times)
h. kalikan matrix dgn matrix ab
i. eigenvectors dgn cara ..b = ..a terus itu dibuat matrix (b)
(a)
j. matrix di h dipangkat 2 each trs di tambah hasilnya = k trs 1/sqrt(k)
k. hasil j ditaruh depan hasil i
l. hasil k dikali hasil dri a terus setiap valuenya itu adalah z. misal top left row is z(11) trs top right itu z(12)
Singular Value Decomposition
decompose X into U (PCs), D (contains sqrt(eigenvalues)), V (contain loadings)
How to choose PC?
Find that at least 90% of data variability
Or perform scree test (var vs PCs), choose PCs after which scree plot levels off