Chp2 Stats Flashcards
What is a random variable
It is a variable whose possible values are drawn from the outcome of a random phenomenonR
Random variable examples
Tossing a coin, Tossing a die
Two types of R.V.
Discrete, Continuous
What do we assume about the observed data
It is a random sample where each sample is drawn from X where each xi is independently and identically distributed.
Discrete
Takes on a countable number of possible values
Continuous
Takes on an infinite number of possible values within a given range
Probability Mass Function
For discrete variables,
Probability Density Function
For continuous variables,
Kernel density estimation
A statistical technique that smooths out data points
Measures of central tendency
Mean
Median
Mode
Mean
Average of all data points
Robustness
The tendency to not be affected by extreme values
Is the mean robust?
No
How to obtain a robust mean
Trimmed mean, which occurs after extreme values on either side are discarded
Median
The middle value when the data points are arranged in order
Is the median robust?
Yes
Mode
The most frequent occurring value in the dataset
Is mode a useful measure of central tendency?
May not be
When is robustness important?
When your data might contain anomalies or extreme values that could distort the overall analysis
Measures of dispersion
Variance
Standard Deviation
Variance
A measure of how much the values of X deviate from the expected (mean) value of X – measure of dispersion
Sample standard deviation
The squared root of sample variance
What does standard deviation tell you
It directly tells you how much, on average, each data point deviates from the mean – just makes number small
bi-variate/multi-variate analysis
Can consider multiple vectors, as oppose to just 1 with varaince/std
What does bi-variate analysis try to understand
The association or dependence on X1 and X2
How to calculate mean and variance (first and second moment) in multivariate?
Same as normal, but return a vector instead of a single value
How to get total variance for multivariates
Sum all individual variances in the output vector
Covariance
Measure of the association or linear dependence between two variables
How to summarize covariance information for n attributes
nxn covariance matrix
Main diagonal of the matrix
Holds the variance of the column with itself
Is covariance matrix symmetric?
Yes
Correlation between two variable
The standardized covariance obtained by normalizing the covariance with the std of each variabl
Which is dimensionless and which is in units obtained by multiplying the two variables
Correlation is dimensionless
Covariance is in units obtained by multiplying the two variables
Range of covariance
-inf, + inf
Range of correlation
-1, 1
what does correlation of 1 mean?
As one variable increases so does the other
Collinearity
Occurs when the two variables are so highly correlated that we can use one to predict another ; one variable is a linear combination of the other variable
Normal/Gaussian Distribution
Parameterized by mean and std
mean = median = mode
std decreases what happens to normal/gaussian distribution
Becomes steep and short
Binomial distribution
Parameterized by n (number of trials) and p (probability of success in each trial)
mean: np
Median: [np]
Variance: np(1-p)
Power-law distribution
Long tailed distributions, Relationships where one quantity varies as a power of another
Hard to define
Power law distribution example
Area of square, quadruples when length is doubled
Visualization is
Important
XY plots
Scatter plots, birds eye view of how your data is distributed
Boxplots
Whisker plots. Maximum, 3rd quartile, median, first quartile, minimum
max and min are outliers
Short rectangle in box plot means
data is similar
Long whiskers
High std and variance
Empirical cumulative distribution function
CDF(y) of a dataset X at a value y is the ration of samples that are lower that the value y.
what is cdf (X,15) X= [2, 7, 8, 9, 10, 15, 16, 20]
CDF(X, 15) = 6/8 = 0.75
CDF PDF relation
PDF is derivative of CDF