Basic Data Analysis Flashcards
Data Types
- Quantitative (numerical)
- Discrete, can only assume finite values, represented by integer positive numbers.
- Continuous, can assume any value within an interval.
- Qualitative (categorical)
- Nominal
- Ordinal
Statistical Methods
- Descriptive Statistics
- Inferential statistics
Probability is used to go from Descriptive to Inferential.
Measures of Location
- Mean: average value.
- Mode: value that occurs most frequently. Represents highest peak of distribution.
- Median: middle value when data is arranged in ascending or descending order. 50th percentile.
Measures of Variability
- Range
- Interquartile range: difference between the 75th and 25th percentile. pth percentile is the value that has p% of the data points below it and (100-p)% above it.
- Variance: mean squared deviation from the mean.
- Standard deviation: square root of the variance.
- Coefficient of variation: ratio of the standard deviation to the mean expressed as percentage.
Measures of Shape
Skewness: tendency of the deviations from the mean to be larger in one direction than the other.
Kurtosis: is a measure of the relative peakedness or flatness of the curve defined by the frequency distribution. Kurtosis of normal distribution is 0.
Steps of Hypothesis Testing
- Formulate H0 and H1
- Select appropriate test.
- Choose level of significance (risk).
- collect data and calculate test statistics.
- Determine p-value.
- Compare with significance.
- Reject or do not reject H0.
- Draw conclusions.
Tools for bivariate analysis with continuous/categorical variables
Categorical/C -> Contingency Tables
Quantitative/Q -> Linear Correlation
Categorical/Quantitative -> ANOVA
Statistical independence in contingency tables
Two variables are independent if the columnwise and rowwise tables show respectively identical columns and rows (and equal to overall sample distributions).
Chi Square Index
The reference case of independence is useful to calculate the degree of association between the variables through an association measure.
(Chi-Squared) compares the observed frequencies with the frequencies that would be expected if the null hypothesis of statistical independence were true.
if c2 = 0, X and Y are independent considering the sample data.
For the population?
H0: Variables are independent in the population. Distribution of c2 with mean 0.
H1: Variables are Dependent in the population.
Cramer’s V
if we reject H0, and there is dependence, we can assess the strength of the relation considering Cramer’s V.
sqrt(c2/(N(min(nrow,ncol)-1)
Covariance and Correlation
Covariance: tendency of two measures to vary in the same direction (positive) or not (negative).
Correlation: standardised covariance, Covariance divided by standard deviation.
ANOVA
Analyses relationship between numerical and categorical variable.
One can understand how the numerical variable changes across the different categories of another categorical variable by comparing its within-category means.
One-way ANOVA F Test
(one-way = one categorical variable)
Is the difference in the sample means significant at the population level?
H0: the population means are equal across all c categories.
H1: not all the population means are equal (at least two differ).
F statistic for the f distribution:
F = between group variability/within group variability = [BSS/(c-1)] / [WSS/(n-c)]
Assumptions:
- Populations are normally distributed
- Populations have equal variance
- degrees of freedom depend on sample size