Basic Data Analysis Flashcards

1
Q

Data Types

A
  1. Quantitative (numerical)
    • Discrete, can only assume finite values, represented by integer positive numbers.
    • Continuous, can assume any value within an interval.
  • Qualitative (categorical)
    • Nominal
    • Ordinal
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Statistical Methods

A
  • Descriptive Statistics
  • Inferential statistics
    Probability is used to go from Descriptive to Inferential.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Measures of Location

A
  • Mean: average value.
  • Mode: value that occurs most frequently. Represents highest peak of distribution.
  • Median: middle value when data is arranged in ascending or descending order. 50th percentile.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Measures of Variability

A
  • Range
  • Interquartile range: difference between the 75th and 25th percentile. pth percentile is the value that has p% of the data points below it and (100-p)% above it.
  • Variance: mean squared deviation from the mean.
  • Standard deviation: square root of the variance.
  • Coefficient of variation: ratio of the standard deviation to the mean expressed as percentage.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Measures of Shape

A

Skewness: tendency of the deviations from the mean to be larger in one direction than the other.

Kurtosis: is a measure of the relative peakedness or flatness of the curve defined by the frequency distribution. Kurtosis of normal distribution is 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Steps of Hypothesis Testing

A
  • Formulate H0 and H1
  • Select appropriate test.
  • Choose level of significance (risk).
  • collect data and calculate test statistics.
  • Determine p-value.
  • Compare with significance.
  • Reject or do not reject H0.
  • Draw conclusions.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Tools for bivariate analysis with continuous/categorical variables

A

Categorical/C -> Contingency Tables
Quantitative/Q -> Linear Correlation
Categorical/Quantitative -> ANOVA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Statistical independence in contingency tables

A

Two variables are independent if the columnwise and rowwise tables show respectively identical columns and rows (and equal to overall sample distributions).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Chi Square Index

A

The reference case of independence is useful to calculate the degree of association between the variables through an association measure.

(Chi-Squared) compares the observed frequencies with the frequencies that would be expected if the null hypothesis of statistical independence were true.

if c2 = 0, X and Y are independent considering the sample data.

For the population?

H0: Variables are independent in the population. Distribution of c2 with mean 0.

H1: Variables are Dependent in the population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Cramer’s V

A

if we reject H0, and there is dependence, we can assess the strength of the relation considering Cramer’s V.

sqrt(c2/(N(min(nrow,ncol)-1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Covariance and Correlation

A

Covariance: tendency of two measures to vary in the same direction (positive) or not (negative).

Correlation: standardised covariance, Covariance divided by standard deviation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

ANOVA

A

Analyses relationship between numerical and categorical variable.

One can understand how the numerical variable changes across the different categories of another categorical variable by comparing its within-category means.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

One-way ANOVA F Test

A

(one-way = one categorical variable)

Is the difference in the sample means significant at the population level?

H0: the population means are equal across all c categories.
H1: not all the population means are equal (at least two differ).

F statistic for the f distribution:

F = between group variability/within group variability = [BSS/(c-1)] / [WSS/(n-c)]

Assumptions:

  • Populations are normally distributed
  • Populations have equal variance
  • degrees of freedom depend on sample size
How well did you know this?
1
Not at all
2
3
4
5
Perfectly