Basic Data Analysis Flashcards

Question 1

Q

Data Types

Answer

A

Quantitative (numerical)
- Discrete, can only assume finite values, represented by integer positive numbers.
- Continuous, can assume any value within an interval.

Qualitative (categorical)
- Nominal
- Ordinal

Question 2

Q

Statistical Methods

Answer

A

Descriptive Statistics
Inferential statistics
Probability is used to go from Descriptive to Inferential.

Question 3

Q

Measures of Location

Answer

A

Mean: average value.
Mode: value that occurs most frequently. Represents highest peak of distribution.
Median: middle value when data is arranged in ascending or descending order. 50th percentile.

Question 4

Q

Measures of Variability

Answer

A

Range
Interquartile range: difference between the 75th and 25th percentile. pth percentile is the value that has p% of the data points below it and (100-p)% above it.
Variance: mean squared deviation from the mean.
Standard deviation: square root of the variance.
Coefficient of variation: ratio of the standard deviation to the mean expressed as percentage.

Question 5

Q

Measures of Shape

Answer

A

Skewness: tendency of the deviations from the mean to be larger in one direction than the other.

Kurtosis: is a measure of the relative peakedness or flatness of the curve defined by the frequency distribution. Kurtosis of normal distribution is 0.

Question 6

Q

Steps of Hypothesis Testing

Answer

A

Formulate H0 and H1
Select appropriate test.
Choose level of significance (risk).
collect data and calculate test statistics.
Determine p-value.
Compare with significance.
Reject or do not reject H0.
Draw conclusions.

Question 7

Q

Tools for bivariate analysis with continuous/categorical variables

Answer

A

Categorical/C -> Contingency Tables
Quantitative/Q -> Linear Correlation
Categorical/Quantitative -> ANOVA

Question 8

Q

Statistical independence in contingency tables

Answer

A

Two variables are independent if the columnwise and rowwise tables show respectively identical columns and rows (and equal to overall sample distributions).

Question 9

Q

Chi Square Index

Answer

A

The reference case of independence is useful to calculate the degree of association between the variables through an association measure.

(Chi-Squared) compares the observed frequencies with the frequencies that would be expected if the null hypothesis of statistical independence were true.

if c2 = 0, X and Y are independent considering the sample data.

For the population?

H0: Variables are independent in the population. Distribution of c2 with mean 0.

H1: Variables are Dependent in the population.

Question 10

Q

Cramer’s V

Answer

A

if we reject H0, and there is dependence, we can assess the strength of the relation considering Cramer’s V.

sqrt(c2/(N(min(nrow,ncol)-1)

Question 11

Q

Covariance and Correlation

Answer

A

Covariance: tendency of two measures to vary in the same direction (positive) or not (negative).

Correlation: standardised covariance, Covariance divided by standard deviation.

Question 12

Q

ANOVA

Answer

A

Analyses relationship between numerical and categorical variable.

One can understand how the numerical variable changes across the different categories of another categorical variable by comparing its within-category means.

Question 13

Q

One-way ANOVA F Test

Answer

A

(one-way = one categorical variable)

Is the difference in the sample means significant at the population level?

H0: the population means are equal across all c categories.
H1: not all the population means are equal (at least two differ).

F statistic for the f distribution:

F = between group variability/within group variability = [BSS/(c-1)] / [WSS/(n-c)]

Assumptions:

Populations are normally distributed
Populations have equal variance
degrees of freedom depend on sample size