Chapter 10 Flashcards
Exploratory Data Analysis
- Used to analyze data for insights, patterns, and relationships before applying formal statistical techniques.
- The approach is free of pre-conceived assumptions, allowing patterns to emerge naturally before fitting models.
EDA helps to:
* Detect errors (outliers or anomalies) in the data.
* Check assumptions made by models or statistical tests.
* Identify the most important/influential variables.
* Develop parsimonious models—models that explain the data using the minimum necessary variables.
For a single variable, EDA involves:
Calculating summary statistics such as mean, median, quartiles, standard deviation, IQR, and skewness and drawing suitable diagrams such as histograms, boxplots, quantile-quantile (Q-Q) plots, and line charts for time series/ordered data.
For bivariate or multivariate data, EDA involves:
Calculating summary statistics for each variable and correlation coefficients between variable pairs and visualizing data using scatterplots between each pair of variables.
Data visualisation for bivariate data
The starting point for analyzing bivariate data is always data visualization.
The simplest way to visualize bivariate data is through a scatterplot, which helps reveal any potential relationships between the variables.
The key focus is on whether there is a linear relationship between:
Y – the response (dependent) variable.
X – the explanatory (independent or regressor) variable.
Specifically, we are interested in whether the expected value of Y for a given X = x follows a linear function:
E[Y|x] = α + βx
Spearman’s rank correlation coefficient
- Spearman’s rank correlation coefficient rₛ measures the strength of a monotonic (but not necessarily linear) relationship between two variables.
- It is the Pearson correlation coefficient applied to the ranks, r(Xᵢ) and r(Yᵢ), rather than the raw values, (Xᵢ, Yᵢ), of the bivariate data.
- It only considers relative ordering, so we usually rank from smallest to largest.
- If there are no ties, the calculation simplifies to:
rₛ = 1 - (6 Σ dᵢ²) / (n(n² - 1)), where dᵢ = r(Xᵢ) - r(Yᵢ).
- Since it only considers ranks rather than actual values, rₛ is less affected by outliers than Pearson’s correlation coefficient, making it more robust.
Kendall’s rank correlation coefficient
- Kendall’s rank correlation coefficient (τ) measures the strength of a monotonic relationship between two variables, based only on ranks.
- Like Spearman’s rank correlation coefficient, it considers only the relative values of the bivariate data, not their actual values.
- Kendall’s τ is more computationally intensive as it evaluates all possible pairs of bivariate data, not just individual ranks.
- Despite its complexity, it has better statistical properties than Spearman’s coefficient, especially for small datasets with many tied ranks.
- A pair (Xᵢ, Yᵢ) and (Xⱼ, Yⱼ) is:
Concordant if Xᵢ < Xⱼ and Yᵢ < Yⱼ, or Xᵢ > Xⱼ and Yᵢ > Yⱼ.
Discordant if Xᵢ < Xⱼ and Yᵢ > Yⱼ, or Xᵢ > Xⱼ and Yᵢ < Yⱼ.
*Let n꜀ be the number of concordant pairs and n_d be the number of discordant pairs. If there are no ties, Kendall’s τ is:
τ = (n꜀ - n_d) / (n(n - 1) / 2).
The numerator is the difference between concordant and discordant pairs, while the denominator is the total number of possible pairs.
Generalised Linear Model
Generalised Linear Models (GLMs) relate the response variable (output) to explanatory variables (predictors, covariates, or independent variables).
A GLM consists of three components:
1. A distribution for the response variable/data
Extends beyond the normal distribution to the exponential family (e.g., normal, gamma, Poisson, binomial).
Example: Gamma for insurance claim sizes, Poisson for claim counts, binomial for disease probabilities.
1. A linear predictor (η)
η is a function of the covariates.
In simple regression: η = β₀ + β₁x.
In multiple regression: η = β₀ + β₁x₁ + β₂x₂ + … + βₖxₖ.
The linear predictor is linear in parameters (β’s) but not necessarily in covariates (e.g., η = β₀ + β₁ log(x)).
1. A link function (g(μ))
Connects the mean response (μ) to the linear predictor: g(μ) = η.
In a standard linear model, g(μ) = μ (identity function).
The link function allows the model to express non-linear relationships while keeping the predictor linear in parameters.
If g is invertible, we can express μ as μ = g⁻¹(η).