Chapter 10 Flashcards

1
Q

Exploratory Data Analysis

A
  1. Used to analyze data for insights, patterns, and relationships before applying formal statistical techniques.
  2. The approach is free of pre-conceived assumptions, allowing patterns to emerge naturally before fitting models.
    EDA helps to:
    * Detect errors (outliers or anomalies) in the data.
    * Check assumptions made by models or statistical tests.
    * Identify the most important/influential variables.
    * Develop parsimonious models—models that explain the data using the minimum necessary variables.

For a single variable, EDA involves:
Calculating summary statistics such as mean, median, quartiles, standard deviation, IQR, and skewness and drawing suitable diagrams such as histograms, boxplots, quantile-quantile (Q-Q) plots, and line charts for time series/ordered data.
For bivariate or multivariate data, EDA involves:
Calculating summary statistics for each variable and correlation coefficients between variable pairs and visualizing data using scatterplots between each pair of variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Data visualisation for bivariate data

A

The starting point for analyzing bivariate data is always data visualization.
The simplest way to visualize bivariate data is through a scatterplot, which helps reveal any potential relationships between the variables.
The key focus is on whether there is a linear relationship between:
Y – the response (dependent) variable.
X – the explanatory (independent or regressor) variable.
Specifically, we are interested in whether the expected value of Y for a given X = x follows a linear function:
E[Y|x] = α + βx

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Spearman’s rank correlation coefficient

A
  • Spearman’s rank correlation coefficient rₛ measures the strength of a monotonic (but not necessarily linear) relationship between two variables.
  • It is the Pearson correlation coefficient applied to the ranks, r(Xᵢ) and r(Yᵢ), rather than the raw values, (Xᵢ, Yᵢ), of the bivariate data.
  • It only considers relative ordering, so we usually rank from smallest to largest.
  • If there are no ties, the calculation simplifies to:

rₛ = 1 - (6 Σ dᵢ²) / (n(n² - 1)), where dᵢ = r(Xᵢ) - r(Yᵢ).

  • Since it only considers ranks rather than actual values, rₛ is less affected by outliers than Pearson’s correlation coefficient, making it more robust.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Kendall’s rank correlation coefficient

A
  • Kendall’s rank correlation coefficient (τ) measures the strength of a monotonic relationship between two variables, based only on ranks.
  • Like Spearman’s rank correlation coefficient, it considers only the relative values of the bivariate data, not their actual values.
  • Kendall’s τ is more computationally intensive as it evaluates all possible pairs of bivariate data, not just individual ranks.
  • Despite its complexity, it has better statistical properties than Spearman’s coefficient, especially for small datasets with many tied ranks.
  • A pair (Xᵢ, Yᵢ) and (Xⱼ, Yⱼ) is:
    Concordant if Xᵢ < Xⱼ and Yᵢ < Yⱼ, or Xᵢ > Xⱼ and Yᵢ > Yⱼ.
    Discordant if Xᵢ < Xⱼ and Yᵢ > Yⱼ, or Xᵢ > Xⱼ and Yᵢ < Yⱼ.
    *Let n꜀ be the number of concordant pairs and n_d be the number of discordant pairs. If there are no ties, Kendall’s τ is:

τ = (n꜀ - n_d) / (n(n - 1) / 2).

The numerator is the difference between concordant and discordant pairs, while the denominator is the total number of possible pairs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Generalised Linear Model

A

Generalised Linear Models (GLMs) relate the response variable (output) to explanatory variables (predictors, covariates, or independent variables).
A GLM consists of three components:
1. A distribution for the response variable/data
Extends beyond the normal distribution to the exponential family (e.g., normal, gamma, Poisson, binomial).
Example: Gamma for insurance claim sizes, Poisson for claim counts, binomial for disease probabilities.
1. A linear predictor (η)
η is a function of the covariates.
In simple regression: η = β₀ + β₁x.
In multiple regression: η = β₀ + β₁x₁ + β₂x₂ + … + βₖxₖ.
The linear predictor is linear in parameters (β’s) but not necessarily in covariates (e.g., η = β₀ + β₁ log(x)).
1. A link function (g(μ))
Connects the mean response (μ) to the linear predictor: g(μ) = η.
In a standard linear model, g(μ) = μ (identity function).
The link function allows the model to express non-linear relationships while keeping the predictor linear in parameters.
If g is invertible, we can express μ as μ = g⁻¹(η).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly