Lecture 2 - Descriptive Statistics & Exploratory Data Analysis Flashcards

1
Q

What is Descriptive Statistics?

A

Describes data via using numerical and graphical methods to summarize and analyze data in a clear and understandable way

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Inferential Statistics?

A

Draws inferences about a larger population from a sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is EDA?

A

An approach to data analysis to gain useful insight into the data

Inspect data without any assumptions

1st step of data analysis

Use of both numerical and graphical methods to describe data distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the 4 functions/benefits/aims/purposes of EDA?

A

Provide descriptive statistics by

  • Giving an overall view of data
  • Help to visualize distributions and relationships

Detect anomalies including outliers

Assess the assumptions for confirmatory analysis (i.e. inferential statistics/tests)

Helps to decide appropriate confirmatory analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the 2 methods of EDA and compare them in terms of benefits/usefulness/aims

A

Numerical methods are more precise and objective

Graphical methods are better for identifying patterns in the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the 5 key characteristics of distribution data to explore? Describe them (simply)

A

Centre: middle of the range of values, where the most frequently occurring data values are often found

Spread: amount of variability in the values away from the centre

Shape: includes number of peaks and symmetry of distribution

Gaps: segments within the range from minimum to maximum data values without data value(s)

Outliers: data values that are very different from all other values (extremes)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Numerical/Summary Statistics?

A

Numerical values that describe

  • Central tendency of data
  • Dispersion/variability of data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Central tendency of data?

A

Mean: arithmetic average = sum of all values/number of values (observations)

Median: middle observation (50th percentile)

Mode: most frequent observation

Alone not enough to describe data distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

When to use mean, median and mode?

A

Mean

  • preferred and widely used
  • quantitative (interval & ratio) data
  • considers both number of values and values themselves
  • sensitive to extreme values (outliers)

Median

  • considers number of values and rank order of values
  • robust to outliers
  • ordinal, quantitative data that are highly skewed

Mode

  • seldom used
  • nominal data
  • consider only frequency
  • robust to outliers but ignore number and rank order of values and rest of values
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

2 misuses of descriptive statistics

A

Inappropriate choice of measure of central tendency

Reporting in absolute values versus percentages and relativity; need to consider the context and other information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is Measure of variability/dispersion/variability of data? Give some of the respective equations/formulas too

A

Range: difference between the maximum and minimum values

Interquartile range: difference between the values at the 75th and 25th percentile points

Variance: average of squared deviations from the mean

  • [sum (X-mean)^2]/(N - 1) = SS/(N - 1), where SS is the sum of squares, and N - 1 because of the reduced degree of freedom for a sample
  • Why sum of squared deviations and not sum of deviations? Ans: Mean is in the middle of data values => half these deviations would be +ve and half would be -ve. Therefore, sum of deviations would always be zero

Standard deviation: Square root of variance or Square root of average of squared deviations from the mean i.e. {[sum (X-mean)^2]/(N - 1)}^0.5 = [SS/(N - 1)]^0.5

Coefficient of variation (CV): measure of relative dispersion

  • If 2 or more distributions to be compared are expressed in the same units and have similar means, then their variability can be compared directly by comparing their SD
  • But if means are very different or are expressed in different units, then need another method
  • e.g. mean of 1 (SD 1) vs mean of 100 (SD 1)
  • Normalize SD to mean to compare 2 or more distributions
  • CV = SD/mean x 100
  • No unit and expressed as a percentage
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

When to use range, interquartile range, variance and standard deviation?

A

Range

  • Only considers 2 points of data but easy to calculate
  • Not often provided but sensitive to extreme outliers, therefore provides an idea to presence of outliers when used with mean and SD
  • Max and min can be used instead and preferred

IQR

  • Considers only 2 points of data and middle 50% of data
  • Used when data are ordinal, or when quantitative data are highly skewed; analogous to median

Variance

  • Accurate and detailed estimate of variability
  • most commonly used measure of “spread” for statistical calculations but usually not reported as a statistic for variability

Standard Deviation

  • Accurate and detailed estimate of variability
  • Preferred and commonly reported because same unit as mean unlike that of variance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

For quantitative, ordinal and nominal variables, which measures of central tendency and measures of variability should we use?

A

Quantitative: mean & SD or median (when data is highly skewed) & IQR

Ordinal: median & IQR

Nominal: mode (& NIL)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is graphical methods?

A

Help to visualize the 5 key characteristics of data distribution

Include

  • histogram
  • box plot
  • scatter plot
  • stem and leaf plot

Very important though often not reported in literature

Not the same as graphs presented in results and discussion in inferential statistics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

(What are) histograms?

A

Commonly used for quantitative and qualitative variables

Constructed from simple or grouped frequency distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

When to use histograms?

A

Useful for frequency distribution
- good fro presentation as it is familiar to most people

Part of EDA to determine normality and outliers
- can display all characteristics of data distributions is suitable bin size is selected
- particularly useful when there are more than 100 data values
- not reliable if sample size too small as may show spurious patterns
(may not contain enough data points to accurately show the distribution of data –> histogram may be distorted –> not reliable)

17
Q

(What are) Stem and leaf plots?

A

Shows shape and distribution

Shows all the data (no information; allows for inspection of individual values)

18
Q

When to use Stem and leaf plot?

A

Alternative to histograms but allows for inspection of individual values

Easy to produce by hand and retains all data so useful for quick check

Less “profesional” looking and display of all digits can be distracting or confusing so less commonly used for presentations

More useful when less than 100 data values

19
Q

(What is) Box plot?

A

Median

25% or first (lower) and 75% or third (upper) quartiles

Lower whisker: smallest non-outlier = larger of min or (first quartile - 1.5 x IQR)

Upper whisker: largest non-outlier = smaller of max or (third quartile + 1.5 x IQR)

Outliers (anything beyond whiskers)

20
Q

When to use box plots?

A

useful for both large and small data set

excellent tool for conveying the centre, spread, shape of and outliers in data distributions, especially when comparing different groups of data

part of EDA to determine normality and outliers

disadvantage: does not convey gaps in data distribution

21
Q

Scatter plot (What is and when to use?)

A

Display relationship/association between 2 continuous variables

Show features of the relationship

  • strength
  • shape (linear/curve)
  • direction
  • outliers
22
Q

3 points to mention when describing shape of distribution

A

symmetry (compare mean and median) vs skewness/kurtosis

skewness/kurtosis

number of modes

23
Q

3 types of kurtosis (pointyness/peakedness and tailedness)

A

leptokurtic: slender; +ve kurtosis; higher peak and heavier tails (many scores at tails)
mesokurtic: bell-shaped
platykurtic: broad; -ve kurtosis; flatter peak, lighter tails