Lecture 2 - Descriptive Statistics & Exploratory Data Analysis Flashcards

Question 1

Q

What is Descriptive Statistics?

Answer

A

Describes data via using numerical and graphical methods to summarize and analyze data in a clear and understandable way

Question 2

Q

What is Inferential Statistics?

Answer

A

Draws inferences about a larger population from a sample

Question 3

Q

What is EDA?

Answer

A

An approach to data analysis to gain useful insight into the data

Inspect data without any assumptions

1st step of data analysis

Use of both numerical and graphical methods to describe data distribution

Question 4

Q

What are the 4 functions/benefits/aims/purposes of EDA?

Answer

A

Provide descriptive statistics by

Giving an overall view of data
Help to visualize distributions and relationships

Detect anomalies including outliers

Assess the assumptions for confirmatory analysis (i.e. inferential statistics/tests)

Helps to decide appropriate confirmatory analysis

Question 5

Q

What are the 2 methods of EDA and compare them in terms of benefits/usefulness/aims

Answer

A

Numerical methods are more precise and objective

Graphical methods are better for identifying patterns in the data

Question 6

Q

What are the 5 key characteristics of distribution data to explore? Describe them (simply)

Answer

A

Centre: middle of the range of values, where the most frequently occurring data values are often found

Spread: amount of variability in the values away from the centre

Shape: includes number of peaks and symmetry of distribution

Gaps: segments within the range from minimum to maximum data values without data value(s)

Outliers: data values that are very different from all other values (extremes)

Question 7

Q

What is Numerical/Summary Statistics?

Answer

A

Numerical values that describe

Central tendency of data
Dispersion/variability of data

Question 8

Q

What is Central tendency of data?

Answer

A

Mean: arithmetic average = sum of all values/number of values (observations)

Median: middle observation (50th percentile)

Mode: most frequent observation

Alone not enough to describe data distribution

Question 9

Q

When to use mean, median and mode?

Answer

A

Mean

preferred and widely used
quantitative (interval & ratio) data
considers both number of values and values themselves
sensitive to extreme values (outliers)

Median

considers number of values and rank order of values
robust to outliers
ordinal, quantitative data that are highly skewed

Mode

seldom used
nominal data
consider only frequency
robust to outliers but ignore number and rank order of values and rest of values

Question 10

Q

2 misuses of descriptive statistics

Answer

A

Inappropriate choice of measure of central tendency

Reporting in absolute values versus percentages and relativity; need to consider the context and other information

Question 11

Q

What is Measure of variability/dispersion/variability of data? Give some of the respective equations/formulas too

Answer

A

Range: difference between the maximum and minimum values

Interquartile range: difference between the values at the 75th and 25th percentile points

Variance: average of squared deviations from the mean

[sum (X-mean)^2]/(N - 1) = SS/(N - 1), where SS is the sum of squares, and N - 1 because of the reduced degree of freedom for a sample
Why sum of squared deviations and not sum of deviations? Ans: Mean is in the middle of data values => half these deviations would be +ve and half would be -ve. Therefore, sum of deviations would always be zero

Standard deviation: Square root of variance or Square root of average of squared deviations from the mean i.e. {[sum (X-mean)^2]/(N - 1)}^0.5 = [SS/(N - 1)]^0.5

Coefficient of variation (CV): measure of relative dispersion

If 2 or more distributions to be compared are expressed in the same units and have similar means, then their variability can be compared directly by comparing their SD
But if means are very different or are expressed in different units, then need another method
e.g. mean of 1 (SD 1) vs mean of 100 (SD 1)
Normalize SD to mean to compare 2 or more distributions
CV = SD/mean x 100
No unit and expressed as a percentage

Question 12

Q

When to use range, interquartile range, variance and standard deviation?

Answer

A

Range

Only considers 2 points of data but easy to calculate
Not often provided but sensitive to extreme outliers, therefore provides an idea to presence of outliers when used with mean and SD
Max and min can be used instead and preferred

IQR

Considers only 2 points of data and middle 50% of data
Used when data are ordinal, or when quantitative data are highly skewed; analogous to median

Variance

Accurate and detailed estimate of variability
most commonly used measure of “spread” for statistical calculations but usually not reported as a statistic for variability

Standard Deviation

Accurate and detailed estimate of variability
Preferred and commonly reported because same unit as mean unlike that of variance

Question 13

Q

For quantitative, ordinal and nominal variables, which measures of central tendency and measures of variability should we use?

Answer

A

Quantitative: mean & SD or median (when data is highly skewed) & IQR

Ordinal: median & IQR

Nominal: mode (& NIL)

Question 14

Q

What is graphical methods?

Answer

A

Help to visualize the 5 key characteristics of data distribution

Include

histogram
box plot
scatter plot
stem and leaf plot

Very important though often not reported in literature

Not the same as graphs presented in results and discussion in inferential statistics

Question 15

Q

(What are) histograms?

Answer

A

Commonly used for quantitative and qualitative variables

Constructed from simple or grouped frequency distribution

Question 16

Q

When to use histograms?

Answer

Study These Flashcards

A

Useful for frequency distribution
- good fro presentation as it is familiar to most people

Part of EDA to determine normality and outliers
- can display all characteristics of data distributions is suitable bin size is selected
- particularly useful when there are more than 100 data values
- not reliable if sample size too small as may show spurious patterns
(may not contain enough data points to accurately show the distribution of data –> histogram may be distorted –> not reliable)

Question 17

Q

(What are) Stem and leaf plots?

Answer

Study These Flashcards

A

Shows shape and distribution

Shows all the data (no information; allows for inspection of individual values)

Question 18

Q

When to use Stem and leaf plot?

Answer

Study These Flashcards

A

Alternative to histograms but allows for inspection of individual values

Easy to produce by hand and retains all data so useful for quick check

Less “profesional” looking and display of all digits can be distracting or confusing so less commonly used for presentations

More useful when less than 100 data values

Question 19

Q

(What is) Box plot?

Answer

Study These Flashcards

A

Median

25% or first (lower) and 75% or third (upper) quartiles

Lower whisker: smallest non-outlier = larger of min or (first quartile - 1.5 x IQR)

Upper whisker: largest non-outlier = smaller of max or (third quartile + 1.5 x IQR)

Outliers (anything beyond whiskers)

Question 20

Q

When to use box plots?

Answer

Study These Flashcards

A

useful for both large and small data set

excellent tool for conveying the centre, spread, shape of and outliers in data distributions, especially when comparing different groups of data

part of EDA to determine normality and outliers

disadvantage: does not convey gaps in data distribution

Question 21

Q

Scatter plot (What is and when to use?)

Answer

Study These Flashcards

A

Display relationship/association between 2 continuous variables

Show features of the relationship

strength
shape (linear/curve)
direction
outliers

Question 22

Q

3 points to mention when describing shape of distribution

Answer

Study These Flashcards

A

symmetry (compare mean and median) vs skewness/kurtosis

skewness/kurtosis

number of modes

Question 23

Q

3 types of kurtosis (pointyness/peakedness and tailedness)

Answer

Study These Flashcards

A

leptokurtic: slender; +ve kurtosis; higher peak and heavier tails (many scores at tails)
mesokurtic: bell-shaped
platykurtic: broad; -ve kurtosis; flatter peak, lighter tails

Lecture 2 - Descriptive Statistics & Exploratory Data Analysis Flashcards

(23 cards)