L03 Descriptive Stats Flashcards

1
Q

Descriptive statistics

A

Describe data through tables and graphs

Summarize through measures of central tendency and measures of spread

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Two types of data

A

Discrete - set of fixed values (ordinal)
Continuous - any fractional value within a given range
Interval and ratio: either type

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Represent frequencies of occurrence - nominal data

A

Frequency table or graph (bar graph; y axis n or %)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Represent frequencies of occurrence - discrete data

A

n or %; cumulative n or cumulative %
Frequency/Cumulative frequency table
Graph - bar graph
Frequency ranges for too many values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Represent frequencies of occurrence - continuous data

A

Frequency table/ Cumulative frequency table
Graph - histogram
- frequency diagram/line chart/frequency polygon
Frequency ranges

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Frequency ranges

A

When frequencies of all possible score is not feasible
Ranges or intervals depending on number of samples
More ranges: better visualisation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Central tendency

A

Summary of data through a single value that reflects the centre of distribution of data
3 measures: mean, median, mode
Important in comparing two populations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Mode

A

Most common category or score - that occurs most frequently

Generally used only for nominal data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Median

A

Middle score/value/category when all values are placed in ascending order
Best for ordinal data
Also used for skewed interval/ratio data (insensitive to outliers)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Mean

A

Sum of all scores divided by the number of scores
Influenced heavily by outliers/extreme scores
Best for normally distributed data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Mode pros and cons

A

+ can be used for categorical data
+ Always gives a real data value
+ Not affected by extremes
- can be more than one value (bimodal, multimodal)
- varies depending on bin size
- can be affected by a few number of cases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Median pros and cons

A

+ Insensitive to outliers
+ relatively unaffected by skews (than mean)
+ Often gives real data value
- Ignores a lot of data
- Not easy to calculate without a computer
- Cannot do calculations to it
- more affected by sampling fluctuations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Mean pros and cons

A

+ Uses all the data
+ tends to be stable in different samples
- Very sensitive to outliers and skews
- Doesn’t always give a meaningful value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Measures of spread or Dispersion

A

Variations in a dataset from the measure of central tendency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Measure of spread of Mode

A

None

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Measure of spread of Median

A

Distance-based measures of spread

Range, Interquartile range

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Measures of spread of Mean

A

Centre-based measures of spread

Variance, Standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Distance-based measures of spread

A

Report these with median
Range
Interquartile range

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Range

A

Highest value - Lowest value

Very sensitive to outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Interquartile range

A

Range of middle 50% of scores

Q3 - Q1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Quartile

A

Lowest score needed to be included in a given quarter of the population
(Cut the set down the median Q2 and find medians of values to left and right - median not included)

22
Q

Semi-quartile range

Mid-quartile range

A
Semi-QR = IQR/2 = (Q3 - Q1)/2
Mid-QR =  (Q3 + Q1)/2
23
Q

IQR pros and cons

A
(Like median)
\+ Less sensitive to outliers
\+ Few assumptions
- Hard to calculate by hand for large datasets
- Doesn't use all the data
24
Q

Centre-based measures of dispersion

A

Variance σ^2

Standard deviation σ

25
Deviance
Aka Error Distance of each score from mean deviance = x(sub i) - x̅
26
Total deviance
Sum of deviances - but the values cancel each other out - cannot use SS instead
27
Sum of Squares SS
Sum of Squared Errors for Mean Total amount of error in mean or indication of total dispersion Sum of (deviance^2)
28
Variance
s^2 or σ^2 = SS/N Average dispersion or distance of scores from mean - Units of measurement is unit squared
29
Variance pros and cons
``` + Uses all the data + Forms basis of several tests like t-test - More sensitive to outliers - requires NORMAL distribution - Doesn't have a sensible unit ```
30
Standard deviation
Square root of variance Converted back to original units of measurement of the scores How widely spread data are around mean
31
Standard deviation symbols
s for sample σ for population σhat for population estimate
32
Estimated Standard Deviation of the population
σhat = Square root of (SS/degrees of freedom or N-1) Because sample would not be as variable as population - to avoid a downwards bias
33
Mean symbols
For sample, x̄ x bar | For population μ mu
34
Population
The number of all statistical units sharing at least one common property which is of interest statistical analysis
35
Sample
A smaller subset of the population from which we collect data and use data to infer things about whole population
36
Degrees of freedom
"freedom to vary" The number of independent values that can vary in an analysis without breaking any constraints Related to N, relationship depends on test used Generally, with one group/sample, df = N-1 Two samples, df = N-1-1
37
Standard Error of a Statistic
Standard deviation (or estimate of the standard deviation) of its sampling distribution
38
Standard Error of the Mean - definition
Standard Deviation of the Sample Mean How likely the x̅ is likely to be representative of the population or how far it is likely to be from true μ
39
Standard Error of Mean - calculate
Standard deviation divided by square root of n SE(subx̅) = s/sqrt(n)
40
Interpret SE(subx̅)
Large SE relative to x̅ - lot of variability between means of different samples, so x̅ may not be representative of μ
41
Normal distribution
Frequency distribution is a bell curve, symmetrical Majority of scores around centre; decreased frequency with deviation from centre Mean = Median = Mode for a perfectly normal distribution We assume most data approaches normal distribution with a large enough sample size Many statistical tests: assumptions of normality
42
Types of deviation from normality
Skew (lack of symmetry) | Kurtosis (pointy graph)
43
Skew
Lack of symmetry Cluster of scores on either end Positive skew - more scores at lower end Negative skew - more scores at higher end
44
Kurtosis
Degree of scores clustering around the ends of the distribution Leptokurtic or positive kurtosis: pointier graph - heavy tailed Platykurtic or negative kurtosis - flat graph, light tailed
45
z-score - what?
How many standard deviations a specific score differs from the mean Any data set is converted to one that has a mean of 0 and a standard deviation of 1 - respective scores: z-scores Used to compare scores from different samples by standardizing scores
46
Interpret z-score
sign whether above or below mean | score how many standard away from the mean
47
z-score formula
z(sub i) = x (sub i) - x̅ divided by s | deviance divided by sample standard deviation
48
Distribution of z-scores around mean
± 1 SD - 68% of data (falls within) ± 2 SD - 95% of data ± 3 SD - 99.7% of data
49
Positive skew
more scores at lower end tail on higher end mean > median
50
Negative skew
tail on lower end more scores at higher end mean < median
51
Leptokurtic
Pointy Positive kurtosis Heavy tailed
52
Platykurtic
Flat Negative kurtosis Light tailed