L03 Descriptive Stats Flashcards

1
Q

Descriptive statistics

A

Describe data through tables and graphs

Summarize through measures of central tendency and measures of spread

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Two types of data

A

Discrete - set of fixed values (ordinal)
Continuous - any fractional value within a given range
Interval and ratio: either type

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Represent frequencies of occurrence - nominal data

A

Frequency table or graph (bar graph; y axis n or %)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Represent frequencies of occurrence - discrete data

A

n or %; cumulative n or cumulative %
Frequency/Cumulative frequency table
Graph - bar graph
Frequency ranges for too many values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Represent frequencies of occurrence - continuous data

A

Frequency table/ Cumulative frequency table
Graph - histogram
- frequency diagram/line chart/frequency polygon
Frequency ranges

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Frequency ranges

A

When frequencies of all possible score is not feasible
Ranges or intervals depending on number of samples
More ranges: better visualisation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Central tendency

A

Summary of data through a single value that reflects the centre of distribution of data
3 measures: mean, median, mode
Important in comparing two populations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Mode

A

Most common category or score - that occurs most frequently

Generally used only for nominal data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Median

A

Middle score/value/category when all values are placed in ascending order
Best for ordinal data
Also used for skewed interval/ratio data (insensitive to outliers)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Mean

A

Sum of all scores divided by the number of scores
Influenced heavily by outliers/extreme scores
Best for normally distributed data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Mode pros and cons

A

+ can be used for categorical data
+ Always gives a real data value
+ Not affected by extremes
- can be more than one value (bimodal, multimodal)
- varies depending on bin size
- can be affected by a few number of cases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Median pros and cons

A

+ Insensitive to outliers
+ relatively unaffected by skews (than mean)
+ Often gives real data value
- Ignores a lot of data
- Not easy to calculate without a computer
- Cannot do calculations to it
- more affected by sampling fluctuations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Mean pros and cons

A

+ Uses all the data
+ tends to be stable in different samples
- Very sensitive to outliers and skews
- Doesn’t always give a meaningful value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Measures of spread or Dispersion

A

Variations in a dataset from the measure of central tendency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Measure of spread of Mode

A

None

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Measure of spread of Median

A

Distance-based measures of spread

Range, Interquartile range

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Measures of spread of Mean

A

Centre-based measures of spread

Variance, Standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Distance-based measures of spread

A

Report these with median
Range
Interquartile range

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Range

A

Highest value - Lowest value

Very sensitive to outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Interquartile range

A

Range of middle 50% of scores

Q3 - Q1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Quartile

A

Lowest score needed to be included in a given quarter of the population
(Cut the set down the median Q2 and find medians of values to left and right - median not included)

22
Q

Semi-quartile range

Mid-quartile range

A
Semi-QR = IQR/2 = (Q3 - Q1)/2
Mid-QR =  (Q3 + Q1)/2
23
Q

IQR pros and cons

A
(Like median)
\+ Less sensitive to outliers
\+ Few assumptions
- Hard to calculate by hand for large datasets
- Doesn't use all the data
24
Q

Centre-based measures of dispersion

A

Variance σ^2

Standard deviation σ

25
Q

Deviance

A

Aka Error
Distance of each score from mean
deviance = x(sub i) - x̅

26
Q

Total deviance

A

Sum of deviances - but the values cancel each other out - cannot use
SS instead

27
Q

Sum of Squares SS

A

Sum of Squared Errors for Mean
Total amount of error in mean or indication of total dispersion

Sum of (deviance^2)

28
Q

Variance

A

s^2 or σ^2 = SS/N
Average dispersion or distance of scores from mean
- Units of measurement is unit squared

29
Q

Variance pros and cons

A
\+ Uses all the data
\+ Forms basis of several tests like t-test
- More sensitive to outliers
- requires NORMAL distribution
- Doesn't have a sensible unit
30
Q

Standard deviation

A

Square root of variance
Converted back to original units of measurement of the scores
How widely spread data are around mean

31
Q

Standard deviation symbols

A

s for sample
σ for population
σhat for population estimate

32
Q

Estimated Standard Deviation of the population

A

σhat = Square root of (SS/degrees of freedom or N-1)

Because sample would not be as variable as population - to avoid a downwards bias

33
Q

Mean symbols

A

For sample, x̄ x bar

For population μ mu

34
Q

Population

A

The number of all statistical units sharing at least one common property which is of interest statistical analysis

35
Q

Sample

A

A smaller subset of the population from which we collect data and use data to infer things about whole population

36
Q

Degrees of freedom

A

“freedom to vary”
The number of independent values that can vary in an analysis without breaking any constraints

Related to N, relationship depends on test used
Generally, with one group/sample, df = N-1
Two samples, df = N-1-1

37
Q

Standard Error of a Statistic

A

Standard deviation (or estimate of the standard deviation) of its sampling distribution

38
Q

Standard Error of the Mean - definition

A

Standard Deviation of the Sample Mean

How likely the x̅ is likely to be representative of the population
or how far it is likely to be from true μ

39
Q

Standard Error of Mean - calculate

A

Standard deviation divided by square root of n

SE(subx̅) = s/sqrt(n)

40
Q

Interpret SE(subx̅)

A

Large SE relative to x̅ - lot of variability between means of different samples, so x̅ may not be representative of μ

41
Q

Normal distribution

A

Frequency distribution is a bell curve, symmetrical
Majority of scores around centre; decreased frequency with deviation from centre
Mean = Median = Mode for a perfectly normal distribution
We assume most data approaches normal distribution with a large enough sample size
Many statistical tests: assumptions of normality

42
Q

Types of deviation from normality

A

Skew (lack of symmetry)

Kurtosis (pointy graph)

43
Q

Skew

A

Lack of symmetry
Cluster of scores on either end

Positive skew - more scores at lower end

Negative skew - more scores at higher end

44
Q

Kurtosis

A

Degree of scores clustering around the ends of the distribution

Leptokurtic or positive kurtosis: pointier graph - heavy tailed

Platykurtic or negative kurtosis - flat graph, light tailed

45
Q

z-score - what?

A

How many standard deviations a specific score differs from the mean
Any data set is converted to one that has a mean of 0 and a standard deviation of 1 - respective scores: z-scores
Used to compare scores from different samples by standardizing scores

46
Q

Interpret z-score

A

sign whether above or below mean

score how many standard away from the mean

47
Q

z-score formula

A

z(sub i) = x (sub i) - x̅ divided by s

deviance divided by sample standard deviation

48
Q

Distribution of z-scores around mean

A

± 1 SD - 68% of data (falls within)
± 2 SD - 95% of data
± 3 SD - 99.7% of data

49
Q

Positive skew

A

more scores at lower end
tail on higher end
mean > median

50
Q

Negative skew

A

tail on lower end
more scores at higher end
mean < median

51
Q

Leptokurtic

A

Pointy
Positive kurtosis
Heavy tailed

52
Q

Platykurtic

A

Flat
Negative kurtosis
Light tailed