Statistics Theory L7 = Statistics Basics Flashcards
Basic principles of statistics? (4)
- PSDI diagram has statistical population parameters (μ, σ, β, δ) & sample parameters (x bar, s, b, cursive s).
- Not only do we want the estimate, but we also want a measure of how good that estimate is.
- We’re forced to look at the subset of the population, because the population is large & sometimes vaguely defined.
- What is the consequence of this “subsettedness” on our ability to get a clear view of the population of interest?
Scales of measurement? (4)
- Ratio.
- Interval.
- Ordinal.
- Nominal.
Ratio attributes? (3)
- Are meaningful.
- Zero means absence.
- Can be continuous data, or discrete/integer.
Eg of how Ratios are meaning?
20 elephants = 0.5 x 40 elephants.
Interval attributes? (3)
- Differences are meaningful.
- Zero is arbitrary (compare degree C & F).
- Can be continuous data, or discrete/integer.
Eg of how differences can be meaningful?
30 degrees - 10 degrees = 20 degrees.
Ordinal attributes? (2)
- Are categorical data.
- Order of categories are meaningful.
Eg of Ordinal?
Education level: primary, secondary, undergraduate, postgraduate.
Nominal attributes? (2)
- Are also categorical data.
- There’s no inherent order.
Egs of Nominal scale of measurement? (2)
- Colour.
- Sex.
Why does the type of data matter? (3)
- Affects the type of analysis we can do.
- Affects interpretation of the analysis.
- As we go down in scale (ratio -> nominal), the data contain less information. Therefore, where possible, we want to stick to higher scales.
Population?
= the entire collection of entities about which we want to make an inference or draw a conclusion.
Sample?
= subset of the population, drawn because we can’t measure all entities in the populations.
Simple random sample?
= a sample of n entities drawn from a population so that each entity has the same chance of selection.
Parameter?
= the true value of something in the population that we want to know about (usually Greek letters).
Parametric statistics?
= statistical methods/models that focus on estimating parameters.
Non-parametric statistics?
= statistical methods/models that don’t focus on estimated parameters.
Statistic?
= any value calculated from sample data (usually Latin letters).
Statistics?
= tools used to make conclusions/inferences about an unknown population from a known sample.
Probability?
= tools used to make conclusions/inferences about an unknown sample from a known population.
Model?
= an approximation/simplification of reality for the purpose of improving understanding.
Statistical model?
= mathematical expression to summarise the relationship between a response (Y) variable & one or more explanatory (X) variables.
Kinds of measurement? (2)
- Measures of central tendency/location.
- Measures of dispersion.
Measures of central tendency/location? (7)
- Mean.
- Median.
- Quantile (percentile).
- Mode.
- Symmetrical, unimodal distribution.
- Skew.
- Kurtosis.
Types of Mean? (2)
- Population mean.
- Sample mean/sample average.
Population mean (μ) equation?
μ = 1/N ∑ᵢ=₁N Yᵢ
Sample mean (ȳ) equation?
ȳ = 1/n ∑ᵢ=₁ⁿ Yᵢ
Median?
= middle value in an ordered/ranked/sorted sample.
Median attribute?
The 50th precentile.
Quantile (percentile)?
= a division of sorted data by the percentage of observations occurring below it.
Quantile (percentile) attributes? (2)
- Quartile: <25%, 25-50%, 50-75%, 75-100%.
- Interquartile range (IQR): 25-75%.
Mode?
= most frequently occurring observation or grouping of observations in a sample.
Symmetrical, unimodal distribution?
= when the mean=median=mode.
Skew?
= when there is asymmetry in the distribution.
Types of Skew? (2)
- Positive skew.
- Negative skew.
Positive skew attributes? (2)
- Long tail to the right.
- Mode < Median < Mean.
Negative skew attributes? (2)
- Long tail to the left.
- Mode > Median > Mean.
Types of Kurtosis? (3)
- Leptokurtic.
- Mesokurtic.
- Platykurtic.
Leptokurtic?
= narrow middle on distribution graph/curve.
Mesokurtic?
= normal distribution/in the middle.
Platykurtic?
= flat-ish but not (evenly spread out).
Measures of dispersion? (5)
- Variance.
- Sum of squared deviations/errors.
- Sample variance.
- Standard deviation.
- Coefficient of variation.
Types of variance? (2)
- Population variance.
- Sample variance.
Population variance (σ²) equation?
σ² = 1/N ∑ᵢ=₁N (Yᵢ - Ȳ)²
Sample variance (s²) equation?
s² = 1/n-1 ∑ᵢ=₁n (Yᵢ - Ȳ)²
Why the -1 in the sample variance equation?
Reduces bias in the sample.
Sum of squared deviations/errors equation?
∑ (Yᵢ - Ȳ)²
Why the squaring in the Sum of squared deviations/errors equation?
Squaring removes the negative deviations.
Sample variation AKA? (3)
- Mean squared error.
- Mean squared residual.
- Mean squared deviation.
Standard deviation?
= the average deviation or difference between each observation & the mean.
Population standard deviation?
σ = square root of σ².
Sample standard deviation?
s = square root of s².
Coefficient of variation (CV)?
= for comparing variability in a sample when the means differ a lot between populations.
Coefficient of variation (CV) equation?
CV = s/ȳ x 100.
NB!! of the Measures of dispersion?
There’s a difference between standard deviation & standard error.
Graphing attributes? (2)
- Help us to assess our data sets & their distributions.
- Reveal mistakes/problems in data entry, or interesting patterns that aren’t apparent in a numerical analysis.
Graphical methods? (3)
- Relative frequency histogram.
- Stem-and-leaf plot.
- Box-and-whisker plot.
Relative frequency histogram attributes? (3)
- X-axis = data values broken into bins or categories.
- Y-axis = number or frequency of observations in each bin.
- Area (height) of each bar is proportional to the number of observations.
Stem-and-leaf plot attributes? (4)
- Useful for getting a distribution & comparing two distributions.
- Identify outliers.
- Can see every observation.
- Useful for quick assessment of distribution in the field.
Standard distribution VS Standard error?
- Standard distribution
= width of the data from the sample. - Standard error
= width of the distribution of the mean.
Box-and-whisker plot attributes? (5)
- Help us assess spread, skewness, outliers & to compare groups.
- Whole box = IQR (25-75%).
- Dark line in box = Median.
- Outlier = extreme observation.
- Positive skew = box is at the bottom (low).
Things to note on Statistics in terms of histograms? (2)
- Large sample = what we want, more narrow, low SE, high precision.
- Small sample = what we don’t want, more broad, high SE, low precision.
Main lesson under Statistics basics?
Statistical inference deals with uncertainty or variability, in the data & in the things we try to estimate with data.
What is the purpose of SE?
To show how much the sample mean is likely to differ from the true population mean (precision).