1. Data and Models Flashcards
Summarising numerical data, summarising attribute data, fitting a model
Population
Definition
-a collection of individuals/items of interest
Sample
Definition
-the subset of the population for which observations are available
Variable/Variate
Definition
-a quantity or attribute whose value varies between individuals
Observation
Definition
-a recorded value of a variate for an individual
Data
Definition
-a collection of observations
Statistic
Definition
-a function of the data
Summarising Numerical Data
min and max
-the minimum and maximum values of the data
Summarising Numerical Data
Measures of Location
- summary statistics which try to capture the location of the centre of the sample
1) sample mean
2) mode
3) median
Summarising Numerical Data
Sample Mean
-the sample mean or sample average of x1,…,xn∈R is given by:
1/n Σ xi
Summarising Numerical Data
Mode
- the mode of a sample x1,…,xn is the value of the variate which occurs most frequently
- in cases where different values occur with the same frequency the mode may not be unique
Sumarising Numerical Data
Meidan
-a median of x1,…,xn∈R is any number m∈R such that:
a) at least half of the observations are less than or equal to m
AND
b) at least half of the observations are greater than or equal to m
-if the number of samples is odd, there is a unique median
-if the number of samples is even even, the median can fall anywhere in the interval between the middle two values, we usually choose the midpoint
Summarising Numerical Data
Measures of Spread
- statistics which characterise the spread of the sample
1) range
2) sample variance
3) sample standard deviation
4) interquartile and semi-interquartile range
Summarising Numerical Data
Range
-the range of a sample of numeric observations x1,…,xn∈R is the interval:
[min xi , max xi]
-i.e. the smallest interval which contains all the data
Summarising Numerical Data
Sample Variance
-the sample variance of x1,…,xn∈R is given by:
sx² = 1/(n-1) Σ(xi-x^)²
- where x^ is the sample mean
- the sample variance is nearly the average squared distances between samples and the sample mean, only the denominator is is n-1 instead of n
Summarising Numerical Data
Sample Standard Deviation
- sample standard deviation is the square root of the sample variance
- large values of sx indicate that the samples are spread out, while small values of sx indicate that the samples are concentrated around the sample mean
Summarising Numerical Data
α-quanitles
- the idea of α-quantiles is to split the samples into two groups such that αn samples are smaller than qα and (1-α)n samples are larger than qα
- the value of qα that leads to such a split is an α-quantile, depending on n, α and x, the α-quantile may or may not be unique
Summarising Numerical Data
first and third quartiles
- using the definition of the α-quantile, qα:
- the value q1/4 is called the first quartile
- q3/4 is called the third quartile
Summarising Numerical Data
interquartile and semi-interquartile range
-the difference q3/4-q1/4 is called the interquartile range
-and:
(q3/4-q1/4)/2 is called the semi-interquartile range
Semi-Interquatile Range vs Sample Standard Devitation
- the semi-interquartile range can be used as an alternative to the sample standard deviation
- its definition is slightly more complicated but the semi-interquartile range is less affected by outliers than the sample standard deviation
- i.e. the semi-interquartile range is a robust measure of the spread of a sample
Summarising Attribute Data
- since the observations of attribute data do not consist of numbers, the mode is the only one of the summary statistics from the previous section which can be computed for attribute data
- often the best way to summarise attribute data is to consider tables which show how often each of the possible values occurs
Statistical Model
Definition
-a statistical model for a sample x1,…,xn consists of random variables X1,…,Xn chosen such that the data x1,…xn ‘look like’ a random sample of X1,…,Xn
Fitting a Model
- one of the main concerns in statistics is to ‘fit a model’ to given data
- i.e. to find a distribution for the random variables X1,…Xn such that the data could plausibly be a random sample from the model
Questions about the relation between data and models
1) what are the best parameter values to use in the model -> parameter estimation
2) which parameter values in the model are compatible with the data -> confidence intervals
3) could the data have been produced by a given model with given parameter values -> hypothesis tests
Models in R
r
-generates n random numbers from the sample
Models in R
d
-densities (weights for the discrete case)
Model in R
p
-cumulative distribution functions
Models in R
q
-quartiles
Models in R
Distributions
binomial : binom chi-squared: chisq exponential: exp gamma: gamma normal: norm poisson: pois uniform: unif
Sampling Attribute Data
-to generate independent, random samples from a model for an attribute value, the command:
sample(values,n,replace=TRUE,prob=p)
-can be used
-values must be a vector of the possible values of the attribute, and p must be a vector of the same length as values giving the corresponding probabilities of each value
-if all possible values have the same probability, the argument prob=… can be omitted