lecture 6-chia statistic Flashcards
Categorical data
Entities are divided into distinct categories
Binary variable: there are only 2 categories
(e.g. dead or alive; yes or no)
Nominal variable: there are more than two categories
(e.g. vegan, omnivore, vegetarian, fruitarian)
Ordinal variable: A nominal variable that has a logical, ordered order
(e.g. H1, H2A, H2B, H3, Pass)
continuous data
Entities receive a distinct score on a measurement scale
Interval variable: equal intervals on the variable represent equal differences in the property being measured
(e.g. difference between 2 and 4 is the same as the difference between 20 and 22)
Ratio variable: Same as interval variable but the ratios are meaningful, with a true zero point
(e.g. response times to the appearance of a target)
This distinction can be blurry
We can measure continuous data as categories
Age (years)
We can treat categorical variables as if they were continuous
Average number of boyfriends that women in their 20s have is 4.6 (.6 of a boyfriend?)
Analysing categorical data
We want to quantify the relationship between two categorical variables
(We can’t use the mean because a mean of categorical data is meaningless)
We analyse the number of things that fall into each category,
i.e. the count
Also known as the frequency
Frequency perspective
Frequency perspective: take a population and measure each person’s height*.
Graph this data on a histogram (or frequency distribution).
Height follows a normal (bell-shaped) (Gaussian) curve/distribution
Probability perspective
Probability perspective: take a person at random and measure their height.
What is the probability that they will be ~170cm tall?
Another way of asking this question is “How big is the blue area compared with all the values of the bars?”
Total count: 53,298 people
170cm people: 8,700
= 8,700
53,298
= 0.16
= 16%
Size of the bars relate directly
to the probability of an event occurring
Probability of an event occurring ranges from 0 to 1
Z- scores
Distributions of data will have different means and SDs
We can make use of the already calculated probabilities associated with the normal distribution (phew!)
To do this, we need to convert our data so it has a mean of 0 and a SD of 1
Z = each score – group mean
group standard deviation
Our data is now fitted onto the normal curve
Null hypothesis testing
We assume the null hypothesis is true (i.e. there is no effect)
We fit a statistical model to the data that represents the alternative hypothesis and see how well the model fits the data (in terms of variance)
To determine the fit, we calculate the probability of getting that ‘model’ if the null hypothesis were true
If that probability is really small (.05 or less) then we conclude that the model fits the data well and we find support for the alternative/experimental hypothesis
chi square
The chi-squared distribution is one of the most widely used probability distributions in inferential statistics
This distribution can be used to calculate precisely the probability of obtaining a given score
Chi-square formula
Χ = Σ(observedij – modelij)2
modelij
Χ means chi
Σ (sigma) means sum all of the information in the bracket afterwards
Where i represents the rows in the contingency table and j represents the columns
The observed frequencies are our counts of what happened (in our contingency table)
The model (expected) frequencies are what we would expect if things happened by chance (see next slide for how to calculate this)
Expected frequencies
To calculate the expected frequencies for each cell in the table we use the column and row totals for a particular cell…
Modelij = Eij = row titlei x column totalj
n
Where n is the total number of observations (fish) (e.g. 100)
Cross tabulations
differences may represent chance – there will most likely be a difference between observed and expected counts just by chance, even if the variables are independent
Are these differences large enough to be confident about an association?
We need to know what happens at the population level and a statistic will help us to know this. Which one?
Chi-square! It estimates the difference between the observed data and what would be expected if the two variables were independent
If the chi-square is large enough, then we can say that the two variables are associated
Degrees of freedom
We need to know the degrees of freedom
df = (number of rows – 1)(number of columns – 1) = (2-1)(2-1) = 1
Residuals
We can conclude that there is an association between training of goldfish and food used, but which food was driving the association?
We need to calculate the “adjusted, standardised residuals” to be confident about this
Observed – expected is called the “residual” for each cell
Adjusted, standardised residuals are residuals that are standardised so they are equivalent to a z-score in a normal distribution
Z-score distribution
We need the Adjusted Standardised Residuals, as there could be a difference between the observed and expected values just by chance (!)
By placing the residuals onto the z-distribution, we can take chance into account, allow a certain amount of error, and agree that a score greater than 1.96 (positive or negative) is a significant effect