lecture 6-chia statistic Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

Categorical data

A

Entities are divided into distinct categories
Binary variable: there are only 2 categories
(e.g. dead or alive; yes or no)
Nominal variable: there are more than two categories
(e.g. vegan, omnivore, vegetarian, fruitarian)
Ordinal variable: A nominal variable that has a logical, ordered order
(e.g. H1, H2A, H2B, H3, Pass)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

continuous data

A

Entities receive a distinct score on a measurement scale
Interval variable: equal intervals on the variable represent equal differences in the property being measured
(e.g. difference between 2 and 4 is the same as the difference between 20 and 22)
Ratio variable: Same as interval variable but the ratios are meaningful, with a true zero point
(e.g. response times to the appearance of a target)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

This distinction can be blurry

A

We can measure continuous data as categories
Age (years)
We can treat categorical variables as if they were continuous
Average number of boyfriends that women in their 20s have is 4.6 (.6 of a boyfriend?)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Analysing categorical data

A

We want to quantify the relationship between two categorical variables
(We can’t use the mean because a mean of categorical data is meaningless)
We analyse the number of things that fall into each category,
i.e. the count
Also known as the frequency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Frequency perspective

A

Frequency perspective: take a population and measure each person’s height*.
Graph this data on a histogram (or frequency distribution).
Height follows a normal (bell-shaped) (Gaussian) curve/distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Probability perspective

A

Probability perspective: take a person at random and measure their height.
What is the probability that they will be ~170cm tall?
Another way of asking this question is “How big is the blue area compared with all the values of the bars?”
Total count: 53,298 people
170cm people: 8,700
= 8,700
53,298

= 0.16
= 16%

Size of the bars relate directly
to the probability of an event occurring
Probability of an event occurring ranges from 0 to 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Z- scores

A

Distributions of data will have different means and SDs
We can make use of the already calculated probabilities associated with the normal distribution (phew!)
To do this, we need to convert our data so it has a mean of 0 and a SD of 1
Z = each score – group mean
group standard deviation
Our data is now fitted onto the normal curve

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Null hypothesis testing

A

We assume the null hypothesis is true (i.e. there is no effect)
We fit a statistical model to the data that represents the alternative hypothesis and see how well the model fits the data (in terms of variance)
To determine the fit, we calculate the probability of getting that ‘model’ if the null hypothesis were true
If that probability is really small (.05 or less) then we conclude that the model fits the data well and we find support for the alternative/experimental hypothesis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

chi square

A

The chi-squared distribution is one of the most widely used probability distributions in inferential statistics
This distribution can be used to calculate precisely the probability of obtaining a given score

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Chi-square formula

A

Χ = Σ(observedij – modelij)2
modelij
Χ means chi
Σ (sigma) means sum all of the information in the bracket afterwards
Where i represents the rows in the contingency table and j represents the columns
The observed frequencies are our counts of what happened (in our contingency table)
The model (expected) frequencies are what we would expect if things happened by chance (see next slide for how to calculate this)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Expected frequencies

A

To calculate the expected frequencies for each cell in the table we use the column and row totals for a particular cell…

Modelij = Eij = row titlei x column totalj
n

Where n is the total number of observations (fish) (e.g. 100)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Cross tabulations

A

differences may represent chance – there will most likely be a difference between observed and expected counts just by chance, even if the variables are independent
Are these differences large enough to be confident about an association?
We need to know what happens at the population level and a statistic will help us to know this. Which one?
Chi-square! It estimates the difference between the observed data and what would be expected if the two variables were independent
If the chi-square is large enough, then we can say that the two variables are associated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Degrees of freedom

A

We need to know the degrees of freedom

df 	= (number of rows – 1)(number of columns – 1)
	= (2-1)(2-1)
	= 1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Residuals

A

We can conclude that there is an association between training of goldfish and food used, but which food was driving the association?
We need to calculate the “adjusted, standardised residuals” to be confident about this
Observed – expected is called the “residual” for each cell
Adjusted, standardised residuals are residuals that are standardised so they are equivalent to a z-score in a normal distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Z-score distribution

A

We need the Adjusted Standardised Residuals, as there could be a difference between the observed and expected values just by chance (!)
By placing the residuals onto the z-distribution, we can take chance into account, allow a certain amount of error, and agree that a score greater than 1.96 (positive or negative) is a significant effect

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Handy summary of how to perform a chi-square

A

Note the observed values in a contingency table
Calculate the expected values
Calculate the chi-square
Calculate the degrees of freedom
Look up the chi-square, with the appropriate df, on the chi-square distribution
Calculate the adjusted, standardised residuals
Draw your conclusion

17
Q

Assumptions of chi-square

A

-Sample is drawn randomly from the population
-The sample (whole contingency table) is sufficiently large
Within each cell, the sample is large enough (typically greater than 5 observations)
The observations are independent of each other

18
Q

Fisher’s exact test

A

There is one problem with Pearson’s chi-square test…the sampling distribution of the test statistic has an approximate chi-square distribution
The larger the sample is, the better this approximation becomes. In large samples we don’t need to worry about this approximation
In small samples, this is a worry
To use the chi-square test, the expected frequencies in each cell of the contingency table need to be greater than 5
Fisher’s exact test allows for small sample sizes

19
Q

Summary

A
  • The chi-square statistic estimates the difference between the observed data and what would be expected if the two variables were independent
  • If a chi-square statistic is large enough (and hence improbable, assuming independence), then there is evidence that the null hypothesis may not hold
  • More formally, if the probability of the chi-square statistic is less than .05, reject the null hypothesis