Statistics Flashcards
Define Population:
Full set of units that we are interested in
Define Sample:
A subunit of units that we experiment on or observe
Why do we use a sample?
To draw inferences about the population
Why won’t we get the right answer from sampling units?
Role of chance
What is hypothesis testing?
Suggesting something is unlikely to be true is rather easier
What are steps of formulating a hypothesis testing?
- Formulate a hypothesis
- Formulate a null hypothesis
- Calculate the chance that you might see your data if the null hypothesis is true (p value)
What is p-values?
Probability that you might see something as extreme or more extreme
What do you do if p<0.05 in the old school approach?
- Significant result
- Reject null hypothesis
- Accept alternative hypothesis
How do we interpret p-value in the modern approach of continuum of evidence?
- 1 =
- 05 =
- 01 =
- 001 =
- 1 = Weak evidence
- 05 = Moderate evidence
- 01 = Strong evidence
- 001 = Very strong evidence
What is wrong with the old school approach?
Effectively by using strict cut-off we interpret p<0.05 as statistical proof
-Does’t represent how strong the evidence is
What are the basic types of data?
- Numerical
- Categorical
What is numerical data?
Any data that can be expressed with numbers
What are two many sub-types of numerical data?
- Continuous
- Count
What is continuous data?
Can take any value
What is an example of continuous data?
Height
Blood pressure
Time
What is count data?
Takes only integer values and represents a count of discrete things
What is an example of count data?
Number of time to A&E
Number of children
What is categorical data?
Things that do not have an inherent numerical value
What are the main subtypes of categorical data?
- Nominal
- Ordinal
What is nominal data?
Things with inherent order
What are examples of nominal data?
Eye colour
Blood type
What is ordinal data?
Things with an inherent order
What are examples of ordinal data?
Large/Small
Education level
-Age group
What is descriptive statistics used for?
To describe the data in you sample
What is inferential statistics used for?
To draw inferences about the population from the sample
Summaries categorical data:
- Data can take on 1 of a number of categories
- Number of categories is small
- Use of table frequency
What do frequency tables allow?
To see which category is most common, least common and which categories occur more frequently
What is a problem with frequency tables allow?
Can not see immediately what share of sample is contained in each category
What can you do to see what share of sample is contained in each category?
Percentages
What are types of graphical summary of categorical data?
- Bar charts
- Pie charts
What does the height of a bar chart represent?
Number of occurs
What does grouping data turn continuous data into?
Categorical data
What can you do instead of grouping data?
Plot histograms
What is the total area of histogram?
1
What equation is used to calculate the density of histograms?
Density = proportion in bin/bin width
Is there gaps between bins in histograms?
No
What does the height of a bin in a histogram indicate?
Relative frequency of observations
What does using density allow histograms to compare?
Different bin widths
If a histogram has a heavy tail does it have a high or low kurtosis?
High
If a histogram has a low tail does it have a high or low kurtosis?
Low
What does location mean?
Defines where data are located in the range of possible values
What are the three common measures of the averages used?
- Mean
- Mode
- Median
What is a mean?
Equal to the sum of values divide by the number of values
What is a median?
- Rank data in order
- Median is the middle number
- If even number of data points, no single point so take mean of 2 middle values
What is a mode?
Most commonly occurring value
What is dispersion?
Technical name for the spread or variability of the data
What are the three common measures of spread?
- Standard deviation
- Interquartile range
- Range
What is standard deviation?
Equal to the square root of the mean of the difference between values and the mean squared