Statistics Flashcards
What are the two main types of data
quantitative
qualitative
What is ordinal data
the data can be given a meaningful order
What is nominal data
there is no relationship that is meaningful in terms of order of the categories ie. it is just name e.g. atkins diet and paleo diet
What is binomial data
there are only two options e.g. yes or no
What is a random sample
one in which each member of the population has an equally likely non zero chance of being included
what is a stratified sample
one in which certain categories of the population must be represented e.g. if we know the library is 50 percent history books, 30 percent science and 20 percent others. in a sample of 20 we must select 10 history books, 6 science and 4 others.
what is a convienience sample
one that is not chosen randomly but is all that is available eg. all patients at an outpatient dermatology clinic
when would you use a bar or pie chart
categorical data
when would you use histograms, stem and leaf plots and box and whisker plots
to visualise continuous data
what does a scatter plot show
the relationship between two variable and how one changes in relation to the other
when would you use the mean and when would you use the median to describe the centrality of data
mean - normal distriuted not skewed data
media- if data is more skewed or significant outlier
mode- used for qualitative data
what do you do differently when calculating the sample variance/sd as opposed to the population
use n-1 as the denominator instead of n
what does the standard deviation show
the spread of the data
what does positively skewed mean
that more of the values are clusted towards the bottome of the scale - such as alcohol intake
what is negatively skewed
most of the values are clustered at the higher range of the scale - rare in clinical data
what is the coefficient of skewness
a value which shows how skewed the data is - the closes to 0 the more symmetrical the data
what does a value of 0 for the kurtosis mean
indicates that the shape of the data is close to the normal distribution
what is inference
making predictions about a population based on the data collected from a smaller sample or series of smaller samples
what are the characteristics of a normal distribution
continuous symmetrical bell shaped curve mean, median and mode are equal single central peak values between -infinity and +infinity
what is the binomial distribution
for binary data e.g. dead/alive, male/female
what is the poisson distribution
for events which occur at random intervals of time or space e.g. deaths per year.
rare events
what is the mean and sd of a standard normal distributions
mean = 0
sd = 1
we write z~ N (0,1)
where would you expect 95 percent of values to like in normally distributed data
mean +/- 1.96 x SD
how can you assess the normality of data
Informal review of properties of normal distribution
Inspection of a normal plot
Shapiro- Wilk test
Name ways in which you can transform data to make it plausibly normal and when you would use each one
Logarithmic - fairly skewed data in which the variances are proportional to the mean Square root - countrs Reciprocal - highly skewed data Cube- volumes Logit - proportions
what is the variance of expected number of events
nxpx (1-p)
when can the binomial distribution be approximated to normal
if np > 5 and n(1-p) >5
what is the standard error
the standard deviation of the mean
When can you make inferences about sample means based on the normal distrubution
- sample is selected from normal population with known SD or the sample size is large
2 observations in the sample are independent
when should the hypothesis be defined
before data is collected
what is a type 1 error
rejecting a true null hypothesis
what is a type 2 error
accepting a false null hypothesis
What does the level of significance of a test mean
the probability of making a type one error
what is the generally accepted risk of making a type 2 error
20 percent
if your significant level is 5 percent what is you confidence level
95 percent
When is students t distribution used
When the population standard deviation is not known - for normally distributed data
What is the degrees of freedom in t distribution
one less than the sample size
What is the difference between and independent and dependent sample
independent = different people dependant= same people
however if samples from two different groups are match e.g. for age and gender the sample could then be viewed as dependant
What are the steps that can be done to compare the means of two samples with incomparable sample variances
- investigate the relationship between the means and variances
- use Welch’s modified t test
- do non parametric tests
- do not process with the test of the means
what does it mean if the F statistic is not significant
the variances of the two samples are comparable
when can you use the normal approximation for a binomial trial
if both np and n(1-p) are greater than 5
what is regression
provides information about the nature of the relationship e.g linear
what is correlation
asses the extent of the associations between two variables
when is a logistic regression used
when one variable is categorical
how do we measure the linear relationship between two variables
correlation coefficient
what is the most commonly used measure of correlation
Pearson’s product moment correlation coefficient (r)
What are the three main points to remember about r
r value increases with sample size
at least one value should be normally distributed
random sample
the pairs of variables are independent
correlation can be mathematically significant but not clinically significant
what is r squared
measure of the proportion of the variation in the dependent variable which is attributable to its linear relationship with the independent variable
what assumptions are made when using regression methods
correlation between x and y significant
for each value of the x variable, the values of the y variable have a normal distribution
variances of these normal distributions are equal
up to what sample size can the Shapiro wilk test provide a test for normality
up to 2000
what do the results of the Shapiro wilk test mean
closer to 1 = the more normal the data is
what is a cohort study
a group of disease free subjects are followed up over time
what is a case control study
retrospective study of people with a disease. compares factors they have been expose to with controls
advantages and disadvantages of a cohort study
less likely to be biased
expensive
not suitable for rare diseases
what are the advantages/disadvantages of a case control study
cheap and easy to do
could be biased
what factors influence the sample size needed in a study
significance level power of the test size of effect to be identified standard deviation of the measurements- greater the SD the greater the sample size needed study design practical issues
when are parametric tests used
for normally distributed data
when are non parametric tests used and name examples
for skewed i.e. not normally distributed data
e.g. chi squared, Wilcoxon, sign
what is another name for non parametric tests
distribution free
what is the disadvantage of non parametric techniques
they are less powerful than parametric techniques as such all efforts to transform data to approximate normal distribution should be done first
what do non parametric techniques use as a representative of centre
the median
when should a large sample Wilcoxon statistic which approximated the normal be used
when n>25