Describing Data Flashcards
What does descriptive statistics do?
Helps to organise and summarise data in easily communicable mannger.
What are measures of central tendency?
Mean
Median
Mode
Is the mean or median more affected by extreme values?
Mean
What makes the mean more accurate?
Higher number of samples
What is the unit of mean the same as?
The unit of original measure
What is a geometric mean?
When individual observations are log transformed, averaged and then back-transformed using antilog
Advantage of geometric mean?
Will be closer to median if log-transformed data had symmetrical distribution
Difference between mean and geometrical mean?
Geometrical mean will be less
What is weighted mean?
Individual values are multiplied by weights (constants) attached to them before averaging
When is weighted mean used?
When some individual observations are more or less valuable than others
Another name for the median?
50th percentile
What data is median preferable for?
Nominal data when treated as values (not as counts)
What does 5th percentile mean?
The value below which 5% of observations lie
What type of data is mode mostly used for?
Nominal
When can mode be useful for ordinal data?
To understand most common rating obtained
In which type of distribution are the mean, mode and median equal?
Normal, symmetric distribution
Where will median lie in skewed distribution?
Between mean and mode
What happens to mean in positive skew?
Mean will be higher than median
Name some measures of variability
Range
Variance
SD
SE
What is range?
Difference between highest and lowest scores in a distribution
What is the interquartile range?
Difference between 75th and 25t percentile
Why does variance give more information than the range?
Includes scores in a distribution
Formula for variance
Sum of squared differences of individual observations from mean/(number of observations - 1)
What is degrees of freedom?
N-1
When is variance high?
When scores are widely scattered
How is variance expressed?
In squared units of the original measure
What is the formula for SD?
Square root of variance
What is the most commonly used measure of dispersion?
SD
What is coefficient of variation a measure of?
Relative spread of data
How does one calculate the coefficient of variation?
Sd / mean
Unit of coefficient of variation?
Percentage
Formula of SE?
SD / square root of sample size
What leads to smaller SE?
Larger sample
What do authors use SE for?
To describe variability of sample
What does SE give estimate of?
How the mean of the sample is related to the mean of the population
Precision and uncertainty of how study sample represents population
What does SD estimate?
Variability in study sample
What does SE tell us of the mean?
How precise our estimate of the mean is
Graphs used for categorical and discrete numerical data
Bar chart
Pie chart
Graphs for continuous data
Histogram
Dot plot
Scatter diagram
Difference between bar chart and histogram
No gaps between bars so data is continuous
How to draw a dot plot
Dot placed for each observation along one axis
When does dot plot become a scatter gram?
When dot plot is extended to two axes
What measures can be plotted on a scattergram?
Two continuous measures
What happens in a steam and leaf plot?
Plot first few digits of numerical observation along vertical axis
Then add numbers to one or both sides to represent individual values of observations
What is a box whisker plot?
Rectangle drawn encompassing 2nd and 3rd quartile of observations
Median value is the line cutting through the rectangle
What do whiskers in box whisker plot show?
Minimum and maximum values of observation
Why is a normal distribution important?
A number of statistical tests assume data comes from normal distribution
In a normal population, the mean and variance (and SD) are not dependent on each other
Many natural phenomena are normally distributed
Central limit theorem
What is the central limit theorem?
States that if we draw equally sized samples from a non-normal distribution, the distribution of the means of these samples will still be normal as long as the samples are large enough
What sample size is large enough to give normal distribution for experimental purposes?
30
Properties of normal distribution
Bell shaped
Mean, median and mode are same value
Curve is symmetric about the mean - skew is 0
Kurtosis is 0
Tials of curve reach close to x axis but never touch it
What is kurtosis?
Flatness of the curve
What parameters have to be specified to describe normal distribution
Mean - where the peak of the density occurs
SD - indicates spread of curve
At a given value for variance, what will higher mean to do a cure
Shift curve to right
At a given value for mean, what will higher Sd do to curve?
Decrease peakedness of curve
At a given value for a mean, what will lower SD do to a curve?
Increase peakedness
What is a leptokurtic curve?
Sharp peak
What is a standard normal distribution?
Normal distribution whose mean is 0 and SD is 1 unit
What is standard normal deviate expression denoted by?
z
What is the formula for standard normal deviate?
(random value ‘x’ - mean) / SD
Value of mean in negative skew?
Left of the median
What is the interquartile range?
Distance from value at 1st quartile to value at 3rd quartile
SE calculation
SD/square root of n
Calculation for CI for population mean
Mean +/- 1.96 x SE
What is Gaussian distribution?
Normal distrbution
What do one tailed tests do?
Examine only one direction of alternative hypothesis
What is usual value of beta?
0.2
What is an unpaired test?
2 groups have different subjects
What is a paired test?
Same subjects at different points in time
Descriptions of categorical data
Mode
Frequency
Descriptions of non-normal data
Median
Inter-quartile range
Descriptions of normal data
Mean
SD
Comparing two unpaired groups of categorical data
Chi-squared
Fischer’s exact test
Comparing two paired categorical groups
McNemars
Comparing two unpaired non-normal groups
Mann-Whitney U Test
Comparing two paired non-normal groups
Wilcoxon’s rank sum test
Comparing paired or unpaired normal data
Student’s t test
Comparing > 2 paired categorial data
Chi-squared
Comparing >2 unpaired categorial groups
McNemars test
Comparing >2 unpaired non-normal groups
Kruskal-Wallis ANOVA
Comparing >2 paired non-normal groups
Friendman test
Comparing >2 normal data; paired or unpaired
ANOVA
What do statistical tests give us?
Value for p
What types of data are contingency tables used for?
Categorical
X and Y axis for contingency tables
X: Outcome
Y: Risk/variable
Impact of small sample size on correlation coefficient?
Less the value of r
How can one dampen the effect of outlying values in small samples?
Using ranks of raw data instead of absolute numbers
What is used if both variables are normal
Pearson
What is used if 1 variable is normal, the other non-normal
Spearman
What is used if 1 variable is normal, the other categorical
Spearman
What is used if 1 variable is non-normal, the other normal
Spearman
What is used if both variables are non-normal?
Kendall
What is used if one variable is categorical and the other normal?
Spearman
What is used if both variables are categorical?
Spearman
Kendall
What does regression equation do?
Describes relationship between 2+ variables by an equation that has a predictive value
What is needed to construct a regression line?
Regressoin equation
What can a regression line represent?
Relationship between variables on a scattergraph
Where on the scattergraph is the IV?
X axis
Where on the scattergraph is the DV?
Y axis
Equation of best fit for regression line
y=a+bx
What is a in y=a+bx
intercept of the regression line on y axis
What is b in y=a+bx
Regression coefficient (slope of regression line)
What does b in y=a+bx describe
Strength of relationship
What is x in y=a+bx
Value of IV
What happens to PPV and NNV as prevalence of a disorder decreases?
PPV will decrease
NNV will increase
What is serial testing?
When 2 or more tests are used in sequence until the test returns a negative result
A diagnosis is only confirmed if all tests return a positive test
Advantages of serial testing
Increases specificity
Useful if treatment is hazardous
What does larger AUC in ROC curve correspond to?
The better the test
AUC of 0.5 in ROC curve?
Worthless test
AUC of 1 in ROC cure?
Perfect test
How is cumulative survival probability calculated?
When end event occurs, survival probabilities are determined by using survival probability prior to event occurring and adjusting this using post event survival rate of remaining uncensored subjects.
Endpoint probability calculation?
1 - survival probability
What is hazard?
Probability that a subject will have an endpoint at a given time
What does hazard >1 mean
The factor increases risk of outcome
What does hazard <1 mean
Factor decreases risk
What does it mean if chi square is bigger than its degree of freedom?
Evidence of heterogeneity
How does forest plot show evidence of heterogeneity?
CI do not overlap with other studies