Data Collection, Sampling and Descriptive Statistics Flashcards
Data Collection Techniques (5)
Observations
Tests and assessments
Surveys
Document analysis (published articles)
Interviews
Cannot mix the techniques
Types of Data (2)
Primary: data that you collected
Secondary: data that someone else collected
Secondary Data Disadvantages (6)
May be out of date (limited by time)
May not have been collected long enough to detect trends.
May be missing info on some observations
May be incomplete
No control over data quality
Data collection may be estimated
Secondary Data Advantages (4)
Saves time
Saves money
Easily accessible
Makes collaboration easy, multicenter collaboration (rare diseases).
Primary Data Disadvantages (4)
Can be expensive to collect
Selection of population or sample
Difficulty recruiting participants
Pretesting the instrument to determine presence or absence of measurement bias.
Probabilistic Sampling Methods (4)
Simple random
Stratified random
Systematic random
Clustered random
Non Probabilistic Sampling Methods (3)
Convenience
Purposive
Snowball
What are Descriptive Statistics used for? (3)
To summarize data, describe data and present data.
Types of Descriptive Statistics (4)
- Measures of frequency: count, percent and frequency (how often an observation occurs).
- Measures of central tendency: mean, median and mode (data in relation to the middle position, locates distribution).
- Measures of Dispersion or variability: range, variance, standard deviation(difference between observed score and mean) and Interquartile range.
- Measures of position and rank: Percentile ranks, quartile.
Mean
Average
Mean = (Y1+Y2+…+Yn)/n
Y: variable
Y1: 1st observation of variable Y
Yn: last observation of variable Y
n: number of observations in sample
Outliers make the mean a bad measure of central tendency.
Median
All values are in rank order. The median is that value that splits the data set equally in halves. Same as 50th percentile.
If you have even nr. the average of the two middle nrs. is the median.
Mode
Observation with the highest frequency.
Can have more than one mode: Bimodal (2 modes).
Finding the mean when you have a bar chart with class intervals
You cannot find the exact mean when you have class intervals. You can estimate it by finding the midpoint of each interval.
(frequency x midpoint) / frequency
So you take each class interval and multiply the frequency of that class with it’s midpoint and then you add all of them up together and divide the nr by the total frequency.
Range
Difference between the lowest value and the highest value in a dataset.
Range = maximum value - minimum value
Can be affected by outliers.
Percentile
(C+0.5xf/N)x100%
C: nr/count of all observations lower than the observation of interest.
f: frequency of the observation of interest.
N: nr of all observations.
If you have two of the same observations you have to use the higher observation when finding C.
100th percentile means the highest score, 0 percentile the lowest score, not the same as percentage.
Interquartile Range
Q1: the value occupying 1/4 position of all values.
Q3: the value occupying 3/4 position of all values.
IQR: Q3-Q1
When Q2 is an odd nr you use the 1st median value to calculate Q1 and 2nd median nr to calculate Q3.
When Q2 is an even nr. you do not include it in the calculations of Q1 and Q3.
Variance
Measure of how close together or far apart the values in a dataset are.
The larger the variance, the further the individual values are from the mean.
The smaller the variance, the closer the individual values are to the mean.
Standard Deviation and Variance
S= standard deviation
S2= variance
therefore s = √s2
Empirical Rules of Normal Distribution
In symmetric normal distribution:
68% of values are within 1 SD of the mean
95% of values are within 2 SDs of the mean
99.7% of values are within 3 SDs of the mean.
Values more than 3 SDs from the mean are outliers.
Mean = Median = Mode for unimodal symmetrical normal distribution
Asymmetrical Distribution (2 Types)
Positively skewed/right tailed: skewness > 0, drop in the trendline on the right side.
Negatively skewed/left tailed: skewness < 0, drop in the trendline on the left side.
Describing what you see in relation to the mean example
To describe the relationship of the mean with the symmetry/asymmetry of the distribution, you could say that out of the 40 observations, 23 of them have IQ scores greater than or equal to the mean. That means that most of the people have I Q scores greater than or equal to the mean. While fewer people (n = 17, or 42.5% of the sample) have IQ scores below the mean score.