Descriptive Stats Flashcards
What is Stats
Science of collecting, organizing and analyzing data.
Two types of Stats
i) Descriptive : What is the mean of scores in class?
ii) Inferential : Is the mean score of sample data similar to the population mean?
Descriptive Stats concept, Measures of Central Tendency :
Mean, Median and Mode.
Descriptive Stats concept,
Measures of Dispersion :
Standard Deviation, Mean Deviation, Variance and Range.
Descriptive Stats concept,
Mean, Median, Mode
The Mean is the average of all data points.
μ=∑X/N
Median:
For odd no. of observations:
The median would be: (n+1)/2th observation’s value.
For even no. of observations:
((n/2)th observation + (n/2+1)th observation)/2
Mode
It is the observation with the highest frequency.
When to use Mean, Median, Mode?
Mean :
- The mean is the best measure of central tendency to use when the data distribution is continuous and symmetrical. Mean is best used for a data set with numbers that are close together.
Median :
- The median is a better measure of central tendency than the mean in skewed distributions.
- The median is usually preferred to other measures of central tendency when dealing with ordinal data.
Mode :
- The mode is the only measure of central tendency for Categorical data.
- The mode is useful for reflecting the most popular answer on a ranked scale of ordinal data.
- The mode is not usually helpful for quantitative data like height or reaction time because there are often many possible values, so it’s unlikely for values to repeat.
- The mode is also not very useful if the distribution is fairly even.
What is Variability ?
Variability refers to how spread scores are in a distribution; that is, it refers to the amount of spread of the scores around the mean.
For example, distributions with the same mean can have different amounts of variability or dispersion.
Measures of variability :
There are four frequently used measures of the variability of a distribution:
- Range
- Interquartile range
- Variance
- Standard deviation.
What is Range
Range
The most basic measure of variation is the range, which is the distance from the smallest to the largest value in a distribution.
Range= Largest value – Smallest Value
Imagine, the distribution of scores of on a Quiz :
Range = 9-5= 4
What is IQR?
The interquartile range (IQR) is the range of the middle 50% scores in a distribution:
IQR= 75th percentile – 25th percentile
Quartiles are the values that divide scores into quarters.
IQR = Q3 - Q1
Q1 is the lower quartile and is the middle number between the smallest number and the median of a data set. Q2 is the middle quartile-or median. Q3 is the upper quartile and is the middle value between the median set and the highest value of a data set.
Example : 5, 6, 7, 8, 9
If the median is 7, then Q1 is 6 (middle value between median and lowest value) and Q3 is 8 (middle value between median and highest value).
To calculate the IQR:
IQR= 8-6= 2
What is Variance?
The variance is the average squared difference of the scores from the mean. To compute the variance in a population:
Population Variance (σ²):
𝜎2(denotes sigma-square) = (∑(𝑥𝑖−𝜇)**2)/𝑁
Each individual data point : 𝜇
μ: Population mean
N: Number of data points in the population.
Sample Variance has a diff formula.
What is the intuition behind Variance formula?
1 ) Mean as a Central Point:
First, the mean (μ) serves as the central point of the data. It’s the “balance point” where data points tend to cluster around.
2) Deviation from the Mean:
For each data point we calculate how far it is from the mean which is called the deviation. Some data points may be below the mean (negative deviation) and some above the mean (positive deviation).
3) Squaring the Deviations:
To avoid the problem of positive and negative deviations canceling each other out, we square each deviation. Squaring makes all deviations positive, emphasizing larger differences (because large deviations, when squared, become disproportionately larger).
4) Average of Squared Deviations:
The variance is the average of these squared deviations, giving us 1 representative number.
5) Measure of Dispersion:
The result is a single number that gives a sense of how spread out the data is. If the variance is small, most data points are close to the mean, and if it’s large, the data points are spread out over a wider range.
What is Standard Deviation?
The standard deviation is the average amount by which scores differ from the mean. The standard deviation is the square root of the variance, and it is a useful measure of variability when the distribution is normal or approximately normal.
Distributions with the same mean can have different standard deviations. As mentioned before, a small standard deviation coefficient indicates that scores are close together, whilst a large standard deviation coefficient indicates that scores are far apart.
Common types of Distribution :
Bernoulli, Binomial, Uniform, Gaussian or Normal, Exponential and Poisson Distribution.
CLT
The Central Limit Theorem (CLT) states that the distribution of the sample means will approach a normal distribution as the sample size increases, regardless of the shape of the population distribution. This theorem is crucial for making inferences about the population from sample data.
Standard Error
The standard error (SE) specifically measures the expected variability of the sample mean from the population mean across many samples.
S.E. = Pop std dev or Sigma/Square root of sample size.
Sample Size Effect
Increasing the sample size lowers the standard error, making the curve narrower and the estimates more precise
P-Value
The p-value can be defined as a measure of how likely the observed data could occur due to random chance, under the assumption that the null hypothesis is true.
Higher P-Value: Suggests that there is a high probability of obtaining data similar to what you observed if the null hypothesis is true. Therefore, we do not have strong evidence to reject the null hypothesis. The differences observed are likely minor and can be attributed to random fluctuations.
Lower P-Value: It indicates a low probability of observing data as extreme as yours if the null hypothesis is true, implying there might be a real effect or relationship present. Thus, this gives us stronger evidence to consider rejecting the null hypothesis. The differences observed are substantial and not easily explained by random chance.