Descriptive Stats Flashcards
What are measures of central tendency
Methods employed to determine central point in a given distribution
Mode
- Corresponds to score that has highest frequency in a frequency distribution (visually its the
highest value) - For grouped distribution (histogram) its defined as the most frequently occurring interval (or
the mid-point of that interval). - It is applicable to almost all kinds of data-sets.
Disadvantages of mode
- Lack of reliability
- Lack of precision in some cases
Unimodal distribution
Distributions with single highest values
Bimodal distribution
Distributions with two highest values
Properties of mean
- If a constant is added (or subtracted) to every score in a distribution, the mean is increased
(or decreased) by that constant. - If every score is multiplied (or divided) by same constant, the mean will be multiplied (or
divided) by the same constant. - The sum of deviations from the mean will be equal to zero.
- The sum of squared deviations from the mean will be less than the sum of squared
deviation around any other point in the distribution.
Median
- Corresponds to determining middle score of a distribution, after arranging the data in ascending order.
- Corresponds to 50th percentile of a distribution.
- If a distribution has odd number of scores then median is literally the middle value in a distribution
(provided the data is arranged in ascending order). - If a distribution has even number of scores then median corresponds to average of two middle scores.
- In principle it divides a distribution into two equal halves.
Disadvantages of median
- Its not applicable to all kinds of data-sets.
(e.g., The median cannot be identified for categorical nominal data, as it cannot be
logically ordered). - Median is more informative if there are not many ties, and the distribution is skewed.
Whathappens if there is a great deal of variability
No measure of central tendency is very representative of the scores, if the
distribution contains a great deal of variability.
Different measures of variability
- Range
- Semi-Interquartile range
- Mean deviation
- Variance
- Standard deviation.
Range
- Evaluates width of a distribution by subtracting lowest (lowest real limits) from highest
score (highest real limits). - The advantage is that it captures whole distribution.
Disadvantages of range
- The major disadvantage of range is that, just like mode, it’s unreliable.
- The range can be changed drastically by removing or adding just one score in the
distribution.
Semi interquartile range
- This type of measure of variability can be used for open-ended distribution.
- The interquartile range is obtained by subtracting the 25th percentile from the 75th
percentile. The semi-interquartile range is half the interquartile range. - It does not get affected much by addition or subtraction of extreme scores from a
distribution.
Mean deviation
- It evaluates distance of every score from the mid point of the distribution and averages it
Deviation score = ?
Mean - Individual score
Mean deviation calculation
- Deviation score
- Mean of al deviation scores (take absolute deviation scores
What is variance also referred to as
Mean square
SS = ?
summation of (individual score - mean)^2
Variance = ?
SS / N
Standard deviation
- calculated by taking the root of teh variance
- also called the root mean square
- affected by scores having large deviations in distribution
Properties of standard deviation
- If a constant is added (or subtracted) to every score in a distribution, the standard deviation
is not affected. - If every score is multiplied (or divided) by same constant, the standard deviation will be
multiplied (or divided) by the same constant. - The standard deviation from the mean will be smaller than the standard deviation from any
other point in the distribution.
What is positive skewness
A positive skewness represents asymmetrical
distribution with long right tail.
What is negative skewness
A negative skewness represents
asymmetrical distribution with long left tail.
Skewness = ?
Summation of (individual score - mean)^3 / N
Central tendencies of a skewed distribution
- When the distribution is negatively skewed, the mean will be to the left of the median
- When the distribution is positively skewed, the mean will be to the right of the median
Important distinction
two distributions can both be symmetric (i.e., skewness equals
zero), unimodal, and bell-shaped and yet not be identical in shape.
How can kurtosis be measured
by raising deviations from the mean to the fourth power,
taking their average, and then dividing by the square of the population variance.
What does negative kurtosis indicate?
relatively thin tails and a lesser peakedness in the middle (a
platykurtic distribution).
What does positive kurtosis indicate
relatively fat tails and
more peakedness in the middle of the distribution (a leptokurtic distribution),
What is mesokurtic distribution?
If the kurtosis measure is set to zero for the normal (mesokurtic) distribution (by
subtracting 3 in the above formula),
What is kurtosis measured relative to?
relative to the kurtosis of a normal distribution, which
is 3. Therefore, we are always interested in the “excess“ kurtosis,
Excess kurtosis= ?
Excess kurtosis = sample kurtosis – 3
What is kurtosis used for quantifying?
non-normality—the deviation from a normal distribution—of a distribution.
What does a value of 3 or more indicate?
large departure from normality.
What does a very small value of kurtosis indicate?
a deviation from normality, but it is
considered as benign deviations.
What is population analysis?
statistics applied to the whole data set
What is ample analysis
Statistics applied ot a sub set of teh whole data
Sample variance can be
Larger or smaller than population variance
What equals to the population variance
If infinitely many sample variances are calculated and their average is taken
What is degree of freedom
The number of deviations that are free to vary
df = ?
N-1
What is confidence interval?
the range of likely values of the parameter
What is teh standard error of mean
the standard deviation divided by the square root of the number
of samples.
What is the variance
the average of the squared deviations from the mean across the number of
samples.
What are outliers
those observations that differ strongly (different properties) from the other data
points in the sample of a population.
Sources of outliers
Human errors (wrong
data entry), Measurement errors (faulty system/ tool), Data manipulation error (Faulty data
pre-processing), Sampling errors (creating samples from heterogeneous sources),
Methods for indicating outliers
- Tukey’s Fences (or Quartile method)
- Z – Score
- Local Outlier Function
- Angle based Outlier Detection (AbOD)
- Silhouette (K-Means Clustering)
- Confidence Interval (CI) of fit
What is H-spread
The length of the box and is equal to teh interquartile range, not teh semi-interquartile range
What are the inner fences
The outermost limits of teh plot
Inner fence is equal to
1.5 times the H spread
The whiskers do not generally extend to the
inner fences
End of upper and lower inner fences are known as
upper adjacent value and lower adjacent value