Descriptive Stats / Intro Flashcards
A frequency distribution table is a summary table that shows the number of occurrences (frequency) of different values or ranges of values in a dataset.
A frequency distribution table
a graphical representation of the distribution of a dataset, displaying the frequencies of data values within specific intervals or bins.
A histogram
Table that shows the frequencies or proportions up to a certain point in a dataset, providing a running total of the frequencies.
A cumulative distribution table
refers to a distribution with kurtosis equal to the normal distribution, indicating a moderate peakedness and tail behaviour.
Mesokurtic kurtosis (normal Kurtosis)
s a distribution with a higher peak and heavier tails than the normal distribution, indicating more extreme values.
Leptokurtic kurtosis (positive kurtosis)
distribution with a lower peak and lighter tails than the normal distribution, indicating fewer extreme values.
Platykurtic kurtosis (negative kurtosis)
What is the difference between inferential and descriptive statistics?
Descriptive statistics summarise and describe data, while inferential statistics make predictions or inferences about a population based on a sample.
What are descriptive statistics?
Descriptive statistics are methods used to summarise and describe the main aspects of a dataset, such as central tendency, variability, and distribution.
What are the main aspects of a dataset that descriptive statistics summarise
Central tendency
Variability
distribution
If the numbering scheme is arbitrary then it’s probably best to use the —– as a measure of central tendency.
Mode
If your data are ordinal scale you’re more likely to want to use the ——- as a measure of central tendency.
median
(The median only makes use of the order information in your data (i.e., which numbers are bigger) but doesn’t depend on the precise numbers involved. That’s exactly the situation that applies when your data are ordinal scale. The mean, on the other hand, makes use of the precise numeric values assigned to the observations, so it’s not really appropriate for ordinal data.)
The —— has the advantage that it uses all the information in the data (which is useful when you don’t have a lot of data). But it’s very sensitive to extreme, outlying values.
mean
——- of the data. That is, how “spread out” are the data? How “far” away from the mean or median do the observed values tend to be?
variability
the 50th percentile is the same as the ——– value
median
The —— ——– (—-) is like the range, but instead of the difference between the biggest and smallest value the difference between the 25th percentile and the 75th percentile is taken.
The interquartile range (IQR)
Variability
Mean absolute deviation
deviations, added and averaged
what is the RMSD
“root mean squared deviation”
Properties of distributions.
- What the central tendency is (mean, median or mode).
- How symmetrical the data is either side of the mean (skew).
- How variable the data is (e.g. data range, standard deviation and kurtosis). * If it’s a “normal distribution”
It’s often extremely useful to try to condense the data into a few simple “summary” statistics. In most situations, the first thing that you’ll want to calculate is a measure of ——- ———
central tendency
If your data are nominal scale you probably why shouldn’t you be using either the mean or the median.
Both the mean and the median rely on the idea that the numbers assigned to values are meaningful. If the numbering scheme is arbitrary then it’s probably best to use the Mode instead.
f your data are ordinal scale you’re more likely to want to use the median than the mean because
The median only makes use of the order information in your data but doesn’t depend on the precise numbers involved.
(The mean, makes use of the precise numeric values so it’s not really appropriate for ordinal data.)
For interval and ratio scale data you can use :
either one (median or mean) is generally acceptable.
(Which one you pick depends a bit on what you’re trying to achieve. The mean has the advantage that it uses all the information in the data (which is useful when you don’t have a lot of data). But it’s very sensitive to extreme, outlying values.)
there are systematic differences between the mean and the median when
the histogram is asymmetric (Skew and kurtosis)
(average income example - median is more appropraiate as mean will give an exaggerated view )
The mean can be rememebered as the
centre of garvity or the balancing point of the data
out of “mean absolute deviation” (from the mean)
and
“median absolute deviation” (from the median).
which seems to be the better of the two?
the measure based on the median seems to be used in statistics and does seem to be the better of the two.
(But to be honest I don’t think I’ve seen it used much in psychology.)
X hat equals
The Mean
deviation from the mean
Score - Xhat
(fisrt step in absolute deviation from the mean)
Absolute deviation from the mean -
deviation from the mean (avaeraged)
The variance of a data set X
is sometimes written as Var( X)
, but it’s more commonly denoted S2
s2 (s-squared)
How does jamovi calculate varience differently?
divides by N-1
RMSD Root Mean Squared Deviation is
the square root of the varience
What are the two catagories of descriptive stats?
Measures of Central Tendencies and measures of Dispersion
Take the square root of the variance, known as the standard deviation, also called the
“root mean squared deviation”, or RMSD.
range, varience and standard deviation are all measures of
Measures of Dispersion
the standard deviation is derived from the ——-
variance
In general, you should expect –% of the data to fall within 1 standard deviation of the mean, –% of the data to fall within 2 standard deviation of the mean, and –% of the data to fall within 3 standard deviations of the mean
68, 95, 99.7
(but it’s not exact. It’s actually calculated based on an assumption that the histogram is symmetric and “bell shaped”) it is approximately correct
Gives you the full spread of the data. It’s very vulnerable to outliers and as a consequence it isn’t often used unless you have good reasons to care about the extremes in the data.
Range
Tells you where the “middle half” of the data sits. It’s pretty robust and complements the median nicely. This is used a lot.
Interquartile range
Tells you how far “on average” the observations are from the mean. It’s very interpretable but has a few minor issues (not discussed here) that make it less attractive to statisticians than the standard deviation. Used sometimes, but not often.
Mean absolute deviation
Tells you the average squared deviation from the mean. It’s mathematically elegant and is probably the “right” way to describe variation around the mean, but it’s completely uninterpretable because it doesn’t use the same units as the data. Almost never used except as a mathematical tool, but it’s buried “under the hood” of a very large number of statistical tools.
Variance
the — and the —— ——– are easily the two most common measures used to report the variability of the data.
IQR and the standard deviation
But there are situations in which the others are used. I’ve described all of them in this book because there’s a fair chance you’ll run into most of these somewhere.
This is the square root of the variance. It’s fairly elegant mathematically and it’s expressed in the same units as the data so it can be interpreted pretty well. In situations where the mean is the measure of central tendency, this is the default. This is by far the most popular measure of variation.
Standard deviation
A Standard score is referred to as
Z-score
The standard score is defined as
the number of standard deviations above the mean that my score lies
standard score (z-score) =
35-mean divided by sample
—– ——- allow you to interpret a raw score in relation to a larger population (and thereby allowing you to make sense of variables that lie on arbitrary scales),
standard scores (z-scores)
stadard scores can also be used to
compared to one another in polls where the raw scores arescaled idfferently to each other.