Descriptive Stats / Intro Flashcards
A frequency distribution table is a summary table that shows the number of occurrences (frequency) of different values or ranges of values in a dataset.
A frequency distribution table
a graphical representation of the distribution of a dataset, displaying the frequencies of data values within specific intervals or bins.
A histogram
Table that shows the frequencies or proportions up to a certain point in a dataset, providing a running total of the frequencies.
A cumulative distribution table
refers to a distribution with kurtosis equal to the normal distribution, indicating a moderate peakedness and tail behaviour.
Mesokurtic kurtosis (normal Kurtosis)
s a distribution with a higher peak and heavier tails than the normal distribution, indicating more extreme values.
Leptokurtic kurtosis (positive kurtosis)
distribution with a lower peak and lighter tails than the normal distribution, indicating fewer extreme values.
Platykurtic kurtosis (negative kurtosis)
What is the difference between inferential and descriptive statistics?
Descriptive statistics summarise and describe data, while inferential statistics make predictions or inferences about a population based on a sample.
What are descriptive statistics?
Descriptive statistics are methods used to summarise and describe the main aspects of a dataset, such as central tendency, variability, and distribution.
What are the main aspects of a dataset that descriptive statistics summarise
Central tendency
Variability
distribution
If the numbering scheme is arbitrary then it’s probably best to use the —– as a measure of central tendency.
Mode
If your data are ordinal scale you’re more likely to want to use the ——- as a measure of central tendency.
median
(The median only makes use of the order information in your data (i.e., which numbers are bigger) but doesn’t depend on the precise numbers involved. That’s exactly the situation that applies when your data are ordinal scale. The mean, on the other hand, makes use of the precise numeric values assigned to the observations, so it’s not really appropriate for ordinal data.)
The —— has the advantage that it uses all the information in the data (which is useful when you don’t have a lot of data). But it’s very sensitive to extreme, outlying values.
mean
——- of the data. That is, how “spread out” are the data? How “far” away from the mean or median do the observed values tend to be?
variability
the 50th percentile is the same as the ——– value
median
The —— ——– (—-) is like the range, but instead of the difference between the biggest and smallest value the difference between the 25th percentile and the 75th percentile is taken.
The interquartile range (IQR)
Variability
Mean absolute deviation
deviations, added and averaged
what is the RMSD
“root mean squared deviation”
Properties of distributions.
- What the central tendency is (mean, median or mode).
- How symmetrical the data is either side of the mean (skew).
- How variable the data is (e.g. data range, standard deviation and kurtosis). * If it’s a “normal distribution”
It’s often extremely useful to try to condense the data into a few simple “summary” statistics. In most situations, the first thing that you’ll want to calculate is a measure of ——- ———
central tendency
If your data are nominal scale you probably why shouldn’t you be using either the mean or the median.
Both the mean and the median rely on the idea that the numbers assigned to values are meaningful. If the numbering scheme is arbitrary then it’s probably best to use the Mode instead.
f your data are ordinal scale you’re more likely to want to use the median than the mean because
The median only makes use of the order information in your data but doesn’t depend on the precise numbers involved.
(The mean, makes use of the precise numeric values so it’s not really appropriate for ordinal data.)
For interval and ratio scale data you can use :
either one (median or mean) is generally acceptable.
(Which one you pick depends a bit on what you’re trying to achieve. The mean has the advantage that it uses all the information in the data (which is useful when you don’t have a lot of data). But it’s very sensitive to extreme, outlying values.)
there are systematic differences between the mean and the median when
the histogram is asymmetric (Skew and kurtosis)
(average income example - median is more appropraiate as mean will give an exaggerated view )
The mean can be rememebered as the
centre of garvity or the balancing point of the data