Statistics - Summarising and presenting data Flashcards
What is the purpose of statistics?
- To summarise and present the information contained in a data set
- To handle and quantify variation and uncertainty in the data, to help to infer what they tell us about the underlying theory of interest.
What are the 5 main summary measures of any numerical data?
Mean, Median, Mode, range, and inter-quartile range (IQR)
How do you calculate Mean?
Add all the values together and divide by how many values there are
How do you calculate Median?
The median is the middle value. Arrange all of the values in size order and locate the middle value.
If there are 2 middle values calculate the number between the middle values.
How do you calculate inter-quartile range (IQR)?
Inter-quartile range (IQR) is the difference between the 75th and 25th percentiles of the data.
There are 4 rank -ordered even parts that give quartiles (Q1, Q2, and Q3):
- Q1 / lower quartile / 25%
- Q2 / the median / 50%
- Q3 / upper quartile / 75%
IQR = Q3 - Q1
How do you calculate range?
Range = largest value - smallest value
How do you calculate Mode?
Mode is the number or value which is repeated most often among all of the values.
What is standard deviation?
Standard deviation is the square root of the variance
Standard deviations (Std. Dev.) = √ (variance)
How do you calculate variance?
You can calculate the variance of a dataset by calculating the distances of values from the mean (e.g. the largest and smallest values in the dataset), and adding the results together, followed by dividing the number from the number of distances calculated.
In the case that there are negative values in the dataset in calculating distances from the mean, square them to make them positive before calculating distances.
Variance = Added distances / how many distances there are.
STATA can be used to run statistical tests when given a dataset, followed by variables and commands imputed. TRUE or FALSE?
TRUE
Can STATA statistical software calculate mean, standard deviation, range, mode, median, and variance?
Yes it can, but you should still know how to calculate them all yourself.
When the variable ‘Age’ is selected in STATA, what is the command that should be used to calculate summary measures (Obs/Mean/Std. Dev./Min/Max)?
summarise Age
What command should be used in STATA to obtain more information following on summary measures (to find quartiles, median etc. rather than just mean/Std. Dev. etc.)?
summarise, Age, detail
If data presents in a graph as either positively or negatively skewed (not normally distributed), is finding the mean and standard deviation an appropriate measure?
No, median and inter-quartile range are more appropriate measures for data which is NOT normally distributed.
This is because skewed data shows the mean as either larger than the median (positively skewed/to the left) or smaller than the median (negatively skewed/to the right).
If data presents as normally distributed (distribution tail extended equally over both left and right sides) in a graph, is finding the mean and standard deviation an appropriate measure?
Yes, finding the mean and standard deviation is an appropriate measure for normally distributed data.
In positively skewed data (to the left of a graph), is the mean larger or smaller than the median?
The mean is larger than the median in positively skewed data.
Positively skewed data: mean > median
In negatively skewed data (to the right of a graph) is the mean larger or smaller than the median?
In negatively skewed data the mean is smaller than the median.
Negatively skewed data: mean < median
Name three main things which presenting data in graphs allows us to easily derive from the data.
Graphical representation of data enables us to get a feel for:
1. Typical (central) values and range of values
2. Shape and spread of the distribution of values
3. Interesting patterns and relationships in the data
Name two ways in which problems can be revealed in concern with data quality by using graphical displays (graphs) to present data.
Graphical displays can reveal problems concerning the quality of the data, including:
1. Identifying outlying / erroneous observations
2. Digit preference
Name three types of graph used in statistical analysis.
- Bar charts
- Histogrms
- Line graphs
Name two types of tables used in statistical analysis.
- Frequency tables
- Cross tabulations (contingency tables)
What is the risk of having too few classes within your data set when using a histogram to present data?
If there are too few classes in the data set when using a histogram, it could be difficult to see any interesting patterns when the data is presented.
What is the risk associated with having too many classes within your data set when using a histogram to present data?
If there are too many classes when presenting data in a histogram, there may be only one observation per class as opposed to a group of observations. The number of observations per class should be no less than 2.