1. Summary Statistics Flashcards
What is exploratory data analysis (ELA)?
the part of statistics concerned with taking a first look at some data
what are two aspects of EDA
- summary statistics
- data visualisation
What is meant by summary statistics?
Calculating numbers that briefly summarise the data
ie central values of the data, how spread out the data is, or about the relationship between two variables
What is meant by data visualisation?
drawing a picture based on the data to show the shape (centrality and spread) of data, or the relationship between two variables
What important questions should you ask about data before calculating summary statistics or drawing a plot?
- What is the data? “What variables were measured and how, how many datapoints etc”
- How was the data collected? “Sample or whole population?
- Are there any outliers?
- Ethical questions like are there any ethical or privacy issues, should the data be confidential?
Two types of summary statistic
- Statistics of centrality, which tell us where the middle of the data is
- Statistics of spread, which tells us how far the data typically spreads out from the middle
What is the mode of a dataset x?
the most common value of x(i)
What is the median of a dataset x?
If the data is ordered, it is the central value in the ordered list
- if n is odd, this is x((n+1)/2)
- if n is even, this is 1/2(x(n/2) + x((n/2)+1
What is the mean of a dataset x?
x(hat) = 1/n(x1 + x2 + … + xn)
= 1/n sum(x(i), 1,n)
What quantile is the median
q(1/2)
What quantile is the maximum
q(1)
What quantile is the minimum
q(0)
What is the lower quartile
q(1/4)
What is the upper quartile
q(3/4)
What is meant by the number of distinct observations
The number of different datapoints we have after removing any repeats