L2: Exploratory Data Analysis Flashcards
After this deck you should be able to: - differentiate discrete and continuous variables - understand the basic statistics used to characterise distributions - produce proper exploratory data analysis on different types of data via different R packages
What are the differences between quantitative and qualitative variables?
Quantitative: generally numeric variables, such as numbers and can be continuous or discrete.
Qualitative: values that are descriptive such as categories or types. Can be ordinal (logically ordered) or nominal (categorical values with no order)
How is variance calculated?
Variance is the measure of the spread. How much the data deviates from the mean.
Var = 1/(n-1) * SUMn(xi-mean)^2
What is the IQR?
Inter Quartile Range - the distance between the 3rd and 1st Quartile.
It represents the middle 50% of the data.
What are some single variable visualisations?
Numerical:
Histogram
Boxplot
Categorical:
Pie chart
Bar plot
What are some two variable visualisations?
Numerical:
Scatter plot
Line plot
Categorical:
Segmented bar plot
Mosaic plot
When should histograms be used?
When we want to observe the skew of the data distribution.
We can see if the data is left or rightly skewed, if it is unimodal, bimodal or multimodal or uniform.
A distribution has the order from smallest to largest: Mode, Median, Mean
What is the shape of its skewness?
This is a positively skewed distribution. Also known as a left-skewed distribution. (Long left tail)
A distribution has an order of Mean, Median, Mode in increasing value.
What kind of skew does the distribution have?
This would mean that the distribution is negatively skewed and so is a right skewed distribution.
Take the time to draw out the components of a boxplot
Upper Hinge Upper Quartile IQR Median Lower Quartile Lower Hinge
Outliers
Where does 50% of the data fall, in terms of the quartiles?
Between Q1 and Q3, we have 50% of the data by definition
A widely used plotting style:
- Has two numerical variables
- Ability to reveal linear/non-linear relationships
- Shows correlation between variables
- Shows presence of extreme outliers
What kind of plot is it?
Scatterplot (shows all individual data points)
How would you calculate the covariance of X and Y distributions?
cov(X,Y) = E[XY] - E[X]E[Y]
If two variables have a correlation that is close to 0, what might we assume?
The two variables have little to no relationship. They are weakly related.
If a QQ Plot has data points that do not follow the 1:1 axis of the Normal QQ Plot, then what does this indicate?
The data points do not fit a normal distribution.
It may be beneficial to plot the data points against a different shape of distribution (e.g. Uniform distribution)