Exploratory Data Analysis Week 4 Flashcards
What are the 6 main goals of exploratory data analysis?
- checking for outliers
- checking assumptions
- checking for data entry errors
- patterns not otherwise obvious
- to gain a thorough descriptive analysis of the data
- analysing and dealing with missing data
In a perfect distribution, mean and median would….
be the same
What would a positive distribution look like?
Tail pointing towards the higher numbers
What would a negative distribution look like?
Tail pointing to the negative numbers
What kind if information comes from the explore command?
- central tendency
- variability
- quantitative measures of shape
- confidence intervals
- percentiles
- stem and leaf
- box and whisker
- histograms
- normality
- homogeneity of variance
- skewness and kurtosis
If the distribution is positively skewed, will the mode be higher or lower than the mean?
Lower, because the mode will be towards the negative end (tail is pointing towards positive end)
If the distribution is negatively skewed, will the mode be higher or lower than the mean?
Higher, because the tail end points towards negative numbers and the highest is around the positive end
When might mode be a better estimate of central tendency than the mean?
In cases of extreme skewnesss
What are three underlying concepts of hypothesis testing?
- finding the stat sig diff
- reject or fail to reject hypothesis
- generalising sample result to population
What is a sample?
A small section of the population
What are two options for dealing with missing data or data entry errors?
- remove the data
- make educated guess about what was intended
- frequencies for categorical/nominal variables
- outliers for continuous/scale variables
What are the two command options for dealing with data entry errors (both continuous/scale and categorical/nominal) ?
- frequencies (categorical/nominal)
- outliers (continuous/scale)
What is normality?
The assumption that your data comes from a population that is normally distributed
What is homogeneity of variance?
The assumption that if your data was to be divided int groups, the level of variability in the groups would be approx. equal.
What is a leptokurtic distribution?
The really tall, skinny one