Exploratory Data Analysis Flashcards
What is exploratory data analysis
Using various tools to discover patterns in data
Not about testing or proving hypotheses
Inductive philosophy about data
EDA according to Behrens (1997)
Inductive, bottom up & data driven process
1) understanding what’s going on with data
2) central use of graphic/visual representations
3) tentative model building & hypothesis generation (not hypothesis testing)
4) robust methods (less influenced by bias, outliers or specific scales used) & subset analyses e.g. post-hocs with stats correlations (follow ups, not planned)
5) skepticism & flexibility in methods
Tukey: EDA, CDA & in-between
Researchers work in either exploratory, rough confirmatory or confirmatory mindsets
If using CDA would never run post-hocs
Potential problems with EDA
Often can find more or less anything you’re looking for in dataset (especially if big/complex)- easy to claim you were looking for that all along
Is empiricism any use of we don’t have any underpinning theory or explanation of the data
Techniques & hallmarks of EDA
Representing data visually/graphically
Trying to avoid assumptions about your data
Paying attention to outliers
How is EDA useful to us?
Core techniques very good practice for looking at dataset- checking for patterns to guide later analysis, examining distributions & qualifies of variables
Easy to forget that stats used are full of assumptions- stats applied blindly lots, checking & visualising data great way to help these jump out at you
Useful when exploring new topics of new measures
Visualising data for EDA; histograms & stem & leaf plots
Help to identify shape of distribution (skew, Kurtosis, spread or variation in scores)
Help to identify unusual scores
Show you when something obvious is wrong
Visualising data for EDA; box & whisker plots
Box shows IQR & whiskers show full range of data (besides outliers)
Robust methods (quartiles/median)
Visualising data for EDA; error bar charts
Used to compare groups or samples, bar usually shows mean score
Error bar displays precision of the mean either by using=
1) standard deviation (rare, most sensible when exploring distribution of dataset rather than comparisons)
2) confidence interval (often used as line up well with significance testing, if error bars overlap likely no significant difference)
3) standard error of mean (used lots as give smallest error bars so looks like error is smaller & significant difference present)
Visualising data for EDA; Scatterplots
Plot 2 or more variables against each other
More variables= more dimensions
Let’s you see potential relationship between variables (potential correlations & whether data fullfil assumptions for linear analyses)
Simple scatterplot= 1 group of pp
Grouped scatterplot= different groups in data (one colour per group)
3D grouped scatterplot= 3 continuous variables & at least 2 groups
Matrix scatterplot= grid of scatterplots looking at paired relationships of multiple variables in dataset, useful when going in blind on new set of measures
Presenting data clearly
Extremely important for conveying ideas outwardly