R - 7. Exploratory Data Analysis Flashcards
What does EDA stand for?
exploratory data analysis
What is EDA?
A task of how to use visualisation and transformation to explore your data in a systematic way.
EDA is an iterative cycle. You:
- Generate questions about your data.
- Search for answers by visualising, transforming, and modelling your data.
- Use what you learn to refine your questions and/or generate new questions.
No strict rules. Ask more and more questions to get to the core of the datasaet.
What two types of questions will always be useful for making discoveries within your data?
You can loosely word these questions as:
What type of variation occurs within my variables?
What type of covariation occurs between my variables?
What is a variable?
A variable is a quantity, quality, or property that you can measure.
What is a value?
A value is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.
What is an observation?
An observation is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. I’ll sometimes refer to an observation as a data point.
What is tabular data?
Tabular data is a set of values, each associated with a variable and an observation. Tabular data is tidy if each value is placed in its own “cell”, each variable in its own column, and each observation in its own row.
What is variation?
Variation is the tendency of the values of a variable to change from measurement to measurement. If you measure any continuous variable twice, you will get two different results. Even if you measure quantities that are constant, like the speed of light. Each of your measurements will include a small amount of error. Categorical variables can also vary if you measure across different subjects (e.g. the eye colors of different people). Every variable has its own pattern of variation, which can reveal interesting information. The best way to understand that pattern is to visualise the distribution of the variable’s values.
What is a categorical variable?
A variable is categorical if it can only take one of a small set of values.
What is a continuous variable?
A variable is continuous if it can take any of an infinite set of ordered values.
What do you use to examine the distribution of a continuous variable?
A histogram.
What do you use to examine the distribution of a categorical variable?
Bar chart
What is geom_freqpoly() and what is it used for?
If you wish to overlay multiple histograms in the same plot, I recommend using geom_freqpoly() instead of geom_histogram().
geom_freqpoly() performs the same calculation as geom_histogram(), but instead of displaying the counts with bars, uses lines instead. It’s much easier to understand overlapping lines than bars.
What do you do woth unnusual values that don’t make sense?
You should zoom into them and figure out, what they are about, and if the data makes sense. If you are sure, that the data is wrong you can replace them with missing values. But you must be careful.
If you replace them, do so with mutate(), to replace the variable with a modified copy.
What happens with missing values?
Missing values should never silently go missing. It’s not obvious where you should plot missing values, so ggplot2 doesn’t include them in the plot, but it does warn that they’ve been removed.