Exploratory Data Analysis Flashcards
What is another name for Exploratory Data Analysis?
statisticians call exploratory data analysis, or EDA for short
What does it mean that EDA is an iterative cycle?
You:
Generate questions about your data.
Search for answers by visualising, transforming, and modelling your data.
Use what you learn to refine your questions and/or generate new questions.
EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind.what does this means?
During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will home in on a few particularly productive areas that you’ll eventually write up and communicate to others.
EDA is fundamentally a creative process.discuss?
EDA is fundamentally a creative process. And like most creative processes, the key to asking quality questions is to generate a large quantity of questions. It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your dataset. On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery. You can quickly drill down into the most interesting parts of your data—and develop a set of thought-provoking questions—if you follow up each question with a new question based on what you find.
There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:
What type of variation occurs within my variables?
What type of covariation occurs between my variables?
What is Variable?
A variable is a quantity, quality, or property that you can measure.
What is Value?
A value is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.
What is observation?
An observation is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. I’ll sometimes refer to an observation as a data point.
Tabular data?
Tabular data is a set of values, each associated with a variable and an observation.
When a Tabular data is Tidy?
. Tabular data is tidy if each value is placed in its own “cell”, each variable in its own column, and each observation in its own row.
In real-life, most data isn’t tid
What is Variation?
Variation is the tendency of the values of a variable to change from measurement to measurement. You can see variation easily in real life; if you measure any continuous variable twice, you will get two different results. This is true even if you measure quantities that are constant, like the speed of light. Each of your measurements will include a small amount of error that varies from measurement to measurement. Categorical variables can also vary if you measure across different subjects (e.g. the eye colors of different people), or different times (e.g. the energy levels of an electron at different moments). Every variable has its own pattern of variation, which can reveal interesting information. The best way to understand that pattern is to visualise the distribution of the variable’s values.
How to Visualising distributions of Variable?
It depend on whether the variable is categorical or continuous.
How to visualise categorical variable?
A variable is categorical if it can only take one of a small set of values. In R, categorical variables are usually saved as factors or character vectors.
To examine the distribution of a categorical variable, use a bar chart:
e.g ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
How to visualise continues variable?
A variable is continuous if it can take any of an infinite set of ordered values. Numbers and date-times are two examples of continuous variables. To examine the distribution of a continuous variable, use a histogram:
e.g ggplot(data = diamonds) +geom_histogram(mapping = aes(x = carat), binwidth =0.5)
why recommended to use geom_freqpoly() instead of geom_histogram()?
If you wish to overlay multiple histograms in the same plot, I recommend using geom_freqpoly() instead of geom_histogram(). geom_freqpoly() performs the same calculation as geom_histogram(), but instead of displaying the counts with bars, uses lines instead. It’s much easier to understand overlapping lines than bars.
typical values in EDA
In both bar charts and histograms, tall bars show the common values of a variable, and shorter bars show less-common values. Places that do not have bars reveal values that were not seen in your data. To turn this information into useful questions, look for anything unexpected:
Which values are the most common? Why?
Which values are rare? Why? Does that match your expectations?
Can you see any unusual patterns? What might explain them?
Unusual values
Outliers are observations that are unusual; data points that don’t seem to fit the pattern. Sometimes outliers are data entry errors; other times outliers suggest important new science. When you have a lot of data, outliers are sometimes difficult to see in a histogram
How to do with Missing values?
If you’ve encountered unusual values in your dataset, and simply want to move on to the rest of your analysis, you have two options.
- Drop the entire row with the strange values:
diamonds2 %
filter(between(y, 3, 20))
I don’t recommend this option because just because one measurement is invalid, doesn’t mean all the measurements are. Additionally, if you have low quality data, by time that you’ve applied this approach to every variable you might find that you don’t have any data left!
- Instead, I recommend replacing the unusual values with missing values. The easiest way to do this is to use mutate() to replace the variable with a modified copy. You can use the ifelse() function to replace unusual values with NA:
diamonds2 %
mutate(y = ifelse(y < 3 | y > 20, NA, y))
What is real? object in environment or r script?
However, in the long run, you’ll be much better off if you consider your R scripts as “real”.
With your R scripts (and your data files), you can recreate the environment. It’s much harder to recreate your R scripts from your environment! You’ll either have to retype a lot of code from memory (making mistakes all the way) or you’ll have to carefully mine your R history.
There is a great pair of keyboard shortcuts that will work together to make sure you’ve captured the important parts of your code in the editor:
Press Cmd/Ctrl + Shift + F10 to restart RStudio.
Press Cmd/Ctrl + Shift + S to rerun the current script.
I use this pattern hundreds of times a week.
How to identify ur current working directory instantly ?
RStudio shows your current working directory at the top of the console:
how to print and see working directory?
setwd():
Studio Project?
R experts keep all the files associated with a project together — input data, R scripts, analytical results, figures. This is such a wise and common practice that RStudio has built-in support for this via projects
R Project
In summary, RStudio projects give you a solid workflow that will serve you well in the future:
Create an RStudio project for each data analysis project.
Keep data files there; we’ll talk about loading them into R in data import.
Keep scripts there; edit them, run them in bits or as a whole.
Save your outputs (plots and cleaned data) there.
Only ever use relative paths, not absolute paths.
Everything you need is in one place, and cleanly separated from all the other projects that you are working on.