Exploratory Data Analaysis Flashcards
Exploratory Data Analysis
an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to maximize insight into a data set; uncover underlying structure; extract important variables; detect outliers and anomalies; test underlying assumptions; develop parsimonious models; and determine optimal factor settings. set of techniques the flexibility to respond to the patterns revealed by successive iterations in the discovery process is an important attribute Free to take many paths in revealing mysteries in the data Emphasizes visual representations and graphical techniques over summary statistics a philosophy as to how we dissect a data set; what we look for; how we look; and how we interpret. heavily uses the collection of techniques that we call "statistical graphics", but it is not identical to statistical graphics.
EDA and Visualization
- Exploratory Data Analysis (EDA) and Visualization
are very important steps in any analysis task. - get to know your data!
◦ distributions (symmetric, normal, skewed)
◦ data quality problems
◦ outliers
◦ correlations and inter-relationships
◦ subsets of interest
◦ suggest functional relationships
Previously Designed Techniques for Displaying Data
◦ frequency tables ◦ bar charts (histogram) ◦ pie charts ◦ stem and leaf displays ◦ boxplots
More detailed explanation of EDA
“Exploratory Data Analysis refers to the
critical process of performing initial
investigations on data so as to discover
patterns,to spot anomalies,to test
hypothesis and to check assumptions
with the help of summary statistics and
graphical representations.”
History of EDA
The seminal work in EDA is Exploratory Data Analysis, Tukey, (1977). Over the years it has benefitted from other noteworthy publications such as Data Analysis and Regression, Mosteller and Tukey (1977), Interactive Data Analysis, Hoaglin (1977), The ABC's of EDA, Velleman and Hoaglin (1981) and has gained a large following as "the" way to analyze a data set.
EDA Techniques
Most are graphical in nature with a few
quantitative techniques.
The reason for the heavy reliance on graphics is that by its very nature the main role of EDA is to open-mindedly explore, and graphics gives the analysts unparalleled power to do so, enticing the data to reveal its structural secrets, and being always ready to gain some new, often unsuspected, insight into the data. In combination with the natural pattern-recognition capabilities that we all possess, graphics provides, of course,
unparalleled power to carry this out.
Proportions
The proportion among elements in the collection belonging in a given category is defined as: the number of elements belonging in the category divided by the total number of elements in the collection.
Percent
Percent means “per hundred”, “by the hundred”, or “out of a hundred”. A proportion can be converted to a percentage by multiplying it by 100.
Ratio
The ratio of a number x to another number
y expresses the size of one measure x with
respect to the size of another measure y.
• It is written as x:y and is read as “x is to
y”.
• When the measure x is divided by the
measure y, the relationship that x bears
to y is then expressed as a ratio to one.
• The measure y in the denominator is
called the base.
Percent Change
When the new amount is less than the original amount, the number on top will be a negative number and the result will be a percent decrease; otherwise, the percentage change is positive and is called a percent increase.