Exploratory Data Analysis Flashcards
Exploratory data analysis
getting a feel of the data, making it easier to find mistakes, guess what actually happened and makes it easier to find outliers.
- Understand and gain insights into the data before selecting analysis techniques.
- Approach data without assumptions, often using visual methods.
We need to get to know the data
- Numeric data distributions (symmetric, normal, skewed etc.)
- Data quality problems
- Find outliers
- Search for correlations and interrelationships
- Identify subsets of interest
- Suggest functional relationships
We can ask questions
- Descriptive stats: “Who is most profitable”
- Hypothesis Testing: “Is there a difference between the value of these two customers”
- Classification: “What are the common characteristics of customers”
- Prediction: “Will this new customer become profitable”
- We need to answer the question of what models and techniques to use given the problem context, data and underlying assumptions.
Comparison with Hypothesis Testing
- EDA: Open-ended exploration with no or incomplete prior expectations.
- Hypothesis Testing: Tests pre-defined hypotheses.
Systematic Process
- Understand Data Context:
- Who created the dataset, when, and why?
- Size, number of fields, and their meanings.
- Initial Exploration:
- Inspect familiar or interpretable records.
- Compute summary statistics (e.g., mean, min, max, quartiles, outliers).
- Visualization:
- Plot variable distributions (e.g., box plots, time-series).
- Examine relationships via scatterplot matrices.
- Visualize pairwise correlations and group breakdowns (e.g., gender, age).
- Transformations:
- Transform variables as needed to identify patterns and outliers.
Descriptive Statistics
Quantitatively describe main features of the data. Main data features:
- Measures of central tendency represent a center around which measurements are distributed (mean, median)
- Measures of variability represent the spread of data from the center (standard dev.)
- Measures of relative standing represent the ‘relative position’ of specific measurements in data (quantiles)
The mean
Average, badly affected by outliers, making it a bad measure of central tendency
The median
Middle value when values are ranked in order, shows two halves. AKA the 50th percentile. Unaffected by outliers, making it a better measure of central tendency. In skewed data, the mean lies further towards the skew than the median.
The mode
Most common data point, may be multiple points.
Variance
the spread around the mean. Shows how median and mean differ. The lower the variance the more consistent it is.
Standard Deviation
Spread around the mean, high std means increased spread, less consistency and less clustering.
Quartiles
The value that marks one of the divisions that breaks a series of values into four equal parts. Median is the 2nd quartile and divides it in half.
Common Visualizations
- Histograms/Bar Charts
- Box Plots
- Scatterplots
Histograms/Bar Charts
Used to display frequency distribution. Counts of data falling in various ranges. Histogram is used for numeric data and bar chart for categorical data. The bin size selection is important; if too small it may show false patterns, if too large it may hide important patterns. Several variations are possible; plot relative frequencies instead of raw frequencies. Make the height of the histogram equal to the relative frequency/width.
Box plots
A five value summary plot of data, minimum, maximum, median, 1st and 3rd quartiles. Often used with histogram in EDA.
Scatterplots
2D graphs, useful for understanding the relationship between two attributes. Features of the relationship are describes by; strength, shape, direction, presence of outliers.
Models Definition & Purpose
- Models encapsulate information into tools for forecasts/predictions.
- Key steps: Building, fitting, and validating.
- “All models are wrong, but some are useful.” — George Box
Philosophies of Models
- Occam’s Razor
- Bias Variance Trade-Off
Occam’s Razor
- Prefer simpler models when equally accurate, as they:
- Make fewer assumptions, reducing overfitting risk.
- Avoid memorizing features of the dataset.
- However, simplicity isn’t absolute:
- Complex models like deep learning can be more predictive despite higher parameter counts.
- Complexity comes with a trade-off between accuracy and cost.
Bias-Variance Trade-Off
- Bias: Error from overly simple assumptions (e.g., underfitting).
- Performs poorly on both training and testing data.
- Variance: Error from excessive sensitivity to noise (e.g., overfitting).
- Performs well on training data but poorly generalizes to new data.
Principles of Good Models
- Probabilistic Predictions: Assign probabilities to forecasts (50% chance of rain) Use probability mean distribution
- Feedback Mechanism: Models should update dynamically and show how predictions evolve over time
- Consensus: Build multiple models with distinct methods for the same prediction
- Bayesian Reasoning: Update probabilities with new events. Requires prior probabilities from domain knowledge
Baseline Models Purpose
- Assess model effectiveness by comparison to simple, reasonable benchmarks.
- Only when models decisively outperform baselines can they be deemed effective.
Classification Baselines
- Random selection of labels (no prior distribution).
- Most common label in the training data.
- Best single-feature model.
- Compare against an existing, well-known model.
Prediction Baselines
- Mean or median value of the target.
- Linear regression for linear relationships.
- Previous value (useful in time-series forecasting).
Visualization
The visual representation and presentation of data to facilitate understanding.
The process of understanding (Visualization)
- Perceiving: what do I see, what is shown, how is data represented
- Interpreting: what does it mean, given the subject? What is interesting?
- Comprehending: what does it mean to me? what have I learnt?
To Facilitate Understanding
- Context is important as it helps determine what is interesting and what is important (signal vs noise)
- Any disconnect from the subject impedes the process of interpretation
- The onus is thus on the visualizer to bridge the gap by providing captions, headlines, use of colors etc.
- Comprehension: the viewers needs to answer “what does it mean to me?”
Chart Rules
- Show the data
- Persuade the user to think about the data
- Avoid distorting data
- Be concise: present more information with minimum ink
- Make large datasets coherent
- Encourage the reader to compare different pieces of data
- Reveal data
Use of Statistics
- Mathematically describe our findings as a numerical representation of the data
- Descriptive statistics summarize data
- Inferential statistics are tools that indicate how much confidence we can have when we generalize from a sample to a population
- Draw conclusions from our results
- Test hypotheses
- Test for relationships among variables
Statistics
set of procedures and rules for reducing large masses of data to manageable proportions allowing us to draw conclusions from the data
Types of questions answered by statistics
Statistical Questions:
- Studies are designed to answer research questions (ex. Will this vaccine be effective, how tall are students at a given school)
Non-Statistical Questions:
- Seeking generality not a particular instance, there should not be a direct comparison (Ex. how tall is the president, which dog weighs more). Interested in variability and features. Should be groups of individuals.
Populations
- Whole group of data is called the population
- Include all elements from the set of observations that can be made
- Members of population share a common set of properties that are the subject statistical analysis
- Subset of population is called a subpopulation if they share one or more additional properties
Samples
- Includes one or more observations from a population
- The sample is the portion of the population that is representative of the population from which it was selected
- Its not always possible to perform a census of every individual member of a population
Using inferential statistics, we perform measurements on a subset of the population which tells us about the corresponding measurements in the population
A good sample is not biased, and is random.
Hypothesis Testing
The null hypothesis (H0) states the numerical assumption to be tested, ex Each household has at least 3 TVs.
Begin with the assumption that the null hypothesis is TRUE, refers to the status Quo; always contains the = sign. The null hypothesis may or may not be rejected.
The alternative hypothesis (H1) represents the opposite of the null hypothesis, ex. each household has less than 3 TVs.
Methodology
Statistical Testing: Formulate the null hypothesis, decide in advance what kinds of evidence/data will lead to rejection of the null hypothesis (define the rejection region). Gather the data and carry out the test.
Errors in Testing
Type 1 error or Type 2 error
Type 1 error
Failing to take action when warranted
Type 2 error
Taking action when not needed
Rejection Region
Data that is inconsistent with the hypothesis. Evidence is divided into two types:
- Data that is inconsistent with the hypothesis (Rejection region)
- Everything else
The Testing Strategy
Usually looking for what kind of data will lead to reject the hypothesis. Scientifically, if you want to prove a hypothesis is true, being by assuming it is not true and look for plausible evidence that contradicts the assumption.
- Formulate the null hypothesis
- Gather the evidence
- Q: if my null hypothesis were true, how likely is it that I would have observed this evidence
- Very unlikely: reject the hypothesis
- Not unlikely: Do not reject (retain the hypothesis for continued scrutiny)