Exploratory Data Analysis Flashcards

1
Q

Exploratory data analysis

A

getting a feel of the data, making it easier to find mistakes, guess what actually happened and makes it easier to find outliers.
- Understand and gain insights into the data before selecting analysis techniques.
- Approach data without assumptions, often using visual methods.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

We need to get to know the data

A
  • Numeric data distributions (symmetric, normal, skewed etc.)
  • Data quality problems
  • Find outliers
  • Search for correlations and interrelationships
  • Identify subsets of interest
  • Suggest functional relationships
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

We can ask questions

A
  • Descriptive stats: “Who is most profitable”
  • Hypothesis Testing: “Is there a difference between the value of these two customers”
  • Classification: “What are the common characteristics of customers”
  • Prediction: “Will this new customer become profitable”
  • We need to answer the question of what models and techniques to use given the problem context, data and underlying assumptions.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Comparison with Hypothesis Testing

A
  • EDA: Open-ended exploration with no or incomplete prior expectations.
  • Hypothesis Testing: Tests pre-defined hypotheses.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Systematic Process

A
  1. Understand Data Context:
    • Who created the dataset, when, and why?
    • Size, number of fields, and their meanings.
  2. Initial Exploration:
    • Inspect familiar or interpretable records.
    • Compute summary statistics (e.g., mean, min, max, quartiles, outliers).
  3. Visualization:
    • Plot variable distributions (e.g., box plots, time-series).
    • Examine relationships via scatterplot matrices.
    • Visualize pairwise correlations and group breakdowns (e.g., gender, age).
  4. Transformations:
    • Transform variables as needed to identify patterns and outliers.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Descriptive Statistics

A

Quantitatively describe main features of the data. Main data features:
- Measures of central tendency represent a center around which measurements are distributed (mean, median)
- Measures of variability represent the spread of data from the center (standard dev.)
- Measures of relative standing represent the ‘relative position’ of specific measurements in data (quantiles)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

The mean

A

Average, badly affected by outliers, making it a bad measure of central tendency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

The median

A

Middle value when values are ranked in order, shows two halves. AKA the 50th percentile. Unaffected by outliers, making it a better measure of central tendency. In skewed data, the mean lies further towards the skew than the median.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

The mode

A

Most common data point, may be multiple points.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Variance

A

the spread around the mean. Shows how median and mean differ. The lower the variance the more consistent it is.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Standard Deviation

A

Spread around the mean, high std means increased spread, less consistency and less clustering.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Quartiles

A

The value that marks one of the divisions that breaks a series of values into four equal parts. Median is the 2nd quartile and divides it in half.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Common Visualizations

A
  • Histograms/Bar Charts
  • Box Plots
  • Scatterplots
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Histograms/Bar Charts

A

Used to display frequency distribution. Counts of data falling in various ranges. Histogram is used for numeric data and bar chart for categorical data. The bin size selection is important; if too small it may show false patterns, if too large it may hide important patterns. Several variations are possible; plot relative frequencies instead of raw frequencies. Make the height of the histogram equal to the relative frequency/width.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Box plots

A

A five value summary plot of data, minimum, maximum, median, 1st and 3rd quartiles. Often used with histogram in EDA.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Scatterplots

A

2D graphs, useful for understanding the relationship between two attributes. Features of the relationship are describes by; strength, shape, direction, presence of outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Models Definition & Purpose

A
  • Models encapsulate information into tools for forecasts/predictions.
  • Key steps: Building, fitting, and validating.
  • “All models are wrong, but some are useful.” — George Box
18
Q

Philosophies of Models

A
  • Occam’s Razor
  • Bias Variance Trade-Off
19
Q

Occam’s Razor

A
  • Prefer simpler models when equally accurate, as they:
    • Make fewer assumptions, reducing overfitting risk.
    • Avoid memorizing features of the dataset.
  • However, simplicity isn’t absolute:
    • Complex models like deep learning can be more predictive despite higher parameter counts.
    • Complexity comes with a trade-off between accuracy and cost.
20
Q

Bias-Variance Trade-Off

A
  • Bias: Error from overly simple assumptions (e.g., underfitting).
    • Performs poorly on both training and testing data.
  • Variance: Error from excessive sensitivity to noise (e.g., overfitting).
    • Performs well on training data but poorly generalizes to new data.
21
Q

Principles of Good Models

A
  • Probabilistic Predictions: Assign probabilities to forecasts (50% chance of rain) Use probability mean distribution
  • Feedback Mechanism: Models should update dynamically and show how predictions evolve over time
  • Consensus: Build multiple models with distinct methods for the same prediction
  • Bayesian Reasoning: Update probabilities with new events. Requires prior probabilities from domain knowledge
22
Q

Baseline Models Purpose

A
  • Assess model effectiveness by comparison to simple, reasonable benchmarks.
  • Only when models decisively outperform baselines can they be deemed effective.
23
Q

Classification Baselines

A
  • Random selection of labels (no prior distribution).
  • Most common label in the training data.
  • Best single-feature model.
  • Compare against an existing, well-known model.
24
Q

Prediction Baselines

A
  • Mean or median value of the target.
  • Linear regression for linear relationships.
  • Previous value (useful in time-series forecasting).
25
Q

Visualization

A

The visual representation and presentation of data to facilitate understanding.

26
Q

The process of understanding (Visualization)

A
  1. Perceiving: what do I see, what is shown, how is data represented
  2. Interpreting: what does it mean, given the subject? What is interesting?
  3. Comprehending: what does it mean to me? what have I learnt?
27
Q

To Facilitate Understanding

A
  • Context is important as it helps determine what is interesting and what is important (signal vs noise)
  • Any disconnect from the subject impedes the process of interpretation
  • The onus is thus on the visualizer to bridge the gap by providing captions, headlines, use of colors etc.
  • Comprehension: the viewers needs to answer “what does it mean to me?”
28
Q

Chart Rules

A
  • Show the data
  • Persuade the user to think about the data
  • Avoid distorting data
  • Be concise: present more information with minimum ink
  • Make large datasets coherent
  • Encourage the reader to compare different pieces of data
  • Reveal data
29
Q

Use of Statistics

A
  • Mathematically describe our findings as a numerical representation of the data
  • Descriptive statistics summarize data
  • Inferential statistics are tools that indicate how much confidence we can have when we generalize from a sample to a population
  • Draw conclusions from our results
  • Test hypotheses
  • Test for relationships among variables
30
Q

Statistics

A

set of procedures and rules for reducing large masses of data to manageable proportions allowing us to draw conclusions from the data

31
Q

Types of questions answered by statistics

A

Statistical Questions:
- Studies are designed to answer research questions (ex. Will this vaccine be effective, how tall are students at a given school)

Non-Statistical Questions:
- Seeking generality not a particular instance, there should not be a direct comparison (Ex. how tall is the president, which dog weighs more). Interested in variability and features. Should be groups of individuals.

32
Q

Populations

A
  • Whole group of data is called the population
  • Include all elements from the set of observations that can be made
  • Members of population share a common set of properties that are the subject statistical analysis
  • Subset of population is called a subpopulation if they share one or more additional properties
33
Q

Samples

A
  • Includes one or more observations from a population
  • The sample is the portion of the population that is representative of the population from which it was selected
  • Its not always possible to perform a census of every individual member of a population
    Using inferential statistics, we perform measurements on a subset of the population which tells us about the corresponding measurements in the population

A good sample is not biased, and is random.

34
Q

Hypothesis Testing

A

The null hypothesis (H0) states the numerical assumption to be tested, ex Each household has at least 3 TVs.
Begin with the assumption that the null hypothesis is TRUE, refers to the status Quo; always contains the = sign. The null hypothesis may or may not be rejected.
The alternative hypothesis (H1) represents the opposite of the null hypothesis, ex. each household has less than 3 TVs.

35
Q

Methodology

A

Statistical Testing: Formulate the null hypothesis, decide in advance what kinds of evidence/data will lead to rejection of the null hypothesis (define the rejection region). Gather the data and carry out the test.

36
Q

Errors in Testing

A

Type 1 error or Type 2 error

37
Q

Type 1 error

A

Failing to take action when warranted

38
Q

Type 2 error

A

Taking action when not needed

39
Q

Rejection Region

A

Data that is inconsistent with the hypothesis. Evidence is divided into two types:
- Data that is inconsistent with the hypothesis (Rejection region)
- Everything else

40
Q

The Testing Strategy

A

Usually looking for what kind of data will lead to reject the hypothesis. Scientifically, if you want to prove a hypothesis is true, being by assuming it is not true and look for plausible evidence that contradicts the assumption.

  • Formulate the null hypothesis
  • Gather the evidence
  • Q: if my null hypothesis were true, how likely is it that I would have observed this evidence
  • Very unlikely: reject the hypothesis
  • Not unlikely: Do not reject (retain the hypothesis for continued scrutiny)