Data Analysis Flashcards
Methods of presenting Data
Data can be organized using tables, graphical methods, and numerical methods. A variable represents a characteristic that varies within a population and can be:
Quantitative (numerical): e.g., height, age
Categorical (nonnumerical): e.g., eye color, political preference.
The distribution of a variable describes how frequently different values occur.
Frequency: The count of a specific value in the dataset.
Relative Frequency: The proportion of a value in relation to the total dataset (expressed as a percentage, fraction, or decimal).
Frequency Distributions and Relative Frequency Distributions use tables or graphs to summarize data effectively.
Tables
Tables help organize and present data clearly. They are often used for frequency distributions (showing how often values occur) and relative frequency distributions (showing the proportion of each value).
A frequency distribution table lists categories or numerical values in one column and their frequencies in another.
A relative frequency table follows the same format but shows proportions (percentages, fractions, or decimals) instead of counts.
If there are many unique values, data can be grouped into ranges to simplify the table.
- Understand the Structure of Tables
Rows and columns organize data clearly—always check labels to understand what’s being measured.
Identify categories (qualitative) vs. numerical values (quantitative). - Frequency vs. Relative Frequency
Frequency = The number of times a value appears.
Relative Frequency = Frequency ÷ Total, expressed as a fraction, decimal, or percentage.
Make sure relative frequencies sum to 1 (or 100%).
- Recognize Grouped Data
If there are too many unique values, data is often grouped into ranges.
Pay attention to range boundaries (e.g., “71-80” includes both 71 and 80).
Mean (Average)
Mean (Average)
Mean=
∑x / n
Sum up all the numbers (∑x)
Divide by the total number of numbers (n)
Median
Median
Step 1: Arrange numbers in order from smallest to largest.
Step 2: If there’s an odd number of values:
Median = Middlevalue
If there’s an even number of values:
Median = (MiddleValue1+MiddleValue2) /2
Example (odd numbers):
Numbers: 4, 4, 6, 7, 10 → Middle number is 6
median = 6
Example (even numbers):
Numbers: 1, 2, 3, 4 → Middle numbers are 2 and 3
Median = (2+3) / 2 =2.5
Mode
The number that appears most often
If one number appears the most → it’s the mode
If multiple numbers appear the most → multiple modes
If no number repeats → no mode
Example:
Numbers: 1, 2, 3, 3, 4, 4, 4, 5 → Mode = 4 (because it appears the most
Quartiles (Splitting into 4 Parts)
Quartiles (Splitting into 4 Parts)
Think of quartiles as cutting the list into 4 equal chunks:
Q1 (First Quartile): The number that splits off the first 25% of the data.
Q2 (Second Quartile or Median): The middle of the whole list (50% mark).
Q3 (Third Quartile): The number that splits off 75% of the data.
So if we lined up 16 numbers in order, we’d:
Find the median (Q2)—the middle of the list.
Take the first half of the numbers and find the middle of that → That’s Q1.
Take the second half of the numbers and find the middle of that → That’s Q3.
For example, if we have this list:
2, 4, 4, 5, 7, 7, 7, 7, 7, 7, 8, 8, 9, 9, 9, 9
Q2 (Median) = 7 (middle number)
Q1 = 6 (middle of first half)
Q3 = 8.5 (middle of second half)
Percentiles (Splitting into 100 Parts)
Now, if we want even smaller sections, we use percentiles, which break the list into 100 tiny pieces.
Q1 is the 25th percentile (25% of data falls below it).
Q2 (median) is the 50th percentile (halfway point).
Q3 is the 75th percentile (75% of data falls below it).
Percentiles are useful when dealing with big lists, like test scores or income levels.
Range
Range – The difference between the biggest and smallest number.
Range=MaximumValue−MinimumValue
Interquartile Range (IQR)
Interquartile Range (IQR) – The middle 50% of your data.
This ignores the extreme numbers (outliers) and focuses on the “core” data.
Example: If your numbers are 2, 4, 5, 7, 8, 9, the middle half (quartiles) might go from 4 to 8, so IQR = 8 - 4 = 4.
formula: IQR= Q3 −Q1
Where:
Q
1
Q
1
Q1 (First Quartile) = 25th percentile (middle of the lower half)
Q3 (Third Quartile) = 75th percentile (middle of the upper half)
Standard Deviation (σ)
Standard Deviation (SD) – Measures how far each number is from the average.
If numbers are super close to the average, SD is small. If they’re spread out, SD is big.
Formula:
σ = √ ∑(xi −xˉ)^2 / n
Where:
xi = each data value
xˉ = mean (average) of the data
n = number of data points
Steps:
- Find the mean: ∑(xi / n
- Subtract the mean from each data point
- Square each result
- Find the average of those squared differences
- Take the square root of that average
Example for numbers 0, 7, 8, 10, 10:
Mean = 0+7+8+10+10 =7
Squared differences:
(7−0) 2, (7−7)2 ,(7−8) 2, (7−10)2 ,(7−10) 2→ 4, 9,0,1, 9, 9
Average = 49+0+1+9+9 / 5 =13.6
Standard deviation = √ 13.6 ≈ 3.7
Standard Score (Z-Score)
Z= (x − xˉ) / σ
Where:
x = the data value
xˉ = mean
σ = standard deviation
Sets
A set is a collection of objects (called elements or members) that share a common property.
Example: The set of even digits: {0, 2, 4, 6, 8}.
Types of Sets
Finite Set: A set with a countable number of elements. Example: {1, 2, 3, 4}.
Infinite Set: A set with an uncountable number of elements. Example: The set of all integers.
Empty Set (∅): A set with no elements. Example: The set of all real numbers greater than 5 and less than 5.
Nonempty Set: Any set that has at least one element.
Subset (⊆): A set A is a subset of set B if all elements of A are also in B. Example: {2, 8} ⊆ {0, 2, 4, 6, 8}.
Universal Set (U): The set containing all elements under discussion.
Notation
Number of Elements in a Set: The number of elements in a set S is denoted as ∣S∣
Example: If S={6.2,−9,π,0.01,0}, then ∣S∣=5.
For the empty set: ∣∅∣=0.
Lists
A list is a collection of objects in a specific order, where repetitions matter.
Example: The lists (1, 2, 3, 2) and (1, 2, 2, 3) are different.
a list and a set are different because on lists orders do matter and repetitions are counted.
Intersection ( ∩ )
The intersection of two sets S and T is the set of elements that are in both S and T.
Formula: S∩T={x∣x∈Sandx∈T}
Example:
If S={1,2,3} and T={2,3,4}, then
S∩T={2,3}.
Union ( ∪ )
The union of two sets S and T is the set of elements that are in either S, T, or both.
Formula:
S∪T={x∣x∈Sorx∈T}
Example: If
S={1,2,3} and
T={2,3,4}, then
S∪T={1,2,3,4}.
Disjoint Sets
Two sets are disjoint if they have no elements in common.
Formula:
S∩T=∅.
Example: If A={1,2,3} and B={4,5,6}, then
A∩B=∅.
Venn Diagrams
A Venn diagram is a visual representation of sets and their relationships using overlapping circles.
Each circle represents a set.
Overlapping regions represent intersections.
The universal set (U) is often represented by a rectangle containing all sets.
Inclusion-Exclusion Principle
A formula used to count elements in the union of two finite sets while avoiding double-counting elements in their intersection.
Formula
For two sets A and B:
∣A∪B∣=∣A∣+∣B∣−∣A∩B∣
For two disjoint sets B and C:
∣B∪C∣=∣B∣+∣C∣
(B∩C=∅, so no need to subtract the intersection).
Example
If:
∣A∣=30
∣B∣=25
∣A∩B∣=10
Then: ∣A∪B∣=30+25−10=45
Multiplication Principle (Fundamental Counting Principle)
The Multiplication Principle states that if one event can occur in m ways and a second event can occur in
n ways, then the total number of ways both events can occur together is:
TotalWays=m×n
General Formula
If a sequence of k independent events occurs in
n1, n2, …. nK ways respectively, then the total number of ways all events can happen is:
n1* n2* …*nK
Example
A restaurant offers 4 appetizers, 3 main courses, and 5 desserts.
The number of different meals (one appetizer, one main course, and one dessert) is:
4×3×5=60
Permutations
A permutation is just a fancy word for arranging things in a specific order. Order matters in permutations.
Formula:
P(n,k)= n! / (n−k)!
n = total objects
k = number of objects you’re picking
Example:
If you want to pick 5 digits from 7 and arrange them:
P(7,5)=7! / (7−5)! = 7! / 2! = 7×6×5×4×3×2×1 / 2×1
Cancel out 2 × 1 (because it appears in both the top and bottom):
7×6×5×4×3=2,520
So, there are 2,520 ways to make a 5-digit number using 5 unique digits from 1 to 7.
Factorial (!)
A factorial is when you multiply a number by all the numbers before it. It’s written as n! (read as “n factorial”).
formula: n!=n×(n−1)×(n−2)×…×3×2×1
Examples:
3! = 3 × 2 × 1 = 6
4! = 4 × 3 × 2 × 1 = 24
5! = 5 × 4 × 3 × 2 × 1 = 120
0! is always 1 (just a rule to make math easier).
Factorials help us quickly count how many ways we can arrange things.
Combinations
Combinations are used when you want to choose items from a set, but the order doesn’t matter.
Formula for Combinations (n choose k):
C(n,k)= n! / k!(n−k)!
Where:
n = total number of items
k = number of items you’re choosing
! = factorial
Example:
How many ways can you pick 3 students from a group of 5?
C(5,3)= 5! / 3!(5−3)! = 5! / 3!2 = 5×4×3! / 3!×2×1
= 5×4 / 2× =10
So there are 10 ways to choose 3 students from 5.
Basics of Probability
Definition: Probability is a numerical way to describe uncertainty.
General Probability Rules
Rule 1 (Certain & Impossible Events):
P(certainevent)=1
P(impossibleevent)=0
Rule 2 (Complement Rule):
The probability that an event does not occur:
P(notE)=1−P(E)
Rule 3 (Sum of All Probabilities):
The sum of the probabilities of all possible outcomes is 1.
Probability formula
If all outcomes are equally likely, the probability of an event E is:
P(E)=
NumberofoutcomesinE/ Totalnumberofoutcomesinsamplespace
Example 1: Probability of rolling a 4:
P (4) = 1/6
Mutually Exclusive Events (Cannot Happen Together)
Two events cannot happen at the same time.
Example: Rolling an odd number and rolling an even number.
Formula:
P(EorF)=P(E)+P(F)
Independent Events (One Does Not Affect the Other)
Two events do not influence each other.
Example: Rolling a die twice → The result of the first roll does not affect the second roll.
Formula:
P(EandF)=P(E)×P(F)
Data Distribution
How numerical data is spread out or organized.
Relative Frequency
The proportion of times a value appears compared to the total data set.
formula : frequency of value / total number or data points
Standard Deviation (SD)
Measures how spread out the data is around the mean (average).
d= √∑(xi −m)^2 / n
Measures how spread out the data is around the mean.
Standard Deviation Ranges
1 SD from mean: m±d → Includes ~68% of data
2 SD from mean: m±2d → Includes ~95% of data
3 SD from mean: m±3d → Includes ~99.7% of data
Total Area Under Distribution Curve
TotalArea=1
The total area under a probability distribution (or histogram bars) always equals 1 (or 100% of data).
random variable
is just a way to represent uncertain outcomes with numbers. Instead of saying “something random happens,” we assign a number to each possible outcome.
mean (expected value)
is the average outcome you’d expect if you repeated the experiment many times.
This is used when you have probabilities associated with different values.
The formula is:
E(X)=∑X⋅P(X)
where:
X: is each possible value of the random variable
P(X): is the probability of that value
normal distribution (bell curve)
is a bell-shaped curve that represents naturally occurring data.
The bell-shaped curve appears when you graph the normal distribution on a coordinate plane.
Mean (m), Standard Deviation (d), and the Bell Curve
Mean (m): The middle of the data, also called the average. It’s the highest point on the bell curve.
Standard Deviation (d): A measure of how spread out the data is. A small d means the data is tightly packed near the mean, while a large d means the data is more spread out.
The bell curve is symmetric, meaning the left and right sides look the same.
normal distribution properties
- Mean = Median = Mode → The most common value is also the average, so everything is centered.
- Symmetry → The left and right sides of the curve are mirror images.
- most values (around 68%) aren’t too far from the average.
One standard deviation (σ) tells us how spread out the data is. If data is normally distributed, about 68% of values fall between:
Mean−σ to Mean+σ
example: If test scores have a mean of 75 and a standard deviation of 5, About 68% of students scored between 70 and 80 (75 ± 5).
- If we go a bit farther from the mean (2 standard deviations), we cover almost all values (95%).
If data is normally distributed, about 95% of values fall between:
Mean−2σ to Mean+2σ
Example:
Using the same test score example (mean = 75, standard deviation = 5),
About 95% of students scored between 65 and 85 (75 ± 10).
Probability and the Normal Curve
The total area under the curve represents 100% of the data.
If you randomly pick a person, the chance of them falling in a certain range is the area under that section of the curve.
Example:
The probability of someone being taller than the average is 50% (since the curve is symmetrical).
The probability of someone having an IQ between 85 and 115 is 68% (because that’s within 1 standard deviation).
Standard Normal Distribution
This is just a special normal distribution where the mean is 0 and the standard deviation is 1.
Any data point can be converted into this form using the formula:
Z= (X−m) / d
Where:
X = the data value
m = the mean
d = the standard deviation
Example: If your test score is 85, the average is 70, and the standard deviation is 10:
Z = (85−70) / 10
= 15/10
=1.5