BA 1 - Describing and Summarizing Data Flashcards
Axes of a histogram?
X-axis - bins corresponding to ranges of data;
Y-axis - frequency of observations falling into each bin.
What’s an outlier?
An outlier is a value that falls far from the rest of the data.
How do you examine the validity of an outlier?
i. Check if it’s valid, though unusual;
ii. Check for a data entry error; and
iii. Check if it was collected under different circumstances than the rest of the data.
What do you do about an outlier?
Leave it; change it to its corrected value; or in extreme cases, delete it.
Skewness
Skewness measures the degree of a graph’s asymmetry.
What are descriptive statistics?
Summary measures that provide an overview of the data set without showing every data point.
Mean
Sum of all data points divided by the number of data points
The mean is affected by outliers.
Median
Middle value of the data set; i.e. 50th percentile.
When the number of values is even, it’s the average of the middle two values.
Mode
Value that occurs most frequently
Conditional mean
The mean of a subset of the data that includes all values satisfying a certain condition.
Percentile value
Value beneath which a certain percentage of the data lie
i.e. 25th percentile is the smallest value that is greater than or equal to 25% of the data points.
Range
Maximum value - Minimum value
Relationship between standard deviation and variance?
SD = square root (Variance)
What does variance measure?
Variance is a measure of how far each point is from the mean.
Difference between populate and sample variance/sd?
For population, denominator is N; for sample, denominator is ‘n-1’
Coefficient of Variation
SD/mean
Measures the standard deviation relative to the size of the mean.
It’s useful for comparing variation in different data sets.
Kurtosis
Measure of flatness or sharpness of a distribution.
Low kurtosis => flat distribution
[EXCEL] Mean
=AVERAGE(range)
[EXCEL] Median
=MEDIAN(range)
[EXCEL] Mode
=MODE.SNGL(range)
[EXCEL] Conditional Mean
=AVERAGEIF(range,criteria,range)
[EXCEL] Percentile
=PERCENTILE.INC(range,k)
33rd percentile => k = 0.33
[EXCEL] Variance
=VAR.S(range)
[EXCEL] Standard deviation
=STDEV.S(range)
[EXCEL] Square root
=SQRT(number)
[EXCEL] Number of values
=COUNT(range)
[EXCEL] Range
=MAX(range) - =MIN(range)
[EXCEL] Total
=SUM(range)
[EXCEL] Correlation
=CORREL(range1,range2)
Scatter plot
Shows the relationship between two variables as a visualization, though we cannot assume causal link.
Correlation
- Measure that quantifies the strength of a linear relationship between two variables.
- Range - +1, -1. 0 = no linear relationship
- Does not imply causation
- Strongly influenced by outliers
Hidden variables
Variable that is correlated with each of the two variables that are not fundamentally related to each other.
Mediating variable
Variable which is affected by one variable, and in turn affects another.
Time series
When one of the variables is time, the relationship is known as a time series
Cross-sectional data
Cross-sectional data provide a snapshot of data across multiple groups at a given point in time.