Intro Flashcards
Definition / equation of churn rate
Cancellations / total subscribers (current + new subscribers)
What are the two types of organized observations
Methodology and shape
What is the most common shape for data
Table or spreadsheet
Variables
The things we measure (columns of a table)
Observations / entity / instance
Rows - Individual instances of the things we are measuring
Numerical variables
Both the measurement and unit of measurement (without unit a numerical variable is just a number)
What are the two ways of getting a number
Counting (discrete) or measuring (continuous)
Whole numbers are what type of variable
Discrete variable
Partial values are what type of variable
Continuous variable
Categorical variables
Characteristics with words or relative values
Nominal variable
A categorical variable that is specifically A named value
Dichotomous variable
A categorical variable that is binary (yes / no, true / false, on / off, 1 / 0)
Ordinal variable
A categorical variable that is a subjective value (a ranking from 1 to 5)
3 common messy data problems
- Typos
- Missing data
- Inconsistent coding (three instead of 3 or N/A instead of 0)
Missing completely at random
Vs
Missing at random
Vs
Structurally missing
Data was simply not entered or entered properly
We can predict if one value is missing based on the value in another variable
We don’t expect there to be a value to begin with
Accuracy
A measure of how well records reflect reality
Validity
The data actually measures what we think it is measuring
Various ways that a dataset can be low quality
Typos
Mistakes
Missing data
Poor measurement
Duplicate observations
What are the two types of categorical variables
Ordinal (ordered)
Nominal (unordered)
A distribution is
A function that shows all possible values of a variable and how frequently each value occurs
Interquartile range
The range between the first and third quartile of the dataset
First quartile marks the point at 25% into the range of data
Third quartile marks the point at 75% into the range of data
A range of data is all values arranged from smallest to largest
Bimodal distribution
A distribution with two peaks (modes)
The act of aggregating data
Summarizing a numeric variable across each value of a categorical variable
Correlation coefficient
Direction: - or +
Strength: 0 to 1