Descriptive Analytics Flashcards
Data
Set of pieces of individual information (can be likes, tweets, etc.)
3 Question to Ask With New Data
1) How many variables in dataset?
2) How many observations in the dataset?
3) What level are the observations (customer, transaction, store, etc.)?
Quantitative
Scale relevant
Numerical
Qualitative
Set of categories
gender {m,f,o} being {0,1,2} doesn’t mean o is double the value of f
Discrete
Takes only certain values
Continuous
Can take on any numerical value
Variation
Observations for one variable will take on different values, not all values for “churn” will be the same in the data set
Mean
Average
Sensitive to outliers
Median
Middle most value
Ignores a lot of information
Can move abruptly
Mode
Most common value
Not good for continuous variables
Frequency Distribution
Table showing fraction of observations for which a variable takes on each possible value
Variable. # Obs. % of overall Obs
A
B
C
Range of Values
Important if you care about the likelihood of a good outcome - blockbuster
Want to avoid a bad- healthcare
Care about inequality/dispersion- inventory management
Interquartile Range
Difference between 25th and 75th percentile
Standard Deviation
sqrt(variance)
gives the average variation around the mean in the units of the variable
lower means data close to the mean
higher means data spread out
Coefficient of Variation
SD/Mean
Compares degree of variation between data sets
Summary Statistics
On numerical variables #obs, mean, median, std dev, min, max
Data will have
Randomness
Measurement Error
True Unexplained Factors (things we didn’t realize impacted our variable of interest
Data will have
Randomness
Measurement Error
True Unexplained Factors (things we didn’t realize impacted our variable of interest
Tools to Uncover Systematic Relationships
Scatter Plot
Binned Scatter Plot
Coefficient of Correlation
Conditional Means
Cross Tab
Binned Scatter Plot
Reduce data points by taking conditional means of y values
Coefficient of Correlation
Measures direction and strength on linear relationship between quant variables
Sign = direction
Abs Val = strength
Always between -1 and 1, unit less, not slope
X with Y = Y with C
V shape will make 0 because it cancels out
Conditional Means
Mean of different x values on different y values to show relationship
Must be numerical
Best if y is discrete or has categories
Can be y vs a dummy variable x
Cross Tab
Measures the frequency that certain combinations of features occur using conditional means
Ie Airports on rows and Buckets of Delays on Columns
Confounding Effects
Mixture of Effects
Descriptive Analysis Steps
1) Get to know data
2) Explore distribution
3) Explore correlations between key variables
4) Recognize some variation is driven factors- we don’t know or can’t measure