Marketing Analytics Flashcards
What is data?
A distinct piece of information
Two main data types
Quantitative data takes on numeric values that allow us to perform mathematical operations (like the number of dogs). Can be divided into continuous and discrete
Categorical is used to label a group or set of items (like dog breeds - Collies, Labs, Poodles, etc.). Can be divided into ordinal and nominal
Categorical Ordinal vs Categorical Nominal
Categorical Ordinal vs. Categorical Nominal
We can divide categorical data further into two types: Ordinal and Nominal.
Categorical Ordinal data take on a ranked ordering (like a ranked interaction on a scale from Very Poor to Very Good with the dogs).
Categorical Nominal data do not have an order or ranking (like the breeds of the dog).
Continuous Vs Discrete Data
Continuous data can be split into smaller and smaller units, and still a smaller unit exists. An example of this is the age of the dog - we can measure the units of the age in years, months, days, hours, seconds, but there are still smaller units that could be associated with the age.
Discrete data only takes on countable values. The number of dogs we interact with is an example of a discrete data type.
Four Aspects for quantitative Data
There are four main aspects to analyzing Quantitative data.
Measures of Center
Measures of Spread
The Shape of the data.
Outliers
Measures of Center
Measures of Center
There are three measures of center:
Mean
Median
Mode
Calculating the Mean
Sum of all values divided by the count of values
Median
it is the middle value of a data set, when the dataset has been ordered from smallest to largest
Mode
The Mode
The mode is the most frequently observed value in our dataset.
There might be multiple modes for a particular dataset, or no mode at all.
No Mode
If all observations in our dataset are observed with the same frequency, there is no mode. If we have the dataset:
1, 1, 2, 2, 3, 3, 4, 4
There is no mode, because all observations occur the same number of times.
Many Modes If two (or more) numbers share the maximum value, then there is more than one mode. If we have the dataset:
1, 2, 3, 3, 3, 4, 5, 6, 6, 6, 7, 8, 9
There are two modes 3 and 6, because these values share the maximum frequencies at 3 times, while all other values only appear once.
Notation
Notation is a common language used to communicate mathematical ideas
Random Variable
Random Variables
A random variable is a placeholder for the possible values of some process (mostly… the term ‘some process’ is a bit ambiguous). As was stated before, notation is useful in that it helps us take complex ideas and simplify (often to a single letter or single symbol). We see random variables represented by capital letters (X, Y, or Z are common ways to represent a random variable).
We might have the random variable X, which is a holder for the possible values of the amount of time someone spends on our site. Or the random variable Y, which is a holder for the possible values of whether or not an individual purchases a product.
X is ‘a holder’ of the values that could possibly occur for the amount of time spent on our website. Any number from 0 to infinity really.
x1
First observed value of the random variable X
Measures of Spread
Measures of Spread are used to provide us an idea of how spread out our data are from one another. Common measures of spread include:
Range
Interquartile Range (IQR)
Standard Deviation
Variance
Histograms
Histograms
Histograms are super useful to understanding the different aspects of quantitative data such as measures of spread
Calculating the 5 Number Summa
Calculating the 5 Number Summary
The five number summary consist of 5 values:
Minimum: The smallest number in the dataset.
Q 1: The value such that 25% of the data fall below.
Q 2: The value such that 50% of the data falls below.
Q3: The value such that 75% of the data fall below.
Maximum: The largest value in the dataset.
Essentially, each value is just the median of a bunch of values
Range
The range is then calculated as the difference between the maximum and the minimum.
IQR
The interquartile range is calculated as the difference between Q3 and Q1
.
In the upcoming sections, you will practice this with Katie and on your own.
Box Plot
Useful for quickly comparing the Spread of two data sets across a key metric
How to measure spread with a single value?
use standard deviation or variance
Standard deviation vs Variance
both Tells us how far each point is from the mean of the point
The standard deviation is the square root of the variance.
In practice, you usually use the standard deviation rather than the variance. The reason for this is because the standard deviation shares the same units with our original data, while the variance has squared units.
What is the use of Standard deviation
If data is associated with money, a higher SD is associated with a higher risk
standard deviation is used to tell if data is statistically significant or part of the expected variation
Which greek symbol is used to denote standard deviation?
Sigma
Normal distribution
A histogram with a symmetrical shape where the mean = the median = the mode
for normal distribution, it might be sufficient to only look at the mean and standard deviation for a conclusion
Right skewed shape
Mean greater than Median which is greater than the mode
for skewed distributions, instead of the mean and standard deviation, a 5 variable summary, might provide better insight
Left skewed shape
Mean is less than the median, which is less than the mode
for skewed distributions, instead of the mean and standard deviation, a 5 variable summary, might provide better insight
Descriptive Statistics
Descriptive statistics is about describing our collected data.
Inferential Statistics
Inferential Statistics is about using our collected data to draw conclusions to a larger population. Population - our entire group of interest.
Parameter - numeric summary about a population
Sample - a subset of the population
Statistic numeric summary about a sample
Parameter vs Statistic
A parameter is a number describing a whole population (e.g., population mean), while a statistic is a number describing a sample (e.g., sample mean).
The goal of quantitative research is to understand characteristics of populations by finding parameters.