data science - statistics I Flashcards
EDA
Exploratory Data Analysis - first step of the data science project, familiarizing yourself with the data
continuous data
data that can take any value in an interval (float)
discrete data
data that can only take integer values, such as counts (int)
categorical data
data that can take on only a specific set of values representing a set of possible categories
binary data
special subset of categorical data that with just two category values
ordinal data
categorical data that has an explicit ordering
feature
often a column in a table, attribute/predictor of a row of data
record
a row in a table
data scientists use features to predict a target, while statisticians..
use predictor variables in a model to predict a response/dependent variable
trimmed mean
avg of all values after dropping a fixed number of extreme values
robust
not sensitive to extreme values
x-bar
sample mean of a population
reasons to use a weighted mean
1) observations that are highly variable may be given a lower weight (ex: a sensor that is less accurate)
2) data collected doesn’t represent different groups that we are interested in measuring (ex: give greater weight to underrepresented minorities )
why is the median more robust than mean as an estimate of central tendency
It isn’t influenced by outliers / extreme cases that could skew the results
what is thought to be a compromise between mean and median
trimmed mean - robust to extreme values in data, but uses more data to calculate the estimate for central tendency than median
variance
sum of squared deviations from the mean divided by n - 1 (aka: mean-squared-error)
standard deviation
the square root of variance (aka: 12-norm, euclidean norm)
range
difference between largest and smallest value in a dataset
interquartile range
the difference between the 75th percentile and the 25th percentile
mean absolute deviation
mean of the abs value of the deviations from the mean
Why use n-1 instead of n when calculating variance?
When using n, you will underestimate the true value of the variance and the std in the population.
what measure of variability is most robust to extremes / outliers?
median abso value = median (| x1 - m |, |x2 - m |,… | xN -m |)
Graph that shows the min/max, IQR, median
box plot, box and whisker plot
tally of the count of data falling into intervals/bins
frequency table
plot of the frequency table
histogram
smoothed version of a histogram
density plot
When the categories can be associated with a numeric value, this gives an average value based on a category’s probability of occurrence.
expected value
How is the expected value calculated (2 steps)
A marketer for a new cloud technology, for example, offers two levels of service, one priced at $300/month and another at $50/month. The marketer offers free webinars to generate leads, and the firm figures that 5% of the attendees will sign up for the $300 service, 15% for the $50 service, and 80% will not sign up for anything. This data can be summed up, for financial purposes, in a single “expected value,” which is a form of weighted mean in which the weights are probabilities.
The expected value is calculated as follows:
- Multiply each outcome by its probability of occurring.
- Sum these values.
In the cloud service example, the expected value of a webinar attendee is thus $22.50 per month, calculated as follows:
EV = (0.05)(300)+(0.15)(50)+(0.8)(0) == 22.5
Exploratory data analysis often begins with what 3 things?
- univariate analysis
- examining correlation among predictors (features)
- examining correlation among features and the target
A metric that measures the extent to which numeric variables are associated with one another (ranges from –1 to +1).
correlation coefficient (R)
A table where the variables are shown on both rows and columns, and the cell values are the correlations between the variables.
correlation matrix
A plot in which the x-axis is the value of one variable, and the y-axis the value of another.
scatterplot
How do we compute the correlation coefficient?
we multiply deviations from the mean for variable 1 times those for variable 2, and divide by the product of the standard deviations & (n-1):
(Summation of (xi - xbar)(yi - ybar)) / ((n-1) * SDx * SDy)
what is an example where correlation coefficient isn’t a useful metric
when the relationship is not linear
A tally of counts between two or more categorical variables.
Contingency tables
A plot of two numeric variables with the records binned into hexagons.
Hexagonal binning
A plot showing the density of two numeric variables like a topographical map.
Contour plots
Similar to a boxplot but showing the density estimate.
Violin plots
scatterplots are good for smaller amounts of data, what are good alternatives when having large amounts of data
hexagonal binning, contour plots,heat maps