data science - statistics I Flashcards
EDA
Exploratory Data Analysis - first step of the data science project, familiarizing yourself with the data
continuous data
data that can take any value in an interval (float)
discrete data
data that can only take integer values, such as counts (int)
categorical data
data that can take on only a specific set of values representing a set of possible categories
binary data
special subset of categorical data that with just two category values
ordinal data
categorical data that has an explicit ordering
feature
often a column in a table, attribute/predictor of a row of data
record
a row in a table
data scientists use features to predict a target, while statisticians..
use predictor variables in a model to predict a response/dependent variable
trimmed mean
avg of all values after dropping a fixed number of extreme values
robust
not sensitive to extreme values
x-bar
sample mean of a population
reasons to use a weighted mean
1) observations that are highly variable may be given a lower weight (ex: a sensor that is less accurate)
2) data collected doesn’t represent different groups that we are interested in measuring (ex: give greater weight to underrepresented minorities )
why is the median more robust than mean as an estimate of central tendency
It isn’t influenced by outliers / extreme cases that could skew the results
what is thought to be a compromise between mean and median
trimmed mean - robust to extreme values in data, but uses more data to calculate the estimate for central tendency than median