EDA Flashcards
Continuous vs Discrete Data
Numerical data
1) Data that can take on any value in an interval
2) Data that can be taken only as an integer value, such as counts
Binary vs Ordinal Data
Categorical data - specific set of values representing a set of possible categories
1) special case of categorical data with just 2 categories of values
2) categorical data that has an explicit ordering
Advantages of explicit identification of data
1) tells software how statistical procedures, such as making a chart of fitting a model should behave
Sk.learn.preprocessing.OrdinalEncoder
2) storage and indexing can be optimized
3) possible values given a categorical variable can take are enforced in the software
Types of nonrectangular data structures
1) time series - successive measurements of the same variable, raw material for statistical forecasting methods
2) spatial data structure - mapping and location
3) graph data structures - physical, social, and abstract relationships
Estimate
Getting a typical value for each features: an estimate of where most of the data is located
Mean
Average
Sum of all values divided by the number of values
Weighted mean
weighted average
Sum of all values times a weight divided by the sum fo the weights
Median
Value such that 1/2 of the data lies above and below
Percentile
Quantile
Value such that p percent of the data lies below
Weighted median
Value such that one half of the sum of the weights lies above and below the sorted data
Trimmed mean
truncated mean
Average of all values after dropping a fixed number of extreme values
Eliminates influence of extreme values
Robust
Not sensitive to extreme values
Median is a robust estimator
Outlier
Data value that is very different from most the data
Metrics vs estimates
Statisticians estimate - account for uncertainty - draw a distinction between what we see from the data and the theoretical true or exact state of affairs
Metric - concrete business or organizational objectives at the focus of data science
anomaly detection
Points of interest that are the outliers, and the greater mass of data serves primarily to define the “normal” against which anomalies are measured
Dispersion/variability
Measures whether the data values are tightly clustered or spread out
Heart of stats: measure, reduce, and distinguishing random from real variability, identify various sources of real variability and making decisions in the presence of it
Deviations
Difference between the observed values and the estimate of location (mean)
Errors, residuals
Variance
Sum of squared deviations from the mean divided by n-1 where n is the number of data values
an average of the squared deviations
Mean squared error