EDA Flashcards
Continuous vs Discrete Data
Numerical data
1) Data that can take on any value in an interval
2) Data that can be taken only as an integer value, such as counts
Binary vs Ordinal Data
Categorical data - specific set of values representing a set of possible categories
1) special case of categorical data with just 2 categories of values
2) categorical data that has an explicit ordering
Advantages of explicit identification of data
1) tells software how statistical procedures, such as making a chart of fitting a model should behave
Sk.learn.preprocessing.OrdinalEncoder
2) storage and indexing can be optimized
3) possible values given a categorical variable can take are enforced in the software
Types of nonrectangular data structures
1) time series - successive measurements of the same variable, raw material for statistical forecasting methods
2) spatial data structure - mapping and location
3) graph data structures - physical, social, and abstract relationships
Estimate
Getting a typical value for each features: an estimate of where most of the data is located
Mean
Average
Sum of all values divided by the number of values
Weighted mean
weighted average
Sum of all values times a weight divided by the sum fo the weights
Median
Value such that 1/2 of the data lies above and below
Percentile
Quantile
Value such that p percent of the data lies below
Weighted median
Value such that one half of the sum of the weights lies above and below the sorted data
Trimmed mean
truncated mean
Average of all values after dropping a fixed number of extreme values
Eliminates influence of extreme values
Robust
Not sensitive to extreme values
Median is a robust estimator
Outlier
Data value that is very different from most the data
Metrics vs estimates
Statisticians estimate - account for uncertainty - draw a distinction between what we see from the data and the theoretical true or exact state of affairs
Metric - concrete business or organizational objectives at the focus of data science
anomaly detection
Points of interest that are the outliers, and the greater mass of data serves primarily to define the “normal” against which anomalies are measured
Dispersion/variability
Measures whether the data values are tightly clustered or spread out
Heart of stats: measure, reduce, and distinguishing random from real variability, identify various sources of real variability and making decisions in the presence of it
Deviations
Difference between the observed values and the estimate of location (mean)
Errors, residuals
Variance
Sum of squared deviations from the mean divided by n-1 where n is the number of data values
an average of the squared deviations
Mean squared error
Standard deviation
Square root of the variance
I2-norm, Euclidean norm
Mean absolute deviation
Mean of the absolute value of the deviations from the mean
I1norm, manhattan norm
Median absolute deviation from the median
The median of the absolute value of the deviations from the mean
Range
Difference between the largest and the smallest value in a dataset
Order statistics
Metrics based on the data values sorted from smallest to biggest
Percentile
Value such that P percent of the values take on this value or less and (100-P) percent take on this value or more
Quantile
Interquartile range
Difference between the 75th percentile and the 25th percentile
IQR
Degrees of freedom
N-1 in denominator instead of n
If you use n you will underestimate the true value of the variance and the standard deviation in the population - biased
When you do n-1, the variance becomes an unbiased estimate
Degrees of freedom = takes into account the number of constraints in computing an estimate
One constraint - standard deviation depends on calculating the sample mean
Boxplot
Visualize distribution of the data
Top and the bottom of the box are 75th and 25th percentiles, respectively
Had
Median is the horizontal line in the box
Whiskers - extend from the top and bottom to indicate the range of the bulk of the dataw
Frequency table
Tally of the count of numeric data values that fall into a set of intervals
Histogram
Plot of the frequency table with the bins on the x-axis and the count on the y-axis
Density plot
Smooth version of the histogram, often based on a kernel density estimate
Why make bins?
Both frequency tables and percentiles summarize the data by creating bins
In general, quartile and deciles will have the same count in each bin (equal-count bins), but bin size will be different
Small bins = result is too granular and the ability to see bigger pictures is lost
Statistical moments
1) location
2) variability
3) skew ness
4) kurtosis
Skewness
Refers to whether the data is skewed to larger or smaller values
Kurtosis
Propensity of the data to have extreme values
Density estimates, density plot
Smoothed histogram
A density plot corresponds to plotting the history ram as a proportion rather than counts
Mode
The most commonly occurring category or value in a data set
Expected value
When the categories can be associated with a numeric value, this give an average value based on a category’s probability of occurence
1) multiply each outcome by its probability of occurring
2) sum these values
- future expectations and probability weights
Bar charts
Frequency or proportion for each category plotted as bars
Pie charts
Frequency or proportion for each category plotted as wedges in a pie
Correlation coefficient
A metric that measures the extent to which numeric variables are associated with one another (ranges from -1 to +1)
Multiply deviations from the mean for variable 1 times those for variable 2 and divide by the product of the standard deviations
Correlation matrix
A table where the variables are shown on both rows and columns, and the cell values are the correlations between the variables
Scatterplot
A plot in which the x-axis is the values of one variable, and the y-axis the value of another
Contingency tables
A tally of counts between 2 or more categorical variables
Hexagonal binning
A plot of two numeric variables with the records binned into hexagons
Contour plots
A plot showing the density of 2 numeric variables like a topographical map
Violin plots
Similar to a boxplot but showing the density estimate
Plot a numeric variable against a categorical variable
Boxplot
Visually compare the distributions of a numeric variable grouped according to a categorical variable