Exploratory Data Analysis Flashcards
Data that are expressed on a numerical scale is what data type?
numeric
Data that can take only a specific set of values representing a set of possible categories (enums, enumerated, factors, nominal) are what data type?
categorical
Cite the two numerical data types
continuous and discrete
Cite the two categorical data types
binary and ordinal
Data that can take on any value in an interval (float, numeric)
continuous
Data that can take only integer values, such as counts
discrete
True or False
Data typing in software acts as a signal on how to process the data
True
Rectangular data (like a spread sheet) is the basic structure for statistical and machine learning models, cite the structure?
dataframe
A column (series) within a table is commonly referred to as a _______?
feature
Many data science projects involve predicting an ______?
outcome (dependent variable, response, target, output)
A row in a table is referred to as a ______?
record
What is the sum of all values divided by the number of values
mean
The sum of all values times a weight divided by the sum of the weights
weighted mean
The value such that one-half of the data lies above and below
median
The value such that P percent of the data lies below
percentile (quantile)
The value such that one-half of the sum of the weights lies above and below the sorted data
weighted median
The average of all values after dropping a fixed number of extreme values
trimmed mean (truncated mean)
Not sensitive to extreme values
Robust (resistant)
What is a data value that is very different from most of the data?
Outlier (extreme value)
The difference between the observed values and the estimate of location?
deviations
The sum of squared deviations from the mean divided by n-1 where n is the number of data values.
variance
The square root of the variance
standard deviation
The mean of the absolute values of the deviations from the mean (L1-norm, Manhattan norm)
mean abs deviation
The mean of the absolute values of the deviations from the median
median abs deviation from the median
The difference between the largest and smallest value in a data set.
range
Metrics based on the data values sorted from smallest to largest (ranks)
order statistics
The value such that P percent of the values take on this value or less and (100-P) percent take on this value or more
percentile
The difference between the 75th percentile and the 25th percentile (IQR)
interquartile range
The basic metric for location is the ____, but it can be sensitive to extreme values _____?
mean, outliers
Location is one dimension in summarizing a _______?
feature
A second dimension, variability (dispersion) measures?
whether the data values are tightly clustered or spread out.
What are the objectives being performed during the exploratory data analysis phase (EDA)?
elements of structured data estimates of location estimates of variability(dispersion metrics) exploring the data distribution exploring binary and categorical data correlation exploring two or more variables
estimates of location can be describe by?
mean, median and robust estimates.
What is being described in estimates of location?
the mean, median and robust estimates.
What is being described in the estimates of variability?
standard deviation and related estimates
estimates based on percentiles
What is being described when exploring the data distributions?
percentiles and boxplots
frequency tables and histograms
density plots and estimates
What is meant by exploring binary and categorical data?
mode, expected value, probability
What describes exploring two or more variables?
hexagon binning and contours (plotting numeric vs numeric data)
two categorical variables
visualizing multiple variables
Give an example of a numerical continuous data?
weight (which can be infinitely divided)
Give an example of a numerical discrete data?
year of birth (numerical data that can’t be divided)
Give an example of a categorical binary value? (only two options)
a brand of camera, Sony
Give an example of a categorical ordinal value? (ordinal meaning order)
data where the order of it matters
Pandas stands for what?
panel data