Supervised Learning Flashcards
Data cleaning
Also called data cleansing, data munging, or data wrangling, the process of identifying and then eliminating problems in the data
Data exploration
The process of exploring the data to discover relationships and features, using visualizations, statistics, and other methods
Continuous Variable
A variable that can take an infini9te number of values, where the difference between two values can be arbitrarily small
Categorical variable
A variable that can take only a limited number of distinct values
Interval variable
A type of continuous variable that is sensitive to both rank order and difference between two values, but doesn’t have an absolute zero point
Ratio variable
a type of continuous variable that is sensitive to both rank order and distance between two values, and has a meaningful absolute zero point
Ordinal variable
A type of categorical variable that is sensitive to rank-ordering but not the difference between two values
Nominal variable
A type of categorical variable that doesn’t have any natural order or ranking
Outlier
An observation that is distant from other observations
Box plot
A chart that indicates the minimum value, the maximum value, the sample median, and the first and third quartiles
quartiles
A type of quantile that divides a ranked dataset into four equal parts to help understand the spread and center of the data. Used in box plots to visualize the distribution and identify outliers
First Quartile (Q1)
25% of the data falls below this value
Second quartile (Q2)
Known as the median, 50% of the data falls below this value
Third quartile (Q3)
75% of the data falls below this value
Interquartile range (IQR)
the range between the first and third quartiles
Histogram
A column chart showing the frequency distribution of a variable
Winsorization
The process of replacing extreme observations with values that are less extreme
Monotonic transformation
A transformation that doesn’t change the relative ordering of the values in a variable
Univariate analysis
Analysis of a single variable in a dataset
Multivariate analysis
Analysis that incorporates two or more variables in a dataset
Bi variate analysis
A type of multivariate analysis that focuses on exactly two variables
Scatter plot (Scattergram)
a chart that typically uses dots to represent two numeric variables, with one variable on the x-axis and the other on the y-axis
Correlation coefficient
A numeric representation of the linear relationship between two continuous variables
Heat map
a type of chart that indicates a variable’s magnitude by color variation such as hue or intensity