Supervised Learning Flashcards
Data cleaning
Also called data cleansing, data munging, or data wrangling, the process of identifying and then eliminating problems in the data
Data exploration
The process of exploring the data to discover relationships and features, using visualizations, statistics, and other methods
Continuous Variable
A variable that can take an infini9te number of values, where the difference between two values can be arbitrarily small
Categorical variable
A variable that can take only a limited number of distinct values
Interval variable
A type of continuous variable that is sensitive to both rank order and difference between two values, but doesn’t have an absolute zero point
Ratio variable
a type of continuous variable that is sensitive to both rank order and distance between two values, and has a meaningful absolute zero point
Ordinal variable
A type of categorical variable that is sensitive to rank-ordering but not the difference between two values
Nominal variable
A type of categorical variable that doesn’t have any natural order or ranking
Outlier
An observation that is distant from other observations
Box plot
A chart that indicates the minimum value, the maximum value, the sample median, and the first and third quartiles
quartiles
A type of quantile that divides a ranked dataset into four equal parts to help understand the spread and center of the data. Used in box plots to visualize the distribution and identify outliers
First Quartile (Q1)
25% of the data falls below this value
Second quartile (Q2)
Known as the median, 50% of the data falls below this value
Third quartile (Q3)
75% of the data falls below this value
Interquartile range (IQR)
the range between the first and third quartiles
Histogram
A column chart showing the frequency distribution of a variable
Winsorization
The process of replacing extreme observations with values that are less extreme
Monotonic transformation
A transformation that doesn’t change the relative ordering of the values in a variable
Univariate analysis
Analysis of a single variable in a dataset
Multivariate analysis
Analysis that incorporates two or more variables in a dataset
Bi variate analysis
A type of multivariate analysis that focuses on exactly two variables
Scatter plot (Scattergram)
a chart that typically uses dots to represent two numeric variables, with one variable on the x-axis and the other on the y-axis
Correlation coefficient
A numeric representation of the linear relationship between two continuous variables
Heat map
a type of chart that indicates a variable’s magnitude by color variation such as hue or intensity
Heat map
A type of chart that indicates a variable’s correlation in relation to another
One-hot encoding
The process of transforming a categorical variable into dichotomous indicator variables so that the data is numeric.
Indicator variable
Aka as a dummy variable, a dichotomous variable that indicates the presence or absence of a given qualitative variable
dichotomy
division between two mutually exclusive or contradictory groups. In data science, it often refers to a binary classification where there are only two possible categories (Ex: True/False, Yes/No, 0/1).
Box-Cox transformation
a transformation designed to transform data to resemble a normal distribution
Normalization
The process of rescaling variables into the [0,1] range
Standardization
The process of rescaling a variable to have a mean of zero and a standard deviation of one
Rescaling a variable
means adjusting its values to fit within a specific range or scale. This process is crucial when dealing with data that have different units or magnitudes. It helps ensure that all variables contribute equally to the analysis.