Supervised Learning Flashcards
Data cleaning
Also called data cleansing, data munging, or data wrangling, the process of identifying and then eliminating problems in the data
Data exploration
The process of exploring the data to discover relationships and features, using visualizations, statistics, and other methods
Continuous Variable
A variable that can take an infini9te number of values, where the difference between two values can be arbitrarily small
Categorical variable
A variable that can take only a limited number of distinct values
Interval variable
A type of continuous variable that is sensitive to both rank order and difference between two values, but doesn’t have an absolute zero point
Ratio variable
a type of continuous variable that is sensitive to both rank order and distance between two values, and has a meaningful absolute zero point
Ordinal variable
A type of categorical variable that is sensitive to rank-ordering but not the difference between two values
Nominal variable
A type of categorical variable that doesn’t have any natural order or ranking
Outlier
An observation that is distant from other observations
Box plot
A chart that indicates the minimum value, the maximum value, the sample median, and the first and third quartiles
quartiles
A type of quantile that divides a ranked dataset into four equal parts to help understand the spread and center of the data. Used in box plots to visualize the distribution and identify outliers
First Quartile (Q1)
25% of the data falls below this value
Second quartile (Q2)
Known as the median, 50% of the data falls below this value
Third quartile (Q3)
75% of the data falls below this value
Interquartile range (IQR)
the range between the first and third quartiles
Histogram
A column chart showing the frequency distribution of a variable
Winsorization
The process of replacing extreme observations with values that are less extreme
Monotonic transformation
A transformation that doesn’t change the relative ordering of the values in a variable
Univariate analysis
Analysis of a single variable in a dataset
Multivariate analysis
Analysis that incorporates two or more variables in a dataset
Bi variate analysis
A type of multivariate analysis that focuses on exactly two variables
Scatter plot (Scattergram)
a chart that typically uses dots to represent two numeric variables, with one variable on the x-axis and the other on the y-axis
Correlation coefficient
A numeric representation of the linear relationship between two continuous variables
Heat map
a type of chart that indicates a variable’s magnitude by color variation such as hue or intensity
Heat map
A type of chart that indicates a variable’s correlation in relation to another
One-hot encoding
The process of transforming a categorical variable into dichotomous indicator variables so that the data is numeric.
Indicator variable
Aka as a dummy variable, a dichotomous variable that indicates the presence or absence of a given qualitative variable
dichotomy
division between two mutually exclusive or contradictory groups. In data science, it often refers to a binary classification where there are only two possible categories (Ex: True/False, Yes/No, 0/1).
Box-Cox transformation
a transformation designed to transform data to resemble a normal distribution
Normalization
The process of rescaling variables into the [0,1] range
Standardization
The process of rescaling a variable to have a mean of zero and a standard deviation of one
Rescaling a variable
means adjusting its values to fit within a specific range or scale. This process is crucial when dealing with data that have different units or magnitudes. It helps ensure that all variables contribute equally to the analysis.
Filter methods
A class of feature-selection methods that evaluate each feature separately and assign it a score that’s used to rank the features, with scores above a certain cutoff point being retained or discarded
Wrapper methods
A class of feature-selection methods that construct sets of features, evaluate each set in terms of their predictive power in a model and compare the set’s performance to the performance of other sets
Embedded methods
A class of feature-selection methods that select sets of features as an intrinsic part of the fitting method for the particular type of model being used
Principal components analysis (PCA)
a complexity reduction technique that tries to reduce a set of variables down to a smaller set of components that represent most of the information in the variables
Eignvector of a linear transformation
A vector that doesn’t change its direction when the linear transformation is applied to it
vector
a quantity with both magnitude and direction represented as an array of numbers. In DS they often represent features of a dataset. (Example: in 2D space, a vector might look like([x,y]), where (x) and (y) are the coordinates.
Eigenvalue of eigenvector
The factor by which the eigenvector is scaled
Components
Eigenvectors that have been divided by the square roots of their eigenvalues
Statistical model
A simplified mathematical representation of the data scientist’s best guess about the underlying processes that created the data
Dense feature
An element with information that explains a large amount of variance in the outcome of interest
Artificial intelligence
Known as AI, the study of systems that perform tasks that require human intelligence, such as understanding natural language, recognizing objects, or driving a car
Feature sets
Processed data that is ready to be used in models
Instance space
The vector space of all instances of the data
Supervised learning
A machine-learning approach where the computer is presented with a set of features and their corresponding targets, and then asked to learn what the pattern in the dataset is
Unsupervised learning
A machine-learning approach where the learning algorithm is given features without labels, meaning that it needs to discover the pattern in the data
Semisupervised learning
A machine-learning approach where the computer is given a partially complete feature-target set, where many targets are missing from the features in many instances
Reinforcement learning
A machine-learning approach where feedback is given to the learning agent (or algorithm) in a dynamic environment in the form of rewards and punishments
Generalization
How well a learning agent can apply the concepts that it’s learned to new instances that it didn’t see during training
Underfitting
A scenario where the model can’t fit any data, including training, test, and unseen data
Overfitting
A phenomenon that occurs in machine learning models when a model becomes too complex or fit so well to the training data that it cannot perform well on new data
classification
The process of determining categories for objects and then predicting which category previously unseen objects belong to
Labeled data
Data that is already associated with a target value
Confusion matrix
A table showing every combination of predicted and actual values
Linearly separable data
Data that when graphed in two dimensions can be separated into two classes by a straight line
Classification algorithm
An algorithm that aims to predict the labeled class to which each observation belongs
Linear classifier
An algorithm that classifies objects based on a linear combination of the characteristics
Decision boundary
A line or surface that separates different predicted classes
Gradient descent algorithm
an optimization algorithm that involves repeatedly updating the parameters to the hypothesis function and measuring the error until the error is as small as possible.
Balanced dataset or class-balanced data set
A dataset with a fairly even distribution of values across each class
Unbalanced dataset or a class-imbalanced dataset
A dataset with a skewed distribution of values across each class, thus creating a challenge for predictive modeling
False positive rate, FPR
the probability that a negative instance will be incorrectly predicted as positive
True positive rate, TPR
the probability that a positive instance will be correctly predicted as positive
Probability threshold
A parameter that determines when to convert a predicted probability into a class label
Precision
The proportion of positive predictions that are correct
Recall
The proportion of instances in the positive class that were correctly predicted as positive
Precision-recall curve
a visualization created by plotting precision against recall while varying the threshold from 0 to 1, which is useful for class-imbalanced data.
One-vs-rest OvR
a strategy for transforming a multiclass problem into several binary problems by training a single classifier per class
Multinomial
Regression
The process of estimating the relationship between one or more observed features and some continuous target variable
Noise
Unexplained variability within a target variable or data
Ordinary least squares, OLS
on optimization algorithm that tries to minimize the sum of squared distances between each point and the line, and chooses the line that minimizes this sum
Linear regression model
A regression model that aims to model a linear relationship between the target variable and the coefficients of the features
Skewness
the measure of the degree of asymmetry of the distribution
Kurtosis
Measure of the sharpness of a distribution’s peak
Optimization
The process of finding the optimal values of the unknown coefficients
Error term
Also known as the residual, the information in the target variable that isn’t explained by the features
Estimation (In linear regression)
Refers to creating a model based on known data(past observations) to understand the underlying relationship between variables. (For example, using a linear regression model to estimate the relationship between features and a target variable.)
Prediction
This involves using the estimated model to forecast unknown outcomes or future data points. This means applying the regression model to new da ta to predict the target variable’s value.