Supervised Learning Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

Data cleaning

A

Also called data cleansing, data munging, or data wrangling, the process of identifying and then eliminating problems in the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Data exploration

A

The process of exploring the data to discover relationships and features, using visualizations, statistics, and other methods

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Continuous Variable

A

A variable that can take an infini9te number of values, where the difference between two values can be arbitrarily small

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Categorical variable

A

A variable that can take only a limited number of distinct values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Interval variable

A

A type of continuous variable that is sensitive to both rank order and difference between two values, but doesn’t have an absolute zero point

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Ratio variable

A

a type of continuous variable that is sensitive to both rank order and distance between two values, and has a meaningful absolute zero point

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Ordinal variable

A

A type of categorical variable that is sensitive to rank-ordering but not the difference between two values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Nominal variable

A

A type of categorical variable that doesn’t have any natural order or ranking

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Outlier

A

An observation that is distant from other observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Box plot

A

A chart that indicates the minimum value, the maximum value, the sample median, and the first and third quartiles

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

quartiles

A

A type of quantile that divides a ranked dataset into four equal parts to help understand the spread and center of the data. Used in box plots to visualize the distribution and identify outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

First Quartile (Q1)

A

25% of the data falls below this value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Second quartile (Q2)

A

Known as the median, 50% of the data falls below this value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Third quartile (Q3)

A

75% of the data falls below this value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Interquartile range (IQR)

A

the range between the first and third quartiles

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Histogram

A

A column chart showing the frequency distribution of a variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Winsorization

A

The process of replacing extreme observations with values that are less extreme

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Monotonic transformation

A

A transformation that doesn’t change the relative ordering of the values in a variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Univariate analysis

A

Analysis of a single variable in a dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Multivariate analysis

A

Analysis that incorporates two or more variables in a dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Bi variate analysis

A

A type of multivariate analysis that focuses on exactly two variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Scatter plot (Scattergram)

A

a chart that typically uses dots to represent two numeric variables, with one variable on the x-axis and the other on the y-axis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Correlation coefficient

A

A numeric representation of the linear relationship between two continuous variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Heat map

A

a type of chart that indicates a variable’s magnitude by color variation such as hue or intensity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Heat map

A

A type of chart that indicates a variable’s correlation in relation to another

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

One-hot encoding

A

The process of transforming a categorical variable into dichotomous indicator variables so that the data is numeric.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Indicator variable

A

Aka as a dummy variable, a dichotomous variable that indicates the presence or absence of a given qualitative variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

dichotomy

A

division between two mutually exclusive or contradictory groups. In data science, it often refers to a binary classification where there are only two possible categories (Ex: True/False, Yes/No, 0/1).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Box-Cox transformation

A

a transformation designed to transform data to resemble a normal distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Normalization

A

The process of rescaling variables into the [0,1] range

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Standardization

A

The process of rescaling a variable to have a mean of zero and a standard deviation of one

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Rescaling a variable

A

means adjusting its values to fit within a specific range or scale. This process is crucial when dealing with data that have different units or magnitudes. It helps ensure that all variables contribute equally to the analysis.

33
Q

Filter methods

A

A class of feature-selection methods that evaluate each feature separately and assign it a score that’s used to rank the features, with scores above a certain cutoff point being retained or discarded

34
Q

Wrapper methods

A

A class of feature-selection methods that construct sets of features, evaluate each set in terms of their predictive power in a model and compare the set’s performance to the performance of other sets

35
Q

Embedded methods

A

A class of feature-selection methods that select sets of features as an intrinsic part of the fitting method for the particular type of model being used

36
Q

Principal components analysis (PCA)

A

a complexity reduction technique that tries to reduce a set of variables down to a smaller set of components that represent most of the information in the variables

37
Q

Eignvector of a linear transformation

A

A vector that doesn’t change its direction when the linear transformation is applied to it

38
Q

vector

A

a quantity with both magnitude and direction represented as an array of numbers. In DS they often represent features of a dataset. (Example: in 2D space, a vector might look like([x,y]), where (x) and (y) are the coordinates.

39
Q

Eigenvalue of eigenvector

A

The factor by which the eigenvector is scaled

40
Q

Components

A

Eigenvectors that have been divided by the square roots of their eigenvalues

41
Q

Statistical model

A

A simplified mathematical representation of the data scientist’s best guess about the underlying processes that created the data

42
Q

Dense feature

A

An element with information that explains a large amount of variance in the outcome of interest

43
Q

Artificial intelligence

A

Known as AI, the study of systems that perform tasks that require human intelligence, such as understanding natural language, recognizing objects, or driving a car

44
Q

Feature sets

A

Processed data that is ready to be used in models

45
Q

Instance space

A

The vector space of all instances of the data

46
Q

Supervised learning

A

A machine-learning approach where the computer is presented with a set of features and their corresponding targets, and then asked to learn what the pattern in the dataset is

47
Q

Unsupervised learning

A

A machine-learning approach where the learning algorithm is given features without labels, meaning that it needs to discover the pattern in the data

48
Q

Semisupervised learning

A

A machine-learning approach where the computer is given a partially complete feature-target set, where many targets are missing from the features in many instances

49
Q

Reinforcement learning

A

A machine-learning approach where feedback is given to the learning agent (or algorithm) in a dynamic environment in the form of rewards and punishments

50
Q

Generalization

A

How well a learning agent can apply the concepts that it’s learned to new instances that it didn’t see during training

51
Q

Underfitting

A

A scenario where the model can’t fit any data, including training, test, and unseen data

52
Q

Overfitting

A

A phenomenon that occurs in machine learning models when a model becomes too complex or fit so well to the training data that it cannot perform well on new data

53
Q

classification

A

The process of determining categories for objects and then predicting which category previously unseen objects belong to

54
Q

Labeled data

A

Data that is already associated with a target value

55
Q

Confusion matrix

A

A table showing every combination of predicted and actual values

56
Q

Linearly separable data

A

Data that when graphed in two dimensions can be separated into two classes by a straight line

57
Q

Classification algorithm

A

An algorithm that aims to predict the labeled class to which each observation belongs

58
Q

Linear classifier

A

An algorithm that classifies objects based on a linear combination of the characteristics

59
Q

Decision boundary

A

A line or surface that separates different predicted classes

60
Q

Gradient descent algorithm

A

an optimization algorithm that involves repeatedly updating the parameters to the hypothesis function and measuring the error until the error is as small as possible.

61
Q

Balanced dataset or class-balanced data set

A

A dataset with a fairly even distribution of values across each class

62
Q

Unbalanced dataset or a class-imbalanced dataset

A

A dataset with a skewed distribution of values across each class, thus creating a challenge for predictive modeling

63
Q

False positive rate, FPR

A

the probability that a negative instance will be incorrectly predicted as positive

64
Q

True positive rate, TPR

A

the probability that a positive instance will be correctly predicted as positive

65
Q

Probability threshold

A

A parameter that determines when to convert a predicted probability into a class label

66
Q

Precision

A

The proportion of positive predictions that are correct

67
Q

Recall

A

The proportion of instances in the positive class that were correctly predicted as positive

68
Q

Precision-recall curve

A

a visualization created by plotting precision against recall while varying the threshold from 0 to 1, which is useful for class-imbalanced data.

69
Q

One-vs-rest OvR

A

a strategy for transforming a multiclass problem into several binary problems by training a single classifier per class

70
Q

Multinomial

A
71
Q

Regression

A

The process of estimating the relationship between one or more observed features and some continuous target variable

72
Q

Noise

A

Unexplained variability within a target variable or data

73
Q

Ordinary least squares, OLS

A

on optimization algorithm that tries to minimize the sum of squared distances between each point and the line, and chooses the line that minimizes this sum

74
Q

Linear regression model

A

A regression model that aims to model a linear relationship between the target variable and the coefficients of the features

75
Q

Skewness

A

the measure of the degree of asymmetry of the distribution

76
Q

Kurtosis

A

Measure of the sharpness of a distribution’s peak

77
Q

Optimization

A

The process of finding the optimal values of the unknown coefficients

78
Q

Error term

A

Also known as the residual, the information in the target variable that isn’t explained by the features