Machine Learning Flashcards by Tim Borden

___ is intelligence demonstrated by machines, unlike the natural intelligence displayed by humans and animals, which involves consciousness and emotionality. The distinction between the former and the latter categories is often revealed by the acronym chosen. ‘Strong’ __ is usually labelled as AGI (Artificial General Intelligence) while attempts to emulate ‘natural’ intelligence have been called ABI (Artificial Biological Intelligence). Leading __ textbooks define the field as the study of “intelligent agents”: any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals. Colloquially, the term ___ is often used to describe machines (or computers) that mimic “cognitive” functions that humans associate with the human mind, such as “learning” and “problem solving”.

Artificial intelligence (AI)

How well did you know this?

Not at all

Perfectly

___ is the study of computer algorithms that improve automatically through experience and by the use of data. It is seen as a part of artificial intelligence. ___ algorithms build a model based on sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to do so. ___ algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.

Machine learning (ML)

How well did you know this?

Not at all

Perfectly

___ is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised.

Deep learning (also known as deep structured learning)

How well did you know this?

Not at all

Perfectly

___ is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples. In ___, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). A ___ algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a “reasonable” way (see inductive bias). This statistical quality of an algorithm is measured through the so-called generalization error.

Supervised learning

How well did you know this?

Not at all

Perfectly

___ is an approach to machine learning that combines a small amount of labeled data with a large amount of unlabeled data during training. ___ falls between unsupervised learning (with no labeled training data) and supervised learning (with only labeled training data). It is a special instance of weak supervision.

Unlabeled data, when used in conjunction with a small amount of labeled data, can produce considerable improvement in learning accuracy. The acquisition of labeled data for a learning problem often requires a skilled human agent (e.g. to transcribe an audio segment) or a physical experiment (e.g. determining the 3D structure of a protein or determining whether there is oil at a particular location). The cost associated with the labeling process thus may render large, fully labeled training sets infeasible, whereas acquisition of unlabeled data is relatively inexpensive. In such situations, ___ can be of great practical value. ___ is also of theoretical interest in machine learning and as a model for human learning.

Semi-supervised learning

How well did you know this?

Not at all

Perfectly

___ is a type of algorithm that learns patterns from untagged data. The hope is that through mimicry, the machine is forced to build a compact internal representation of its world. In contrast to supervised learning (SL) where data is tagged by a human, e.g. as “car” or “fish” etc, ___ exhibits self-organization that captures patterns as neuronal predelections or probability densities. The other levels in the supervision spectrum are reinforcement learning where the machine is given only a numerical performance score as its guidance, and semi-supervised learning where a smaller portion of the data is tagged. Two broad methods in __ are Neural Networks and Probabilistic Methods.

Unsupervised learning (UL)

How well did you know this?

Not at all

Perfectly

___ is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. ___ is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning.

___ differs from supervised learning in not needing labelled input/output pairs be presented, and in not needing sub-optimal actions to be explicitly corrected. Instead the focus is on finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge).

The environment is typically stated in the form of a Markov decision process (MDP), because many ___ algorithms for this context use dynamic programming techniques. The main difference between the classical dynamic programming methods and ___ algorithms is that the latter do not assume knowledge of an exact mathematical model of the MDP and they target large MDPs where exact methods become infeasible.

Reinforcement learning (RL)

How well did you know this?

Not at all

Perfectly

___ are computing systems vaguely inspired by the biological neural networks that constitute animal brains.

An ___ is based on a collection of connected units or nodes called artificial neurons, which loosely model the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit a signal to other neurons. An artificial neuron that receives a signal then processes it and can signal neurons connected to it. The “signal” at a connection is a real number, and the output of each neuron is computed by some non-linear function of the sum of its inputs. The connections are called edges. Neurons and edges typically have a weight that adjusts as learning proceeds. The weight increases or decreases the strength of the signal at a connection. Neurons may have a threshold such that a signal is sent only if the aggregate signal crosses that threshold. Typically, neurons are aggregated into layers. Different layers may perform different transformations on their inputs. Signals travel from the first layer (the input layer), to the last layer (the output layer), possibly after traversing the layers multiple times.

Artificial neural networks (ANNs), usually simply called neural networks (NNs)

How well did you know this?

Not at all

Perfectly

In statistics, ___ is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable (conventionally, the x and y coordinates in a Cartesian coordinate system) and finds a linear function (a non-vertical straight line) that, as accurately as possible, predicts the dependent variable values as a function of the independent variable. The adjective simple refers to the fact that the outcome variable is related to a single predictor.

Simple linear regression

How well did you know this?

Not at all

Perfectly

___ is an extension of simple linear regression. It is used when we want to predict the value of a variable based on the value of two or more other variables. The variable we want to predict is called the dependent variable (or sometimes, the outcome, target or criterion variable). The variables we are using to predict the value of the dependent variable are called the independent variables (or sometimes, the predictor, explanatory or regressor variables).

Multiple linear regression

How well did you know this?

Not at all

Perfectly

In statistics, ___ is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is modelled as an nth degree polynomial in x. ___ fits a nonlinear relationship between the value of x and the corresponding conditional mean of y, denoted E(y |x). Although ___ fits a nonlinear model to the data, as a statistical estimation problem it is linear, in the sense that the regression function E(y | x) is linear in the unknown parameters that are estimated from the data. For this reason, ___ is considered to be a special case of multiple linear regression.

Polynomial regression

How well did you know this?

Not at all

Perfectly

In machine learning and pattern recognition, a ___ is an individual measurable property or characteristic of a phenomenon being observed. Choosing informative, discriminating and independent ___ is a crucial step for effective algorithms in pattern recognition, classification and regression. ___ are usually numeric, but structural ___ such as strings and graphs are used in syntactic pattern recognition. The concept of “___” is related to that of explanatory variable used in statistical techniques such as linear regression.

Feature

How well did you know this?

Not at all

Perfectly

In digital circuits and machine learning, a ___ is a group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0). A similar implementation in which all bits are ‘1’ except one ‘0’ is sometimes called one-cold. In statistics, dummy variables represent a similar technique for representing categorical data.

One-hot, or one-hot encoding

How well did you know this?

Not at all

Perfectly

A ___ is a dataset of examples used during the learning process and is used to fit the parameters (e.g., weights) of, for example, a classifier.

For classification tasks, a supervised learning algorithm looks at the ___ to determine, or learn, the optimal combinations of variables that will generate a good predictive model. The goal is to produce a trained (fitted) model that generalizes well to new, unknown data. The fitted model is evaluated using “new” examples from the held-out datasets (validation and test datasets) to estimate the model’s accuracy in classifying new data. To reduce the risk of issues such as overfitting, the examples in the validation and test datasets should not be used to train the model.

Training dataset

How well did you know this?

Not at all

Perfectly

A ___ is a dataset that is independent of the training dataset, but that follows the same probability distribution as the training dataset. If a model fit to the training dataset also fits the ___ well, minimal overfitting has taken place. A better fitting of the training dataset as opposed to the ___ usually points to overfitting.

A ___ is therefore a set of examples used only to assess the performance (i.e. generalization) of a fully specified classifier. To do this, the final model is used to predict classifications of examples in the ___. Those predictions are compared to the examples’ true classifications to assess the model’s accuracy.

Test dataset

How well did you know this?

Not at all

Perfectly

___ is a method used to normalize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.

Feature scaling

How well did you know this?

Not at all

Perfectly

___ is a feature scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1.

Normalization, also known as Min-Max scaling

How well did you know this?

Not at all

Perfectly

___ is a feature scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.

Standardization

How well did you know this?

Not at all

Perfectly

In statistics, ___ is “the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably”. An ___ model is a statistical model that contains more parameters than can be justified by the data. The essence of ___ is to have unknowingly extracted some of the residual variation (i.e. the noise) as if that variation represented underlying model structure.
In other words, the model remembers a huge number of examples instead of learning to notice features.
___ can occur in machine learning, in particular. In machine learning, the phenomena are sometimes called “___-training”.

Overfitting

How well did you know this?

Not at all

Perfectly

___ occurs when a statistical model cannot adequately capture the underlying structure of the data. An ___ model is a model where some parameters or terms that would appear in a correctly specified model are missing. ___ would occur, for example, when fitting a linear model to non-linear data. Such a model will tend to have poor predictive performance.
___ can occur in machine learning, in particular. In machine learning, the phenomena are sometimes called “___-training”.

Underfitting

How well did you know this?

Not at all

Perfectly

The ___ is a scenario in which the independent variables are multicollinear - a scenario in which two or more variables are highly correlated; in simple terms one variable can be predicted from the others.

To demonstrate the ___, take the case of gender (male/female) as an example. Including a dummy variable for each is redundant (of male is 0, female is 1, and vice-versa).

Dummy Variable trap

How well did you know this?

Not at all

Perfectly

In statistics, ___ is a method of fitting regression models in which the choice of predictive variables is carried out by an automatic procedure. In each step, a variable is considered for addition to or subtraction from the set of explanatory variables based on some prespecified criterion. Usually, this takes the form of a sequence of F-tests or t-tests, but other techniques are possible, such as adjusted R2, Akaike information criterion, Bayesian information criterion, Mallows’s Cp, PRESS, or false discovery rate.

Stepwise regression

How well did you know this?

Not at all

Perfectly

___ is a stepwise regression, which involves starting with no variables in the model, testing the addition of each variable using a chosen model fit criterion, adding the variable (if any) whose inclusion gives the most statistically significant improvement of the fit, and repeating this process until none improves the model to a statistically significant extent.

Forward selection

How well did you know this?

Not at all

Perfectly

___ is a stepwise regression, which involves starting with all candidate variables, testing the deletion of each variable using a chosen model fit criterion, deleting the variable (if any) whose loss gives the most statistically insignificant deterioration of the model fit, and repeating this process until no further variables can be deleted without a statistically insignificant loss of fit.

Backward elimination

How well did you know this?

Not at all

Perfectly

___ is a stepwise regression, a combination of forward selection and backward elimination testing at each step for variables to be included or excluded.

Bidirectional elimination

An ___ is any statistical test in which the test statistic has an F-distribution under the null hypothesis. It is most often used when comparing statistical models that have been fitted to a data set, in order to identify the model that best fits the population from which the data were sampled. Exact "___" mainly arise when the models have been fitted to the data using least squares. The name was coined by George W. Snedecor, in honour of Sir Ronald A. Fisher. Fisher initially developed the statistic as the variance ratio in the 1920s.

F-test

The ___ is any statistical hypothesis test in which the test statistic follows a Student's t-distribution under the null hypothesis. A ___ is the most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. When the scaling term is unknown and is replaced by an estimate based on the data, the test statistics (under certain conditions) follow a Student's t distribution. The ___ can be used, for example, to determine if the means of two sets of data are significantly different from each other.

t-test

The ___ is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, ___ estimates the quality of each model, relative to each of the other models. Thus, ___ provides a means for model selection. ___ is founded on information theory. When a statistical model is used to represent the process that generated the data, the representation will almost never be exact; so some information will be lost by using the model to represent the process. ___ estimates the relative amount of information lost by a given model: the less information a model loses, the higher the quality of that model. In estimating the amount of information lost by a model, ___ deals with the trade-off between the goodness of fit of the model and the simplicity of the model. In other words, ___ deals with both the risk of overfitting and the risk of underfitting. The ___ is named after the Japanese statistician Hirotugu ___, who formulated it. It now forms the basis of a paradigm for the foundations of statistics and is also widely used for statistical inference.

Akaike information criterion (AIC)

In statistics, the ___ is a criterion for model selection among a finite set of models; the model with the lowest ___ is preferred. It is based, in part, on the likelihood function and it is closely related to the Akaike information criterion (AIC). When fitting models, it is possible to increase the likelihood by adding parameters, but doing so may result in overfitting. Both ___ and AIC attempt to resolve this problem by introducing a penalty term for the number of parameters in the model; the penalty term is larger in ___ than in AIC. The ___ was developed by Gideon E. ___ and published in a 1978 paper, where he gave a Bayesian argument for adopting it.

Bayesian information criterion (BIC) or Schwarz information criterion (also SIC, SBC, SBIC)

In statistics, ___, named for Colin Lingwood ___, is used to assess the fit of a regression model that has been estimated using ordinary least squares. It is applied in the context of model selection, where a number of predictor variables are available for predicting some outcome, and the goal is to find the best model involving a subset of these predictors. A small value of ___ means that the model is relatively precise. ___ has been shown to be equivalent to Akaike information criterion in the special case of Gaussian linear regression.

Mallows’s Cp

In statistics, the ___ statistic is a form of cross-validation used in regression analysis to provide a summary measure of the fit of a model to a sample of observations that were not themselves used to estimate the model. It is calculated as the sums of squares of the prediction residuals for those observations. A fitted model having been produced, each observation in turn is removed and the model is refitted using the remaining observations. The out-of-sample predicted value is calculated for the omitted observation in each case, and the ___ statistic is calculated as the sum of the squares of all the resulting prediction errors. Given this procedure, the ___ statistic can be calculated for a number of candidate model structures for the same dataset, with the lowest values of ___ indicating the best structures. Models that are over-parameterised (over-fitted) would tend to give small residuals for observations included in the model-fitting but large residuals for observations that are excluded. ___ statistic has been extensively used in Lazy Learning and locally linear learning to speed-up the assessment and the selection of the neighbourhood size.

predicted residual error sum of squares (PRESS)

In statistics, the ___ is a method of conceptualizing the rate of type I errors in null hypothesis testing when conducting multiple comparisons. ___-controlling procedures are designed to control the expected proportion of "discoveries" (rejected null hypotheses) that are false (incorrect rejections of the null). ___-controlling procedures provide less stringent control of Type I errors compared to familywise error rate (FWER) controlling procedures (such as the Bonferroni correction), which control the probability of at least one Type I error. Thus, ___-controlling procedures have greater power, at the cost of increased numbers of Type I errors.

False discovery rate (FDR)

In computer science, ___ is the concept that flawed, or nonsense input data produces nonsense output or "___".

garbage in, garbage out (GIGO), or rubbish in, rubbish out (RIRO).

In probability theory and intertemporal portfolio choice, the ___, is a formula for bet sizing that leads almost surely to higher wealth compared to any other strategy in the long run (i.e. approaching the limit as the number of bets goes to infinity). The ___ size is found by maximizing the expected value of the logarithm of wealth, which is equivalent to maximizing the expected geometric growth rate. The ___ is to bet a predetermined fraction of assets, and it can seem counterintuitive. It was described by J. L. ___ Jr, a researcher at Bell Labs, in 1956. For an even money bet, the ___ computes the wager size percentage by multiplying the percent chance to win by two, then subtracting one-hundred percent. So, for a bet with a 70% chance to win the optimal wager size is 40% of available funds. The practical use of the formula has been demonstrated for gambling and the same idea was used to explain diversification in investment management. In the 2000s, ___-style analysis became a part of mainstream investment theory and the claim has been made that well-known successful investors including Warren Buffett and Bill Gross use ___ methods. William Poundstone wrote an extensive popular account of the history of ___ betting.

Kelly criterion (or Kelly strategy or Kelly bet), also known as the scientific gambling method

In machine learning, ___ are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues (Boser et al., 1992, Guyon et al., 1993, Vapnik et al., 1997), ___ are one of the most robust prediction methods, being based on statistical learning frameworks or VC theory proposed by Vapnik and Chervonenkis (1974) and Vapnik (1982, 1995). Given a set of training examples, each marked as belonging to one of two categories, an ___ training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier (although methods such as Platt scaling exist to use ___ in a probabilistic classification setting). An ___ maps training examples to points in space so as to maximise the width of the gap between the two categories. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.

Support-vector machines (SVMs, also support-vector networks)

In machine learning, ___ are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). The general task of pattern analysis is to find and study general types of relations (for example clusters, rankings, principal components, correlations, classifications) in datasets. For many algorithms that solve these tasks, the data in raw representation have to be explicitly transformed into feature vector representations via a user-specified feature map: in contrast, ___ methods require only a user-specified ___, i.e., a similarity function over pairs of data points in raw representation. ___ owe their name to the use of ___ functions, which enable them to operate in a high-dimensional, implicit feature space without ever computing the coordinates of the data in that space, but rather by simply computing the inner products between the images of all pairs of data in the feature space. This operation is often computationally cheaper than the explicit computation of the coordinates. This approach is called the "___ trick". ___ functions have been introduced for sequence data, graphs, text, images, as well as vectors.

Kernel machines or kernel methods

___ learning is one of the predictive modelling approaches used in statistics, data mining and machine learning. It uses a ___ (as a predictive model) to go from observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). Tree models where the target variable can take a discrete set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. ___ where the target variable can take continuous values (typically real numbers) are called regression trees. The term Classification And Regression Tree (CART) analysis is an umbrella term used to refer to both of the above procedures. ___ are among the most popular machine learning algorithms given their intelligibility and simplicity.

Decision tree

In information theory, the ___ of a random variable is the average level of "information", "surprise", or "uncertainty" inherent in the variable's possible outcomes. The concept of ___ was introduced by Claude Shannon in his 1948 paper "A Mathematical Theory of Communication". As an example, consider a biased coin with probability p of landing on heads and probability 1-p of landing on tails. The maximum surprise is for p = 1/2, when there is no reason to expect one outcome over another, and in this case a coin flip has an ___ of one bit. The minimum surprise is when p = 0 or p = 1, when the event is known and the ___ is zero bits. Other values of p give different entropies between zero and one bits.

Information entropy and is sometimes called Shannon entropy.

___ are an ensemble learning method for classification, regression and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean/average prediction (regression) of the individual trees. ___ correct for decision trees' habit of overfitting to their training set. ___ generally outperform decision trees, but their accuracy is lower than gradient boosted trees. However, data characteristics can affect their performance.

Random forests or random decision forests

In statistics and machine learning, ___ use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. Unlike a statistical ensemble in statistical mechanics, which is usually infinite, a machine learning ensemble consists of only a concrete finite set of alternative models, but typically allows for much more flexible structure to exist among those alternatives.

Ensemble learning or ensemble methods

___ is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist. In regression analysis, ___ is estimating the parameters of a logistic model (a form of binary regression). Mathematically, a binary logistic model has a dependent variable with two possible values, such as pass/fail which is represented by an indicator variable, where the two values are labeled "0" and "1".

Logistic regression or logit regression

A ___ is a mathematical function having a characteristic "S"-shaped curve or ___ curve. A common example of a ___ is the logistic function.

Sigmoid function

In statistics, the ___ is a non-parametric classification method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and regression. In both cases, the input consists of the k closest training examples in data set. The output depends on whether ___ is used for classification or regression: - In ___ classification, the output is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its ___ (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor. - In ___ regression, the output is the property value for the object. This value is the average of the values of ___.

k-nearest neighbors algorithm (k-NN)

In mathematics, the ___ between two points in Euclidean space is the length of a line segment between the two points. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, therefore occasionally being called the Pythagorean distance. These names come from the ancient Greek mathematicians Euclid and Pythagoras, although Euclid did not represent distances as numbers, and the connection from the Pythagorean theorem to distance calculation was not made until the 18th century.

Euclidean distance

In geometry, a ___ is a subspace whose dimension is one less than that of its ambient space. If a space is 3-dimensional then its ___ are the 2-dimensional planes, while if the space is 2-dimensional, its ___ are the 1-dimensional lines. This notion can be used in any general space in which the concept of the dimension of a subspace is defined. In different settings, ___ may have different properties. For instance, a ___ of an n-dimensional affine space is a flat subset with dimension n − 1 and it separates the space into two half spaces. While a ___ of an n-dimensional projective space does not have this property.

Hyperplane

In machine learning the ___ of a single data point is defined to be the distance from the data point to a decision boundary. Note that there are many distances and decision boundaries that may be appropriate for certain datasets and goals. A ___ classifier is a classifier that explicitly utilizes the ___ of each example while learning a classifier. There are theoretical justifications (based on the VC dimension) as to why maximizing the ___ (under some suitable constraints) may be beneficial for machine learning and statistical inferences algorithms. There are many hyperplanes that might classify the data. One reasonable choice as the best hyperplane is the one that represents the largest separation, or ___, between the two classes. So we choose the hyperplane so that the distance from it to the nearest data point on each side is maximized. If such a hyperplane exists, it is known as the maximum-___ hyperplane and the linear classifier it defines is known as a maximum ___ classifier; or equivalently, the perceptron of optimal stability.

Margin

The ___ avoids the explicit mapping to a higher dimension that is needed to get linear learning algorithms to learn a nonlinear function or decision boundary. For all x and x' in the input space X, certain functions k(x, x') can be expressed as an inner product in another space V. The word "___" is used in mathematics to denote a weighting function for a weighted sum or integral.

Kernel trick

In statistics, ___ are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (___) independence assumptions between the features. They are among the simplest Bayesian network models, but coupled with kernel density estimation, they can achieve higher accuracy levels

Naive Bayes classifiers

___ is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. When a decision tree is the weak learner, the resulting algorithm is called ___ trees, which usually outperforms random forest. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.

Gradient boosting

In the field of machine learning and specifically the problem of statistical classification, a ___, is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one (in unsupervised learning it is usually called a matching matrix). Each row of the matrix represents the instances in a predicted class, while each column represents the instances in an actual class (or vice versa). The name stems from the fact that it makes it easy to see whether the system is ___ two classes (i.e. commonly mislabeling one as another). It is a special kind of contingency table, with two dimensions ("actual" and "predicted"), and identical sets of "classes" in both dimensions (each combination of dimension and class is a variable in the contingency table).

Confusion matrix, also known as an error matrix

The ___ is the paradoxical finding that accuracy is not a good metric for predictive models when classifying in predictive analytics. This is because a simple model may have a high level of accuracy but be too crude to be useful. For example, if the incidence of category A is dominant, being found in 99% of cases, then predicting that every case is category A will have an accuracy of 99%. The underlying issue is that there is a class imbalance between the positive class and the negative class. Prior probabilities for these classes need to be accounted for in error analysis.

Accuracy paradox

A ___ is a concept utilized in data science to visualize discrimination power. The ___ of a model represents the cumulative number of positive outcomes along the y-axis versus the corresponding cumulative number of a classifying parameter along the x-axis. The output is called a ___ curve. The ___ is distinct from the receiver operating characteristic (ROC) curve, which plots the true-positive rate against the false-positive rate. ___ are used in robustness evaluations of classification models.

Cumulative accuracy profile (CAP)

A ___, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The method was originally developed for operators of military radar receivers, which is why it is so named. The ___ is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true-positive rate is also known as sensitivity, recall or probability of detection in machine learning. The false-positive rate is also known as probability of false alarm and can be calculated as (1 − specificity).

Receiver operating characteristic curve, or ROC curve

In cluster analysis, the ___ is a heuristic used in determining the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters, and picking the ___ of the curve as the number of clusters to use.

Elbow method

In mathematics and physics, the ___ of a plane figure is the arithmetic mean position of all the points in the figure. Informally, it is the point at which a cutout of the shape could be perfectly balanced on the tip of a pin. The definition extends to any object in n-dimensional space: its ___ is the mean position of all the points in all of the coordinate directions.

Centroid or geometric center

___ is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. ___ clustering minimizes within-cluster variances (squared Euclidean distances) or WCSS (Within Cluster Sum of Squares), but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances.

k-means clustering

In data mining, ___ is an algorithm for choosing the initial values (or "seeds") for the k-means clustering algorithm. It was proposed in 2007 by David Arthur and Sergei Vassilvitskii, as an approximation algorithm for the NP-hard k-means problem—a way of avoiding the sometimes poor clusterings found by the standard k-means algorithm. The k-means algorithm has a major theoretic shortcomings where the approximation found can be arbitrarily bad with respect to the objective function compared to the optimal clustering. The ___ algorithm addresses this by specifying a procedure to initialize the cluster centers before proceeding with the standard k-means optimization iterations. With the ___ initialization, the algorithm is guaranteed to find a solution that is O(log k) competitive to the optimal k-means solution.

k-means++

In data mining and statistics, ___ is a method of cluster analysis which seeks to build a ___ of clusters. Strategies for ___ generally fall into two types: - Agglomerative: This is a "bottom-up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the ___. - Divisive: This is a "top-down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the ___. In general, the merges and splits are determined in a greedy manner. The results of ___ are usually presented in a dendrogram.

Hierarchical clustering (also called hierarchical cluster analysis or HCA)

A ___ is a diagram representing a tree. This diagrammatic representation is frequently in hierarchical clustering, it illustrates the arrangement of the clusters produced by the corresponding analyses.

Dendrogram

In statistics, ___ is a criterion applied in hierarchical cluster analysis. ___ is a special case of the objective function approach originally presented by Joe H. ___, Jr. ___ suggested a general agglomerative hierarchical clustering procedure, where the criterion for choosing the pair of clusters to merge at each step is based on the optimal value of an objective function. This objective function could be "any function that reflects the investigator's purpose." Many of the standard clustering procedures are contained in this very general class. To illustrate the procedure, ___ used the example where the objective function is the error sum of squares, and this example is known as ___.

Ward's method or Ward's minimum variance method

___ is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness. Based on the concept of strong rules, Rakesh Agrawal, Tomasz Imieliński and Arun Swami introduced association rules for discovering regularities between products in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets. For example, the rule {onions,potatoes} => {burger} found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, they are likely to also buy hamburger meat. Such information can be used as the basis for decisions about marketing activities such as, e.g., promotional pricing or product placements.

Association rule learning

___ is an algorithm for frequent item set mining and association rule learning over relational databases. ___ uses a breadth-first search strategy to count the support of itemsets and uses a candidate generation function which exploits the downward closure property of support. The frequent item sets determined by ___ can be used to determine association rules which highlight general trends in the database: this has applications in domains such as market basket analysis.

Apriori

___ is an algorithm for frequent item set mining and association rule learning over relational databases. ___ is a depth-first search algorithm based on set intersection. It is suitable for both sequential as well as parallel execution with locality-enhancing properties. The frequent item sets determined by ___ can be used to determine association rules which highlight general trends in the database: this has applications in domains such as market basket analysis.

Eclat (alt. ECLAT, stands for Equivalence Class Transformation)

In probability theory and machine learning, the ___ is a problem in which a fixed limited set of resources must be allocated between competing (alternative) choices in a way that maximizes their expected gain, when each choice's properties are only partially known at the time of allocation, and may become better understood as time passes or by allocating resources to the choice. This is a classic reinforcement learning problem that exemplifies the exploration–exploitation tradeoff dilemma. The name comes from imagining a gambler at a row of slot machines (sometimes known as "one-armed bandits"), who has to decide which machines to play, how many times to play each machine and in which order to play them, and whether to continue with the current machine or try a different machine. The ___ also falls into the broad category of stochastic scheduling.

Multi-armed bandit problem (sometimes called the K- or N-armed bandit problem)

___, named after William R. ___, is a heuristic for choosing actions that addresses the exploration-exploitation dilemma in the multi-armed bandit problem. It consists of choosing the action that maximizes the expected reward with respect to a randomly drawn belief.

Thompson sampling

___ is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of ___ data. The result is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Natural language processing (NLP)

The ___ is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. The ___ has also been used for computer vision. The ___ is commonly used in methods of document classification where the (frequency of) occurrence of each word is used as a feature for training a classifier. An early reference to "___" in a linguistic context can be found in Zellig Harris's 1954 article on Distributional Structure.

Bag-of-words model

In statistical analysis of binary classification, the ___ is a measure of a test's accuracy. It is calculated from the precision and recall of the test, where the precision is the number of true positive results divided by the number of all positive results, including those not identified correctly, and the recall is the number of true positive results divided by the number of all samples that should have been identified as positive. Precision is also known as positive predictive value, and recall is also known as sensitivity in diagnostic binary classification.

F-score or F-measure

In artificial neural networks, the ___ of a node defines the output of that node given an input or set of inputs. A standard integrated circuit can be seen as a digital network of ___ that can be "ON" (1) or "OFF" (0), depending on input. This is similar to the linear perceptron in neural networks. However, only nonlinear ___ allow such networks to compute nontrivial problems using only a small number of nodes, and such ___ are called nonlinearities.

Activation function

In the context of artificial neural networks, the ___ is an activation function defined as the positive part of its argument f(x) = max(0,x) where x is the input to a neuron.

Rectifier or ReLU (Rectified Linear Unit) activation function. Also known as a ramp function and is analogous to half-wave rectification in electrical engineering.

A ___ is a mathematical function having a characteristic "S"-shaped curve or ___ curve. A common example of a ___ is the logistic function defined by the formula: S(x) = 1-S(-x). A wide variety of ___ including the logistic and hyperbolic tangent functions have been used as the activation function of artificial neurons.

Sigmoid function

In machine learning, the ___ is an algorithm for supervised learning of binary classifiers. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belongs to some specific class. It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector.

Perceptron

In mathematical optimization and decision theory, a ___ is a function that maps an event or values of one or more variables onto a real number intuitively representing some "___" associated with the event. An optimization problem seeks to minimize a ___. An objective function is either a ___ or its negative (in specific domains, variously called a reward function, a profit function, a utility function, a fitness function, etc.), in which case it is to be maximized.

Loss function or cost function (sometimes also called an error function)

In terms of artificial neural networks, an ___ refers to one cycle through the full training dataset. Usually, training a neural network takes more than a few ___. In other words, if we feed a neural network the training data for more than one ___ in different patterns, we hope for a better generalization when given a new "unseen" input (test data). An ___ is often mixed up with an iteration. Iterations is the number of batches or steps through partitioned packets of the training data, needed to complete one ___. Heuristically, one motivation is that (especially for large but finite training sets) it gives the network a chance to see the previous data to readjust the model parameters so that the model is not biased towards the last few data points during training.

Epoch

The ___ refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. The expression was coined by Richard E. Bellman when considering problems in dynamic programming.

Curse of dimensionality

___ is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the ___ (or approximate ___) of the function at the current point, because this is the direction of steepest ___.

Gradient descent

In deep learning, a ___ is a class of artificial neural network, most commonly applied to analyze visual imagery. They are also known as shift invariant or space invariant artificial neural networks (SIANN), based on the shared-weight architecture of the ___ kernels or filters that slide along input features and provide translation equivariant responses known as feature maps. Counter-intuitively, most ___ are only equivariant, as opposed to invariant, to translation.

Convolutional neural network (CNN, or ConvNet)

In information theory, the ___ between two probability distributions p and q over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is optimized for an estimated probability distribution q, rather than the true distribution p.

Cross-entropy

In statistics, the ___ measures the goodness of fit of a statistical model to a sample of data for given values of the unknown parameters. It is formed from the joint probability distribution of the sample, but viewed and used as a function of the parameters only, thus treating the random variables as fixed at the observed values. The ___ describes a hypersurface whose peak, if it exists, represents the combination of model parameter values that maximize the probability of drawing the sample obtained. The procedure for obtaining these arguments of the maximum of the ___ is known as maximum ___ estimation, which for computational convenience is usually done using the natural logarithm of the ___, known as the log-___ function. Additionally, the shape and curvature of the ___ surface represent information about the stability of the estimates, which is why the ___ is often plotted as part of a statistical analysis.

Likelihood function (often simply called the likelihood)

The ___ of a collection of points in a real coordinate space are a sequence of p unit vectors, where the i-th vector is the direction of a line that best fits the data while being orthogonal to the first i-1 vectors. ___ is used in exploratory data analysis and for making predictive models. It is commonly used for dimensionality reduction by projecting each data point onto only the first few ___ to obtain lower-dimensional data while preserving as much of the data's variation as possible. The first ___ can equivalently be defined as a direction that maximizes the variance of the projected data. The i-th ___ can be taken as a direction orthogonal to the first i-1 ___ that maximizes the variance of the projected data.

Principal component analysis (PCA)

___ is a generalization of Fisher's ___, a method used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.

Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis

In linear algebra, an ___ of a linear transformation is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it. The corresponding ___ is the factor by which the ___ is scaled.

Eigenvector or characteristic vector and eigenvalue, often denoted by the lambda character

In ___, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data.

k-fold cross-validation

The traditional way of performing hyperparameter optimization has been ___ which is simply an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm. A ___ algorithm must be guided by some performance metric, typically measured by cross-validation on the training set or evaluation on a held-out validation set.

Grid search, or a parameter sweep

In machine learning, ___ is the problem of choosing a set of optimal ___ for a learning algorithm. A ___ is a parameter whose value is used to control the learning process. By contrast, the values of other parameters (typically node weights) are learned.

Hyperparameter optimization or tuning

In statistics and machine learning, ___ is the use of information in the model training process which would not be expected to be available at prediction time, causing the predictive scores (metrics) to overestimate the model's utility when run in a production environment. ___ is often subtle and indirect, making it hard to detect and eliminate. ___ can cause a statistician or modeler to select a suboptimal model, which could be outperformed by a ___-free model.

Leakage (also known as data leakage or target leakage)

In statistics, ___ is the reduction in the effects of sampling variation. In regression analysis, a fitted relationship appears to perform less well on a new data set than on the data set used for fitting. In particular the value of the coefficient of determination 'shrinks'. This idea is complementary to overfitting and, separately, to the standard adjustment made in the coefficient of determination to compensate for the subjunctive effects of further sampling, like controlling for the potential of new explanatory terms improving the model by chance: that is, the adjustment formula itself provides "___." But the adjustment formula yields an artificial ___. A ___ estimator is an estimator that, either explicitly or implicitly, incorporates the effects of ___. In loose terms this means that a naive or raw estimate is improved by combining it with other information. The term relates to the notion that the improved estimate is made closer to the value supplied by the 'other information' than the raw estimate. In this sense, ___ is used to regularize ill-posed inference problems.

Shrinkage

Machine Learning Flashcards

(87 cards)