Machine Learning Flashcards
___ is intelligence demonstrated by machines, unlike the natural intelligence displayed by humans and animals, which involves consciousness and emotionality. The distinction between the former and the latter categories is often revealed by the acronym chosen. ‘Strong’ __ is usually labelled as AGI (Artificial General Intelligence) while attempts to emulate ‘natural’ intelligence have been called ABI (Artificial Biological Intelligence). Leading __ textbooks define the field as the study of “intelligent agents”: any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals. Colloquially, the term ___ is often used to describe machines (or computers) that mimic “cognitive” functions that humans associate with the human mind, such as “learning” and “problem solving”.
Artificial intelligence (AI)
___ is the study of computer algorithms that improve automatically through experience and by the use of data. It is seen as a part of artificial intelligence. ___ algorithms build a model based on sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to do so. ___ algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.
Machine learning (ML)
___ is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised.
Deep learning (also known as deep structured learning)
___ is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples. In ___, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). A ___ algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a “reasonable” way (see inductive bias). This statistical quality of an algorithm is measured through the so-called generalization error.
Supervised learning
___ is an approach to machine learning that combines a small amount of labeled data with a large amount of unlabeled data during training. ___ falls between unsupervised learning (with no labeled training data) and supervised learning (with only labeled training data). It is a special instance of weak supervision.
Unlabeled data, when used in conjunction with a small amount of labeled data, can produce considerable improvement in learning accuracy. The acquisition of labeled data for a learning problem often requires a skilled human agent (e.g. to transcribe an audio segment) or a physical experiment (e.g. determining the 3D structure of a protein or determining whether there is oil at a particular location). The cost associated with the labeling process thus may render large, fully labeled training sets infeasible, whereas acquisition of unlabeled data is relatively inexpensive. In such situations, ___ can be of great practical value. ___ is also of theoretical interest in machine learning and as a model for human learning.
Semi-supervised learning
___ is a type of algorithm that learns patterns from untagged data. The hope is that through mimicry, the machine is forced to build a compact internal representation of its world. In contrast to supervised learning (SL) where data is tagged by a human, e.g. as “car” or “fish” etc, ___ exhibits self-organization that captures patterns as neuronal predelections or probability densities. The other levels in the supervision spectrum are reinforcement learning where the machine is given only a numerical performance score as its guidance, and semi-supervised learning where a smaller portion of the data is tagged. Two broad methods in __ are Neural Networks and Probabilistic Methods.
Unsupervised learning (UL)
___ is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. ___ is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning.
___ differs from supervised learning in not needing labelled input/output pairs be presented, and in not needing sub-optimal actions to be explicitly corrected. Instead the focus is on finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge).
The environment is typically stated in the form of a Markov decision process (MDP), because many ___ algorithms for this context use dynamic programming techniques. The main difference between the classical dynamic programming methods and ___ algorithms is that the latter do not assume knowledge of an exact mathematical model of the MDP and they target large MDPs where exact methods become infeasible.
Reinforcement learning (RL)
___ are computing systems vaguely inspired by the biological neural networks that constitute animal brains.
An ___ is based on a collection of connected units or nodes called artificial neurons, which loosely model the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit a signal to other neurons. An artificial neuron that receives a signal then processes it and can signal neurons connected to it. The “signal” at a connection is a real number, and the output of each neuron is computed by some non-linear function of the sum of its inputs. The connections are called edges. Neurons and edges typically have a weight that adjusts as learning proceeds. The weight increases or decreases the strength of the signal at a connection. Neurons may have a threshold such that a signal is sent only if the aggregate signal crosses that threshold. Typically, neurons are aggregated into layers. Different layers may perform different transformations on their inputs. Signals travel from the first layer (the input layer), to the last layer (the output layer), possibly after traversing the layers multiple times.
Artificial neural networks (ANNs), usually simply called neural networks (NNs)
In statistics, ___ is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable (conventionally, the x and y coordinates in a Cartesian coordinate system) and finds a linear function (a non-vertical straight line) that, as accurately as possible, predicts the dependent variable values as a function of the independent variable. The adjective simple refers to the fact that the outcome variable is related to a single predictor.
Simple linear regression
___ is an extension of simple linear regression. It is used when we want to predict the value of a variable based on the value of two or more other variables. The variable we want to predict is called the dependent variable (or sometimes, the outcome, target or criterion variable). The variables we are using to predict the value of the dependent variable are called the independent variables (or sometimes, the predictor, explanatory or regressor variables).
Multiple linear regression
In statistics, ___ is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is modelled as an nth degree polynomial in x. ___ fits a nonlinear relationship between the value of x and the corresponding conditional mean of y, denoted E(y |x). Although ___ fits a nonlinear model to the data, as a statistical estimation problem it is linear, in the sense that the regression function E(y | x) is linear in the unknown parameters that are estimated from the data. For this reason, ___ is considered to be a special case of multiple linear regression.
Polynomial regression
In machine learning and pattern recognition, a ___ is an individual measurable property or characteristic of a phenomenon being observed. Choosing informative, discriminating and independent ___ is a crucial step for effective algorithms in pattern recognition, classification and regression. ___ are usually numeric, but structural ___ such as strings and graphs are used in syntactic pattern recognition. The concept of “___” is related to that of explanatory variable used in statistical techniques such as linear regression.
Feature
In digital circuits and machine learning, a ___ is a group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0). A similar implementation in which all bits are ‘1’ except one ‘0’ is sometimes called one-cold. In statistics, dummy variables represent a similar technique for representing categorical data.
One-hot, or one-hot encoding
A ___ is a dataset of examples used during the learning process and is used to fit the parameters (e.g., weights) of, for example, a classifier.
For classification tasks, a supervised learning algorithm looks at the ___ to determine, or learn, the optimal combinations of variables that will generate a good predictive model. The goal is to produce a trained (fitted) model that generalizes well to new, unknown data. The fitted model is evaluated using “new” examples from the held-out datasets (validation and test datasets) to estimate the model’s accuracy in classifying new data. To reduce the risk of issues such as overfitting, the examples in the validation and test datasets should not be used to train the model.
Training dataset
A ___ is a dataset that is independent of the training dataset, but that follows the same probability distribution as the training dataset. If a model fit to the training dataset also fits the ___ well, minimal overfitting has taken place. A better fitting of the training dataset as opposed to the ___ usually points to overfitting.
A ___ is therefore a set of examples used only to assess the performance (i.e. generalization) of a fully specified classifier. To do this, the final model is used to predict classifications of examples in the ___. Those predictions are compared to the examples’ true classifications to assess the model’s accuracy.
Test dataset
___ is a method used to normalize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.
Feature scaling
___ is a feature scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1.
Normalization, also known as Min-Max scaling
___ is a feature scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.
Standardization
In statistics, ___ is “the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably”. An ___ model is a statistical model that contains more parameters than can be justified by the data. The essence of ___ is to have unknowingly extracted some of the residual variation (i.e. the noise) as if that variation represented underlying model structure.
In other words, the model remembers a huge number of examples instead of learning to notice features.
___ can occur in machine learning, in particular. In machine learning, the phenomena are sometimes called “___-training”.
Overfitting
___ occurs when a statistical model cannot adequately capture the underlying structure of the data. An ___ model is a model where some parameters or terms that would appear in a correctly specified model are missing. ___ would occur, for example, when fitting a linear model to non-linear data. Such a model will tend to have poor predictive performance.
___ can occur in machine learning, in particular. In machine learning, the phenomena are sometimes called “___-training”.
Underfitting
The ___ is a scenario in which the independent variables are multicollinear - a scenario in which two or more variables are highly correlated; in simple terms one variable can be predicted from the others.
To demonstrate the ___, take the case of gender (male/female) as an example. Including a dummy variable for each is redundant (of male is 0, female is 1, and vice-versa).
Dummy Variable trap
In statistics, ___ is a method of fitting regression models in which the choice of predictive variables is carried out by an automatic procedure. In each step, a variable is considered for addition to or subtraction from the set of explanatory variables based on some prespecified criterion. Usually, this takes the form of a sequence of F-tests or t-tests, but other techniques are possible, such as adjusted R2, Akaike information criterion, Bayesian information criterion, Mallows’s Cp, PRESS, or false discovery rate.
Stepwise regression
___ is a stepwise regression, which involves starting with no variables in the model, testing the addition of each variable using a chosen model fit criterion, adding the variable (if any) whose inclusion gives the most statistically significant improvement of the fit, and repeating this process until none improves the model to a statistically significant extent.
Forward selection
___ is a stepwise regression, which involves starting with all candidate variables, testing the deletion of each variable using a chosen model fit criterion, deleting the variable (if any) whose loss gives the most statistically insignificant deterioration of the model fit, and repeating this process until no further variables can be deleted without a statistically insignificant loss of fit.
Backward elimination
___ is a stepwise regression, a combination of forward selection and backward elimination testing at each step for variables to be included or excluded.
Bidirectional elimination
An ___ is any statistical test in which the test statistic has an F-distribution under the null hypothesis. It is most often used when comparing statistical models that have been fitted to a data set, in order to identify the model that best fits the population from which the data were sampled. Exact “___” mainly arise when the models have been fitted to the data using least squares. The name was coined by George W. Snedecor, in honour of Sir Ronald A. Fisher. Fisher initially developed the statistic as the variance ratio in the 1920s.
F-test
The ___ is any statistical hypothesis test in which the test statistic follows a Student’s t-distribution under the null hypothesis.
A ___ is the most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. When the scaling term is unknown and is replaced by an estimate based on the data, the test statistics (under certain conditions) follow a Student’s t distribution. The ___ can be used, for example, to determine if the means of two sets of data are significantly different from each other.
t-test
The ___ is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, ___ estimates the quality of each model, relative to each of the other models. Thus, ___ provides a means for model selection.
___ is founded on information theory. When a statistical model is used to represent the process that generated the data, the representation will almost never be exact; so some information will be lost by using the model to represent the process. ___ estimates the relative amount of information lost by a given model: the less information a model loses, the higher the quality of that model.
In estimating the amount of information lost by a model, ___ deals with the trade-off between the goodness of fit of the model and the simplicity of the model. In other words, ___ deals with both the risk of overfitting and the risk of underfitting.
The ___ is named after the Japanese statistician Hirotugu ___, who formulated it. It now forms the basis of a paradigm for the foundations of statistics and is also widely used for statistical inference.
Akaike information criterion (AIC)
In statistics, the ___ is a criterion for model selection among a finite set of models; the model with the lowest ___ is preferred. It is based, in part, on the likelihood function and it is closely related to the Akaike information criterion (AIC).
When fitting models, it is possible to increase the likelihood by adding parameters, but doing so may result in overfitting. Both ___ and AIC attempt to resolve this problem by introducing a penalty term for the number of parameters in the model; the penalty term is larger in ___ than in AIC.
The ___ was developed by Gideon E. ___ and published in a 1978 paper, where he gave a Bayesian argument for adopting it.
Bayesian information criterion (BIC) or Schwarz information criterion (also SIC, SBC, SBIC)
In statistics, ___, named for Colin Lingwood ___, is used to assess the fit of a regression model that has been estimated using ordinary least squares. It is applied in the context of model selection, where a number of predictor variables are available for predicting some outcome, and the goal is to find the best model involving a subset of these predictors. A small value of ___ means that the model is relatively precise.
___ has been shown to be equivalent to Akaike information criterion in the special case of Gaussian linear regression.
Mallows’s Cp
In statistics, the ___ statistic is a form of cross-validation used in regression analysis to provide a summary measure of the fit of a model to a sample of observations that were not themselves used to estimate the model. It is calculated as the sums of squares of the prediction residuals for those observations.
A fitted model having been produced, each observation in turn is removed and the model is refitted using the remaining observations. The out-of-sample predicted value is calculated for the omitted observation in each case, and the ___ statistic is calculated as the sum of the squares of all the resulting prediction errors.
Given this procedure, the ___ statistic can be calculated for a number of candidate model structures for the same dataset, with the lowest values of ___ indicating the best structures. Models that are over-parameterised (over-fitted) would tend to give small residuals for observations included in the model-fitting but large residuals for observations that are excluded. ___ statistic has been extensively used in Lazy Learning and locally linear learning to speed-up the assessment and the selection of the neighbourhood size.
predicted residual error sum of squares (PRESS)
In statistics, the ___ is a method of conceptualizing the rate of type I errors in null hypothesis testing when conducting multiple comparisons. ___-controlling procedures are designed to control the expected proportion of “discoveries” (rejected null hypotheses) that are false (incorrect rejections of the null). ___-controlling procedures provide less stringent control of Type I errors compared to familywise error rate (FWER) controlling procedures (such as the Bonferroni correction), which control the probability of at least one Type I error. Thus, ___-controlling procedures have greater power, at the cost of increased numbers of Type I errors.
False discovery rate (FDR)
In computer science, ___ is the concept that flawed, or nonsense input data produces nonsense output or “___”.
garbage in, garbage out (GIGO), or rubbish in, rubbish out (RIRO).
In probability theory and intertemporal portfolio choice, the ___, is a formula for bet sizing that leads almost surely to higher wealth compared to any other strategy in the long run (i.e. approaching the limit as the number of bets goes to infinity). The ___ size is found by maximizing the expected value of the logarithm of wealth, which is equivalent to maximizing the expected geometric growth rate. The ___ is to bet a predetermined fraction of assets, and it can seem counterintuitive. It was described by J. L. ___ Jr, a researcher at Bell Labs, in 1956.
For an even money bet, the ___ computes the wager size percentage by multiplying the percent chance to win by two, then subtracting one-hundred percent. So, for a bet with a 70% chance to win the optimal wager size is 40% of available funds.
The practical use of the formula has been demonstrated for gambling and the same idea was used to explain diversification in investment management. In the 2000s, ___-style analysis became a part of mainstream investment theory and the claim has been made that well-known successful investors including Warren Buffett and Bill Gross use ___ methods. William Poundstone wrote an extensive popular account of the history of ___ betting.
Kelly criterion (or Kelly strategy or Kelly bet), also known as the scientific gambling method