ch1 Flashcards

Question 1

Q

big data

Answer

A

data sets that are so voluminous and complex that traditional data processing application software are inadequate to deal with them

Question 2

Q

machine learning

Answer

A

a set of methods thst can automatically detect patterns in data, and then use the uncovered patterns to predict future data, or perform other kinds of decision making under uncertainty

Question 3

Q

the long tail

Answer

A

property in data sets of a few instances being highly common/likely while other instances are quite rare

this is apparently shared in data generated across a wide variety of domains (why?)

Question 4

Q

supervised learning

Answer

A

the machine learning task of inferring a function from labeled training data

Formally put, the task is to learn a mapping from inputs x to outputs y, given a labeled set of input-output pairs D = {(Xi, Yi)} for i=1 through N

Here, D is the training set, and N is the number of training examples.

Question 5

Q

features, attributes. covariates

Answer

A

input-output pairs that make up the training set in a supervised learning problem

Question 6

Q

response variable

Answer

A

output as a function of input in a supervised lesrning problem

Question 7

Q

categorical/nominal variable

Answer

A

a variable that can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group or nominal category on the basis of some qualitative property

Question 8

Q

classification vs regression

Answer

A

When the response variable is categorical (Ie takes on a finite range of values), supervised learning is called classification. When the response variable is resl-valued, the problem is known as regression.

Question 9

Q

ordinal regression

Answer

A

supervised learning when the response variable occurs over a finite range of values but they have some natural ordering (eg grades A-F)

Question 10

Q

descriptive/unsupervised learning, or knowledge discovery

Answer

A

learning where only inputs D - {Xi} over i[1 to N are provided and the goal is to find “interesting patterns” in the data.

A much less well-defined problem, since we aren’t told what kinds of patterns to look for and no obvious error metric exists

Question 11

Q

reinforcement learning

Answer

A

learning a decision based on occasional reward or punishment signals

Question 12

Q

binary classification

Answer

A

classification between two classes

Question 13

Q

multiclass classificstion

Answer

A

classifcation between more than two classes

Question 14

Q

multi-label classifcation

Answer

A

classification where class labels aren’t mutually exclusive

often best viewed as predicting multiple related binsry class labels from the same inout (multiple output model)

Question 15

Q

multiple output model

Answer

A

model of multi-label classification treating problem as involving prediction of multiple related binary class labels

Question 16

Q

function approximation

Answer

A

formalization of classification w/ assumption y = f(x) for some unknown function f and goal to estimate the function f given a labeled training set and predict using y^ = f^(x)

Question 17

Q

maximum a posteriori probability (MAP) estimate

Answer

A

an estimate of an unknown quantity, that equals the mode of the posterior distribution

Question 18

Q

posterior probability

Answer

A

the conditional probability that is assigned after the relevant evidence or background is taken into account.

Question 19

Q

posterior probability distribution

Answer

A

the probability distribution of an unknown quantity, treated as a random variable, conditional on the evidence obtained from an experiment or survey

Question 20

Q

central idea of bayesian statistics

Answer

A

probability is orderly opinion, and that inference from data is nothing other than the revision of such opinion in the light of relevant new information.

Question 21

Q

bag-of-words / vector space model

Answer

A

a simplifying representation used in natural language processing and information retrieval (IR). Also known as vector space model[1]. In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.

Documents and queries are represented as vectors. Each dimension corresponds to a separate term. If a term occurs in the document, its value in the vector is non-zero.

Question 22

Q

tf–idf or TFIDF, short for term frequency–inverse document frequency

Answer

A

often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf-idf value increases proportionally to the number of times a word appears in the document, but is often offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.

Question 23

Q

stop words

Answer

A

words which are filtered out before or after processing of natural language data (text)

Question 24

Q

feature extraction

Answer

A

starts from an initial set of measured data and builds derived values (features) intended to be informative and non-redundant, facilitating the subsequent learning and generalization steps, and in some cases leading to better human interpretations

Question 25

Q

dimensionality reduction or dimension reduction

Answer

A

the process of reducing the number of random variables under consideration[1] by obtaining a set of principal variables. It can be divided into feature selection and feature extraction.

Question 26

Q

feature selection, also known as variable selection, attribute selection or variable subset selection

Answer

A

the process of selecting a subset of relevant features (variables, predictors) for use in model construction

Question 27

Q

curse of dimensionality

Answer

A

various phenomena that arise when analyzing and organizing data in high-dimensional spaces (often with hundreds or thousands of dimensions) that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience

The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data become sparse. This sparsity is problematic for any method that requires statistical significance. In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality. Also, organizing and searching data often relies on detecting areas where objects form groups with similar properties; in high dimensional data, however, all objects appear to be sparse and dissimilar in many ways, which prevents common data organization strategies from being efficient.

Question 28

Q

overfitting

Answer

A

the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably”

The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e. the noise) as if that variation represented underlying model structure.

Question 29

Q

Underfitting

Answer

A

occurs when a statistical model cannot adequately capture the underlying structure of the data. An underfitted model is a model where some parameters or terms that would appear in a correctly specified model are missing.[2] Underfitting would occur, for example, when fitting a linear model to non-linear data. Such a model will tend to have poor predictive performance.

Question 30

Q

density estimation

Answer

A

formulation of unsupervised learning as involving the construction of an estimate, based on observed data, of an unobservable underlying probability density function

Question 31

Q

Cluster analysis or clustering

Answer

A

the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters)

Question 32

Q

latent variables

Answer

A

variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured).

Question 33

Q

model based clustering

Answer

A

fitting a probabilistic model to the datw, rather than running some ad hoc algorithm

advantages include that one can compare different kinds of models in an objective way (in trrms of the likelihood that they assign to the data)

Question 34

Q

imputation

Answer

A

the process of replacing missing data with substituted values

Question 35

Q

matrix completion

Answer

A

the task of filling in the missing entries of a partially observed matrix

Question 36

Q

collaborative filtering

Answer

A

In the newer, narrower sense, collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating).

In the more general sense, collaborative filtering is the process of filtering for information or patterns using techniques involving collaboration among multiple agents, viewpoints, data sources, etc

Question 37

Q

parametric model

Answer

A

A statistical model with a fixed number of parameters

Question 38

Q

non-parametric model

Answer

A

statistical model whose number of parameters increase with the amount of training data. More flexible than parametric models (ie make weaker assumptions), but are computationally intractable for large data sets.

Question 39

Q

K nearest neighbor classifier

Answer

A

Non parametric classifier that looks at the K points of the training set nearest to the test input X, counts how many members of each class are in the set, and returns the empirical fraction as the estimate.

Loses effectiveness in high dimensions or whenever the nearest neighbors are in fact distant from the test input.

Question 40

Q

instance-based or memory-based learning

Answer

A

a family of learning algorithms that, instead of performing explicit generalization, compares new problem instances with instances seen in training, which have been stored in memory.

Question 41

Q

inductive bias

Answer

A

a learning algorithm is the set of assumptions that the learner uses to predict outputs given inputs that it has not encountered

Question 42

Q

inductive bias

Answer

A

of a learning algorithm is the set of assumptions that the learner uses to predict outputs given inputs that it has not encountered

Question 43

Q

linear regression

Answer

A

parametric model asserting that the response is a linear function of the inputs

Question 44

Q

error

Answer

A

the deviation of the observed value from the (unobservable) true value of a quantity of interest

Question 45

Q

residual

Answer

A

the difference between the observed value and the estimated value of the quantity of interest

Question 46

Q

scalar product

Answer

A

dot product of two vectors

Question 47

Q

basis function expansion

Answer

A

using linear regression to model nonlinear relationships by replacing x with some nonlinear function of the inputs ø(x)

Question 48

Q

bernoulli distribution

Answer

A

the probability distribution of a random variable which takes the value 1 with probability 
p
p and the value 0 with probability 
q
=
1
−
p
q=1-p

Question 49

Q

sigmoid function, aka logistic or logit function

Answer

A

S shaped “squashing function” that maps the whole real line to [0,1], which is necessary formthe output to be interpreted as a probsbility

Question 50

Q

logistic regression

Answer

A

a regression model where the dependent variable (DV) is categorical (usually binary)

result of doing a basis (sigmoid) function expansion on a bernoulli distribution,

actuslly classification, not real regression

Question 51

Q

decision rule

Answer

A

a function which maps an observation to an appropriate action

Question 52

Q

linearly separable

Answer

A

two sets are linearly separable if there exists at least one line in the plane with all of the blue points on one side of the line and all the red points on the other side

Question 53

Q

generalization error

Answer

A

a measure of how accurately an algorithm is able to predict outcome values for previously unseen data

Question 54

Q

cross validation

Answer

A

partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set). To reduce variability, in most methods multiple rounds of cross-validation are performed using different partitions, and the validation results are combined (e.g. averaged) over the rounds to estimate a final predictive model.

often used when there is not enough data available to partition it into separate training and test sets without losing significant modelling or testing capability

Question 55

Q

LOOCV leave-one-out cross validation

or replace ‘one’ w ‘p’

Answer

A

using p observations as the validation set and the remaining observations as the training set. This is repeated on all ways to cut the original sample on a validation set of p observations and a training set.

Question 56

Q

k-fold cross-validation

Answer

A

the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data. The k results from the folds can then be averaged to produce a single estimation.

The advantage of this method over repeated random sub-sampling (see below) is that all observations are used for both training and validation, and each observation is used for validation exactly once. 10-fold cross-validation is commonly used,[7] but in general k remains an unfixed parameter.

Question 57

Q

no free lunch theorem

Answer

A

a result that states that for certain types of mathematical problems, the computational cost of finding a solution, averaged over all problems in the class, is the same for any solution method. No solution therefore offers a “short cut”.