ch1 Flashcards

1
Q

big data

A

data sets that are so voluminous and complex that traditional data processing application software are inadequate to deal with them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

machine learning

A

a set of methods thst can automatically detect patterns in data, and then use the uncovered patterns to predict future data, or perform other kinds of decision making under uncertainty

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

the long tail

A

property in data sets of a few instances being highly common/likely while other instances are quite rare

this is apparently shared in data generated across a wide variety of domains (why?)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

supervised learning

A

the machine learning task of inferring a function from labeled training data

Formally put, the task is to learn a mapping from inputs x to outputs y, given a labeled set of input-output pairs D = {(Xi, Yi)} for i=1 through N

Here, D is the training set, and N is the number of training examples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

features, attributes. covariates

A

input-output pairs that make up the training set in a supervised learning problem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

response variable

A

output as a function of input in a supervised lesrning problem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

categorical/nominal variable

A

a variable that can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group or nominal category on the basis of some qualitative property

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

classification vs regression

A

When the response variable is categorical (Ie takes on a finite range of values), supervised learning is called classification. When the response variable is resl-valued, the problem is known as regression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

ordinal regression

A

supervised learning when the response variable occurs over a finite range of values but they have some natural ordering (eg grades A-F)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

descriptive/unsupervised learning, or knowledge discovery

A

learning where only inputs D - {Xi} over i[1 to N are provided and the goal is to find “interesting patterns” in the data.

A much less well-defined problem, since we aren’t told what kinds of patterns to look for and no obvious error metric exists

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

reinforcement learning

A

learning a decision based on occasional reward or punishment signals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

binary classification

A

classification between two classes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

multiclass classificstion

A

classifcation between more than two classes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

multi-label classifcation

A

classification where class labels aren’t mutually exclusive

often best viewed as predicting multiple related binsry class labels from the same inout (multiple output model)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

multiple output model

A

model of multi-label classification treating problem as involving prediction of multiple related binary class labels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

function approximation

A

formalization of classification w/ assumption y = f(x) for some unknown function f and goal to estimate the function f given a labeled training set and predict using y^ = f^(x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

maximum a posteriori probability (MAP) estimate

A

an estimate of an unknown quantity, that equals the mode of the posterior distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

posterior probability

A

the conditional probability that is assigned after the relevant evidence or background is taken into account.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

posterior probability distribution

A

the probability distribution of an unknown quantity, treated as a random variable, conditional on the evidence obtained from an experiment or survey

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

central idea of bayesian statistics

A

probability is orderly opinion, and that inference from data is nothing other than the revision of such opinion in the light of relevant new information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

bag-of-words / vector space model

A

a simplifying representation used in natural language processing and information retrieval (IR). Also known as vector space model[1]. In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.

Documents and queries are represented as vectors. Each dimension corresponds to a separate term. If a term occurs in the document, its value in the vector is non-zero.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

tf–idf or TFIDF, short for term frequency–inverse document frequency

A

often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf-idf value increases proportionally to the number of times a word appears in the document, but is often offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

stop words

A

words which are filtered out before or after processing of natural language data (text)

24
Q

feature extraction

A

starts from an initial set of measured data and builds derived values (features) intended to be informative and non-redundant, facilitating the subsequent learning and generalization steps, and in some cases leading to better human interpretations

25
Q

dimensionality reduction or dimension reduction

A

the process of reducing the number of random variables under consideration[1] by obtaining a set of principal variables. It can be divided into feature selection and feature extraction.

26
Q

feature selection, also known as variable selection, attribute selection or variable subset selection

A

the process of selecting a subset of relevant features (variables, predictors) for use in model construction

27
Q

curse of dimensionality

A

various phenomena that arise when analyzing and organizing data in high-dimensional spaces (often with hundreds or thousands of dimensions) that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience

The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data become sparse. This sparsity is problematic for any method that requires statistical significance. In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality. Also, organizing and searching data often relies on detecting areas where objects form groups with similar properties; in high dimensional data, however, all objects appear to be sparse and dissimilar in many ways, which prevents common data organization strategies from being efficient.

28
Q

overfitting

A

the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably”

The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e. the noise) as if that variation represented underlying model structure.

29
Q

Underfitting

A

occurs when a statistical model cannot adequately capture the underlying structure of the data. An underfitted model is a model where some parameters or terms that would appear in a correctly specified model are missing.[2] Underfitting would occur, for example, when fitting a linear model to non-linear data. Such a model will tend to have poor predictive performance.

30
Q

density estimation

A

formulation of unsupervised learning as involving the construction of an estimate, based on observed data, of an unobservable underlying probability density function

31
Q

Cluster analysis or clustering

A

the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters)

32
Q

latent variables

A

variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured).

33
Q

model based clustering

A

fitting a probabilistic model to the datw, rather than running some ad hoc algorithm

advantages include that one can compare different kinds of models in an objective way (in trrms of the likelihood that they assign to the data)

34
Q

imputation

A

the process of replacing missing data with substituted values

35
Q

matrix completion

A

the task of filling in the missing entries of a partially observed matrix

36
Q

collaborative filtering

A

In the newer, narrower sense, collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating).

In the more general sense, collaborative filtering is the process of filtering for information or patterns using techniques involving collaboration among multiple agents, viewpoints, data sources, etc

37
Q

parametric model

A

A statistical model with a fixed number of parameters

38
Q

non-parametric model

A

statistical model whose number of parameters increase with the amount of training data. More flexible than parametric models (ie make weaker assumptions), but are computationally intractable for large data sets.

39
Q

K nearest neighbor classifier

A

Non parametric classifier that looks at the K points of the training set nearest to the test input X, counts how many members of each class are in the set, and returns the empirical fraction as the estimate.

Loses effectiveness in high dimensions or whenever the nearest neighbors are in fact distant from the test input.

40
Q

instance-based or memory-based learning

A

a family of learning algorithms that, instead of performing explicit generalization, compares new problem instances with instances seen in training, which have been stored in memory.

41
Q

inductive bias

A

a learning algorithm is the set of assumptions that the learner uses to predict outputs given inputs that it has not encountered

42
Q

inductive bias

A

of a learning algorithm is the set of assumptions that the learner uses to predict outputs given inputs that it has not encountered

43
Q

linear regression

A

parametric model asserting that the response is a linear function of the inputs

44
Q

error

A

the deviation of the observed value from the (unobservable) true value of a quantity of interest

45
Q

residual

A

the difference between the observed value and the estimated value of the quantity of interest

46
Q

scalar product

A

dot product of two vectors

47
Q

basis function expansion

A

using linear regression to model nonlinear relationships by replacing x with some nonlinear function of the inputs ø(x)

48
Q

bernoulli distribution

A
the probability distribution of a random variable which takes the value 1 with probability 
p
p and the value 0 with probability 
q
=
1
−
p
q=1-p
49
Q

sigmoid function, aka logistic or logit function

A

S shaped “squashing function” that maps the whole real line to [0,1], which is necessary formthe output to be interpreted as a probsbility

50
Q

logistic regression

A

a regression model where the dependent variable (DV) is categorical (usually binary)

result of doing a basis (sigmoid) function expansion on a bernoulli distribution,

actuslly classification, not real regression

51
Q

decision rule

A

a function which maps an observation to an appropriate action

52
Q

linearly separable

A

two sets are linearly separable if there exists at least one line in the plane with all of the blue points on one side of the line and all the red points on the other side

53
Q

generalization error

A

a measure of how accurately an algorithm is able to predict outcome values for previously unseen data

54
Q

cross validation

A

partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set). To reduce variability, in most methods multiple rounds of cross-validation are performed using different partitions, and the validation results are combined (e.g. averaged) over the rounds to estimate a final predictive model.

often used when there is not enough data available to partition it into separate training and test sets without losing significant modelling or testing capability

55
Q

LOOCV leave-one-out cross validation

or replace ‘one’ w ‘p’

A

using p observations as the validation set and the remaining observations as the training set. This is repeated on all ways to cut the original sample on a validation set of p observations and a training set.

56
Q

k-fold cross-validation

A

the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data. The k results from the folds can then be averaged to produce a single estimation.

The advantage of this method over repeated random sub-sampling (see below) is that all observations are used for both training and validation, and each observation is used for validation exactly once. 10-fold cross-validation is commonly used,[7] but in general k remains an unfixed parameter.

57
Q

no free lunch theorem

A

a result that states that for certain types of mathematical problems, the computational cost of finding a solution, averaged over all problems in the class, is the same for any solution method. No solution therefore offers a “short cut”.