definitions and terms Flashcards

Question

# lift (association rules)

Answer 1

lift(A,B) = p(A,B) / p(A)p(B) (p(A,B) is probability of A and B occurring together) eg this can come up in context of co-occurrences--here the probability that items A and B are bought together, relative to the product of probabilities they're bought at all (ie the probability of both if they were independent of each other)--so this is a measure of how beyond-independent the co-occurrence is

Answer 2

leverage(A,B) = p(A,B) - p(A)p(B) (p(A,B) is probability of A and B occurring together)

Answer 3

the "support" of the association is the prevalence of both elements occurring together (P(A,B)) the "strength" of the association is the conditional probability, eg: p(lottery tickets|beer) = 67% (from P(A)P(B|A) = P(A,B) where P(beer)=0.3 and P(beeer,tickets)=0.2)

Answer 4

in machine learning (eg) a hyperparameter is a parameter that is set from outside the actual training runs, such as the learning rate (for eg gradient descent in a neural network), which specifies details of the learning process; this is contrasted with parameters that determine the model itself (such as weights in neural network)

Answer 5

**discriminative**: * find the best way to distinguish target values in the attribute phase space; ie find the best way to discriminate targets; learn the (hard or soft) boundary between classes * based on P(y|x) (something like, what is the probability of class value y, given instance x) * to predict the label y from the training instance x, evaluate, f(x) = argmax_y P(y|x) (ie what is the maximal probable class y conditioned on the given instance x; eg for a decision tree, such a process is ~literally followed--we bin training instances to a given region, amounting to a leaf, then we compute maximum probability based on the leaf target label distribution); this is simply modeling the boundary between classes **generative:** * model how different target segments produce feature values; ie we're modeling how the data was generated; when given a test instance, ask, which class most likely generated this example?; model the distribution of individual classes * based on P(x,y) (something like, what is the probability of instance x and class y) * to predict label y from the training instance x, evaluate, f(x) = argmax_y P(x|y)P(y) (used for eg Naive Bayes); note that P(x|y)P(y) = P(x,y)--ie this is explicitly modeling the distribution of each class (over the attribute phase space); each instance gets its own little mini-distribution, over all possible class outcomes

Answer 6

the methods of bringing to bear existing data mining and data engineering techniques on a busines problem; eg a cell phone customer churn problem can be analyzed in a data mining / data engineering context to formulate appropriate models, appropriate baselines, appropriate data acquisition (possibly funding for more data, if it's deemed important), etc.

Answer 7

an ensemble model is a combination of many different recommendation models

Answer 8

a method used sometimes in spatial analysis, working kind of like a spline fit to interpolate/extrapolate between points this method has a Bayesian flavor, by assuming a Gaussian prior (so by the prior the sample points are drawn from a Gaussian pdf/pmf), and by inferring a Gaussian likelihood function from the observed points, which, combined with the prior yields the posterior distribution

Answer 9

machine learning with datasets that contain a mixture of labeled and unlabeled data (in the case of the labeled data only being of a certain category (eg in binary, "positive"), it is referred to as the subcase of "positive-unlabeled" learning) eg "label propogation"--label propogation is based on the assumption that closer data points have similar class labels; so if we have a labeled instance, we can consider "propogating" its label through to "close by" (so presumably similar) instances; once this propogation is done, we may then apply (fully) supervised learning to the data

Answer 10

basic technique from statistics, and comes up eg in context of causal modeling; a general term for a method of comparing the outcomes of two different choices; there are two treatments and one acts as the control for the other also seen is the term "A/A" testing, which is just using the same choice on both the control and treatment group (ideally, I suppose, there should be no difference between the 2)

Answer 11

this basically converts any vector (over reals) into something like a pdf for each component v_i of vector v, take quotient, exp(v_i) / sum(exp(v_i)) it's called "softmax" because it tends to "amplify" the largest component(s) of v, via taking the exponential; eg for (1,1.1,3), the softmax is (0.11,0.12,0.78)

Answer 12

this is similar in spirit to logistic regression instead of logistic regression's ~scheme of giving a Bernoulli pmf to a binary outcome, softmax regression can produce a multinomial (generalized Bernoulli) pmf "predictor"--ie it can deal with more than 2 categories, producing a pdf over the various categories (similar to logistic regression, it can have a linear classifier as the "kernel" to the exponentials)

Answer 13

feature engineering is the process of selecting, manipulating, and transforming raw data into features that can be used in supervised learning; feature engineering is a machine learning technique that leverages data to create new variables that aren’t in the training set

Answer 14

each segment has its target variable value proportions treated as probabilities each segment gets an entropy score a kind of total entropy is calculated, based on adding all segment entropies, weighted by the segments' sizes relative to total dataset size information gain = H(S) - H(S|x) = h(parent) - ( w_1 \* h_1 + ... + w_k \* h_k), where h(parent) is the whole-set entropy before the classification model has been applied (note similarity to general mutual information formula re entropy)

Answer 15

Gini impurity measures how often a randomly chosen element of a set would be incorrectly labeled if it were labeled randomly and independently according to the distribution of labels in the set this works out to be, 1-sum\_i (p\_i)^2, where the p\_i are class relative frequencies in the set for a decision tree, the terminal-leaf frequency weighted sum of all the terminal leaf Gini impurities is taken

Answer 16

used for eg CART algorithm the variance before the model segmentation is computed the variance is computed after segmentation by summing over all segment-variances; each segment variance is the MSE within that segment--ie sum\_i (y\_i-mu)^2, where mu is the mean of the instances in that segment

Answer 17

used eg in probability classification trees to smooth category probabilities in leafs with few samples eg binary categories: (n+1)/(n+m+2)

Answer 18

a way to summarize a dataset by arranging the data into rows, according to the values of an attribute, and columns, according to the values of another attribute; a summary value is then created for each row-column unit

Answer 19

PPV: * a simple, direct measure of model predictive accuracy for the "positive" class * probability that an observation with a positive predicted outcome actually has a positive outcome NPV: * a simple, direct measure of model predictive accuracy for the "negative" class * probability that an observation with a negative predicted outcome actually has a negative outcome re class priors: * sensitivity (SE) is only the accuracy rate assuming we've a true "positive" sample (similarly for specificity (SP)); these are values with the class priors (ie true positives and true negatives) "removed" * in Bayesian terms, the class prevalence is the prior, the sensitivity / specificity are the new information, and the positive / negative predicted values are the posterior probabilities * if p is the proportion of true positives, then, * PPV = SE\*p / (SE\*p + (1-SP)\*(1-p)) = TP/(TP+FP) (1st row) * NPV = SP\*(1-p) / ((1-SE)\*p + SP\*(1-p)) = TN/(TN+FN) (2nd row)

Answer 20

* tests for multicollinearity among predictors * runs "internal" linear regressions under the model: X\_i ~ sum\_{j!=i) X\_j * VIF = 1/(1-R^2) * the higher the VIF, the closer R^2 is to 1, and the more likely multicollinearity exists

Answer 21

in retaining the simplest practical model during hyperparameter tuning, the practice of keeping the simplest model (eg fewest neighbors in KNN) that's still within one (cross-validation) standard error of the best model

Answer 22

a method for obtaining a standard distance between a point and a (possibly non-spherical, eg ellipsoidal) probability distribution; used in eg in cases of class-conditional pdfs, and wanting to predict the class of a test point

Answer 23

* for predictive modeling, a method of better controlling the "final" dataset conclusive model fitting is performed on * a series of controlled experiments is conducted to refine predictors and predictor class balance * final "response surface experiments" are run on the resulting refined dataset, by fitting predictive models

definitions and terms Flashcards

(47 cards)