Module 6 Flashcards

1
Q

Describe the Supervised learning problem

A
  • Outcome measurement Y ( dependent var, response/target)

- Vector of P predictor measurements X ( input, regressor, covariates, independent var)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are X and Y in regression/classification problems

A

Regression problem
- Y is quantitative ( price, blood pressure)

Classification problem
- Y takes value in a finite ordered set ( classes, true/false)

has training data - instances of the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

List objectives of supervised learning (AUA)

A
  • Accurately predict unseen test cases
  • Understand which inputs affect the outcome and how
  • Assess the quality of our predictions and inferences
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Describe unsupervised learning

A
  • No outcome variables, just a set of predictors/features measured on a set of samples
  • objective is fuzzy - find group of samples
  • difficult to tell how well you’re doing
  • useful for pre-processing in supervised learning
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe Statistical Learning vs ML

A

ML is a subset of AI
SL is a subfield of stats

ML has a greater emphasis on large-scale applications and prediction accuracy

SL emphasizes models and their interpretability, precision, and uncertainty

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Describe the regression function

A
  • Is also defined for vector X. f (x) = f (x1, x2, x3) = E(Y |X1 = x1, X2 = x2, X3 = x3
  • Is the ideal/optimal predictor of Y with regard to mean squared prediction error - minimizes error
  • E is the irreducible error - error in prediction due to distribution of y values
  • mean squared prediction error = reducable error + irreducible error
    E[(Y − ˆf (X))2|X = x] = [f (x) − ˆf (x)]^2 Reducible
    + Var(e) Irreducible
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe the nearest neighbor

A
  • N(x)
  • good for sample / p <= 4
  • can be lousy when p is large due to curse of dimenionality - nearest neighbours far in high dimensions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe the linear model

A

f(x) = B0 + B1X1 + B2X2 + … BPXP

  • Parametric Model
  • specified in terms of p + 1 parameters
  • almost never correct - good and interpretable appx to unknown true function
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

trade-offs of linear model (PGP)

A
  1. Prediciton accuracy vs interpretability
    Linear models easy to interpret
  2. Good fit vs over/under-fit
  3. Parismony vs Blakcbox
    - prefer simple model with fewer variables
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Describe assessing model accuracy

A

Compute average squared prediction error over TE (fresh test data) rather than TR (training data) to avoid bias towards overfit models.
- MSETe = Avei∈Te[yi − ˆf (xi)]2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Describe Bias Variance Trade-off

A
  • As flexibility of f increases, so does variance and bias decreases
  • choosing flexibility based on average test error amounts to bias-variance trade-off
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Describe Classification Problem (BAU)

A
  • Response variable Y is qualitative
  • Goals are to:
    1) Build a classifier that assigns a class label from C to a future unlabeled observation X
    2) Assess uncertainty in each classification
    3) Understand the roles of different predictors among X
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Is there an ideal C(X)?

A
- Let pk(x) = Pr(Y = k|X = x), k = 1, 2, . . . , K.
These are conditional class probabiliteies

The Bayes optimal classifier at x is
C(x) = j if pj (x) = max{p1(x), p2(x), . . . , pK (x)}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Classification details (MBS)

A
  • Measure Performance through misclassification rate
    ErrTe = Avei∈TeI[yi 6 = ˆC(xi)]
  • Bayes classifier has the smallest error
  • SVM builds structured models for C(x)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Describe Tree based models

A
  • for regression and classification
  • involve stratifying or segmenting predictor space into a number of simple regions
  • splitting decision methods are also known as decision tree methods
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Describe Pros and Cons of tree-based methods

A
  • Simple / useful for interpretation
  • not competitive with best-supervised learning approaches in terms of prediction accuracy
  • combining trees can result in dramatic improvements in prediction accuracy while losing some interpretation
17
Q

Details of tree building process

A
  • Divide predictor space into J distinct nonoverlapping regions
  • For every observation in region R, we make the same prediction = mean of response values for training observations in R
  • Goal is to find boxes R1,…RJ that minimizes RSS = ∑∑(yi − ˆyRj )2
    j=1 i∈Rj
  • Takes a top-down greedy approach - recursive binary splitting
18
Q

Describe classification tree

A
  • Used to predict qualitative response

- Predict that each observation belongs to the most commonly occurring class of training observations in its region

19
Q

Details of classification tree

A
  • uses recursive binary splitting
  • Uses classification error rate rather than RSS , E= 1 - max(pmk)
  • pmk = proportion of training observations in the mth region from kth classes
  • Two other measures are preferable - Gini index and deviance
20
Q

Describe Gini index

A

G =K∑^pmk(1 − ˆpmk)
k=1
- takes on a small value if all of pmk are close to 0 or 1

  • measure of node purity, small = single class observations
  • similar to cross-entropy
21
Q

Tree 10 fold / N fold cross validation

A
  • Divide dataset into 10/N parts, use 9 parts for training set and 1 part for test set
  • repeat process 10/N times using every part for testing
  • stratified sampling is used to divide dataset
22
Q

Evaluation Measures

A

Accuracy = TP + TN / ( TP+TN+FP+FN)
True Positive Rate = TP/ (TP+FN)
False positive Rate = FP / (FP + TN)

23
Q

Issues with decision trees

A
- Missing values
assign most common attribute value or common class value
  • Overfitting
    When accuracy high on training data and low on test data
  • Reduced Error Pruning
    Remove sub-tree and make it leaf node
24
Q

Describe unsupervised Learning

A
  • Only observe the features such as X1, X2, etc.

- Not interested in prediction since no response variable Y

25
Q

Goals of unsupervised learning

A
  • discover things about measurements, patterns etc

- two methods, clustering and principal components analysis

26
Q

Challenges of unsupervised learning

A
  • More subjective than supervised, no simple goal
27
Q

Advantage of unsupervised learning

A
  • Growing importance

- easier to obtain unlabeled data rather than labeled data

28
Q

Describe clustering

A
  • techniques for finding subgroups or clusters in a dataset
  • find similarity patterns
  • must define what is similar vs different
29
Q

clustering advantages

A
  • Clustering data
  • Discover communitites
  • Crash report grouping
30
Q

Details of k means clustering

A
  • each observation belongs to at least one cluster
  • no observation belongs to > 1 cluster
  • good clustering is when within-cluster variation is small as possible
  • Thus, minimize WCV(Ck)
31
Q

How to define within cluster variation

A
  • Euclidean distance

K∑ 1/ |Ck| ∑ p∑ (xij − xi′j )2
k=1 i,i′∈Ck j=1

32
Q

K-Means clustering algorithm

A
  1. Randomly assign an initial cluster for observations
  2. Iterate until cluster assignments stop changing
    - compute cluster centroid
    - assign observation to cluster where the euclidean distance to centroid smallest

not guaranteed to have a global minimum