Module 6 Flashcards
Describe the Supervised learning problem
- Outcome measurement Y ( dependent var, response/target)
- Vector of P predictor measurements X ( input, regressor, covariates, independent var)
What are X and Y in regression/classification problems
Regression problem
- Y is quantitative ( price, blood pressure)
Classification problem
- Y takes value in a finite ordered set ( classes, true/false)
has training data - instances of the data
List objectives of supervised learning (AUA)
- Accurately predict unseen test cases
- Understand which inputs affect the outcome and how
- Assess the quality of our predictions and inferences
Describe unsupervised learning
- No outcome variables, just a set of predictors/features measured on a set of samples
- objective is fuzzy - find group of samples
- difficult to tell how well you’re doing
- useful for pre-processing in supervised learning
Describe Statistical Learning vs ML
ML is a subset of AI
SL is a subfield of stats
ML has a greater emphasis on large-scale applications and prediction accuracy
SL emphasizes models and their interpretability, precision, and uncertainty
Describe the regression function
- Is also defined for vector X. f (x) = f (x1, x2, x3) = E(Y |X1 = x1, X2 = x2, X3 = x3
- Is the ideal/optimal predictor of Y with regard to mean squared prediction error - minimizes error
- E is the irreducible error - error in prediction due to distribution of y values
- mean squared prediction error = reducable error + irreducible error
E[(Y − ˆf (X))2|X = x] = [f (x) − ˆf (x)]^2 Reducible
+ Var(e) Irreducible
Describe the nearest neighbor
- N(x)
- good for sample / p <= 4
- can be lousy when p is large due to curse of dimenionality - nearest neighbours far in high dimensions
Describe the linear model
f(x) = B0 + B1X1 + B2X2 + … BPXP
- Parametric Model
- specified in terms of p + 1 parameters
- almost never correct - good and interpretable appx to unknown true function
trade-offs of linear model (PGP)
- Prediciton accuracy vs interpretability
Linear models easy to interpret - Good fit vs over/under-fit
- Parismony vs Blakcbox
- prefer simple model with fewer variables
Describe assessing model accuracy
Compute average squared prediction error over TE (fresh test data) rather than TR (training data) to avoid bias towards overfit models.
- MSETe = Avei∈Te[yi − ˆf (xi)]2
Describe Bias Variance Trade-off
- As flexibility of f increases, so does variance and bias decreases
- choosing flexibility based on average test error amounts to bias-variance trade-off
Describe Classification Problem (BAU)
- Response variable Y is qualitative
- Goals are to:
1) Build a classifier that assigns a class label from C to a future unlabeled observation X
2) Assess uncertainty in each classification
3) Understand the roles of different predictors among X
Is there an ideal C(X)?
- Let pk(x) = Pr(Y = k|X = x), k = 1, 2, . . . , K. These are conditional class probabiliteies
The Bayes optimal classifier at x is
C(x) = j if pj (x) = max{p1(x), p2(x), . . . , pK (x)}
Classification details (MBS)
- Measure Performance through misclassification rate
ErrTe = Avei∈TeI[yi 6 = ˆC(xi)] - Bayes classifier has the smallest error
- SVM builds structured models for C(x)
Describe Tree based models
- for regression and classification
- involve stratifying or segmenting predictor space into a number of simple regions
- splitting decision methods are also known as decision tree methods
Describe Pros and Cons of tree-based methods
- Simple / useful for interpretation
- not competitive with best-supervised learning approaches in terms of prediction accuracy
- combining trees can result in dramatic improvements in prediction accuracy while losing some interpretation
Details of tree building process
- Divide predictor space into J distinct nonoverlapping regions
- For every observation in region R, we make the same prediction = mean of response values for training observations in R
- Goal is to find boxes R1,…RJ that minimizes RSS = ∑∑(yi − ˆyRj )2
j=1 i∈Rj - Takes a top-down greedy approach - recursive binary splitting
Describe classification tree
- Used to predict qualitative response
- Predict that each observation belongs to the most commonly occurring class of training observations in its region
Details of classification tree
- uses recursive binary splitting
- Uses classification error rate rather than RSS , E= 1 - max(pmk)
- pmk = proportion of training observations in the mth region from kth classes
- Two other measures are preferable - Gini index and deviance
Describe Gini index
G =K∑^pmk(1 − ˆpmk)
k=1
- takes on a small value if all of pmk are close to 0 or 1
- measure of node purity, small = single class observations
- similar to cross-entropy
Tree 10 fold / N fold cross validation
- Divide dataset into 10/N parts, use 9 parts for training set and 1 part for test set
- repeat process 10/N times using every part for testing
- stratified sampling is used to divide dataset
Evaluation Measures
Accuracy = TP + TN / ( TP+TN+FP+FN)
True Positive Rate = TP/ (TP+FN)
False positive Rate = FP / (FP + TN)
Issues with decision trees
- Missing values assign most common attribute value or common class value
- Overfitting
When accuracy high on training data and low on test data - Reduced Error Pruning
Remove sub-tree and make it leaf node
Describe unsupervised Learning
- Only observe the features such as X1, X2, etc.
- Not interested in prediction since no response variable Y
Goals of unsupervised learning
- discover things about measurements, patterns etc
- two methods, clustering and principal components analysis
Challenges of unsupervised learning
- More subjective than supervised, no simple goal
Advantage of unsupervised learning
- Growing importance
- easier to obtain unlabeled data rather than labeled data
Describe clustering
- techniques for finding subgroups or clusters in a dataset
- find similarity patterns
- must define what is similar vs different
clustering advantages
- Clustering data
- Discover communitites
- Crash report grouping
Details of k means clustering
- each observation belongs to at least one cluster
- no observation belongs to > 1 cluster
- good clustering is when within-cluster variation is small as possible
- Thus, minimize WCV(Ck)
How to define within cluster variation
- Euclidean distance
K∑ 1/ |Ck| ∑ p∑ (xij − xi′j )2
k=1 i,i′∈Ck j=1
K-Means clustering algorithm
- Randomly assign an initial cluster for observations
- Iterate until cluster assignments stop changing
- compute cluster centroid
- assign observation to cluster where the euclidean distance to centroid smallest
not guaranteed to have a global minimum