Final Exam Flashcards
The Four V’s
Volume
Variety
Velocity
Veracity - A lot of noise/false alarms
What makes predictive modeling difficult?
- Millions of patients to analyze - dx, rx, etc.
- Many models to be built
Computational Phenotyping
Raw data (demo, dx, rx, labs) -> phenotypes
Patient Similarity
Simulate doctor’s case-based reasoning with algorithms
Hadoop
Distributed disk-based big data system
Spark
Distributed in-memory big data system
T/F: Hadoop is much faster than spark
False. Spark is in-memory so is faster
What are the steps of the predictive modeling pipeline
Prediction Target should be both ____ and ____
interesting and possible
Cohort Construction Study
Defining the study population
Prospective vs. Retrospective
Prospective - identify cohort then collect data
Retrospective - Retrieve historical data then identify cohort
T/F: A prospective study has more noise in the data than a retrospective study
False. Retrospective study has more noise in historical data
T/F: A prospective study is more expensive than a retrospective study
True. The data collection has to be pre-planned for the study
T/F: A prospective study takes more time than a retrospective study
True. The data collection has to be planned and executed before analysis of the data.
T/F: A prospective study more commonly involves a larger dataset than a retrospective study.
False. A retrospective study more often involves a large dataset because historical data can be accessed more easily
Cohort Study
The goal is to selected a group of patients who are exposed to a risk.
Example: Target is heart failure readmission. The Cohort contains all HF patients discharged from hospital. The key in a cohort study is to define the right inclusion/exclusion criteria.
Case-Control Study
Identify two sets of patients - cases and controls. Put the case patients and control patients together to define the cohort.
Case in Case-Control study
Patients with positive outcome (have disease)
Control in Case-Control study
Patient with negative outcome (healthy) but otherwise similar to the case patients
Feature Construction Goal
Construct all potentially relevant features about patients in order to predict the target outcome
Example components of a Feature Construction pipeline
Large observation window and short prediction window
Small observation window and large prediction window. This is the most useful model but most likely unrealistic and difficult
Curve B because it can predict accurately for a longer period of time while the performance drops quickly for the other models
C, 630 days. The performance plateaus beyond that point. There is a trade-off between how long the observation window is and how many patients have enough data for the longer window.
Goal of Feature Selection
Find the truly predictive features to be included in the model.
T/F: Training error is not very useful
True - Training data is prone to overfit
Leave one out cross validation
Take one example at a time as validation set, use remaining set as training. Repeat the process. Final performance is average predictive performance across all iterations
K Fold Cross Validation
Similar to leave one out cross validation except K items are left for validation, resulting in the dataset being split into K chunks
Randomized Cross Validation
Randomly split dataset into train and validation. Model is fit to training data and accuracy is assessed using validaiton. Results are validated over all the splits. Advantage over K-Fold - proportion of the training and validation split does not depend on number of folds. Disadvantage - some observations may never be selected into validation set.
What is hadoop mapreduce?
- Programming Model
- Execution Environment - Hadoop is Java impl
- Software package - tools developed to facilitate data science tasks
Hadoop provides what capabilities
- Distributed Storage - file system
- Distributed Computation - mapReduce
- Fault Tolerance - for sys failures
Computational Process of Hadoop
Fundamental pattern of writing algorithm using Hadoop is to specify algorithm as ____
aggregation statistics
First stage of MapReduce System
First stage of MapReduce System
Second stage of MapReduce System
Final Stage of MapReduce System
In what way is MapReduce designed to minimize re-computation
When a component fails only the specific component is re-computed
What is HDFS?
The back-end file system to store all the data to process using the MapReduce paradigm.
What are limitations of MapReduce?
- Cannot directly access data (must use map/reduce and aggregation query)
- Logistic Regression not easy to implement in map reduce - due to iterative batch gradient descent approach. Iteration requires load of data twice for each iteration.
MapReduce KNN
True Positive
Prediction Outcome Positive & Condition Positive
False Positive
Prediction Outcome Positive & Condition Negative
False Negative
Prediction Outcome Negative & Condition Positive
True Negative
Prediction Outcome Negative & Condition Negative
Type I Error
False Positive
Type II Error
False Negative
Accuracy
TP + TN / Population
True Positive Rate
TP / (TP + FN)
False Positive Rate
FP / (FP + TN)
False Negative Rate
FN / (TP + FN)
True Negative Rate
TN / (FP + TN)
Sensitivty
TP / (TP + FN)
Recall
TP / (TP + FN)
Specificity
TN / (FP + TN)
Prevalence
Condition Positive (TP + FN) / Total Population
Positive Predictive Value
TP / (TP + FP)
False Discovery Rate
FP / (TP + FP)
False Omission Rate
FN / (FN + TN)
Negative Predictive Value
TN / (FN + TN)
F1 Score
2 * [ (Precision * Recall) / (Precision + Recall) ]
Harmonic mean of Precision and Recall
What does the ROC curve do?
Illustrates overall performance of a classifier when varying the threshold value
What is the AUC?
A performance metric that does not depend on threshold value
Regression Metrics (MSE & MAE)
Regression Metric that can be used across datasets
Gradient Descent Method
Gradient Descent Method for Linear Regression
Stochastic Gradient Descent
SGD for Linear Regression
Steps of Ensemble Methods
- Generate a set of datasets (independently in bagging or sequentially in boosting)
- Each dataset is used to train a separate model (can be independently trained models)
- Aggregation function F (avg or weighted avg)
Bias Variance Tradeoff
Bagging
Take repeated samples of a dataset to create subsamples (with replacement), train separate models, then classify data point by taking majority vote of the models
Random Forest
- Create multiple simple trees for models and generate an average
- Simple algorithms help with computational cost
- Simple algorithms works better
Why does bagging work?
Reduces variance without increasing Bias
Boosting
Incrementally building models one at a time
Based on mistakes and misclassifications create a subsequent model (better)
repeat process over and over
Final mode is weighted average
(May be better than bagging but more likely to overfit)