Final Exam Flashcards
The Four V’s
Volume
Variety
Velocity
Veracity - A lot of noise/false alarms
What makes predictive modeling difficult?
- Millions of patients to analyze - dx, rx, etc.
- Many models to be built
Computational Phenotyping
Raw data (demo, dx, rx, labs) -> phenotypes
Patient Similarity
Simulate doctor’s case-based reasoning with algorithms
Hadoop
Distributed disk-based big data system
Spark
Distributed in-memory big data system
T/F: Hadoop is much faster than spark
False. Spark is in-memory so is faster
What are the steps of the predictive modeling pipeline
Prediction Target should be both ____ and ____
interesting and possible
Cohort Construction Study
Defining the study population
Prospective vs. Retrospective
Prospective - identify cohort then collect data
Retrospective - Retrieve historical data then identify cohort
T/F: A prospective study has more noise in the data than a retrospective study
False. Retrospective study has more noise in historical data
T/F: A prospective study is more expensive than a retrospective study
True. The data collection has to be pre-planned for the study
T/F: A prospective study takes more time than a retrospective study
True. The data collection has to be planned and executed before analysis of the data.
T/F: A prospective study more commonly involves a larger dataset than a retrospective study.
False. A retrospective study more often involves a large dataset because historical data can be accessed more easily
Cohort Study
The goal is to selected a group of patients who are exposed to a risk.
Example: Target is heart failure readmission. The Cohort contains all HF patients discharged from hospital. The key in a cohort study is to define the right inclusion/exclusion criteria.
Case-Control Study
Identify two sets of patients - cases and controls. Put the case patients and control patients together to define the cohort.
Case in Case-Control study
Patients with positive outcome (have disease)
Control in Case-Control study
Patient with negative outcome (healthy) but otherwise similar to the case patients
Feature Construction Goal
Construct all potentially relevant features about patients in order to predict the target outcome
Example components of a Feature Construction pipeline
Large observation window and short prediction window
Small observation window and large prediction window. This is the most useful model but most likely unrealistic and difficult
Curve B because it can predict accurately for a longer period of time while the performance drops quickly for the other models
C, 630 days. The performance plateaus beyond that point. There is a trade-off between how long the observation window is and how many patients have enough data for the longer window.
Goal of Feature Selection
Find the truly predictive features to be included in the model.
T/F: Training error is not very useful
True - Training data is prone to overfit
Leave one out cross validation
Take one example at a time as validation set, use remaining set as training. Repeat the process. Final performance is average predictive performance across all iterations
K Fold Cross Validation
Similar to leave one out cross validation except K items are left for validation, resulting in the dataset being split into K chunks
Randomized Cross Validation
Randomly split dataset into train and validation. Model is fit to training data and accuracy is assessed using validaiton. Results are validated over all the splits. Advantage over K-Fold - proportion of the training and validation split does not depend on number of folds. Disadvantage - some observations may never be selected into validation set.
What is hadoop mapreduce?
- Programming Model
- Execution Environment - Hadoop is Java impl
- Software package - tools developed to facilitate data science tasks
Hadoop provides what capabilities
- Distributed Storage - file system
- Distributed Computation - mapReduce
- Fault Tolerance - for sys failures
Computational Process of Hadoop
Fundamental pattern of writing algorithm using Hadoop is to specify algorithm as ____
aggregation statistics
First stage of MapReduce System
First stage of MapReduce System
Second stage of MapReduce System
Final Stage of MapReduce System
In what way is MapReduce designed to minimize re-computation
When a component fails only the specific component is re-computed
What is HDFS?
The back-end file system to store all the data to process using the MapReduce paradigm.
What are limitations of MapReduce?
- Cannot directly access data (must use map/reduce and aggregation query)
- Logistic Regression not easy to implement in map reduce - due to iterative batch gradient descent approach. Iteration requires load of data twice for each iteration.
MapReduce KNN
True Positive
Prediction Outcome Positive & Condition Positive
False Positive
Prediction Outcome Positive & Condition Negative
False Negative
Prediction Outcome Negative & Condition Positive
True Negative
Prediction Outcome Negative & Condition Negative
Type I Error
False Positive
Type II Error
False Negative
Accuracy
TP + TN / Population
True Positive Rate
TP / (TP + FN)
False Positive Rate
FP / (FP + TN)
False Negative Rate
FN / (TP + FN)
True Negative Rate
TN / (FP + TN)
Sensitivty
TP / (TP + FN)
Recall
TP / (TP + FN)
Specificity
TN / (FP + TN)
Prevalence
Condition Positive (TP + FN) / Total Population
Positive Predictive Value
TP / (TP + FP)
False Discovery Rate
FP / (TP + FP)
False Omission Rate
FN / (FN + TN)
Negative Predictive Value
TN / (FN + TN)
F1 Score
2 * [ (Precision * Recall) / (Precision + Recall) ]
Harmonic mean of Precision and Recall
What does the ROC curve do?
Illustrates overall performance of a classifier when varying the threshold value
What is the AUC?
A performance metric that does not depend on threshold value
Regression Metrics (MSE & MAE)
Regression Metric that can be used across datasets
Gradient Descent Method
Gradient Descent Method for Linear Regression
Stochastic Gradient Descent
SGD for Linear Regression
Steps of Ensemble Methods
- Generate a set of datasets (independently in bagging or sequentially in boosting)
- Each dataset is used to train a separate model (can be independently trained models)
- Aggregation function F (avg or weighted avg)
Bias Variance Tradeoff
Bagging
Take repeated samples of a dataset to create subsamples (with replacement), train separate models, then classify data point by taking majority vote of the models
Random Forest
- Create multiple simple trees for models and generate an average
- Simple algorithms help with computational cost
- Simple algorithms works better
Why does bagging work?
Reduces variance without increasing Bias
Boosting
Incrementally building models one at a time
Based on mistakes and misclassifications create a subsequent model (better)
repeat process over and over
Final mode is weighted average
(May be better than bagging but more likely to overfit)
Pros of Ensemble Methods
Cons of Ensemble Methods
Computational Phenotyping
Converting Raw Data (demographics, dx, meds, labs) into phenotyping then medical concepts (phenotypes ex. Diabetes I)
Applications of Phenotyping
- Genomic Studies
- Clinical Predictive Modeling
- Pragmatic Clinical Trials
- Healthcare Quality Measurements
Genomic Wide-Association Study (GWAS)
Approach that involves scanning biomarkings from single nucleotide polymorphisms (SNPS) from DNA of many people to find genetic association for specific phenotypes
How to run a Genomic Wide Association Study
- Identify the phenotypes
- Group patients into case and control
- Get DNA samples from all patients
For each SNP: - Compute frequency of SNP (single-nucleotide polymorphism) on cases and controls
- Compute odds ratio
- Compute p-value. If p-value is small, conclude the SNP is significant
Why do we care about phenotyping?
Rich and deep phenotypic data is needed to analyze genomic data. Cost of phenotypic data is increasing while cost of genomic data is decreasing
Clinical Predictive Modeling
Start with raw EHR data -> predictive model -> model
Why is predictive modeling from raw data not ideal?
- noise in raw data
- complex/high-dimensional raw data
- model is tied to raw data so cannot be adapted from one hospital to another
Pragmatic Clinical Trials differ from traditional trials how?
Main phenotyping methods
- Supervised Learning
a. expert defined rules (popular)
b. classification - Unsupervised learning (clustering) - does not take as much time but missing ground truth. Needs large amount of training data.
a. dimensionality reduction
b. tensor factorization
Which phenotyping approach requires more human effort during evaluation?
1. expert-defined rules
2. classification models
classification models
Which phenotyping approach is easier to interpret?
1. expert-defined rules
2. classification rules
expert-defined rules because they use clinical intuition and knowledge.
Healthcare Applications of clustering
- Patient stratification - group patients into clusters
- Disease Hierarchy discovery - learning hierarchy between diseases and how they relate
- phenotyping - data to concepts
Classical Clustering Algorithms include
K-Means, Hierarchical Clustering, Gaussian Mixture Modeling
Scalable Clustering Algorithms include
Mini batch K-Means, DBScan
K-Means Algorithm
N * k * d * i
- n data points to k centers with d dimensions run i times
Hierarchical Clustering
Which Hierarchical Clustering approach is more efficient? Agglomerative or Divisive?
Agglomerative
Algorithm for Agglomerative Clustering
Example of a soft clustering method
Gaussian Mixture Model (GMM)
What is the equation for the Gaussian Mixture Model (GMM?)
GMM uses which popular optimization strategy?
Expectation Maximization (EM)
GMM Expectation Maximization Algorithm
GMM Initialization
Need to initialize mixing coefficient (pi_k), centers (mu_k), and variance (sigma_k)
- good initialization can lead to faster convergence
- bad initialization can fail to converge
How to improve GMM initialization
Use K-Means result to initialize for GMMs. Center (mu_k) can be the center from k-means result and the covariance (sigma_k) can be computed from all the data points in the clusters and the mixing coefficient pi_k can be the size of cluster k / n data points
GMM Expectation Step
GMM Maximization Step
K-Means vs. GMM (clustering, parameters, algorithm)
Mini Batch K-Means Benefit
With large dataset, it allows streaming data and using mini-batches rather than the full dataset
Mini Batch K-Means Algorithm
Mini batch K-Means:
How to assign points in M (mini-batch) to current center in C
Save center in hash map to retrieve quickly
Mini batch K-Means:
How to update C center based on assignments in M (mini batch)
center update becomes smaller and smaller (stabilizes) as the step size decreases
t * b * K * d
DBScan
Density-Based Spatial Clustering of Applications with Noise
- clusters defined as areas of high density separated by areas of low density
T/F: The clusters found by DB scan are typically oval shape
False - they can be any shape
DBScan Key Concepts
- Density = # of points within epsilon to point p
- Point is in a dense region if density is greater than a threshold
- Core - Points in the dense region
- Border - Points within epsilon distance from a core point
- Noise - Points outside of epsilon distance to a core point
DBScan Algorithm
How many clusters can a datapoint belong to using the DBScan algorithm?
0 or 1
Clustering Evaluation Metrics
- Rand Index - requires ground truth
- Mutual Information - requires ground truth
- Silhouette Coefficient - no ground truth required
Rand Index (RI)
Mutual Information
Measures the mutual dependency of two random variables (from information theory).
Pros/Cons of Rand Index and Mutual Information
Silhouette Coefficient
Pros/Cons of Silhouette Coefficient
Which better supports iterative ML Algorithms: MapReduce or Spark
Spark
___ is based on acyclic data flow from stable storage to stable storage
Hadoop
Limitations of Hadoop
Inefficient for applications that repeatedly reuse a working set of data:
- Iterative Algorithms (ML, Graph Analysis)
- Interactive Data Mining Tools (R, Python)
Problem with iteration in Hadoop Map-Reduce
- Iteratively reload the data over and over
- Redundantly save output between stages
These steps result in repeated reading/writing from disk
Key objective for supporting iterative algorithms is what?
Keep working set in memory to perform quick operations.
Load the dataset once into distributed memory
What is the challenge to keeping data in memory?
Designing a distributed memory abstraction that is both fault-tolerant and efficient
Resilient Distributed Datasets (RDDs)
Balance between granularity of the computation and the efficiency for enabling fault tolerance. Help to provide fault-tolerant and efficient solution
RDDs lead to efficient fault recovery, how?
Using lineage.
Root RDD -> transformation applied -> Dervied RDD Generated
Log one operation to apply to many elements
Recompute lost partitions on failure
No cost if nothing fails (everything is in memory)
RDD Recovery
Can selectively reload the failed input data to refresh memory or iteration data and run that subset of the data to recover.
Spark Stack
Spark Core - RDD
Spark Programming Interface
Map vs Flatmap
Map = list of lists
Flatmap = list of flattened list (single list)
Cost of RDD Transformations (union, intersection, map, subtract)
-union - cheap
-intersection - expensive (sorting)
-map - cheap
-subtract - expensive (distinct elements and set difference operation)
Spark Operations
Shared Variable in Spark
Broadcast Variable - Allows the program to efficiently send a large, read-only value to all worker nodes
Fault Tolerance of Spark
RDDs track lineage information that can be used to efficiently reconstruct lost partitions. Can be used to recompute efficiently in the event of a loss
Health Data Standards
- LOINC (logical observations identifiers names and codes) for LABS
- ICD (international classification of disease) for dx
- CPT (current procedural terminology) for procedure
- NDC (national drug code) for meds
Most popular medical ontology
SNOMED - systemized nomenclature of medicine
UMLS
Unified Medical Language System
ICD Codes
- From WHO
- Categorize diseases
- ICD 10
- Covers Dx and procedure
ICD 9 Codes
- [E/V/n x x] . [ y1 y2]
a. x - category (17 categories + supplemental categories)
b. y1- subcategory
c. y2 - subclassification
3 to 5 digits
ICD 10 Codes
- [ x x x ] . [ y1 y2 y3 y4]
a - x - category
b - y1 - etiology
c - y2 - body part
d - y3 - severity/vital details
e - y4 - extension
ICD9 to ICD10 Mapping
One-to-many relationships (ICD10 is more specific) but may be one-to-one occasionally
CPT
Current Procedure Terminology - medical/surgical/diagnostic services.
Maintained by the AMA
Used by insurance to determine how much to pay
- Category I (5 digits)
- Category II (4 digits + F for quality metrics)
- Category III (4 digits + T for experimental use)
LOINC
A standard for lab and clinical observation created by Regenstrief Institute
- Used to capture lab tests
NDC
National Drug Code
Medication Standard maintained by FDA
Used through drug supply chain to track medications
3 parts
company/labeler - product code - package code
SNOMED
Comprehensive, multi-lingual clinical healthcare terminology
Maintained by IHT SDO (non-profit in Denmark)
Encode health information and support effective clinical recording of data
Purpose of SNOMED- improve clinical docs, understand semantic interop, enable clinical decision support, data retrieval
Logical Model of SNOMED CT
UMLS
Unified Medical Language System
Maintained by national library of medicine
Integrates all data standards
Software tools to map data to medical concepts
3 Sources:
Metathesaurus Concepts
Semantic Network
Specialist Lexicon and Tools
PageRank
- Algorithm developed for ranking of web pages
- ## Nodes with more incoming edges are higher ranked and more important
MapReduce PageRank
MapReduce PageRank - Map Phase
MapReduce PageRank - Reduce Phase
Spectral Clustering
- Construct a graph (patient vectors)
- Create a similarity graph of patients
- Store graph as matrix (adjacency matrix)
- Find Top K Eigenvectors of Graph
- Cluster into K Groups of patients using eigenvectors
Similarity Graph Construction
E-Neighborhood Graph
- Connect patients within epsilon distance to each other
Fully Connected Graph
Similarity function w is Gaussian kernel (or radial basis function) . Use fully connected graph but parameterize edges differently (edge weights different)
Singular Value Decomposition (SVD)
Singular Value Decomposition Example
SVD Properties
Principal Component Analysis (PCA)
Sparsity problem with SVD
SVD destroys sparsity in the original data
CUR vs SVD
CUR maintains the sparsity of the original data
CUR Decomposition
Use actual rows and columns to form factorization matrices
CUR Algorithm
Tensor Factorization
Rank 1 Tensor
Outer product of a set of vectors (1 from each mode)
Example Phenotype
Phenotyping through Tensor Factorization
Factorize tensor as sum of Rank 1 Tensors (Rank 1 Approximation of input).
Lambda corresponds to the importance of the phenotype
Canonical Decomposition & Parallel Factorization (CP Decomposition)
Phenotyping Process Using Tensor Factorization
Tensor Factorization vs Non-Matrix Factorization
Much more concise with Tensor Phenotypes vs. NMF Phenotypes
Benefits for Tensor Factorizations
- Unsupervised - Multiple phenotypes can be discovered
- Predictive - phenotypes can be used for predictive modeling
Traditional Paradigm (Evidence based medicines)
Medical decisions based on well-designed and conducted research.
- Randomized clinical trials to test hypothesis
- Successful hypothesis becomes evidence
- Evidence becomes clinical guidelines
- Clinicians apply guidelines in practice
New Paradigm (precision medicine)
- Pragmatic trials (Data-driven evidence)
- patient similarity search (practice based evidence)
- individualized recommendations
- precision medicine
Randomized Clinical Trials (RCT)
Start w/ study population
Two groups - current treatment (control/placebo) and new treatment group
RCT compares groups for improved outcomes in treatment group
Pragmatic Trials
- Measure effectiveness of treatment in routine clinical practice
- Do similarity search with patients related to current patient
- Look at patient outcome and recommend treatment with best outcome for similar patients
Using patient similarity
- practice-based medicine (look for similar patients)
- hypothesis with retrospective evidence (other patients)
- randomized clinical trials (prospective study)
- evidence generation
- clinical guidelines
- apply in practice
Patient Similarity Approaches
Distance Metric Learning - similarity between patients due to similarity of ground truth label and feature labels.
Graph-based similarity learning - connect patients to regions in a disease network and find similarity.
Locally Supervised Metric Learning
Sigmoid Function
Activation Function
Tanh Function
Rectified Linear Function (ReLU)
Stochastic Gradient Descent
Forward Computation for a neuron
Backward computation for a neuron
Advantage of CNN model
sparse interactions, parameter sharing, and translational invariance
Dimension Calculation of Convolution Layers
Dimension Calculation of Pooling Layer
Number of calculations for Convolution Layers
Number of calculations for Pooling Layers
Number of calculations for fully-conntected layers
T/F: Most of the parameters in a model are in the Convolution layer but most of the operations (calculations) are in the fully-connected layer
False. The parameters are in the fully connected layer and the operations are in the fully connected layer
Problem with RNN
- Gradient can become very small over a long sequence
- Standard RNN will have difficulty to remember state from early history