Final Exam Flashcards

Question

Answer 1

C, 630 days. The performance plateaus beyond that point. There is a trade-off between how long the observation window is and how many patients have enough data for the longer window.

Answer 2

Find the truly predictive features to be included in the model.

Answer 3

True - Training data is prone to overfit

Answer 4

Take one example at a time as validation set, use remaining set as training. Repeat the process. Final performance is average predictive performance across all iterations

Answer 5

Similar to leave one out cross validation except K items are left for validation, resulting in the dataset being split into K chunks

Answer 6

Randomly split dataset into train and validation. Model is fit to training data and accuracy is assessed using validaiton. Results are validated over all the splits. Advantage over K-Fold - proportion of the training and validation split does not depend on number of folds. Disadvantage - some observations may never be selected into validation set.

Answer 7

- Programming Model - Execution Environment - Hadoop is Java impl - Software package - tools developed to facilitate data science tasks

Answer 8

- Distributed Storage - file system - Distributed Computation - mapReduce - Fault Tolerance - for sys failures

Answer 9

aggregation statistics

Answer 10

When a component fails only the specific component is re-computed

Answer 11

The back-end file system to store all the data to process using the MapReduce paradigm.

Answer 12

1. Cannot directly access data (must use map/reduce and aggregation query) 2. Logistic Regression not easy to implement in map reduce - due to iterative batch gradient descent approach. Iteration requires load of data twice for each iteration.

Answer 13

Prediction Outcome Positive & Condition Positive

Answer 14

Prediction Outcome Positive & Condition Negative

Answer 15

Prediction Outcome Negative & Condition Positive

Answer 16

Prediction Outcome Negative & Condition Negative

Answer 17

False Positive

Answer 18

False Negative

Answer 19

TP + TN / Population

Answer 20

TP / (TP + FN)

Answer 21

FP / (FP + TN)

Answer 22

FN / (TP + FN)

Answer 23

TN / (FP + TN)

Answer 24

TP / (TP + FN)

Answer 25

TP / (TP + FN)

Answer 26

TN / (FP + TN)

Answer 27

Condition Positive (TP + FN) / Total Population

Answer 28

TP / (TP + FP)

Answer 29

FP / (TP + FP)

Answer 30

FN / (FN + TN)

Answer 31

TN / (FN + TN)

Answer 32

2 * [ (Precision * Recall) / (Precision + Recall) ] Harmonic mean of Precision and Recall

Answer 33

Illustrates overall performance of a classifier when varying the threshold value

Answer 34

A performance metric that does not depend on threshold value

Answer 35

1. Generate a set of datasets (independently in bagging or sequentially in boosting) 2. Each dataset is used to train a separate model (can be independently trained models) 3. Aggregation function F (avg or weighted avg)

Answer 36

Take repeated samples of a dataset to create subsamples (with replacement), train separate models, then classify data point by taking majority vote of the models

Answer 37

- Create multiple simple trees for models and generate an average - Simple algorithms help with computational cost - Simple algorithms works better

Answer 38

Reduces variance without increasing Bias

Answer 39

Incrementally building models one at a time Based on mistakes and misclassifications create a subsequent model (better) repeat process over and over Final mode is weighted average (May be better than bagging but more likely to overfit)

Answer 40

Converting Raw Data (demographics, dx, meds, labs) into phenotyping then medical concepts (phenotypes ex. Diabetes I)

Answer 41

1. Genomic Studies 2. Clinical Predictive Modeling 3. Pragmatic Clinical Trials 4. Healthcare Quality Measurements

Answer 42

Approach that involves scanning biomarkings from single nucleotide polymorphisms (SNPS) from DNA of many people to find genetic association for specific phenotypes

Answer 43

1. Identify the phenotypes 2. Group patients into case and control 3. Get DNA samples from all patients For each SNP: 4. Compute frequency of SNP (single-nucleotide polymorphism) on cases and controls 5. Compute odds ratio 6. Compute p-value. If p-value is small, conclude the SNP is significant

Answer 44

Rich and deep phenotypic data is needed to analyze genomic data. Cost of phenotypic data is increasing while cost of genomic data is decreasing

Answer 45

Start with raw EHR data -> predictive model -> model

Answer 46

- noise in raw data - complex/high-dimensional raw data - model is tied to raw data so cannot be adapted from one hospital to another

Answer 47

1. Supervised Learning a. expert defined rules (popular) b. classification 2. Unsupervised learning (clustering) - does not take as much time but missing ground truth. Needs large amount of training data. a. dimensionality reduction b. tensor factorization

Answer 48

classification models

Answer 49

expert-defined rules because they use clinical intuition and knowledge.

Answer 50

1. Patient stratification - group patients into clusters 2. Disease Hierarchy discovery - learning hierarchy between diseases and how they relate 3. phenotyping - data to concepts

Answer 51

K-Means, Hierarchical Clustering, Gaussian Mixture Modeling

Answer 52

Mini batch K-Means, DBScan

Answer 53

N * k * d * i - n data points to k centers with d dimensions run i times

Answer 54

Agglomerative

Answer 55

Gaussian Mixture Model (GMM)

Answer 56

Expectation Maximization (EM)

Answer 57

Need to initialize mixing coefficient (pi_k), centers (mu_k), and variance (sigma_k) - good initialization can lead to faster convergence - bad initialization can fail to converge

Answer 58

Use K-Means result to initialize for GMMs. Center (mu_k) can be the center from k-means result and the covariance (sigma_k) can be computed from all the data points in the clusters and the mixing coefficient pi_k can be the size of cluster k / n data points

Answer 59

With large dataset, it allows streaming data and using mini-batches rather than the full dataset

Answer 60

Save center in hash map to retrieve quickly

Answer 61

center update becomes smaller and smaller (stabilizes) as the step size decreases

Answer 62

t * b * K * d

Answer 63

Density-Based Spatial Clustering of Applications with Noise - clusters defined as areas of high density separated by areas of low density

Answer 64

False - they can be any shape

Answer 65

- Density = # of points within epsilon to point p - Point is in a dense region if density is greater than a threshold - Core - Points in the dense region - Border - Points within epsilon distance from a core point - Noise - Points outside of epsilon distance to a core point

Answer 66

1. Rand Index - requires ground truth 2. Mutual Information - requires ground truth 3. Silhouette Coefficient - no ground truth required

Answer 67

Measures the mutual dependency of two random variables (from information theory).

Answer 68

Inefficient for applications that repeatedly reuse a working set of data: - Iterative Algorithms (ML, Graph Analysis) - Interactive Data Mining Tools (R, Python)

Answer 69

1. Iteratively reload the data over and over 2. Redundantly save output between stages These steps result in repeated reading/writing from disk

Answer 70

Keep working set in memory to perform quick operations. Load the dataset once into distributed memory

Answer 71

Designing a distributed memory abstraction that is both fault-tolerant and efficient

Answer 72

Balance between granularity of the computation and the efficiency for enabling fault tolerance. Help to provide fault-tolerant and efficient solution

Answer 73

Using lineage. Root RDD -> transformation applied -> Dervied RDD Generated Log one operation to apply to many elements Recompute lost partitions on failure No cost if nothing fails (everything is in memory)

Answer 74

Can selectively reload the failed input data to refresh memory or iteration data and run that subset of the data to recover.

Answer 75

Spark Core - RDD

Answer 76

Map = list of lists Flatmap = list of flattened list (single list)

Answer 77

-union - cheap -intersection - expensive (sorting) -map - cheap -subtract - expensive (distinct elements and set difference operation)

Answer 78

Broadcast Variable - Allows the program to efficiently send a large, read-only value to all worker nodes

Answer 79

RDDs track lineage information that can be used to efficiently reconstruct lost partitions. Can be used to recompute efficiently in the event of a loss

Answer 80

1. LOINC (logical observations identifiers names and codes) for LABS 2. ICD (international classification of disease) for dx 3. CPT (current procedural terminology) for procedure 4. NDC (national drug code) for meds

Answer 81

SNOMED - systemized nomenclature of medicine

Answer 82

Unified Medical Language System

Answer 83

- From WHO - Categorize diseases - ICD 10 - Covers Dx and procedure

Answer 84

- [E/V/n x x] . [ y1 y2] a. x - category (17 categories + supplemental categories) b. y1- subcategory c. y2 - subclassification 3 to 5 digits

Answer 85

- [ x x x ] . [ y1 y2 y3 y4] a - x - category b - y1 - etiology c - y2 - body part d - y3 - severity/vital details e - y4 - extension

Answer 86

One-to-many relationships (ICD10 is more specific) but may be one-to-one occasionally

Answer 87

Current Procedure Terminology - medical/surgical/diagnostic services. Maintained by the AMA Used by insurance to determine how much to pay - Category I (5 digits) - Category II (4 digits + F for quality metrics) - Category III (4 digits + T for experimental use)

Answer 88

A standard for lab and clinical observation created by Regenstrief Institute - Used to capture lab tests

Answer 89

National Drug Code Medication Standard maintained by FDA Used through drug supply chain to track medications 3 parts company/labeler - product code - package code

Answer 90

Comprehensive, multi-lingual clinical healthcare terminology Maintained by IHT SDO (non-profit in Denmark) Encode health information and support effective clinical recording of data Purpose of SNOMED- improve clinical docs, understand semantic interop, enable clinical decision support, data retrieval

Answer 91

Unified Medical Language System Maintained by national library of medicine Integrates all data standards Software tools to map data to medical concepts 3 Sources: Metathesaurus Concepts Semantic Network Specialist Lexicon and Tools

Answer 92

- Algorithm developed for ranking of web pages - Nodes with more incoming edges are higher ranked and more important -

Answer 93

1. Construct a graph (patient vectors) 2. Create a similarity graph of patients 3. Store graph as matrix (adjacency matrix) 4. Find Top K Eigenvectors of Graph 5. Cluster into K Groups of patients using eigenvectors

Answer 94

- Connect patients within epsilon distance to each other

Answer 95

Similarity function w is Gaussian kernel (or radial basis function) . Use fully connected graph but parameterize edges differently (edge weights different)

Answer 96

SVD destroys sparsity in the original data

Answer 97

CUR maintains the sparsity of the original data

Answer 98

Use actual rows and columns to form factorization matrices

Answer 99

Outer product of a set of vectors (1 from each mode)

Answer 100

Factorize tensor as sum of Rank 1 Tensors (Rank 1 Approximation of input). Lambda corresponds to the importance of the phenotype

Answer 101

Much more concise with Tensor Phenotypes vs. NMF Phenotypes

Answer 102

1. Unsupervised - Multiple phenotypes can be discovered 2. Predictive - phenotypes can be used for predictive modeling

Answer 103

Medical decisions based on well-designed and conducted research. - Randomized clinical trials to test hypothesis - Successful hypothesis becomes evidence - Evidence becomes clinical guidelines - Clinicians apply guidelines in practice

Answer 104

- Pragmatic trials (Data-driven evidence) - patient similarity search (practice based evidence) - individualized recommendations - precision medicine

Answer 105

Start w/ study population Two groups - current treatment (control/placebo) and new treatment group RCT compares groups for improved outcomes in treatment group

Answer 106

- Measure effectiveness of treatment in routine clinical practice - Do similarity search with patients related to current patient - Look at patient outcome and recommend treatment with best outcome for similar patients

Answer 107

- practice-based medicine (look for similar patients) - hypothesis with retrospective evidence (other patients) - randomized clinical trials (prospective study) - evidence generation - clinical guidelines - apply in practice

Answer 108

Distance Metric Learning - similarity between patients due to similarity of ground truth label and feature labels. Graph-based similarity learning - connect patients to regions in a disease network and find similarity.

Answer 109

Activation Function

Answer 110

sparse interactions, parameter sharing, and translational invariance

Answer 111

False. The parameters are in the fully connected layer and the operations are in the fully connected layer

Answer 112

- Gradient can become very small over a long sequence - Standard RNN will have difficulty to remember state from early history

Final Exam Flashcards

(202 cards)