Domain 4: Data Governance & Data Analytics Flashcards

1
Q

What is the ACID test for reliable database (DB) transactions?

A

Atomicity - transaction is individisble

Consistency - database is always in a valid state

Isolation - 2 transactions run simultaneously will not interfere with each other

Durability - persistent data store will always contain up-to-date info

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is included in the OSI 7 layer model

A

<Host>
DATA
7. Application (e.g. POP3/IMAP4 for email, HTTP for web content, FTP for file transfer, SSH and HTTPS for secure browsing)

6. Presentation (Encryption, decryption, conversion to character sets like ASCII)

5. Session (Lightweight Directory Access Protocol = LDAP for authenticating users, SSL)

SEGMENTS
4. Transport (Transmission of segments via TCP, UDP)

<Media>
TCP Packet/UDP Datagram
3. Network (Transmission of packets via IP, DHCP (used to assign IP addresses to hosts), DNS server, through routers)

Frame
2. Data Link (transmission of frames via ethernet, PPP (Point-to-Point Protocol), ARP (Address Resolution Protocol)

Bit
1. Physical (transmission of binary bits via copper wire, coaxial or fiber optic cable)
</Media></Host>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Global variables referenced by what symbol?

A

Up arrow symbol (later became a caret “^”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

In an event monitor, a high test result is a?

A

Condition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is evoking strength in INTERNIST-1/QMR

A

PPV

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the observed value for an item in a ML data set?

A

Label

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is an example of an expert system?

A

CADUCEUS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Examples of reinforcement learning

A

Markov decision process (use with known model)
- Q-learning

Monte Carlo (use when one or more elements are unknown)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are measures of information retrieval success?

A

Precision (PPV) = % of returned documents that are relevant to the query

Recall (sensitivity) = % of all relevant documents in the corpus that were found

Fall-out (false positive rate) = % of irrelevant documents that are retrieved

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Difference between bias and variance

A

Bias = measure of inaccuracy
Variance = measure of imprecision

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the null error rate

A

Rate of being wrong if you ALWAYS pick the majority class.

Ex) If majority class has 105 instance out of 165 total instances, null error rate = (165-105)/165

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What statistical models to use to screen for low incidence conditions

A
  • Fbeta score
  • Matthews Correlation Coefficient
  • Stratified K-fold cross-validation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Measures to evaluate regression methods (numerical output)

A
  • Root mean squared error (RMSE): lower = better fit
  • Correlation coefficient (r): strength of relationship between x and y on a scatter plot. No correlation is r = 0.
  • Coefficient of determination (r2 squared): Goodness of fit. Represents % variation in y that is not explained by variation in x. 100% = perfectly fit model.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Measures to evaluate classification methods (categorical output)

A
  • Confusion Matrix: Assessment of model’s “confusion” in analyzing instances (i.e., assigning instances to the wrong class or outcome)
  • Cohen’s Kappa: measure of how well the classifier performs as compared to random chance
  • Reliability diagram: Assess calibration by plotting observed event against predictive probability
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Measure used to evaluate both supervised & unsupervised models

A
  • Receiver Operated Characteristic (ROC) curve: Plot sensitivity against 1-specificity
  • Area Under the Curve (AUC): AKA concordance (c) statistic. AUC = 1 –> perfect discrimination.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the most important, time-consuming and expensive part of developing ML model

A

Gathering appropriate data (instances)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Difference between validation and testing data set

A
  • Validation data set used to evaluate PRELIM model to tune model
  • Testing data set used for final evaluation of model. No further change is anticipated.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

General rule for feature to instance ratio

A

Select <= 1 feature for each 10 instances in the development data set

High # of features –> overfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Methods for feature selection

A
  • Forward selection (iterative inclusion)
  • Backward selection (iterative removal)
  • Stepwise selection (combination of forward & backward selection)
  • Forced inclusion
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Methods for reducing number of features when an unsupervised model is overfit

A

Dimensionality reduction method

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Methods for validating model when data (instance) is limited

A
  • K-fold cross-validation: training data randomly partitioned into equal # (k) subsamples (folds)
  • Leave One Out Cross-Validation (LOOCV): extreme case of k-fold
  • Bootstrapping
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Methods for optimizing (aka tuning) model

A
  • Tune hyperparameters
  • Dimensionality reduction
  • Regularization: force algorithm to build a less complex model (i.e. more generalizable –> less likely to overfit data)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Types of deployed model

A
  • Static model (most common in medicine)
  • Incremental/continuous model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

ML algorithm that can be used for both supervised and unsupervised algorithms

A

Neural networks (aka connectionist systems)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
ML algorithm used for supervised algorithms
- Regression Methods - Classification Methods - Ensemble Methods
26
ML algorithm used for unsupervised algorithms
- Clustering Methods - Association Rules - Dimensionality Reduction Methods
27
How many hidden layers does Deep Artificial Neural Network usually have?
>3
28
Different types of ANN
- Feed-forward network (unidirectional from input to output) - Multilayer Perceptron* (MLP) * Perceptron: iterative algorithm that determines best values for the coefficient vector - Convolutional neural Network - Recurrent neural network (RNN) - Generative Adversarial Network (GAN)
29
Common output of convolutional neural network (CNN)
Image classification and/or image feature selection (e.g. saliency map showing which features considered more relevant)
30
Common use of recurrent neural network (RNN)
NLP
31
Common use of generative adversarial network
Use to generate DeepFake images and simulate cat-and-mouse fraud schemes
32
How does generative adversarial network work?
Pairs of deep learning neural networks (discriminator network vs. generator network) trained in tandem repeatedly
33
What are examples of classification models
- Logistic regression - Naive Bayes Classifier - Support Vector Machines - k-Nearest Neighbor - Decision Trees (most common method)
34
What does an internal and leaf node represent in a decision tree
Internal node = 1 feature (independent variable) Leaf node = outcome class (dependent variable)
35
Method for checking quality of decision tree model
- Gini impurity (0 with single class populations) - Entropy (high when large number of evenly mixed classes)
36
Types of ensemble methods
Parallel ensembles (uses unweighted voting) Sequential (series) ensembles (uses weighted voting)
37
What are methods to create diversity in parallel ensembles to help decrease overfitting
- Bagging (bootstrap aggregating): boostrap each model in ensmble - Random subspaces: Use random subset of features per model - Random forest: Ensemble of randomly selected decision trees to make a "forest." Uses BOTH bagging and random subspaces.
38
How is sequential (series) ensembles constructed?
- Combines constrained (weak learner) models into a single strong learner - Constructed using boosting methods, which uses weighted voting
39
Difference between hard and soft clustering
- Exclusive (hard) clustering: An instance can only belong to 1 cluster - Fuzzy (soft) clustering: An instance can have more than 1 cluster assignment
39
Types of boosting methods
- AdaBoost (Adaptive Boosting): misclassified data from each algorithm have weights increased - Gradient boosting: similar to adaboost except model trains on residual errors of the previously run model - CatBoost
40
Types of clustering methods
- Hierarchical clustering: instances grouped based on similarities and differences - Probabilistic clustering: instances clustered based on probability that they belong to a particular distribution - K-Means clustering: Assign instances to manually defined number (K) of clusters based on similarity --> Compute distance of all instances in cluster from the centroid (center of the cluster) using a defined distance metric. Move instances closest to centroid to that k-cluster. Continue until centroids stop moving location.
41
What is a silhouette coefficient
Ratio of cluster sum of squared error (SSE) & cluster separation 1: clusters well apart 0: clusters indifferent -1: clusters assigned incorrectly Used for K-means clustering
42
Types of association rules
- Market basket analysis (e.g. customers who bought X also bought Y) - Apriori algorithm: Uses a hash tree to count item sets navigating through data set in breadth first manner
43
Dimensionality reduction methods
- Principle Components Analysis (principle component = axis through data that is a function of contribution of variability in a population. Each principle component has to be orthogonal with all other principle components) - Singular value decomposition - Autoencoders
44
What is denormalization of data
Intentional duplication of data to improve database performance
45
3 Database integrity requirements
1. Entity integrity: every table in the database has a unique primary key 2. Referential integrity: Whenever a database column refers to a row in another table, that row exits 3. Domain integrity: specific list of values that are acceptable for a particular column
46
What is F1 score and how is it calculated?
F-Score (aka F-measure or F1 score) = harmonic mean between precision and recall F1 = 2 * (Precision x Recall) / (Precision + Recall)
47
What is tokenization?
Process of breaking documents into searchable items. Tokenization must be done before items are placed into an index.
48
Which SQL keyword can be used to eliminate duplicate rows from a query result?
DISTINCT
49
Equation for P (A, B) joint probability of events A and B occurring simultaneously,
P(A | B) x P(B) = P(B | A) x P(A)
50
What is the DIKW Model
1. Data = Item or signal with little to no meaning by itself 2. Information = Data + Meaning 3. Knowledge = Patterns and relationships between pieces of info 4. Wisdom = Understanding and internalization of knowledge to apply it appropriately
51
Difference between calibration and discrimination as metrics for an expert system
Discrimination measures ability to tell the difference between two sets of patients Calibration measures the degree to which the system produces an accurate probability
52
What does a class, instance, attribute and method in OOP correspond to in RDBMS?
- Class --> DB relation (table) - Instance --> Tuple (row) - Attribute --> Attribute (column) - Method (accessors, mutators) --> Database CRUD function
53
What is the difference between primary and foreign key?
Primary key: an attribute that uniquely identifies a tuple (row) Foreign key: an attribute whose values must have matching values in the primary key of another table
54
Difference between inner, outer and cross join
Inner join - Only includes rows that match both tables - Can be written as WHERE statement Left outer join - Include all rows in the left table (i.e. primary table specified in "FROM") and display blanks from right Right outer join - Include all rows in the right table, display blanks from left Full Outer Join - Includes all rows in both tables, blanks in both Cartesian (Cross) Join - Give you "cross product" of both tables
55
What is the "LIKE" condition used for in SQL?
- Match strings - LIKE is case sensitive - UPPER([char]) & Lower([char]) converts to all upper or lower case Ex) Like 'B%" --> % matches any length Ex) Like 'Smith_' --> "_" must match a single character
56
What are the strengths and limitations of hierarchical database
Strengths - Optimized for rapid transactions of hierarchical data - Makes it easy to know every attribute about one thing Limitations - Can only traverse tree from root (top parent) node - Child nodes can only have 1 parent --> difficult to model relationships between child nodes
57
What is an example of hierarchical database?
MUMPS (MGH Utility Multi-Programming System)
58
Example of object oriented database management system
Intersystems Cache (OODBMC behind Epic EHR)
59
Benefits of object oriented database management systems
- Support for more data types (e.g. graphics, photo, video, webpages) - Usually integrated into programming language so accessing data doesn't require complex drive configuration - Most web app frameworks support interaction with OODBMS
60
Tools of Unified Modeling Language
- Class diagram (describes object oriented classes like class title, attributes, methods, inheritance) - Activity diagram (~process flowchart) - Use case diagram (describes actors, goals and dependencies) - Entity-Relationship (ER) diagram (describes objects and their relationships) --> used to define RDBMS logical schema
61
What level of normalization is considered to be sufficient to call a database "normalized"?
Third Normal Form (3NF)
62
Pros & Cons of normalized database
Pros - Reduce inconsistencies and dependences in relational databases - Safe against most INSERT, UPDATE and DELETE anomalies Cons - To generate a report you have to "denormalize" the data - Requires lots of PK, FK, JOIN logic in your query
63
What is the ETL process
Extract-Transformation-Load process gets transactional data into format that can be integrated into Data Warehouse for reporting/queries
64
What is TCP/IP
TCP = transmission control protocol IP = internet protocol
65
What is the difference between TCP & UDP (user datagram protocol)
- TCP requires acknowledgment. UDP does not. - TCP guarantees sequence/order of packets. UDP does not. - TCP used when packet loss is unacceptable - UDP used where packet loss is less important (e.g. VoIP or streaming protocols)
66
What is CODEC?
Coder / Decoder - compression algorithm used for digital stream to transmit audio, video, etc. Can be "lossy" or "lossless."
67
Examples of short range wireless standards
Short range PAN (personal area network): - RFID (one way) and NFC (two way) - IEEE 802.15 - wireless PAN and derivatives (bluetooth and infrared)
68
Examples of medium range wireless standards
Medium Range WLAN (Wireless Local Area Network) - 802.11b - 802.11g - 802.11n - 802.11ac
69
Long Range Wireless Standards
WiMax CDMA 3G 4G / 4G LTE 5G
70
Advantages of Bluetooth Low Energy (aka bluetooth smart)
- Low battery consumption - Limited need for data transfer - One-way communication in close proximity - BLE Beacons broadcast packets of data at regular intervals and devices pick them up, detected by pre-installed apps or services - Popular use in health and fitness, indoor navigation, proximity-based marketing
71
Requirement for first normal form DB
Each cell contains a single value Each record is unique
72
Requirements for second normal form DB
Table must have a single-column, non-composite, primary key
73
Requirements for third normal form DB
Table must meet all criteria for 1NF AND 2NF AND must have no "transitive functional dependencies," which means changing the value of one cell should not require a change to another row.
74
What are the assumptions of Baye's theorem and the corresponding limitations for diagnosis of disease?
- Conditional independence of predictors --> Findings in disease are usually not conditionally independent - Mutual exclusivity of conditions --> diseases may not be mutually exclusive - Calculation is not simple --> when there are multiple findings, computation becomes complex quickly
75
Difference between notifiable vs. reportable diseases in the context of public health reporting
Notifiable diseases must be reported to CDC Reportable diseases must be reported to states
76
Difference between validation & verification of software/systems
Validation - Extensive testing including regression testing - Used on non-FDA approved software/systems Verification - Limited tests on sampling of functions - Used for FDA-approved software/systems to make sure it did not break in transit or installation
77
Difference between data reconciliation and data validation
Data reconciliation performed during data migration (large scale data transfer) and is followed by data validation of small subset of data. Performed one time or in several large batches. Data validation is used for continuously interfaced data that is transferred in small scale. Performed in real-time or very frequently.
78
What is automation bias?
Assumption that computer is right even when it doesn’t make sense
79
What is Polanyi’s paradox?
Explain the cognitive phenomenon that there exist many tasks which we, human beings, understand intuitively how to perform but cannot verbalize their rules or procedures
80
How to achieve AI Intelligibility from an ethical perspective
Transparency + Explainabillty
81
What is the goal of reinforcement learning?
Learn policy through trial and error while optimizing long term reward (ie value)
82
What method is used to reduce loss function (measure of deviation from correct output)
Stochastic gradient descent
83
Correlation between variance and fit
High variance —> overfit Low variance —> underfit
84
Types of regularization methods
Mathematical methods - Least Absolute Shrinkage and Selection Operator (LASSO) (L1) regularization - Ridge (L2) regularization - Elastic Net regularization (combo of L1 & L2) - Special regularization for neural networks Non mathematical methods - Early stopping, pruning decision trees
85
What is back-propagation in ANN
Process where ANN learns whether it made a mistake or not based on output. Adjust internal parameters of transfer functions (nodes) using loss functions and stochastic gradient descent functions in waves propagating backwards from the output nodes to the input nodes
86
How dose recurrent neural network work and what is an example?
Ex) Long Short-Term Memory (LSTM) - Data storage units called gated nodes can flexibly represent short/long-term data - Output recurrently feeds back on itself to inform next prediction (i.e. analyzes current and past data) - Involves context nodes (network nodes that can accumulate historical data)
87
How does convultional neural network work?
In the convolution (filter) layer data that match the pattern of weights are amplified (creates hotspots) Pooling layer masks data except for hot spots
88
Which machine learning algorithm requires the least number of training instances?
Regression methods
89
When to use polynomial regression and what is the limitation
Modeling non-linear exponential data (e.g. growth rates, progression of pandemics, etc.) Uses exponentials of SINGLE input variable Higher risk of overfitting
90
Goal of regression vs classification algorithms
Regression: Minimize the loss function Classification: Maximize the maximum likelihood estimation
91
How does support vector machine work?
Determines optimal boundary between 2 classes in multidimensional space by maximizing margin between support vectors of different classes - 2 features --> boundary is a line - 3 features --> boundary is a plane - >3 features --> boundary is a HYPERPLANE Support vector = data point closes to the boundary (hardest to classify) * Focus on borderline is UNIQUE (outliers automatically ignored)
92
What are common uses of Support Vector Machine?
- Fetal aneuploidy screening - Prediction of metastasis from gene profiles - Autoverifiction of GC/MS in the lab
93
How does k-nearest neighbor work?
- Instance-based method - Plots instances in multi-dimensional space. Knowledge is stored in the structure of the mapped data. Training data is not discarded.
94
Difference between hard & soft voting when random forest is used for classification
Hard voting: most frequent class selected is voted for Soft voting: averaging probabilities for each class selected, then calculating average probability per class then selecting class with highest average probability
95
2 main types of hierarchical clustering
1. Agglomerative clustering (bottom up approach; most common) 2. Divisive clustering (top down approach; not common)
96
What is the output of hierarchical clustering?
Dendogram (tree diagram)
97
Example of probabilistic clustering
Gaussian Mixture Model - Determine which gaussian distribution an instance belongs to
98
How is association rules rated
1. Support - % total transactions from a transaction database that the rule satisfies 2. Confidence - degree of certainty of an association
99
What is discretization and when is it necessary to do discretization?
Converting continuous data into bins (categorical data) Ex) Discretization must be done to use data with association rules
100
What is multi-collinearity and how do you fix it?
Data having redundant features/attributes Can be reduced using dimensionality reduction methods
101
How does bootstrapping work
Random sampling WITH replacement Model is trained on each bootstrapped sample + validated on out-of-bag sample ~36.8% OOB
102
What is one-hot encoding
- Transforming categories into an array of binary switches, one item per categories - Adds dimensionality to features Example: Melanoma = [1,0,0] Dysplastic nevus = [0,1,0] Benign nevus = [0,0,1]
103
7 Examples of low-level NLP tasks
1. Sentence boundary detection 2. Tokenization 3. Part-of-speech assignment (complicated by gerunds, which are nouns formed from verbs + ing) 4. Morphological decomposition of compound words (e.g. nasogastric) 5. Shallow parsing (chunking) - identifying phrases (e.g. noun phrase comprised of adjective + noun) 6. Problem-specific segmentation 7. Coreference resolution (e.g. determining that Mr. XXX, he and his refer to same person)
104
7 Higher level NLP tasks
1. Spelling/grammatical error ID and recovery 2. Named entity recognition (e.g. categorize as persons, locations, med, disease, etc.) 3. Word sense disambiguation (determining a homograph's correct meaning) 4. Negation and uncertainty detection 5. Relationship extraction 6. Temporal inferences extraction 7. Information extraction
105
Difference between semi-supervised and weakly supervised ML models
Semi-supervised: Some data are labeled while other data are not Weakly supervised: Small amount of data have detailed labels; rest of data have fewer labels