Domain 4: Data Governance & Data Analytics Flashcards

1
Q

What is the ACID test for reliable database (DB) transactions?

A

Atomicity - transaction is individisble

Consistency - database is always in a valid state

Isolation - 2 transactions run simultaneously will not interfere with each other

Durability - persistent data store will always contain up-to-date info

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is included in the OSI 7 layer model

A

<Host>
DATA
7. Application (e.g. POP3/IMAP4 for email, HTTP for web content, FTP for file transfer, SSH and HTTPS for secure browsing)

6. Presentation (Encryption, decryption, conversion to character sets like ASCII)

5. Session (Lightweight Directory Access Protocol = LDAP for authenticating users, SSL)

SEGMENTS
4. Transport (Transmission of segments via TCP, UDP)

<Media>
TCP Packet/UDP Datagram
3. Network (Transmission of packets via IP, DHCP (used to assign IP addresses to hosts), DNS server, through routers)

Frame
2. Data Link (transmission of frames via ethernet, PPP (Point-to-Point Protocol), ARP (Address Resolution Protocol)

Bit
1. Physical (transmission of binary bits via copper wire, coaxial or fiber optic cable)
</Media></Host>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Global variables referenced by what symbol?

A

Up arrow symbol (later became a caret “^”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

In an event monitor, a high test result is a?

A

Condition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is evoking strength in INTERNIST-1/QMR

A

PPV

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the observed value for an item in a ML data set?

A

Label

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is an example of an expert system?

A

CADUCEUS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Examples of reinforcement learning

A

Markov decision process (use with known model)
- Q-learning

Monte Carlo (use when one or more elements are unknown)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are measures of information retrieval success?

A

Precision (PPV) = % of returned documents that are relevant to the query

Recall (sensitivity) = % of all relevant documents in the corpus that were found

Fall-out (false positive rate) = % of irrelevant documents that are retrieved

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Difference between bias and variance

A

Bias = measure of inaccuracy
Variance = measure of imprecision

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the null error rate

A

Rate of being wrong if you ALWAYS pick the majority class.

Ex) If majority class has 105 instance out of 165 total instances, null error rate = (165-105)/165

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What statistical models to use to screen for low incidence conditions

A
  • Fbeta score
  • Matthews Correlation Coefficient
  • Stratified K-fold cross-validation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Measures to evaluate regression methods (numerical output)

A
  • Root mean squared error (RMSE): lower = better fit
  • Correlation coefficient (r): strength of relationship between x and y on a scatter plot. No correlation is r = 0.
  • Coefficient of determination (r2 squared): Goodness of fit. Represents % variation in y that is not explained by variation in x. 100% = perfectly fit model.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Measures to evaluate classification methods (categorical output)

A
  • Confusion Matrix: Assessment of model’s “confusion” in analyzing instances (i.e., assigning instances to the wrong class or outcome)
  • Cohen’s Kappa: measure of how well the classifier performs as compared to random chance
  • Reliability diagram: Assess calibration by plotting observed event against predictive probability
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Measure used to evaluate both supervised & unsupervised models

A
  • Receiver Operated Characteristic (ROC) curve: Plot sensitivity against 1-specificity
  • Area Under the Curve (AUC): AKA concordance (c) statistic. AUC = 1 –> perfect discrimination.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the most important, time-consuming and expensive part of developing ML model

A

Gathering appropriate data (instances)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Difference between validation and testing data set

A
  • Validation data set used to evaluate PRELIM model to tune model
  • Testing data set used for final evaluation of model. No further change is anticipated.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

General rule for feature to instance ratio

A

Select <= 1 feature for each 10 instances in the development data set

High # of features –> overfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Methods for feature selection

A
  • Forward selection (iterative inclusion)
  • Backward selection (iterative removal)
  • Stepwise selection (combination of forward & backward selection)
  • Forced inclusion
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Methods for reducing number of features when an unsupervised model is overfit

A

Dimensionality reduction method

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Methods for validating model when data (instance) is limited

A
  • K-fold cross-validation: training data randomly partitioned into equal # (k) subsamples (folds)
  • Leave One Out Cross-Validation (LOOCV): extreme case of k-fold
  • Bootstrapping
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Methods for optimizing (aka tuning) model

A
  • Tune hyperparameters
  • Dimensionality reduction
  • Regularization: force algorithm to build a less complex model (i.e. more generalizable –> less likely to overfit data)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Types of deployed model

A
  • Static model (most common in medicine)
  • Incremental/continuous model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

ML algorithm that can be used for both supervised and unsupervised algorithms

A

Neural networks (aka connectionist systems)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

ML algorithm used for supervised algorithms

A
  • Regression Methods
  • Classification Methods
  • Ensemble Methods
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

ML algorithm used for unsupervised algorithms

A
  • Clustering Methods
  • Association Rules
  • Dimensionality Reduction Methods
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

How many hidden layers does Deep Artificial Neural Network usually have?

A

> 3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Different types of ANN

A
  • Feed-forward network (unidirectional from input to output)
  • Multilayer Perceptron* (MLP)
  • Perceptron: iterative algorithm that determines best values for the coefficient vector
  • Convolutional neural Network
  • Recurrent neural network (RNN)
  • Generative Adversarial Network (GAN)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Common output of convolutional neural network (CNN)

A

Image classification and/or image feature selection (e.g. saliency map showing which features considered more relevant)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Common use of recurrent neural network (RNN)

A

NLP

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Common use of generative adversarial network

A

Use to generate DeepFake images and simulate cat-and-mouse fraud schemes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

How does generative adversarial network work?

A

Pairs of deep learning neural networks (discriminator network vs. generator network) trained in tandem repeatedly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What are examples of classification models

A
  • Logistic regression
  • Naive Bayes Classifier
  • Support Vector Machines
  • k-Nearest Neighbor
  • Decision Trees (most common method)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What does an internal and leaf node represent in a decision tree

A

Internal node = 1 feature (independent variable)

Leaf node = outcome class (dependent variable)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Method for checking quality of decision tree model

A
  • Gini impurity (0 with single class populations)
  • Entropy (high when large number of evenly mixed classes)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Types of ensemble methods

A

Parallel ensembles (uses unweighted voting)

Sequential (series) ensembles (uses weighted voting)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What are methods to create diversity in parallel ensembles to help decrease overfitting

A
  • Bagging (bootstrap aggregating): boostrap each model in ensmble
  • Random subspaces: Use random subset of features per model
  • Random forest: Ensemble of randomly selected decision trees to make a “forest.” Uses BOTH bagging and random subspaces.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

How is sequential (series) ensembles constructed?

A
  • Combines constrained (weak learner) models into a single strong learner
  • Constructed using boosting methods, which uses weighted voting
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Difference between hard and soft clustering

A
  • Exclusive (hard) clustering: An instance can only belong to 1 cluster
  • Fuzzy (soft) clustering: An instance can have more than 1 cluster assignment
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Types of boosting methods

A
  • AdaBoost (Adaptive Boosting): misclassified data from each algorithm have weights increased
  • Gradient boosting: similar to adaboost except model trains on residual errors of the previously run model
  • CatBoost
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Types of clustering methods

A
  • Hierarchical clustering: instances grouped based on similarities and differences
  • Probabilistic clustering: instances clustered based on probability that they belong to a particular distribution
  • K-Means clustering: Assign instances to manually defined number (K) of clusters based on similarity –> Compute distance of all instances in cluster from the centroid (center of the cluster) using a defined distance metric. Move instances closest to centroid to that k-cluster. Continue until centroids stop moving location.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What is a silhouette coefficient

A

Ratio of cluster sum of squared error (SSE) & cluster separation

1: clusters well apart
0: clusters indifferent
-1: clusters assigned incorrectly

Used for K-means clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Types of association rules

A
  • Market basket analysis (e.g. customers who bought X also bought Y)
  • Apriori algorithm: Uses a hash tree to count item sets navigating through data set in breadth first manner
43
Q

Dimensionality reduction methods

A
  • Principle Components Analysis (principle component = axis through data that is a function of contribution of variability in a population. Each principle component has to be orthogonal with all other principle components)
  • Singular value decomposition
  • Autoencoders
44
Q

What is denormalization of data

A

Intentional duplication of data to improve database performance

45
Q

3 Database integrity requirements

A
  1. Entity integrity: every table in the database has a unique primary key
  2. Referential integrity: Whenever a database column refers to a row in another table, that row exits
  3. Domain integrity: specific list of values that are acceptable for a particular column
46
Q

What is F1 score and how is it calculated?

A

F-Score (aka F-measure or F1 score) = harmonic mean between precision and recall

F1 = 2 * (Precision x Recall) / (Precision + Recall)

47
Q

What is tokenization?

A

Process of breaking documents into searchable items. Tokenization must be done before items are placed into an index.

48
Q

Which SQL keyword can be used to eliminate duplicate rows from a query result?

A

DISTINCT

49
Q

Equation for P (A, B) joint probability of events A and B occurring simultaneously,

A

P(A | B) x P(B) = P(B | A) x P(A)

50
Q

What is the DIKW Model

A
  1. Data = Item or signal with little to no meaning by itself
  2. Information = Data + Meaning
  3. Knowledge = Patterns and relationships between pieces of info
  4. Wisdom = Understanding and internalization of knowledge to apply it appropriately
51
Q

Difference between calibration and discrimination as metrics for an expert system

A

Discrimination measures ability to tell the difference between two sets of patients

Calibration measures the degree to which the system produces an accurate probability

52
Q

What does a class, instance, attribute and method in OOP correspond to in RDBMS?

A
  • Class –> DB relation (table)
  • Instance –> Tuple (row)
  • Attribute –> Attribute (column)
  • Method (accessors, mutators) –> Database CRUD function
53
Q

What is the difference between primary and foreign key?

A

Primary key: an attribute that uniquely identifies a tuple (row)

Foreign key: an attribute whose values must have matching values in the primary key of another table

54
Q

Difference between inner, outer and cross join

A

Inner join
- Only includes rows that match both tables
- Can be written as WHERE statement

Left outer join
- Include all rows in the left table (i.e. primary table specified in “FROM”) and display blanks from right

Right outer join
- Include all rows in the right table, display blanks from left

Full Outer Join
- Includes all rows in both tables, blanks in both

Cartesian (Cross) Join
- Give you “cross product” of both tables

55
Q

What is the “LIKE” condition used for in SQL?

A
  • Match strings
  • LIKE is case sensitive
  • UPPER([char]) & Lower([char]) converts to all upper or lower case

Ex) Like ‘B%” –> % matches any length
Ex) Like ‘Smith_’ –> “_” must match a single character

56
Q

What are the strengths and limitations of hierarchical database

A

Strengths
- Optimized for rapid transactions of hierarchical data
- Makes it easy to know every attribute about one thing

Limitations
- Can only traverse tree from root (top parent) node
- Child nodes can only have 1 parent –> difficult to model relationships between child nodes

57
Q

What is an example of hierarchical database?

A

MUMPS (MGH Utility Multi-Programming System)

58
Q

Example of object oriented database management system

A

Intersystems Cache (OODBMC behind Epic EHR)

59
Q

Benefits of object oriented database management systems

A
  • Support for more data types (e.g. graphics, photo, video, webpages)
  • Usually integrated into programming language so accessing data doesn’t require complex drive configuration
  • Most web app frameworks support interaction with OODBMS
60
Q

Tools of Unified Modeling Language

A
  • Class diagram (describes object oriented classes like class title, attributes, methods, inheritance)
  • Activity diagram (~process flowchart)
  • Use case diagram (describes actors, goals and dependencies)
  • Entity-Relationship (ER) diagram (describes objects and their relationships) –> used to define RDBMS logical schema
61
Q

What level of normalization is considered to be sufficient to call a database “normalized”?

A

Third Normal Form (3NF)

62
Q

Pros & Cons of normalized database

A

Pros
- Reduce inconsistencies and dependences in relational databases
- Safe against most INSERT, UPDATE and DELETE anomalies

Cons
- To generate a report you have to “denormalize” the data
- Requires lots of PK, FK, JOIN logic in your query

63
Q

What is the ETL process

A

Extract-Transformation-Load process gets transactional data into format that can be integrated into Data Warehouse for reporting/queries

64
Q

What is TCP/IP

A

TCP = transmission control protocol
IP = internet protocol

65
Q

What is the difference between TCP & UDP (user datagram protocol)

A
  • TCP requires acknowledgment. UDP does not.
  • TCP guarantees sequence/order of packets. UDP does not.
  • TCP used when packet loss is unacceptable
  • UDP used where packet loss is less important (e.g. VoIP or streaming protocols)
66
Q

What is CODEC?

A

Coder / Decoder - compression algorithm used for digital stream to transmit audio, video, etc. Can be “lossy” or “lossless.”

67
Q

Examples of short range wireless standards

A

Short range PAN (personal area network):
- RFID (one way) and NFC (two way)
- IEEE 802.15 - wireless PAN and derivatives (bluetooth and infrared)

68
Q

Examples of medium range wireless standards

A

Medium Range WLAN (Wireless Local Area Network)
- 802.11b
- 802.11g
- 802.11n
- 802.11ac

69
Q

Long Range Wireless Standards

A

WiMax
CDMA
3G
4G / 4G LTE
5G

70
Q

Advantages of Bluetooth Low Energy (aka bluetooth smart)

A
  • Low battery consumption
  • Limited need for data transfer
  • One-way communication in close proximity
  • BLE Beacons broadcast packets of data at regular intervals and devices pick them up, detected by pre-installed apps or services
  • Popular use in health and fitness, indoor navigation, proximity-based marketing
71
Q

Requirement for first normal form DB

A

Each cell contains a single value
Each record is unique

72
Q

Requirements for second normal form DB

A

Table must have a single-column, non-composite, primary key

73
Q

Requirements for third normal form DB

A

Table must meet all criteria for 1NF AND 2NF AND must have no “transitive functional dependencies,” which means changing the value of one cell should not require a change to another row.

74
Q

What are the assumptions of Baye’s theorem and the corresponding limitations for diagnosis of disease?

A
  • Conditional independence of predictors –> Findings in disease are usually not conditionally independent
  • Mutual exclusivity of conditions –> diseases may not be mutually exclusive
  • Calculation is not simple –> when there are multiple findings, computation becomes complex quickly
75
Q

Difference between notifiable vs. reportable diseases in the context of public health reporting

A

Notifiable diseases must be reported to CDC

Reportable diseases must be reported to states

76
Q

Difference between validation & verification of software/systems

A

Validation
- Extensive testing including regression testing
- Used on non-FDA approved software/systems

Verification
- Limited tests on sampling of functions
- Used for FDA-approved software/systems to make sure it did not break in transit or installation

77
Q

Difference between data reconciliation and data validation

A

Data reconciliation performed during data migration (large scale data transfer) and is followed by data validation of small subset of data. Performed one time or in several large batches.

Data validation is used for continuously interfaced data that is transferred in small scale. Performed in real-time or very frequently.

78
Q

What is automation bias?

A

Assumption that computer is right even when it doesn’t make sense

79
Q

What is Polanyi’s paradox?

A

Explain the cognitive phenomenon that there exist many tasks which we, human beings, understand intuitively how to perform but cannot verbalize their rules or procedures

80
Q

How to achieve AI Intelligibility from an ethical perspective

A

Transparency + Explainabillty

81
Q

What is the goal of reinforcement learning?

A

Learn policy through trial and error while optimizing long term reward (ie value)

82
Q

What method is used to reduce loss function (measure of deviation from correct output)

A

Stochastic gradient descent

83
Q

Correlation between variance and fit

A

High variance —> overfit
Low variance —> underfit

84
Q

Types of regularization methods

A

Mathematical methods
- Least Absolute Shrinkage and Selection Operator (LASSO) (L1) regularization
- Ridge (L2) regularization
- Elastic Net regularization (combo of L1 & L2)
- Special regularization for neural networks

Non mathematical methods
- Early stopping, pruning decision trees

85
Q

What is back-propagation in ANN

A

Process where ANN learns whether it made a mistake or not based on output.

Adjust internal parameters of transfer functions (nodes) using loss functions and stochastic gradient descent functions in waves propagating backwards from the output nodes to the input nodes

86
Q

How dose recurrent neural network work and what is an example?

A

Ex) Long Short-Term Memory (LSTM)
- Data storage units called gated nodes can flexibly represent short/long-term data
- Output recurrently feeds back on itself to inform next prediction (i.e. analyzes current and past data)
- Involves context nodes (network nodes that can accumulate historical data)

87
Q

How does convultional neural network work?

A

In the convolution (filter) layer data that match the pattern of weights are amplified (creates hotspots)

Pooling layer masks data except for hot spots

88
Q

Which machine learning algorithm requires the least number of training instances?

A

Regression methods

89
Q

When to use polynomial regression and what is the limitation

A

Modeling non-linear exponential data (e.g. growth rates, progression of pandemics, etc.)

Uses exponentials of SINGLE input variable

Higher risk of overfitting

90
Q

Goal of regression vs classification algorithms

A

Regression: Minimize the loss function

Classification: Maximize the maximum likelihood estimation

91
Q

How does support vector machine work?

A

Determines optimal boundary between 2 classes in multidimensional space by maximizing margin between support vectors of different classes
- 2 features –> boundary is a line
- 3 features –> boundary is a plane
- >3 features –> boundary is a HYPERPLANE

Support vector = data point closes to the boundary (hardest to classify)

  • Focus on borderline is UNIQUE (outliers automatically ignored)
92
Q

What are common uses of Support Vector Machine?

A
  • Fetal aneuploidy screening
  • Prediction of metastasis from gene profiles
  • Autoverifiction of GC/MS in the lab
93
Q

How does k-nearest neighbor work?

A
  • Instance-based method
  • Plots instances in multi-dimensional space. Knowledge is stored in the structure of the mapped data. Training data is not discarded.
94
Q

Difference between hard & soft voting when random forest is used for classification

A

Hard voting: most frequent class selected is voted for

Soft voting: averaging probabilities for each class selected, then calculating average probability per class then selecting class with highest average probability

95
Q

2 main types of hierarchical clustering

A
  1. Agglomerative clustering (bottom up approach; most common)
  2. Divisive clustering (top down approach; not common)
96
Q

What is the output of hierarchical clustering?

A

Dendogram (tree diagram)

97
Q

Example of probabilistic clustering

A

Gaussian Mixture Model
- Determine which gaussian distribution an instance belongs to

98
Q

How is association rules rated

A
  1. Support - % total transactions from a transaction database that the rule satisfies
  2. Confidence - degree of certainty of an association
99
Q

What is discretization and when is it necessary to do discretization?

A

Converting continuous data into bins (categorical data)

Ex) Discretization must be done to use data with association rules

100
Q

What is multi-collinearity and how do you fix it?

A

Data having redundant features/attributes

Can be reduced using dimensionality reduction methods

101
Q

How does bootstrapping work

A

Random sampling WITH replacement

Model is trained on each bootstrapped sample + validated on out-of-bag sample

~36.8% OOB

102
Q

What is one-hot encoding

A
  • Transforming categories into an array of binary switches, one item per categories
  • Adds dimensionality to features

Example:
Melanoma = [1,0,0]
Dysplastic nevus = [0,1,0]
Benign nevus = [0,0,1]

103
Q

7 Examples of low-level NLP tasks

A
  1. Sentence boundary detection
  2. Tokenization
  3. Part-of-speech assignment (complicated by gerunds, which are nouns formed from verbs + ing)
  4. Morphological decomposition of compound words (e.g. nasogastric)
  5. Shallow parsing (chunking) - identifying phrases (e.g. noun phrase comprised of adjective + noun)
  6. Problem-specific segmentation
  7. Coreference resolution (e.g. determining that Mr. XXX, he and his refer to same person)
104
Q

7 Higher level NLP tasks

A
  1. Spelling/grammatical error ID and recovery
  2. Named entity recognition (e.g. categorize as persons, locations, med, disease, etc.)
  3. Word sense disambiguation (determining a homograph’s correct meaning)
  4. Negation and uncertainty detection
  5. Relationship extraction
  6. Temporal inferences extraction
  7. Information extraction
105
Q

Difference between semi-supervised and weakly supervised ML models

A

Semi-supervised: Some data are labeled while other data are not

Weakly supervised: Small amount of data have detailed labels; rest of data have fewer labels