Domain 4: Data Governance & Data Analytics Flashcards
What is the ACID test for reliable database (DB) transactions?
Atomicity - transaction is individisble
Consistency - database is always in a valid state
Isolation - 2 transactions run simultaneously will not interfere with each other
Durability - persistent data store will always contain up-to-date info
What is included in the OSI 7 layer model
<Host>
DATA
7. Application (e.g. POP3/IMAP4 for email, HTTP for web content, FTP for file transfer, SSH and HTTPS for secure browsing)
6. Presentation (Encryption, decryption, conversion to character sets like ASCII)
5. Session (Lightweight Directory Access Protocol = LDAP for authenticating users, SSL)
SEGMENTS
4. Transport (Transmission of segments via TCP, UDP)
<Media>
TCP Packet/UDP Datagram
3. Network (Transmission of packets via IP, DHCP (used to assign IP addresses to hosts), DNS server, through routers)
Frame
2. Data Link (transmission of frames via ethernet, PPP (Point-to-Point Protocol), ARP (Address Resolution Protocol)
Bit
1. Physical (transmission of binary bits via copper wire, coaxial or fiber optic cable)
</Media></Host>
Global variables referenced by what symbol?
Up arrow symbol (later became a caret “^”)
In an event monitor, a high test result is a?
Condition
What is evoking strength in INTERNIST-1/QMR
PPV
What is the observed value for an item in a ML data set?
Label
What is an example of an expert system?
CADUCEUS
Examples of reinforcement learning
Markov decision process (use with known model)
- Q-learning
Monte Carlo (use when one or more elements are unknown)
What are measures of information retrieval success?
Precision (PPV) = % of returned documents that are relevant to the query
Recall (sensitivity) = % of all relevant documents in the corpus that were found
Fall-out (false positive rate) = % of irrelevant documents that are retrieved
Difference between bias and variance
Bias = measure of inaccuracy
Variance = measure of imprecision
What is the null error rate
Rate of being wrong if you ALWAYS pick the majority class.
Ex) If majority class has 105 instance out of 165 total instances, null error rate = (165-105)/165
What statistical models to use to screen for low incidence conditions
- Fbeta score
- Matthews Correlation Coefficient
- Stratified K-fold cross-validation
Measures to evaluate regression methods (numerical output)
- Root mean squared error (RMSE): lower = better fit
- Correlation coefficient (r): strength of relationship between x and y on a scatter plot. No correlation is r = 0.
- Coefficient of determination (r2 squared): Goodness of fit. Represents % variation in y that is not explained by variation in x. 100% = perfectly fit model.
Measures to evaluate classification methods (categorical output)
- Confusion Matrix: Assessment of model’s “confusion” in analyzing instances (i.e., assigning instances to the wrong class or outcome)
- Cohen’s Kappa: measure of how well the classifier performs as compared to random chance
- Reliability diagram: Assess calibration by plotting observed event against predictive probability
Measure used to evaluate both supervised & unsupervised models
- Receiver Operated Characteristic (ROC) curve: Plot sensitivity against 1-specificity
- Area Under the Curve (AUC): AKA concordance (c) statistic. AUC = 1 –> perfect discrimination.
What is the most important, time-consuming and expensive part of developing ML model
Gathering appropriate data (instances)
Difference between validation and testing data set
- Validation data set used to evaluate PRELIM model to tune model
- Testing data set used for final evaluation of model. No further change is anticipated.
General rule for feature to instance ratio
Select <= 1 feature for each 10 instances in the development data set
High # of features –> overfitting
Methods for feature selection
- Forward selection (iterative inclusion)
- Backward selection (iterative removal)
- Stepwise selection (combination of forward & backward selection)
- Forced inclusion
Methods for reducing number of features when an unsupervised model is overfit
Dimensionality reduction method
Methods for validating model when data (instance) is limited
- K-fold cross-validation: training data randomly partitioned into equal # (k) subsamples (folds)
- Leave One Out Cross-Validation (LOOCV): extreme case of k-fold
- Bootstrapping
Methods for optimizing (aka tuning) model
- Tune hyperparameters
- Dimensionality reduction
- Regularization: force algorithm to build a less complex model (i.e. more generalizable –> less likely to overfit data)
Types of deployed model
- Static model (most common in medicine)
- Incremental/continuous model
ML algorithm that can be used for both supervised and unsupervised algorithms
Neural networks (aka connectionist systems)
ML algorithm used for supervised algorithms
- Regression Methods
- Classification Methods
- Ensemble Methods
ML algorithm used for unsupervised algorithms
- Clustering Methods
- Association Rules
- Dimensionality Reduction Methods
How many hidden layers does Deep Artificial Neural Network usually have?
> 3
Different types of ANN
- Feed-forward network (unidirectional from input to output)
- Multilayer Perceptron* (MLP)
- Perceptron: iterative algorithm that determines best values for the coefficient vector
- Convolutional neural Network
- Recurrent neural network (RNN)
- Generative Adversarial Network (GAN)
Common output of convolutional neural network (CNN)
Image classification and/or image feature selection (e.g. saliency map showing which features considered more relevant)
Common use of recurrent neural network (RNN)
NLP
Common use of generative adversarial network
Use to generate DeepFake images and simulate cat-and-mouse fraud schemes
How does generative adversarial network work?
Pairs of deep learning neural networks (discriminator network vs. generator network) trained in tandem repeatedly
What are examples of classification models
- Logistic regression
- Naive Bayes Classifier
- Support Vector Machines
- k-Nearest Neighbor
- Decision Trees (most common method)
What does an internal and leaf node represent in a decision tree
Internal node = 1 feature (independent variable)
Leaf node = outcome class (dependent variable)
Method for checking quality of decision tree model
- Gini impurity (0 with single class populations)
- Entropy (high when large number of evenly mixed classes)
Types of ensemble methods
Parallel ensembles (uses unweighted voting)
Sequential (series) ensembles (uses weighted voting)
What are methods to create diversity in parallel ensembles to help decrease overfitting
- Bagging (bootstrap aggregating): boostrap each model in ensmble
- Random subspaces: Use random subset of features per model
- Random forest: Ensemble of randomly selected decision trees to make a “forest.” Uses BOTH bagging and random subspaces.
How is sequential (series) ensembles constructed?
- Combines constrained (weak learner) models into a single strong learner
- Constructed using boosting methods, which uses weighted voting
Difference between hard and soft clustering
- Exclusive (hard) clustering: An instance can only belong to 1 cluster
- Fuzzy (soft) clustering: An instance can have more than 1 cluster assignment
Types of boosting methods
- AdaBoost (Adaptive Boosting): misclassified data from each algorithm have weights increased
- Gradient boosting: similar to adaboost except model trains on residual errors of the previously run model
- CatBoost
Types of clustering methods
- Hierarchical clustering: instances grouped based on similarities and differences
- Probabilistic clustering: instances clustered based on probability that they belong to a particular distribution
- K-Means clustering: Assign instances to manually defined number (K) of clusters based on similarity –> Compute distance of all instances in cluster from the centroid (center of the cluster) using a defined distance metric. Move instances closest to centroid to that k-cluster. Continue until centroids stop moving location.
What is a silhouette coefficient
Ratio of cluster sum of squared error (SSE) & cluster separation
1: clusters well apart
0: clusters indifferent
-1: clusters assigned incorrectly
Used for K-means clustering
Types of association rules
- Market basket analysis (e.g. customers who bought X also bought Y)
- Apriori algorithm: Uses a hash tree to count item sets navigating through data set in breadth first manner
Dimensionality reduction methods
- Principle Components Analysis (principle component = axis through data that is a function of contribution of variability in a population. Each principle component has to be orthogonal with all other principle components)
- Singular value decomposition
- Autoencoders
What is denormalization of data
Intentional duplication of data to improve database performance
3 Database integrity requirements
- Entity integrity: every table in the database has a unique primary key
- Referential integrity: Whenever a database column refers to a row in another table, that row exits
- Domain integrity: specific list of values that are acceptable for a particular column
What is F1 score and how is it calculated?
F-Score (aka F-measure or F1 score) = harmonic mean between precision and recall
F1 = 2 * (Precision x Recall) / (Precision + Recall)
What is tokenization?
Process of breaking documents into searchable items. Tokenization must be done before items are placed into an index.
Which SQL keyword can be used to eliminate duplicate rows from a query result?
DISTINCT
Equation for P (A, B) joint probability of events A and B occurring simultaneously,
P(A | B) x P(B) = P(B | A) x P(A)
What is the DIKW Model
- Data = Item or signal with little to no meaning by itself
- Information = Data + Meaning
- Knowledge = Patterns and relationships between pieces of info
- Wisdom = Understanding and internalization of knowledge to apply it appropriately
Difference between calibration and discrimination as metrics for an expert system
Discrimination measures ability to tell the difference between two sets of patients
Calibration measures the degree to which the system produces an accurate probability
What does a class, instance, attribute and method in OOP correspond to in RDBMS?
- Class –> DB relation (table)
- Instance –> Tuple (row)
- Attribute –> Attribute (column)
- Method (accessors, mutators) –> Database CRUD function
What is the difference between primary and foreign key?
Primary key: an attribute that uniquely identifies a tuple (row)
Foreign key: an attribute whose values must have matching values in the primary key of another table
Difference between inner, outer and cross join
Inner join
- Only includes rows that match both tables
- Can be written as WHERE statement
Left outer join
- Include all rows in the left table (i.e. primary table specified in “FROM”) and display blanks from right
Right outer join
- Include all rows in the right table, display blanks from left
Full Outer Join
- Includes all rows in both tables, blanks in both
Cartesian (Cross) Join
- Give you “cross product” of both tables
What is the “LIKE” condition used for in SQL?
- Match strings
- LIKE is case sensitive
- UPPER([char]) & Lower([char]) converts to all upper or lower case
Ex) Like ‘B%” –> % matches any length
Ex) Like ‘Smith_’ –> “_” must match a single character
What are the strengths and limitations of hierarchical database
Strengths
- Optimized for rapid transactions of hierarchical data
- Makes it easy to know every attribute about one thing
Limitations
- Can only traverse tree from root (top parent) node
- Child nodes can only have 1 parent –> difficult to model relationships between child nodes
What is an example of hierarchical database?
MUMPS (MGH Utility Multi-Programming System)
Example of object oriented database management system
Intersystems Cache (OODBMC behind Epic EHR)
Benefits of object oriented database management systems
- Support for more data types (e.g. graphics, photo, video, webpages)
- Usually integrated into programming language so accessing data doesn’t require complex drive configuration
- Most web app frameworks support interaction with OODBMS
Tools of Unified Modeling Language
- Class diagram (describes object oriented classes like class title, attributes, methods, inheritance)
- Activity diagram (~process flowchart)
- Use case diagram (describes actors, goals and dependencies)
- Entity-Relationship (ER) diagram (describes objects and their relationships) –> used to define RDBMS logical schema
What level of normalization is considered to be sufficient to call a database “normalized”?
Third Normal Form (3NF)
Pros & Cons of normalized database
Pros
- Reduce inconsistencies and dependences in relational databases
- Safe against most INSERT, UPDATE and DELETE anomalies
Cons
- To generate a report you have to “denormalize” the data
- Requires lots of PK, FK, JOIN logic in your query
What is the ETL process
Extract-Transformation-Load process gets transactional data into format that can be integrated into Data Warehouse for reporting/queries
What is TCP/IP
TCP = transmission control protocol
IP = internet protocol
What is the difference between TCP & UDP (user datagram protocol)
- TCP requires acknowledgment. UDP does not.
- TCP guarantees sequence/order of packets. UDP does not.
- TCP used when packet loss is unacceptable
- UDP used where packet loss is less important (e.g. VoIP or streaming protocols)
What is CODEC?
Coder / Decoder - compression algorithm used for digital stream to transmit audio, video, etc. Can be “lossy” or “lossless.”
Examples of short range wireless standards
Short range PAN (personal area network):
- RFID (one way) and NFC (two way)
- IEEE 802.15 - wireless PAN and derivatives (bluetooth and infrared)
Examples of medium range wireless standards
Medium Range WLAN (Wireless Local Area Network)
- 802.11b
- 802.11g
- 802.11n
- 802.11ac
Long Range Wireless Standards
WiMax
CDMA
3G
4G / 4G LTE
5G
Advantages of Bluetooth Low Energy (aka bluetooth smart)
- Low battery consumption
- Limited need for data transfer
- One-way communication in close proximity
- BLE Beacons broadcast packets of data at regular intervals and devices pick them up, detected by pre-installed apps or services
- Popular use in health and fitness, indoor navigation, proximity-based marketing
Requirement for first normal form DB
Each cell contains a single value
Each record is unique
Requirements for second normal form DB
Table must have a single-column, non-composite, primary key
Requirements for third normal form DB
Table must meet all criteria for 1NF AND 2NF AND must have no “transitive functional dependencies,” which means changing the value of one cell should not require a change to another row.
What are the assumptions of Baye’s theorem and the corresponding limitations for diagnosis of disease?
- Conditional independence of predictors –> Findings in disease are usually not conditionally independent
- Mutual exclusivity of conditions –> diseases may not be mutually exclusive
- Calculation is not simple –> when there are multiple findings, computation becomes complex quickly
Difference between notifiable vs. reportable diseases in the context of public health reporting
Notifiable diseases must be reported to CDC
Reportable diseases must be reported to states
Difference between validation & verification of software/systems
Validation
- Extensive testing including regression testing
- Used on non-FDA approved software/systems
Verification
- Limited tests on sampling of functions
- Used for FDA-approved software/systems to make sure it did not break in transit or installation
Difference between data reconciliation and data validation
Data reconciliation performed during data migration (large scale data transfer) and is followed by data validation of small subset of data. Performed one time or in several large batches.
Data validation is used for continuously interfaced data that is transferred in small scale. Performed in real-time or very frequently.
What is automation bias?
Assumption that computer is right even when it doesn’t make sense
What is Polanyi’s paradox?
Explain the cognitive phenomenon that there exist many tasks which we, human beings, understand intuitively how to perform but cannot verbalize their rules or procedures
How to achieve AI Intelligibility from an ethical perspective
Transparency + Explainabillty
What is the goal of reinforcement learning?
Learn policy through trial and error while optimizing long term reward (ie value)
What method is used to reduce loss function (measure of deviation from correct output)
Stochastic gradient descent
Correlation between variance and fit
High variance —> overfit
Low variance —> underfit
Types of regularization methods
Mathematical methods
- Least Absolute Shrinkage and Selection Operator (LASSO) (L1) regularization
- Ridge (L2) regularization
- Elastic Net regularization (combo of L1 & L2)
- Special regularization for neural networks
Non mathematical methods
- Early stopping, pruning decision trees
What is back-propagation in ANN
Process where ANN learns whether it made a mistake or not based on output.
Adjust internal parameters of transfer functions (nodes) using loss functions and stochastic gradient descent functions in waves propagating backwards from the output nodes to the input nodes
How dose recurrent neural network work and what is an example?
Ex) Long Short-Term Memory (LSTM)
- Data storage units called gated nodes can flexibly represent short/long-term data
- Output recurrently feeds back on itself to inform next prediction (i.e. analyzes current and past data)
- Involves context nodes (network nodes that can accumulate historical data)
How does convultional neural network work?
In the convolution (filter) layer data that match the pattern of weights are amplified (creates hotspots)
Pooling layer masks data except for hot spots
Which machine learning algorithm requires the least number of training instances?
Regression methods
When to use polynomial regression and what is the limitation
Modeling non-linear exponential data (e.g. growth rates, progression of pandemics, etc.)
Uses exponentials of SINGLE input variable
Higher risk of overfitting
Goal of regression vs classification algorithms
Regression: Minimize the loss function
Classification: Maximize the maximum likelihood estimation
How does support vector machine work?
Determines optimal boundary between 2 classes in multidimensional space by maximizing margin between support vectors of different classes
- 2 features –> boundary is a line
- 3 features –> boundary is a plane
- >3 features –> boundary is a HYPERPLANE
Support vector = data point closes to the boundary (hardest to classify)
- Focus on borderline is UNIQUE (outliers automatically ignored)
What are common uses of Support Vector Machine?
- Fetal aneuploidy screening
- Prediction of metastasis from gene profiles
- Autoverifiction of GC/MS in the lab
How does k-nearest neighbor work?
- Instance-based method
- Plots instances in multi-dimensional space. Knowledge is stored in the structure of the mapped data. Training data is not discarded.
Difference between hard & soft voting when random forest is used for classification
Hard voting: most frequent class selected is voted for
Soft voting: averaging probabilities for each class selected, then calculating average probability per class then selecting class with highest average probability
2 main types of hierarchical clustering
- Agglomerative clustering (bottom up approach; most common)
- Divisive clustering (top down approach; not common)
What is the output of hierarchical clustering?
Dendogram (tree diagram)
Example of probabilistic clustering
Gaussian Mixture Model
- Determine which gaussian distribution an instance belongs to
How is association rules rated
- Support - % total transactions from a transaction database that the rule satisfies
- Confidence - degree of certainty of an association
What is discretization and when is it necessary to do discretization?
Converting continuous data into bins (categorical data)
Ex) Discretization must be done to use data with association rules
What is multi-collinearity and how do you fix it?
Data having redundant features/attributes
Can be reduced using dimensionality reduction methods
How does bootstrapping work
Random sampling WITH replacement
Model is trained on each bootstrapped sample + validated on out-of-bag sample
~36.8% OOB
What is one-hot encoding
- Transforming categories into an array of binary switches, one item per categories
- Adds dimensionality to features
Example:
Melanoma = [1,0,0]
Dysplastic nevus = [0,1,0]
Benign nevus = [0,0,1]
7 Examples of low-level NLP tasks
- Sentence boundary detection
- Tokenization
- Part-of-speech assignment (complicated by gerunds, which are nouns formed from verbs + ing)
- Morphological decomposition of compound words (e.g. nasogastric)
- Shallow parsing (chunking) - identifying phrases (e.g. noun phrase comprised of adjective + noun)
- Problem-specific segmentation
- Coreference resolution (e.g. determining that Mr. XXX, he and his refer to same person)
7 Higher level NLP tasks
- Spelling/grammatical error ID and recovery
- Named entity recognition (e.g. categorize as persons, locations, med, disease, etc.)
- Word sense disambiguation (determining a homograph’s correct meaning)
- Negation and uncertainty detection
- Relationship extraction
- Temporal inferences extraction
- Information extraction
Difference between semi-supervised and weakly supervised ML models
Semi-supervised: Some data are labeled while other data are not
Weakly supervised: Small amount of data have detailed labels; rest of data have fewer labels