Domain 4: Data Governance & Data Analytics Flashcards
What is the ACID test for reliable database (DB) transactions?
Atomicity - transaction is individisble
Consistency - database is always in a valid state
Isolation - 2 transactions run simultaneously will not interfere with each other
Durability - persistent data store will always contain up-to-date info
What is included in the OSI 7 layer model
<Host>
DATA
7. Application (e.g. POP3/IMAP4 for email, HTTP for web content, FTP for file transfer, SSH and HTTPS for secure browsing)
6. Presentation (Encryption, decryption, conversion to character sets like ASCII)
5. Session (Lightweight Directory Access Protocol = LDAP for authenticating users, SSL)
SEGMENTS
4. Transport (Transmission of segments via TCP, UDP)
<Media>
TCP Packet/UDP Datagram
3. Network (Transmission of packets via IP, DHCP (used to assign IP addresses to hosts), DNS server, through routers)
Frame
2. Data Link (transmission of frames via ethernet, PPP (Point-to-Point Protocol), ARP (Address Resolution Protocol)
Bit
1. Physical (transmission of binary bits via copper wire, coaxial or fiber optic cable)
</Media></Host>
Global variables referenced by what symbol?
Up arrow symbol (later became a caret “^”)
In an event monitor, a high test result is a?
Condition
What is evoking strength in INTERNIST-1/QMR
PPV
What is the observed value for an item in a ML data set?
Label
What is an example of an expert system?
CADUCEUS
Examples of reinforcement learning
Markov decision process (use with known model)
- Q-learning
Monte Carlo (use when one or more elements are unknown)
What are measures of information retrieval success?
Precision (PPV) = % of returned documents that are relevant to the query
Recall (sensitivity) = % of all relevant documents in the corpus that were found
Fall-out (false positive rate) = % of irrelevant documents that are retrieved
Difference between bias and variance
Bias = measure of inaccuracy
Variance = measure of imprecision
What is the null error rate
Rate of being wrong if you ALWAYS pick the majority class.
Ex) If majority class has 105 instance out of 165 total instances, null error rate = (165-105)/165
What statistical models to use to screen for low incidence conditions
- Fbeta score
- Matthews Correlation Coefficient
- Stratified K-fold cross-validation
Measures to evaluate regression methods (numerical output)
- Root mean squared error (RMSE): lower = better fit
- Correlation coefficient (r): strength of relationship between x and y on a scatter plot. No correlation is r = 0.
- Coefficient of determination (r2 squared): Goodness of fit. Represents % variation in y that is not explained by variation in x. 100% = perfectly fit model.
Measures to evaluate classification methods (categorical output)
- Confusion Matrix: Assessment of model’s “confusion” in analyzing instances (i.e., assigning instances to the wrong class or outcome)
- Cohen’s Kappa: measure of how well the classifier performs as compared to random chance
- Reliability diagram: Assess calibration by plotting observed event against predictive probability
Measure used to evaluate both supervised & unsupervised models
- Receiver Operated Characteristic (ROC) curve: Plot sensitivity against 1-specificity
- Area Under the Curve (AUC): AKA concordance (c) statistic. AUC = 1 –> perfect discrimination.
What is the most important, time-consuming and expensive part of developing ML model
Gathering appropriate data (instances)
Difference between validation and testing data set
- Validation data set used to evaluate PRELIM model to tune model
- Testing data set used for final evaluation of model. No further change is anticipated.
General rule for feature to instance ratio
Select <= 1 feature for each 10 instances in the development data set
High # of features –> overfitting
Methods for feature selection
- Forward selection (iterative inclusion)
- Backward selection (iterative removal)
- Stepwise selection (combination of forward & backward selection)
- Forced inclusion
Methods for reducing number of features when an unsupervised model is overfit
Dimensionality reduction method
Methods for validating model when data (instance) is limited
- K-fold cross-validation: training data randomly partitioned into equal # (k) subsamples (folds)
- Leave One Out Cross-Validation (LOOCV): extreme case of k-fold
- Bootstrapping
Methods for optimizing (aka tuning) model
- Tune hyperparameters
- Dimensionality reduction
- Regularization: force algorithm to build a less complex model (i.e. more generalizable –> less likely to overfit data)
Types of deployed model
- Static model (most common in medicine)
- Incremental/continuous model
ML algorithm that can be used for both supervised and unsupervised algorithms
Neural networks (aka connectionist systems)