Domain 4: Data Governance & Data Analytics Flashcards
What is the ACID test for reliable database (DB) transactions?
Atomicity - transaction is individisble
Consistency - database is always in a valid state
Isolation - 2 transactions run simultaneously will not interfere with each other
Durability - persistent data store will always contain up-to-date info
What is included in the OSI 7 layer model
- Application (e.g. POP3/IMAP4, HTTP, FTP, SSH and HTTPS)
- Presentation (Encryption, decryption, conversion to character sets like ASCII)
- Session (Lightweight Directory Access Protocol = LDAP, SSL)
- Transport (TCP, UDP)
- Network (IPv4, IPv6, DHCP)
- Data Link (Address Resolution Protocol = ARP)
- Physical (transmission of binary bits via copper wire, coaxial or fiber optic cable)
Global variables referenced by what symbol?
Up arrow symbol (later became a caret “^”)
In an event monitor, a high test result is a?
Condition
What is evoking strength in INTERNIST-1/QMR
PPV
What is the observed value for an item in a ML data set?
Label
What is an example of an expert system?
CADUCEUS
Examples of reinforcement learning
Markov decision process (use with known model)
Monte Carlo (use when one or more elements are unknown)
What are measures of information retrieval success?
Precision (PPV) = % of returned documents that are relevant to the query
Recall (sensitivity) = % of all relevant documents in the corpus that were found
Fall-out (false positive rate) = % of irrelevant documents that are retrieved
Difference between bias and variance
Bias = measure of inaccuracy
Variance = measure of imprecision
What is the null error rate
Rate of being wrong if you ALWAYS pick the majority class.
Ex) If majority class has 105 instance out of 165 total instances, null error rate = (165-105)/165
What statistical models to use to screen for low incidence conditions
- Fbeta score
- Matthews Correlation Coefficient
- Stratified K-fold cross-validation
Measures to evaluate regression methods (numerical output)
- Root mean squared error (RMSE): lower = better fit
- Correlation coefficient (r): strength of relationship between x and y on a scatter plot. No correlation is r = 0.
- Coefficient of determination (r2 squared): Goodness of fit. Represents % variation in y that is not explained by variation in x. 100% = perfectly fit model.
Measures to evaluate classification methods (categorical output)
- Cohen’s Kappa: measure of how well the classifier performs as compared to random chance
Measure used to evaluate both supervised & unsupervised models
- Receiver Operated Characteristic (ROC) curve: Plot sensitivity against 1-specificity
- Area Under the Curve (AUC): AKA concordance (c) statistic. AUC = 1 –> perfect discrimination.
What is the most important, time-consuming and expensive part of developing ML model
Gathering appropriate data (instances)
Difference between validation and testing data set
- Validation data set used to evaluate PRELIM model to tune model
- Testing data set used for final evaluation of model. No further change is anticipated.
General rule for feature to instance ratio
Select <= 1 feature for each 10 instances in the development data set
High # of features –> overfitting
Methods for feature selection
- Forward selection (iterative inclusion)
- Backward selection (iterative removal)
- Stepwise selection (combination of forward & backward selection)
- Forced inclusion