Domain 4: Data Governance & Data Analytics Flashcards
What is the ACID test for reliable database (DB) transactions?
Atomicity - transaction is individisble
Consistency - database is always in a valid state
Isolation - 2 transactions run simultaneously will not interfere with each other
Durability - persistent data store will always contain up-to-date info
What is included in the OSI 7 layer model
<Host>
DATA
7. Application (e.g. POP3/IMAP4 for email, HTTP for web content, FTP for file transfer, SSH and HTTPS for secure browsing)
6. Presentation (Encryption, decryption, conversion to character sets like ASCII)
5. Session (Lightweight Directory Access Protocol = LDAP for authenticating users, SSL)
SEGMENTS
4. Transport (Transmission of segments via TCP, UDP)
<Media>
TCP Packet/UDP Datagram
3. Network (Transmission of packets via IP, DHCP (used to assign IP addresses to hosts), DNS server, through routers)
Frame
2. Data Link (transmission of frames via ethernet, PPP (Point-to-Point Protocol), ARP (Address Resolution Protocol)
Bit
1. Physical (transmission of binary bits via copper wire, coaxial or fiber optic cable)
</Media></Host>
Global variables referenced by what symbol?
Up arrow symbol (later became a caret “^”)
In an event monitor, a high test result is a?
Condition
What is evoking strength in INTERNIST-1/QMR
PPV
What is the observed value for an item in a ML data set?
Label
What is an example of an expert system?
CADUCEUS
Examples of reinforcement learning
Markov decision process (use with known model)
- Q-learning
Monte Carlo (use when one or more elements are unknown)
What are measures of information retrieval success?
Precision (PPV) = % of returned documents that are relevant to the query
Recall (sensitivity) = % of all relevant documents in the corpus that were found
Fall-out (false positive rate) = % of irrelevant documents that are retrieved
Difference between bias and variance
Bias = measure of inaccuracy
Variance = measure of imprecision
What is the null error rate
Rate of being wrong if you ALWAYS pick the majority class.
Ex) If majority class has 105 instance out of 165 total instances, null error rate = (165-105)/165
What statistical models to use to screen for low incidence conditions
- Fbeta score
- Matthews Correlation Coefficient
- Stratified K-fold cross-validation
Measures to evaluate regression methods (numerical output)
- Root mean squared error (RMSE): lower = better fit
- Correlation coefficient (r): strength of relationship between x and y on a scatter plot. No correlation is r = 0.
- Coefficient of determination (r2 squared): Goodness of fit. Represents % variation in y that is not explained by variation in x. 100% = perfectly fit model.
Measures to evaluate classification methods (categorical output)
- Confusion Matrix: Assessment of model’s “confusion” in analyzing instances (i.e., assigning instances to the wrong class or outcome)
- Cohen’s Kappa: measure of how well the classifier performs as compared to random chance
- Reliability diagram: Assess calibration by plotting observed event against predictive probability
Measure used to evaluate both supervised & unsupervised models
- Receiver Operated Characteristic (ROC) curve: Plot sensitivity against 1-specificity
- Area Under the Curve (AUC): AKA concordance (c) statistic. AUC = 1 –> perfect discrimination.
What is the most important, time-consuming and expensive part of developing ML model
Gathering appropriate data (instances)
Difference between validation and testing data set
- Validation data set used to evaluate PRELIM model to tune model
- Testing data set used for final evaluation of model. No further change is anticipated.
General rule for feature to instance ratio
Select <= 1 feature for each 10 instances in the development data set
High # of features –> overfitting
Methods for feature selection
- Forward selection (iterative inclusion)
- Backward selection (iterative removal)
- Stepwise selection (combination of forward & backward selection)
- Forced inclusion
Methods for reducing number of features when an unsupervised model is overfit
Dimensionality reduction method
Methods for validating model when data (instance) is limited
- K-fold cross-validation: training data randomly partitioned into equal # (k) subsamples (folds)
- Leave One Out Cross-Validation (LOOCV): extreme case of k-fold
- Bootstrapping
Methods for optimizing (aka tuning) model
- Tune hyperparameters
- Dimensionality reduction
- Regularization: force algorithm to build a less complex model (i.e. more generalizable –> less likely to overfit data)
Types of deployed model
- Static model (most common in medicine)
- Incremental/continuous model
ML algorithm that can be used for both supervised and unsupervised algorithms
Neural networks (aka connectionist systems)
ML algorithm used for supervised algorithms
- Regression Methods
- Classification Methods
- Ensemble Methods
ML algorithm used for unsupervised algorithms
- Clustering Methods
- Association Rules
- Dimensionality Reduction Methods
How many hidden layers does Deep Artificial Neural Network usually have?
> 3
Different types of ANN
- Feed-forward network (unidirectional from input to output)
- Multilayer Perceptron* (MLP)
- Perceptron: iterative algorithm that determines best values for the coefficient vector
- Convolutional neural Network
- Recurrent neural network (RNN)
- Generative Adversarial Network (GAN)
Common output of convolutional neural network (CNN)
Image classification and/or image feature selection (e.g. saliency map showing which features considered more relevant)
Common use of recurrent neural network (RNN)
NLP
Common use of generative adversarial network
Use to generate DeepFake images and simulate cat-and-mouse fraud schemes
How does generative adversarial network work?
Pairs of deep learning neural networks (discriminator network vs. generator network) trained in tandem repeatedly
What are examples of classification models
- Logistic regression
- Naive Bayes Classifier
- Support Vector Machines
- k-Nearest Neighbor
- Decision Trees (most common method)
What does an internal and leaf node represent in a decision tree
Internal node = 1 feature (independent variable)
Leaf node = outcome class (dependent variable)
Method for checking quality of decision tree model
- Gini impurity (0 with single class populations)
- Entropy (high when large number of evenly mixed classes)
Types of ensemble methods
Parallel ensembles (uses unweighted voting)
Sequential (series) ensembles (uses weighted voting)
What are methods to create diversity in parallel ensembles to help decrease overfitting
- Bagging (bootstrap aggregating): boostrap each model in ensmble
- Random subspaces: Use random subset of features per model
- Random forest: Ensemble of randomly selected decision trees to make a “forest.” Uses BOTH bagging and random subspaces.
How is sequential (series) ensembles constructed?
- Combines constrained (weak learner) models into a single strong learner
- Constructed using boosting methods, which uses weighted voting
Difference between hard and soft clustering
- Exclusive (hard) clustering: An instance can only belong to 1 cluster
- Fuzzy (soft) clustering: An instance can have more than 1 cluster assignment
Types of boosting methods
- AdaBoost (Adaptive Boosting): misclassified data from each algorithm have weights increased
- Gradient boosting: similar to adaboost except model trains on residual errors of the previously run model
- CatBoost
Types of clustering methods
- Hierarchical clustering: instances grouped based on similarities and differences
- Probabilistic clustering: instances clustered based on probability that they belong to a particular distribution
- K-Means clustering: Assign instances to manually defined number (K) of clusters based on similarity –> Compute distance of all instances in cluster from the centroid (center of the cluster) using a defined distance metric. Move instances closest to centroid to that k-cluster. Continue until centroids stop moving location.
What is a silhouette coefficient
Ratio of cluster sum of squared error (SSE) & cluster separation
1: clusters well apart
0: clusters indifferent
-1: clusters assigned incorrectly
Used for K-means clustering