Domain 4: Data Governance & Data Analytics Flashcards

1
Q

What is the ACID test for reliable database (DB) transactions?

A

Atomicity - transaction is individisble

Consistency - database is always in a valid state

Isolation - 2 transactions run simultaneously will not interfere with each other

Durability - persistent data store will always contain up-to-date info

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is included in the OSI 7 layer model

A
  1. Application (e.g. POP3/IMAP4, HTTP, FTP, SSH and HTTPS)
  2. Presentation (Encryption, decryption, conversion to character sets like ASCII)
  3. Session (Lightweight Directory Access Protocol = LDAP, SSL)
  4. Transport (TCP, UDP)
  5. Network (IPv4, IPv6, DHCP)
  6. Data Link (Address Resolution Protocol = ARP)
  7. Physical (transmission of binary bits via copper wire, coaxial or fiber optic cable)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Global variables referenced by what symbol?

A

Up arrow symbol (later became a caret “^”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

In an event monitor, a high test result is a?

A

Condition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is evoking strength in INTERNIST-1/QMR

A

PPV

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the observed value for an item in a ML data set?

A

Label

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is an example of an expert system?

A

CADUCEUS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Examples of reinforcement learning

A

Markov decision process (use with known model)

Monte Carlo (use when one or more elements are unknown)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are measures of information retrieval success?

A

Precision (PPV) = % of returned documents that are relevant to the query

Recall (sensitivity) = % of all relevant documents in the corpus that were found

Fall-out (false positive rate) = % of irrelevant documents that are retrieved

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Difference between bias and variance

A

Bias = measure of inaccuracy
Variance = measure of imprecision

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the null error rate

A

Rate of being wrong if you ALWAYS pick the majority class.

Ex) If majority class has 105 instance out of 165 total instances, null error rate = (165-105)/165

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What statistical models to use to screen for low incidence conditions

A
  • Fbeta score
  • Matthews Correlation Coefficient
  • Stratified K-fold cross-validation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Measures to evaluate regression methods (numerical output)

A
  • Root mean squared error (RMSE): lower = better fit
  • Correlation coefficient (r): strength of relationship between x and y on a scatter plot. No correlation is r = 0.
  • Coefficient of determination (r2 squared): Goodness of fit. Represents % variation in y that is not explained by variation in x. 100% = perfectly fit model.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Measures to evaluate classification methods (categorical output)

A
  • Cohen’s Kappa: measure of how well the classifier performs as compared to random chance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Measure used to evaluate both supervised & unsupervised models

A
  • Receiver Operated Characteristic (ROC) curve: Plot sensitivity against 1-specificity
  • Area Under the Curve (AUC): AKA concordance (c) statistic. AUC = 1 –> perfect discrimination.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the most important, time-consuming and expensive part of developing ML model

A

Gathering appropriate data (instances)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Difference between validation and testing data set

A
  • Validation data set used to evaluate PRELIM model to tune model
  • Testing data set used for final evaluation of model. No further change is anticipated.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

General rule for feature to instance ratio

A

Select <= 1 feature for each 10 instances in the development data set

High # of features –> overfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Methods for feature selection

A
  • Forward selection (iterative inclusion)
  • Backward selection (iterative removal)
  • Stepwise selection (combination of forward & backward selection)
  • Forced inclusion
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Methods for reducing number of features when an unsupervised model is overfit

A

Dimensionality reduction method

21
Q

Methods for validating model when data (instance) is limited

A
  • K-fold cross-validation: training data randomly partitioned into equal # (k) subsamples (folds)
  • Leave One Out Cross-Validation (LOOCV): extreme case of k-fold
  • Bootstrapping
22
Q

Methods for optimizing (aka tuning) model

A
  • Tune hyperparameters
  • Dimensionality reduction
  • Regularization: force algorithm to build a less complex model (i.e. more generalizable –> less likely to overfit data)
23
Q

Types of deployed model

A
  • Static model (most common in medicine)
  • Incremental/continuous model
24
Q

ML algorithm that can be used for both supervised and unsupervised algorithms

A

Neural networks (aka connectionist systems)

25
Q

ML algorithm used for supervised algorithms

A
  • Regression Methods
  • Classification Methods
  • Ensemble Methods
26
Q

ML algorithm used for unsupervised algorithms

A
  • Clustering Methods
  • Association Rules
  • Dimensionality Reduction Methods
27
Q

How many hidden layers does Deep Artificial Neural Network usually have?

A

> 3

28
Q

Different types of ANN

A
  • Feed-forward network
  • Multilayer Perception (MLP)
  • Convolutional neural Network
  • Recurrent neural network (RNN)
  • Generative Adversarial Network (GAN)
29
Q

Common output of convolutional neural network (CNN)

A

Image classification and/or image feature selection (e.g. saliency map showing which features considered more relevant)

30
Q

Common use of recurrent neural network (RNN)

A

NLP

31
Q

Common use of generative adversarial network

A

Use to generate DeepFake images and simulate cat-and-mouse fraud schemes

32
Q

How does generative adversarial network work?

A

Pairs of deep learning neural networks (discriminator network vs. generator network) trained in tandem repeatedly

33
Q

What are examples of classification models

A
  • Logistic regression
  • Naive Bayes Classifier
  • Support Vector Machines
  • k-Nearest Neighbor
  • Decision Trees (most common method)
34
Q

What does an internal and leaf node represent in a decision tree

A

Internal node = 1 feature (independent variable)

Leaf node = outcome class (dependent variable)

35
Q

Method for checking quality of decision tree model

A
  • Gini impurity (0 with single class populations)
  • Entropy (high when large number of evenly mixed classes)
36
Q

Types of ensemble methods

A

Parallel ensembles

Sequential (series) ensembles

37
Q

What are methods to create diversity in parallel ensembles to help decrease overfitting

A
  • Bagging (bootstrap aggregating): boostrap each model in ensmble
  • Random subspaces: Use random subset of features per model
  • Random forest: Ensemble of randomly selected decision trees to make a “forest.” Uses BOTH bagging and random subspaces.
38
Q

How is sequential (series) ensembles constructed?

A
  • Combines constrained (weak learner) models into a single strong learner
  • Constructed using boosting methods, which uses weighted voting
39
Q

Difference between hard and soft clustering

A
  • Exclusive (hard) clustering: An instance can only belong to 1 cluster
  • Fuzzy (soft) clustering: An instance can have more than 1 cluster assignment
39
Q

Types of boosting methods

A
  • AdaBoost (Adaptive Boosting): misclassified data from each algorithm have weights increased
  • Gradient boosting: similar to adaboost except model trains on residual errors of the previously run model
  • CatBoost
40
Q

Types of clustering methods

A
  • Hierarchical clustering: instances grouped based on similarities and differences
  • Probabilistic clustering: instances clustered based on probability that they belong to a particular distribution
  • K-Means clustering: Assign instances to manually defined number (K) of clusters based on similarity –> Compute distance of all instances in cluster from the centroid (center of the cluster) using a defined distance metric. Move instances closest to centroid to that k-cluster. Continue until centroids stop moving location.
41
Q

What is a silhouette coefficient

A

Ratio of cluster sum of squared error (SSE) & cluster separation

1: clusters well apart
0: clusters indifferent
-1: clusters assigned incorrectly

42
Q

Types of association rules

A
  • Market basket analysis (e.g. customers who bought X also bought Y)
  • Apriori algorithm: Uses a hash tree to count item sets navigating through data set in breadth first manner
43
Q

Dimensionality reduction methods

A
  • Principle Components Analysis (principle component = axis through data that is a function of contribution of variability in a population. Each principle component has to be orthogonal with all other principle components)
  • Singular value decomposition
  • Autoencoders
44
Q

What is denormalization of data

A

Intentional duplication of data to improve database performance

45
Q

3 Database integrity requirements

A
  1. Entity integrity: every table in the database has a unique primary key
  2. Referential integrity: Whenever a database column refers to a row in another table, that row exits
  3. Domain integrity: specific list of values that are acceptable for a particular column
46
Q

What is F1 score and how is it calculated?

A

F-Score (aka F-measure or F1 score) = harmonic mean between precision and recall

F1 = 2 * (Precision x Recall) / (Precision + Recall)

47
Q

What is tokenization?

A

Process of breaking documents into searchable items. Tokenization must be done before items are placed into an index.