Domain 4: Data Governance & Data Analytics Flashcards

Question

ML algorithm used for supervised algorithms

Answer 1

- Regression Methods - Classification Methods - Ensemble Methods

Answer 2

- Clustering Methods - Association Rules - Dimensionality Reduction Methods

Answer 3

- Feed-forward network (unidirectional from input to output) - Multilayer Perceptron* (MLP) * Perceptron: iterative algorithm that determines best values for the coefficient vector - Convolutional neural Network - Recurrent neural network (RNN) - Generative Adversarial Network (GAN)

Answer 4

Image classification and/or image feature selection (e.g. saliency map showing which features considered more relevant)

Answer 5

Use to generate DeepFake images and simulate cat-and-mouse fraud schemes

Answer 6

Pairs of deep learning neural networks (discriminator network vs. generator network) trained in tandem repeatedly

Answer 7

- Logistic regression - Naive Bayes Classifier - Support Vector Machines - k-Nearest Neighbor - Decision Trees (most common method)

Answer 8

Internal node = 1 feature (independent variable) Leaf node = outcome class (dependent variable)

Answer 9

- Gini impurity (0 with single class populations) - Entropy (high when large number of evenly mixed classes)

Answer 10

Parallel ensembles (uses unweighted voting) Sequential (series) ensembles (uses weighted voting)

Answer 11

- Bagging (bootstrap aggregating): boostrap each model in ensmble - Random subspaces: Use random subset of features per model - Random forest: Ensemble of randomly selected decision trees to make a "forest." Uses BOTH bagging and random subspaces.

Answer 12

- Combines constrained (weak learner) models into a single strong learner - Constructed using boosting methods, which uses weighted voting

Answer 13

- Exclusive (hard) clustering: An instance can only belong to 1 cluster - Fuzzy (soft) clustering: An instance can have more than 1 cluster assignment

Answer 14

- AdaBoost (Adaptive Boosting): misclassified data from each algorithm have weights increased - Gradient boosting: similar to adaboost except model trains on residual errors of the previously run model - CatBoost

Answer 15

- Hierarchical clustering: instances grouped based on similarities and differences - Probabilistic clustering: instances clustered based on probability that they belong to a particular distribution - K-Means clustering: Assign instances to manually defined number (K) of clusters based on similarity --> Compute distance of all instances in cluster from the centroid (center of the cluster) using a defined distance metric. Move instances closest to centroid to that k-cluster. Continue until centroids stop moving location.

Answer 16

Ratio of cluster sum of squared error (SSE) & cluster separation 1: clusters well apart 0: clusters indifferent -1: clusters assigned incorrectly Used for K-means clustering

Answer 17

- Market basket analysis (e.g. customers who bought X also bought Y) - Apriori algorithm: Uses a hash tree to count item sets navigating through data set in breadth first manner

Answer 18

- Principle Components Analysis (principle component = axis through data that is a function of contribution of variability in a population. Each principle component has to be orthogonal with all other principle components) - Singular value decomposition - Autoencoders

Answer 19

Intentional duplication of data to improve database performance

Answer 20

1. Entity integrity: every table in the database has a unique primary key 2. Referential integrity: Whenever a database column refers to a row in another table, that row exits 3. Domain integrity: specific list of values that are acceptable for a particular column

Answer 21

F-Score (aka F-measure or F1 score) = harmonic mean between precision and recall F1 = 2 * (Precision x Recall) / (Precision + Recall)

Answer 22

Process of breaking documents into searchable items. Tokenization must be done before items are placed into an index.

Answer 23

P(A | B) x P(B) = P(B | A) x P(A)

Answer 24

1. Data = Item or signal with little to no meaning by itself 2. Information = Data + Meaning 3. Knowledge = Patterns and relationships between pieces of info 4. Wisdom = Understanding and internalization of knowledge to apply it appropriately

Answer 25

Discrimination measures ability to tell the difference between two sets of patients Calibration measures the degree to which the system produces an accurate probability

Answer 26

- Class --> DB relation (table) - Instance --> Tuple (row) - Attribute --> Attribute (column) - Method (accessors, mutators) --> Database CRUD function

Answer 27

Primary key: an attribute that uniquely identifies a tuple (row) Foreign key: an attribute whose values must have matching values in the primary key of another table

Answer 28

Inner join - Only includes rows that match both tables - Can be written as WHERE statement Left outer join - Include all rows in the left table (i.e. primary table specified in "FROM") and display blanks from right Right outer join - Include all rows in the right table, display blanks from left Full Outer Join - Includes all rows in both tables, blanks in both Cartesian (Cross) Join - Give you "cross product" of both tables

Answer 29

- Match strings - LIKE is case sensitive - UPPER([char]) & Lower([char]) converts to all upper or lower case Ex) Like 'B%" --> % matches any length Ex) Like 'Smith_' --> "_" must match a single character

Answer 30

Strengths - Optimized for rapid transactions of hierarchical data - Makes it easy to know every attribute about one thing Limitations - Can only traverse tree from root (top parent) node - Child nodes can only have 1 parent --> difficult to model relationships between child nodes

Answer 31

MUMPS (MGH Utility Multi-Programming System)

Answer 32

Intersystems Cache (OODBMC behind Epic EHR)

Answer 33

- Support for more data types (e.g. graphics, photo, video, webpages) - Usually integrated into programming language so accessing data doesn't require complex drive configuration - Most web app frameworks support interaction with OODBMS

Answer 34

- Class diagram (describes object oriented classes like class title, attributes, methods, inheritance) - Activity diagram (~process flowchart) - Use case diagram (describes actors, goals and dependencies) - Entity-Relationship (ER) diagram (describes objects and their relationships) --> used to define RDBMS logical schema

Answer 35

Third Normal Form (3NF)

Answer 36

Pros - Reduce inconsistencies and dependences in relational databases - Safe against most INSERT, UPDATE and DELETE anomalies Cons - To generate a report you have to "denormalize" the data - Requires lots of PK, FK, JOIN logic in your query

Answer 37

Extract-Transformation-Load process gets transactional data into format that can be integrated into Data Warehouse for reporting/queries

Answer 38

TCP = transmission control protocol IP = internet protocol

Answer 39

- TCP requires acknowledgment. UDP does not. - TCP guarantees sequence/order of packets. UDP does not. - TCP used when packet loss is unacceptable - UDP used where packet loss is less important (e.g. VoIP or streaming protocols)

Answer 40

Coder / Decoder - compression algorithm used for digital stream to transmit audio, video, etc. Can be "lossy" or "lossless."

Answer 41

Short range PAN (personal area network): - RFID (one way) and NFC (two way) - IEEE 802.15 - wireless PAN and derivatives (bluetooth and infrared)

Answer 42

Medium Range WLAN (Wireless Local Area Network) - 802.11b - 802.11g - 802.11n - 802.11ac

Answer 43

WiMax CDMA 3G 4G / 4G LTE 5G

Answer 44

- Low battery consumption - Limited need for data transfer - One-way communication in close proximity - BLE Beacons broadcast packets of data at regular intervals and devices pick them up, detected by pre-installed apps or services - Popular use in health and fitness, indoor navigation, proximity-based marketing

Answer 45

Each cell contains a single value Each record is unique

Answer 46

Table must have a single-column, non-composite, primary key

Answer 47

Table must meet all criteria for 1NF AND 2NF AND must have no "transitive functional dependencies," which means changing the value of one cell should not require a change to another row.

Answer 48

- Conditional independence of predictors --> Findings in disease are usually not conditionally independent - Mutual exclusivity of conditions --> diseases may not be mutually exclusive - Calculation is not simple --> when there are multiple findings, computation becomes complex quickly

Answer 49

Notifiable diseases must be reported to CDC Reportable diseases must be reported to states

Answer 50

Validation - Extensive testing including regression testing - Used on non-FDA approved software/systems Verification - Limited tests on sampling of functions - Used for FDA-approved software/systems to make sure it did not break in transit or installation

Answer 51

Data reconciliation performed during data migration (large scale data transfer) and is followed by data validation of small subset of data. Performed one time or in several large batches. Data validation is used for continuously interfaced data that is transferred in small scale. Performed in real-time or very frequently.

Answer 52

Assumption that computer is right even when it doesn’t make sense

Answer 53

Explain the cognitive phenomenon that there exist many tasks which we, human beings, understand intuitively how to perform but cannot verbalize their rules or procedures

Answer 54

Transparency + Explainabillty

Answer 55

Learn policy through trial and error while optimizing long term reward (ie value)

Answer 56

Stochastic gradient descent

Answer 57

High variance —> overfit Low variance —> underfit

Answer 58

Mathematical methods - Least Absolute Shrinkage and Selection Operator (LASSO) (L1) regularization - Ridge (L2) regularization - Elastic Net regularization (combo of L1 & L2) - Special regularization for neural networks Non mathematical methods - Early stopping, pruning decision trees

Answer 59

Process where ANN learns whether it made a mistake or not based on output. Adjust internal parameters of transfer functions (nodes) using loss functions and stochastic gradient descent functions in waves propagating backwards from the output nodes to the input nodes

Answer 60

Ex) Long Short-Term Memory (LSTM) - Data storage units called gated nodes can flexibly represent short/long-term data - Output recurrently feeds back on itself to inform next prediction (i.e. analyzes current and past data) - Involves context nodes (network nodes that can accumulate historical data)

Answer 61

In the convolution (filter) layer data that match the pattern of weights are amplified (creates hotspots) Pooling layer masks data except for hot spots

Answer 62

Regression methods

Answer 63

Modeling non-linear exponential data (e.g. growth rates, progression of pandemics, etc.) Uses exponentials of SINGLE input variable Higher risk of overfitting

Answer 64

Regression: Minimize the loss function Classification: Maximize the maximum likelihood estimation

Answer 65

Determines optimal boundary between 2 classes in multidimensional space by maximizing margin between support vectors of different classes - 2 features --> boundary is a line - 3 features --> boundary is a plane - >3 features --> boundary is a HYPERPLANE Support vector = data point closes to the boundary (hardest to classify) * Focus on borderline is UNIQUE (outliers automatically ignored)

Answer 66

- Fetal aneuploidy screening - Prediction of metastasis from gene profiles - Autoverifiction of GC/MS in the lab

Answer 67

- Instance-based method - Plots instances in multi-dimensional space. Knowledge is stored in the structure of the mapped data. Training data is not discarded.

Answer 68

Hard voting: most frequent class selected is voted for Soft voting: averaging probabilities for each class selected, then calculating average probability per class then selecting class with highest average probability

Answer 69

1. Agglomerative clustering (bottom up approach; most common) 2. Divisive clustering (top down approach; not common)

Answer 70

Dendogram (tree diagram)

Answer 71

Gaussian Mixture Model - Determine which gaussian distribution an instance belongs to

Answer 72

1. Support - % total transactions from a transaction database that the rule satisfies 2. Confidence - degree of certainty of an association

Answer 73

Converting continuous data into bins (categorical data) Ex) Discretization must be done to use data with association rules

Answer 74

Data having redundant features/attributes Can be reduced using dimensionality reduction methods

Answer 75

Random sampling WITH replacement Model is trained on each bootstrapped sample + validated on out-of-bag sample ~36.8% OOB

Answer 76

- Transforming categories into an array of binary switches, one item per categories - Adds dimensionality to features Example: Melanoma = [1,0,0] Dysplastic nevus = [0,1,0] Benign nevus = [0,0,1]

Answer 77

1. Sentence boundary detection 2. Tokenization 3. Part-of-speech assignment (complicated by gerunds, which are nouns formed from verbs + ing) 4. Morphological decomposition of compound words (e.g. nasogastric) 5. Shallow parsing (chunking) - identifying phrases (e.g. noun phrase comprised of adjective + noun) 6. Problem-specific segmentation 7. Coreference resolution (e.g. determining that Mr. XXX, he and his refer to same person)

Answer 78

1. Spelling/grammatical error ID and recovery 2. Named entity recognition (e.g. categorize as persons, locations, med, disease, etc.) 3. Word sense disambiguation (determining a homograph's correct meaning) 4. Negation and uncertainty detection 5. Relationship extraction 6. Temporal inferences extraction 7. Information extraction

Answer 79

Semi-supervised: Some data are labeled while other data are not Weakly supervised: Small amount of data have detailed labels; rest of data have fewer labels

Domain 4: Data Governance & Data Analytics Flashcards

(106 cards)