Study Guide - questions B Flashcards by Timothy Martin

What type of sampling has been shown to lead to significant bias?

Convenience Sampling - asking subjects easy to identify

How well did you know this?

Not at all

Perfectly

In rare event analysis, it may be advantageous to bias the sampling toward sampling those individuals most likely to have….

Experience the event of interest. Known as stratified random sampling.

How well did you know this?

Not at all

Perfectly

This sampling method ensures that each subgroup of a given population is adequately represented within the whole sample population of a research study?

Stratified random sampling

How well did you know this?

Not at all

Perfectly

It is common to use __________ (and especially regression) to specify the value of interest as a function of the covariates (characteristics).

Response surface modeling

How well did you know this?

Not at all

Perfectly

When the variable is ratio scale, _________ are often used to achieve normality.

Box-Cox Transformations

How well did you know this?

Not at all

Perfectly

When the dependent variable is categorical, the regression model is typically ______?

logistic

How well did you know this?

Not at all

Perfectly

When the dependent variable is ordinal, the regression model is typically ordered ______?

logit

How well did you know this?

Not at all

Perfectly

When the dependent variable is ratio, ________ is often used?

standard regression

How well did you know this?

Not at all

Perfectly

If Y is the dependent variable and X1…Xn represent the independent variables, then the typical regression model has the form ________?

y=E[Y] + e

where is e is a normally distributed error term, and E[Y] the expected value of Y is a parameterized function (X1…Xn).

How well did you know this?

Not at all

Perfectly

Time series analysis typically corrects for _____?

Season patterns, and provides a natural way of identifying trends

How well did you know this?

Not at all

Perfectly

Sampling plan - A simple rule of thumb is that _______ the number of individuals sampled reduces the uncertainty in half.

Quadrupling

How well did you know this?

Not at all

Perfectly

Sampling plan - _______ is a common way to measure uncertainty

Standard deviation

How well did you know this?

Not at all

Perfectly

Sampling plan - If standard deviation does not exist, then the difference between the ____________, is more appropriate.

third and first fractile of the uncertainty distribution

How well did you know this?

Not at all

Perfectly

Sampling plan - If our uncertainty is described by an exponential family distribution, it will have how may parameters?

two

How well did you know this?

Not at all

Perfectly

Determining questions to be asked - A key issue in designing the experiment is determining what?

The nature of the variable being assessed - i.e. categorical

How well did you know this?

Not at all

Perfectly

What type of scale asks YES/NO questions or multiple choice for ______

Nominal scales

How well did you know this?

Not at all

Perfectly

For ordinal scales, it is possible to define the normalized quantity for each response x by the fraction of responses __________?

Less than or equal to x

How well did you know this?

Not at all

Perfectly

Semantic differential survey responses with a form “very hard, somewhat hard, OKAY, somewhat easy, very easy”, where two ends of the scale represent opposites, the response is _______?

Ordinal

How well did you know this?

Not at all

Perfectly

What survey approach asks individuals to rate various factors in order of importance?

Rank- order

How well did you know this?

Not at all

Perfectly

Determining a control group - measurements are typically only meaningful if there is reference to some kind of _____________?

Underlying standard

How well did you know this?

Not at all

Perfectly

When the item is an uncertain quantity, the score of an item is the probability of the item outranking a randomly chosen item from the __________?

Benchmark group

How well did you know this?

Not at all

Perfectly

The benchmark group is commonly referred to as a _____ with the item’s score being called its _______?

control……effect size

How well did you know this?

Not at all

Perfectly

The purpose of extraction is to collect all this data from the many sources so that it can eventually be loaded into a common ________.

database

How well did you know this?

Not at all

Perfectly

In extracting data, it is critical to know the _______ from which each data element was taken.

data source

How well did you know this?

Not at all

Perfectly

What is it called if there is a change in the clients analysis, and its important to transition the database to reflect the data sources which the new clients consider important?

traceability - and typically requires careful documentation

What are three reasons why survey quality may be deficient?

1. Respondents get fatigued and put in any value 2. Respondents may be offended by questions and deliberately fill in false answers 3. Respondents refuse to fill out the survey

Data cleaning involves the following 6 items:

1. Identifying the range of valid responses 2. Identifying invalid data responses 3. Identifying inconsistent data encodings 4. Identifying suspicious data responses 5. Identifying suspicious distribution of values 6. Identifying suspicious interrelationships between fields.

A key part of data cleaning is determining whether the data makes sense, and also involves handling _______.

Null or missing values

What are four possible solutions to missing values?

1. Deletion 2. Deletion when necessary 3. Imputing a value 4. Randomly imputing a value

What are the 10 "Cs" checks on quality of the data?

1. Completeness 2. Correctness 3. Consistency (is data under a given field consistent with definition of that field?) 4. Currency (is data obsolete?) 5. Collaborative (is data based on one opinion or a consensus of experts?) 6. Confidential 7. Clarity (is data legible and comprehensible) 8. Common format 9. Convenient (can data be conveniently and quickly accessed) 10. Cost-effective (is cost of collecting data commensurate with its value).

A data warehouse is generally used to describe these three things:

1. A staging area 2. Data integration in centralized source 3. Access layers in OLAP data marts

Data marts are organized along a single point of view for efficient data retrieval. It allows analysts to do these 5 things:

1. Slice data (filtering) 2. Dice data (grouping) 3. Drill down 4. Roll-up 5. pivot

What are three examples of fact tables?

1. Transaction fact tables 2. Snapshot fact tables (at point in time) 3. Accumulating fact tables (aggregate facts)

Do dimension tables have a larger or smaller number of records compared to fact tables?

smaller

What are 5 examples of dimension tables?

1. time 2. geography 3. product 4. employee 5. range

Discovering relationships in data - what are 5 methods to reduce dimensions in the data?

1. PCA or factor analysis (can determine if there is correlation across different dimensions) 2. Frequency-inverse document frequency 3. Feature hashing (creating fixed number of features) 4. Sensitivity analysis and wrapper methods 5. Self-organizing maps and Bayes nets

When data has a variable number of features, _________ is an efficient method of creating a fixed number of features which form the indices of an array.

Feature hashing

For unstructured text data, __________ identifies the importance of a word in some document in a collection by comparing the frequency with which the word appears in the document...

frequency-inverse document frequency

_______ and _______ are typically essential when you don't know which features of your data are important.

Sensitivity analysis and wrapper methods

Wrapper methods, unlike sensitivity analysis, typically involving identifying a set of features on a small sample and then testing that set on a ________.

holdout sample

________ and _______ are helpful in understanding the probability distribution of the data.

Self-organizing maps and Bayes nets

Extracting features - ________ is required to ensure your data stays within common ranges.

Normalization

Format conversion is typically required when data is in __________?

binary format

Fast Fourner Transformations and Discrete wavelet transformations are used for _________?

frequency data

Coordinate transformations are used for geometric data defined over ________?

Euclidian

Collecting and summarizing data - These three plots provide compact representations of how data is distributed?

1. Box plots 2. Scatter plots 3. box and whisker plots

Collecting and summarizing data - when the data can be reasonably described in parametric distributions, ___________ are even more efficient ways of summarizing data.

distribution fitting

Collecting and summarizing data - ___________ aggregation is an effective way of summarizing all the information available on an entity

Baseball card

Adding new information to the data - ________ is recommended for tracking source information and other use-defined parameters.

Annotation

Adding new info to the data - ____________ and _______ can be helpful in processing certain data fields together or in using one field to compute the value of another.

Relational algebra rename and feature addition

What are the 6 methods for segmenting data to find natural groupings?

1. Connectivity-based methods (hierarchical clustering) 2. Centroid-based methods 3. Distribution-based methods 4. Density-based method 5. Graph-based methods 6. Topic modeling (text data)

segmentation -A connectivity-based method called _________ generates an ordered set of clusters with variable precision.

Hierarchical clustering

segmentation - A centroid-based method with a known number of clusters

K-means clustering

segmentation - A centroid-based method with an unknown number of clusters.

x-means clustering

segmentation - A centroid-based method that is an alternate way of enhancing k-means when the number of cluster is unknown

canopy clustering

segmentation - A distribution-based method that typically uses the expectation-maximization (EM) algorithm and is appropriate if you want any data elements' membership in a segment to be 'soft'

Gaussian mixture models

segmentation - Two density-based methods used for non-elliptical clusters are _________?

fractal and DB scan

segmentation - _________ methods are often based on constructing cliques and semi-cliques, and are useful when you only have knowledge of how one item is connected to another.

Graph-based models

segmentation - For text data, this method allows for segmentation of the data.

topic modeling

variable importance - When the structure of the data is unknown, these methods are helpful.

tree-based methods

variable importance - If statistical measures of importance are needed, these models are appropriate.

Generalized linear models

variable importance - if statistical measures of importance are NOT needed, these two methods are useful.

1. regression with shrinkage (e.g. Lasso or elastic net) | 2. stepwise regression

classifying data into groups - These two methods are helpful if you're unsure of feature importance.

1. neural nets | 2. random forests

classifying data into groups - If you require a highly transparent model, this type of model can be preferable.

decision trees (i.e. CART, CHAID)

classifying data into groups - What method should you use if the number of data dimensions is less than 20?

k nearest neighbor methods

classifying data into groups - If you have a large dataset with an unknown classification, what method should you use?

Naive Bayes

classifying data into groups - These models are useful in estimating an unobservable state based on observable values.

Hidden Markov models

Refining BP and AP statements- You may find at this point that the true _______ of the system isn't what you thought it was, and that therefore the analytics problem needs to be reframed around the newly surfaced constraint.

constraint

APF - When reformulating the "what" of the business problem into the "how" of the analytics problem, what are the four questions you need to ask?

1. What result do we want? 2. Who will act? 3. What will they do? 4. What will change in the organization as a result of the new information generated?

APF - This formal method of decomposition is a rigorous process that maps the translation of requirements from one level to the next (i.e. from business level to the first analytics level)

quality function deployment

APF - If you are formally decomposing and parsing a complex business statement, or less formally brainstorming with a project sponsor, it is critical to account for these two types of requirements.

Tacit and Formal

APF - This is the best known model for decomposing and parsing requirements.

Kano's requirements model

APF - Kano's requirements model distinguishes between unexpected customer delights, known customer requirements, and customer ________ that are not explicitly stated.

must-haves

APF - When you ask business stakeholders for a list of requirements, they will tend to focus on the "normal" requirements not the _______ requirements.

expected

APF - Your _________ functions are strongly related to your assumptions about what is important about this problem as well as the key metrics by which you'll measure the organizational response to the problem.

input/output functions

APF - Once you have inputs and general sense of their predicted effects, what is the next step?

communicate them to the team

APF - What are two simple approaches for communicating back to the team?

1. Input table | 2. black box sketch

APF - Key business metrics need to be negotiated, published, committed to, and _______

tracked

APF - the output of the stakeholder agreement will vary by organization, but should include the following 5 items:

1. budget 2. timeline 3. interim milestones 4. goals 5. any known effort that is excluded as out of scope

APF - translation of problems from business domain to analytics domain requires that all parties agree to __________

definitions and terms

APF - Requirements should be these three things:

1. unitary (no conjunctions such as and, but, or) 2. positive 3. testable

APF - __________ is the act of breaking down a higher-level requirement to multiple lower-level requirements.

decomposition

Methodology - Almost all analytical methods can be classified into one of these three categories

1. Descriptive 2. Predictive 3. Prescriptive

Methodology - Generally speaking, this type of model answers the question "what is the best action or outcome?"

prescriptive

Methodology - three types of prescriptive techniques are:

1. Optimization 2. Simulation-Optimization 3. Stochastic Optimization

Methodology - 7 types of Optimization techniques:

1. Linear programming 2. Integer programming 3. non-linear programming 4. Mixed integer programming 5. Network optimization 6. Dynamic programming 7. Metaheuristics

Methodology - These types of methodologies include any forecasting models such as time-series models, moving averages, and auto-regression models. Answers the question "What could happen?"

predictive models

Methodology - List 7 types of predictive models:

1. Simulation 2. Regression 3. Statistical inferences 4. Classification 5. Clustering 6. Artificial Intelligence 7. Game Theory

Methodology - List three types of simulation techniques:

1. Discrete event 2. Monte Carlo 3. Agent-based modeling

Methodology - List 4 types of statistical inference techniques:

1. Confidence intervals 2. Hypothesis testing 3. Analysis of variance 4. Design of experiments

Methodology - Descriptive methodologies can be conveyed through these 2 methods:

1. Charts and graphs | 2. numerical presentations (mean, median, mode, etc)

Methodology - These techniques answer the question "What happened?"

Descriptive

Methodology - Prescriptive analytics evaluates and determines new ways to operate, targets business objectives, and balances __________.

constraints

Methodology - What are the 7 primary factors that an analyst generally considers to select an appropriate methodology?

1. Time 2. Accuracy of the model 3. Relevance of the methodology and scope of project 4. Accuracy of the data 5. Data availability and readiness 6. Staff and resource availability 7. Methodology popularity

Methodology - ________ methods are most helpful when there is a need to pinpoint certain decisions to the level of quantifying the variables that enhance the performance under study.

Prescriptive

Methodology - common methods - This type of method is often used to understand bottlenecks in systems, handles cases that cannot be handled in queueing theory, and is often used for multistage processes modeling with variations in their arrivals and service times and utilizing shared resources to perform multiple operations.

Discrete event simulation

Methodology - common methods - This method is designed to identify the most efficient pathway to solution. i.e. it might identify the number of tellers needed to satisfy customers in a particular time frame such as no more than 10 minutes waiting

Queuing model

Methodology - common methods - This method is used primarily to estimate dependent variable randomness out of a set of independent variable randomness. This is necessary when distributions of the input variables are not normally distributed and the relationship to estimate the dependent variable is not simple (i.e. additive). Use when Queuing model is not needed.

Monte Carlo simulation

Methodology - common methods - This method is simulated as a collection of autonomous decision making entities that are used to discover emergent behavior that is hard to predict without simulation.

Agent-based modeling (ABM)

Methodology - common methods - This is a simulation approach used to understand the interaction of a complex system over time.

System dynamics (SD)

Methodology - common methods - This is the study of strategic decision-making processes through competition and collaboration

Game theory

Methodology - common method (econ) - Discounted rate used in capital budgeting to compare returns on investment opportunities.

IRR - Internal rate of return

Methodology - common method (econ) - Difference between present value of income vs. outgo

NPV - Net present value

Methodology - value of a future event or item based on current value that is adjusted by some standard

FV - Future value

Methodology - period of time after which an expenditure is fully amortized and income begins to accrue in excess of expsense

Payback period

Methodology - common methods - A class of statistical methods used to map dependent variables with independent variables and understand the significance between the variables and their correlations.

Regression

Methodology - common methods - method of model building that successively adds or deletes variables based on performance

Stepwise regression

Methodology - common methods - a regression analysis often used to predict the outcome of categorical variables

logistic regression

Methodology - common methods - What are two types of statistical inferences:

1. Confidence intervals | 2. Hypothesis testing

Methodology - what are three types of AI models?

1. Artificial neural networks 2. Fuzzy logic 3. Expert systems

Methodology - common methods - What is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.

Markov chains

Methodology - ________ mapping requires more aggregate data compared to a discrete-event simulation model.

Value-stream

Methodology - Lower level aggregation is more accurate and descriptive, but is harder to ______ and will certainly lead to more mistakes.

validate

Methodology - Higher level aggregation usually provides _______ results that are easier to understand.

faster

Methodology - The general rule of thumb is to model at the highest level of aggregation possible that will ensure a satisfactory level of _______

accuracy

Methodology - It is often advisable to run scenarios on the "back of the envelope" often referred to as QnD

quick and dirty

Methodology - products that specialize in visualization, optimization, simulation, data mining, and statistical are ________

software tools

Methodology - After the model is developed, the _______ step refers to making certain that the model is built the way it was designed and meant to be.

verification

Methodology - After the model is developed, the _______ step refers to making certain that the model is representing real-life to a certain level of accuracy.

validation

Methodology - The help the testing process, it is advisable to divide data into these three portions:

1. Building 2. Testing 3. Validation

Model building - A logistic regression, a decision tree, and a neural network can all predict in _____ target.

binary - (this is NOT typically done a priori. Instead you might identify several types of models, fit them all, and select a champion)

Model building - It should be possible to perform ______ in a real-time production environment where specialized analytics software might not be available.

Scoring

Model building - A predictive model should always be selected using an honest assessment of the model on ______ data.

holdout

Model building - A model that will be used to select the "top x%" from a sample should be assessed using a metric that evaluates the rank order of predicted values such as these three things:

1. concordance 2. discordance 3. ROC/c-statistic

Model building - validation assessment techniques can vary and include the following 3 types:

1. data splitting 2. k-fold cross validation 3. leave-one-out cross validation

Model building - Honest validation assessment - It is critical that the observations used to fit the model and estimate parameters are not observations that are ________ in the assessment.

scored

Model building - honest assessment with data splitting on binary target - You must select a large sample of data for modeling, for a binary target a good practice is to ensure that you have at least _______ observations in the small of the two classes.

2000

Model building - honest assessment - use stratified random sampling without replacement to create two data sets with appx. the same proportion of ______ target levels.

0 and 1

Model building - Fit models and estimate parameters using the _______ data.

training

Model building - using assessment statistic, score observations in the _______ data set.

validation

Model building - If the model uses stop training, pruning, or model selection without stopping rules, then those selection should always be based on the ________ data performance

validation

Model building - selection - you might select the champion based on a combination of model performance and ________

interpretability - (models like neural networks might not be selected because they are difficult to interpret, but might be used as a benchmark against which other models are compared)

Model building - Segmentation through clustering, rule generation through market basket/association analysis, deriving links among nodes through social network analysis, measurement of latent variables through common factor analysis are all examples of:

unsupervised techniques

Model building - Techniques for validating unsupervised analyses are not as straightforward and typically rely on the analysts _________

best judgement

BPF - Popular way to frame business opportunity or problem is to obtain reliable info on the 5 Ws. What are they?

1. Who are the stakeholders 2. What problem are we trying to solve 3. Where does the problem occur 4. When does the problem occur 5. Why does the problem occur

BPF - Of the 5 Ws, which is the most critical to the long term success of the project

Who - stakeholders

BPF - In determining if problem is amenable to analytics solution, most important question is can the organization accept and ______ the answer.

deploy

BPF - If there is no feasible way forward, the ethical analyst will notify who?

stakeholders

BPF - After initial analysis, it may be necessary to refine the problem statement to make it more accurate, more appropriate to the stakeholders, or _______

more amenable to available analytic tools/methods

BPF - It will be necessary to define constraints. These constraints could be any of the three:

1. analytical 2. financial 3. political

BPF - If an optimization problem has a large number of constraints, it may need to be restated with fewer constraints and/or a less complex _________.

objective function

BPF - List 4 types of potential constraints:

1. Desired accuracy and repeatability 2. Program cost 3. Timeframe 4. Number of stakeholders impacted

BPF - After problem statement is set, you define business benefits which can be quantitative or qualitative. This is also known as the _______

business case

overall - What are the 5 E's that are the pillars of the Certified Analytics Professional?

1. Ethics 2. Education 3. Experience 4. Examination 5. Effectiveness

Model Building - what are the 4 overall objectives in the Model Building phase?

1. Identify and build effective model structures 2. Run and evaluate 3. Calibrate models and data 4. Integrate the models

Deployment - what are the two methods used for deployment?

1. CRISP-DM (cross industry standard process for data mining) 2. DMAIC (6 sigma - define, measure, analyze, improve, control)

Deployment - What are the four steps for CRISP-DM deployment?

1. Planning deployment - your methods for integrating data mining discoveries into use 2. Planning monitoring and maintenance 3. Reporting final results 4. Reviewing final results

Deployment - After deployment, it is necessary to ensure your answer is still tied to the original question. However, discrepancies can creep in. It is common for business context to have changed, which can invalidate key ________?

Assumptions

Deployment - For organizations to accept the results of the process, those results must be integral and acknowledged as having _______?

Integrity (not just what senior mgmt wants to hear)

Deployment - What are the 2 key items to consider as a model becomes the basis for an organization taking action?

1. Plan the deployment | 2. Plan monitoring and maintenance

Deployment - When surveying key stakeholders use of the model, pay attention to functional areas where the model is being ignored - this will tell you where key assumptions have been invalidated and use that as a way to ____________ the model.

Strengthen and update

Life cycle mgmt - A good lifecycle process helps with the following 3 items:

1. Keep the process orderly 2. minimizes cost and efforts 3. provides business users with clear roles

Life cycle mgmt - An effective process requires defining the roles of the various departments involved and the _______ process that will be used to iron out differences and make decisions.

Governance

Life cycle mgmt - For the model to be trusted it has to be _______.

Repeatable

Life cycle mgmt - Documentation should include the following 6 items:

1. Key assumptions made about the business context and analytics problem 2. Data sources and schema 3. Methods used to clean and harmonize the data 4. Model approach and model review artifacts 5. Documentation for any software code written 6. Recommendations for future improvements to the model

Life cycle mgmt - Evalution criteria should be created up front both in terms of the business results expected and the ______ and ______ expected from the model.

Accuracy and Confidence

Life cycle mgmt - What are 5 useful model evaluation criteria?

1. Value of the model in terms of the business 2. Does the model discover/predict something that is new and useful? 3. Is the model reliable across a wide range of data? 4. Can a "lift" or "gain" graph be constructed to show how well the model is predicting? 5. Check if the model's predictions on unknown data vs. train/test data

Life cycle mgmt - When the model quality starts to decay, it is time for the next step of _______ the model and rechecking its _______.

Recalibrating, assumptions

Life cycle mgmt - The results of the model should be tracked over the long term because a model may degrade if either of these 2 things happen:

1. input data changes | 2. user requirements change

Life cycle mgmt - If there has been a fundamental change in a key assumption or two, then the project needs to be....

revalidated against the business problem (to see if the overall approach is still valid)

Life cycle mgmt - One of the keys to a successful analytics project or engagement is appropriate _______ for the users of the model and its results

training

Life cycle mgmt - One way to demonstrate business benefits of a model is to compare how your organization is doing against industry _______ during the time period in question.

benchmarks

Study Guide - questions B Flashcards

(162 cards)