Study Guide - questions B Flashcards
What type of sampling has been shown to lead to significant bias?
Convenience Sampling - asking subjects easy to identify
In rare event analysis, it may be advantageous to bias the sampling toward sampling those individuals most likely to have….
Experience the event of interest. Known as stratified random sampling.
This sampling method ensures that each subgroup of a given population is adequately represented within the whole sample population of a research study?
Stratified random sampling
It is common to use __________ (and especially regression) to specify the value of interest as a function of the covariates (characteristics).
Response surface modeling
When the variable is ratio scale, _________ are often used to achieve normality.
Box-Cox Transformations
When the dependent variable is categorical, the regression model is typically ______?
logistic
When the dependent variable is ordinal, the regression model is typically ordered ______?
logit
When the dependent variable is ratio, ________ is often used?
standard regression
If Y is the dependent variable and X1…Xn represent the independent variables, then the typical regression model has the form ________?
y=E[Y] + e
where is e is a normally distributed error term, and E[Y] the expected value of Y is a parameterized function (X1…Xn).
Time series analysis typically corrects for _____?
Season patterns, and provides a natural way of identifying trends
Sampling plan - A simple rule of thumb is that _______ the number of individuals sampled reduces the uncertainty in half.
Quadrupling
Sampling plan - _______ is a common way to measure uncertainty
Standard deviation
Sampling plan - If standard deviation does not exist, then the difference between the ____________, is more appropriate.
third and first fractile of the uncertainty distribution
Sampling plan - If our uncertainty is described by an exponential family distribution, it will have how may parameters?
two
Determining questions to be asked - A key issue in designing the experiment is determining what?
The nature of the variable being assessed - i.e. categorical
What type of scale asks YES/NO questions or multiple choice for ______
Nominal scales
For ordinal scales, it is possible to define the normalized quantity for each response x by the fraction of responses __________?
Less than or equal to x
Semantic differential survey responses with a form “very hard, somewhat hard, OKAY, somewhat easy, very easy”, where two ends of the scale represent opposites, the response is _______?
Ordinal
What survey approach asks individuals to rate various factors in order of importance?
Rank- order
Determining a control group - measurements are typically only meaningful if there is reference to some kind of _____________?
Underlying standard
When the item is an uncertain quantity, the score of an item is the probability of the item outranking a randomly chosen item from the __________?
Benchmark group
The benchmark group is commonly referred to as a _____ with the item’s score being called its _______?
control……effect size
The purpose of extraction is to collect all this data from the many sources so that it can eventually be loaded into a common ________.
database
In extracting data, it is critical to know the _______ from which each data element was taken.
data source
What is it called if there is a change in the clients analysis, and its important to transition the database to reflect the data sources which the new clients consider important?
traceability - and typically requires careful documentation
What are three reasons why survey quality may be deficient?
- Respondents get fatigued and put in any value
- Respondents may be offended by questions and deliberately fill in false answers
- Respondents refuse to fill out the survey
Data cleaning involves the following 6 items:
- Identifying the range of valid responses
- Identifying invalid data responses
- Identifying inconsistent data encodings
- Identifying suspicious data responses
- Identifying suspicious distribution of values
- Identifying suspicious interrelationships between fields.
A key part of data cleaning is determining whether the data makes sense, and also involves handling _______.
Null or missing values
What are four possible solutions to missing values?
- Deletion
- Deletion when necessary
- Imputing a value
- Randomly imputing a value
What are the 10 “Cs” checks on quality of the data?
- Completeness
- Correctness
- Consistency (is data under a given field consistent with definition of that field?)
- Currency (is data obsolete?)
- Collaborative (is data based on one opinion or a consensus of experts?)
- Confidential
- Clarity (is data legible and comprehensible)
- Common format
- Convenient (can data be conveniently and quickly accessed)
- Cost-effective (is cost of collecting data commensurate with its value).
A data warehouse is generally used to describe these three things:
- A staging area
- Data integration in centralized source
- Access layers in OLAP data marts
Data marts are organized along a single point of view for efficient data retrieval. It allows analysts to do these 5 things:
- Slice data (filtering)
- Dice data (grouping)
- Drill down
- Roll-up
- pivot
What are three examples of fact tables?
- Transaction fact tables
- Snapshot fact tables (at point in time)
- Accumulating fact tables (aggregate facts)
Do dimension tables have a larger or smaller number of records compared to fact tables?
smaller
What are 5 examples of dimension tables?
- time
- geography
- product
- employee
- range
Discovering relationships in data - what are 5 methods to reduce dimensions in the data?
- PCA or factor analysis (can determine if there is correlation across different dimensions)
- Frequency-inverse document frequency
- Feature hashing (creating fixed number of features)
- Sensitivity analysis and wrapper methods
- Self-organizing maps and Bayes nets
When data has a variable number of features, _________ is an efficient method of creating a fixed number of features which form the indices of an array.
Feature hashing
For unstructured text data, __________ identifies the importance of a word in some document in a collection by comparing the frequency with which the word appears in the document…
frequency-inverse document frequency
_______ and _______ are typically essential when you don’t know which features of your data are important.
Sensitivity analysis and wrapper methods
Wrapper methods, unlike sensitivity analysis, typically involving identifying a set of features on a small sample and then testing that set on a ________.
holdout sample
________ and _______ are helpful in understanding the probability distribution of the data.
Self-organizing maps and Bayes nets
Extracting features - ________ is required to ensure your data stays within common ranges.
Normalization
Format conversion is typically required when data is in __________?
binary format
Fast Fourner Transformations and Discrete wavelet transformations are used for _________?
frequency data
Coordinate transformations are used for geometric data defined over ________?
Euclidian
Collecting and summarizing data - These three plots provide compact representations of how data is distributed?
- Box plots
- Scatter plots
- box and whisker plots
Collecting and summarizing data - when the data can be reasonably described in parametric distributions, ___________ are even more efficient ways of summarizing data.
distribution fitting
Collecting and summarizing data - ___________ aggregation is an effective way of summarizing all the information available on an entity
Baseball card
Adding new information to the data - ________ is recommended for tracking source information and other use-defined parameters.
Annotation
Adding new info to the data - ____________ and _______ can be helpful in processing certain data fields together or in using one field to compute the value of another.
Relational algebra rename and feature addition
What are the 6 methods for segmenting data to find natural groupings?
- Connectivity-based methods (hierarchical clustering)
- Centroid-based methods
- Distribution-based methods
- Density-based method
- Graph-based methods
- Topic modeling (text data)
segmentation -A connectivity-based method called _________ generates an ordered set of clusters with variable precision.
Hierarchical clustering
segmentation - A centroid-based method with a known number of clusters
K-means clustering
segmentation - A centroid-based method with an unknown number of clusters.
x-means clustering
segmentation - A centroid-based method that is an alternate way of enhancing k-means when the number of cluster is unknown
canopy clustering
segmentation - A distribution-based method that typically uses the expectation-maximization (EM) algorithm and is appropriate if you want any data elements’ membership in a segment to be ‘soft’
Gaussian mixture models
segmentation - Two density-based methods used for non-elliptical clusters are _________?
fractal and DB scan
segmentation - _________ methods are often based on constructing cliques and semi-cliques, and are useful when you only have knowledge of how one item is connected to another.
Graph-based models
segmentation - For text data, this method allows for segmentation of the data.
topic modeling
variable importance - When the structure of the data is unknown, these methods are helpful.
tree-based methods
variable importance - If statistical measures of importance are needed, these models are appropriate.
Generalized linear models
variable importance - if statistical measures of importance are NOT needed, these two methods are useful.
- regression with shrinkage (e.g. Lasso or elastic net)
2. stepwise regression
classifying data into groups - These two methods are helpful if you’re unsure of feature importance.
- neural nets
2. random forests
classifying data into groups - If you require a highly transparent model, this type of model can be preferable.
decision trees (i.e. CART, CHAID)
classifying data into groups - What method should you use if the number of data dimensions is less than 20?
k nearest neighbor methods
classifying data into groups - If you have a large dataset with an unknown classification, what method should you use?
Naive Bayes
classifying data into groups - These models are useful in estimating an unobservable state based on observable values.
Hidden Markov models
Refining BP and AP statements- You may find at this point that the true _______ of the system isn’t what you thought it was, and that therefore the analytics problem needs to be reframed around the newly surfaced constraint.
constraint
APF - When reformulating the “what” of the business problem into the “how” of the analytics problem, what are the four questions you need to ask?
- What result do we want?
- Who will act?
- What will they do?
- What will change in the organization as a result of the new information generated?
APF - This formal method of decomposition is a rigorous process that maps the translation of requirements from one level to the next (i.e. from business level to the first analytics level)
quality function deployment
APF - If you are formally decomposing and parsing a complex business statement, or less formally brainstorming with a project sponsor, it is critical to account for these two types of requirements.
Tacit and Formal
APF - This is the best known model for decomposing and parsing requirements.
Kano’s requirements model
APF - Kano’s requirements model distinguishes between unexpected customer delights, known customer requirements, and customer ________ that are not explicitly stated.
must-haves
APF - When you ask business stakeholders for a list of requirements, they will tend to focus on the “normal” requirements not the _______ requirements.
expected
APF - Your _________ functions are strongly related to your assumptions about what is important about this problem as well as the key metrics by which you’ll measure the organizational response to the problem.
input/output functions
APF - Once you have inputs and general sense of their predicted effects, what is the next step?
communicate them to the team
APF - What are two simple approaches for communicating back to the team?
- Input table
2. black box sketch
APF - Key business metrics need to be negotiated, published, committed to, and _______
tracked
APF - the output of the stakeholder agreement will vary by organization, but should include the following 5 items:
- budget
- timeline
- interim milestones
- goals
- any known effort that is excluded as out of scope
APF - translation of problems from business domain to analytics domain requires that all parties agree to __________
definitions and terms
APF - Requirements should be these three things:
- unitary (no conjunctions such as and, but, or)
- positive
- testable
APF - __________ is the act of breaking down a higher-level requirement to multiple lower-level requirements.
decomposition
Methodology - Almost all analytical methods can be classified into one of these three categories
- Descriptive
- Predictive
- Prescriptive
Methodology - Generally speaking, this type of model answers the question “what is the best action or outcome?”
prescriptive
Methodology - three types of prescriptive techniques are:
- Optimization
- Simulation-Optimization
- Stochastic Optimization
Methodology - 7 types of Optimization techniques:
- Linear programming
- Integer programming
- non-linear programming
- Mixed integer programming
- Network optimization
- Dynamic programming
- Metaheuristics
Methodology - These types of methodologies include any forecasting models such as time-series models, moving averages, and auto-regression models. Answers the question “What could happen?”
predictive models
Methodology - List 7 types of predictive models:
- Simulation
- Regression
- Statistical inferences
- Classification
- Clustering
- Artificial Intelligence
- Game Theory
Methodology - List three types of simulation techniques:
- Discrete event
- Monte Carlo
- Agent-based modeling
Methodology - List 4 types of statistical inference techniques:
- Confidence intervals
- Hypothesis testing
- Analysis of variance
- Design of experiments
Methodology - Descriptive methodologies can be conveyed through these 2 methods:
- Charts and graphs
2. numerical presentations (mean, median, mode, etc)
Methodology - These techniques answer the question “What happened?”
Descriptive
Methodology - Prescriptive analytics evaluates and determines new ways to operate, targets business objectives, and balances __________.
constraints
Methodology - What are the 7 primary factors that an analyst generally considers to select an appropriate methodology?
- Time
- Accuracy of the model
- Relevance of the methodology and scope of project
- Accuracy of the data
- Data availability and readiness
- Staff and resource availability
- Methodology popularity
Methodology - ________ methods are most helpful when there is a need to pinpoint certain decisions to the level of quantifying the variables that enhance the performance under study.
Prescriptive
Methodology - common methods - This type of method is often used to understand bottlenecks in systems, handles cases that cannot be handled in queueing theory, and is often used for multistage processes modeling with variations in their arrivals and service times and utilizing shared resources to perform multiple operations.
Discrete event simulation
Methodology - common methods - This method is designed to identify the most efficient pathway to solution. i.e. it might identify the number of tellers needed to satisfy customers in a particular time frame such as no more than 10 minutes waiting
Queuing model
Methodology - common methods - This method is used primarily to estimate dependent variable randomness out of a set of independent variable randomness. This is necessary when distributions of the input variables are not normally distributed and the relationship to estimate the dependent variable is not simple (i.e. additive). Use when Queuing model is not needed.
Monte Carlo simulation
Methodology - common methods - This method is simulated as a collection of autonomous decision making entities that are used to discover emergent behavior that is hard to predict without simulation.
Agent-based modeling (ABM)
Methodology - common methods - This is a simulation approach used to understand the interaction of a complex system over time.
System dynamics (SD)
Methodology - common methods - This is the study of strategic decision-making processes through competition and collaboration
Game theory
Methodology - common method (econ) - Discounted rate used in capital budgeting to compare returns on investment opportunities.
IRR - Internal rate of return
Methodology - common method (econ) - Difference between present value of income vs. outgo
NPV - Net present value
Methodology - value of a future event or item based on current value that is adjusted by some standard
FV - Future value
Methodology - period of time after which an expenditure is fully amortized and income begins to accrue in excess of expsense
Payback period
Methodology - common methods - A class of statistical methods used to map dependent variables with independent variables and understand the significance between the variables and their correlations.
Regression
Methodology - common methods - method of model building that successively adds or deletes variables based on performance
Stepwise regression
Methodology - common methods - a regression analysis often used to predict the outcome of categorical variables
logistic regression
Methodology - common methods - What are two types of statistical inferences:
- Confidence intervals
2. Hypothesis testing
Methodology - what are three types of AI models?
- Artificial neural networks
- Fuzzy logic
- Expert systems
Methodology - common methods - What is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.
Markov chains
Methodology - ________ mapping requires more aggregate data compared to a discrete-event simulation model.
Value-stream
Methodology - Lower level aggregation is more accurate and descriptive, but is harder to ______ and will certainly lead to more mistakes.
validate
Methodology - Higher level aggregation usually provides _______ results that are easier to understand.
faster
Methodology - The general rule of thumb is to model at the highest level of aggregation possible that will ensure a satisfactory level of _______
accuracy
Methodology - It is often advisable to run scenarios on the “back of the envelope” often referred to as QnD
quick and dirty
Methodology - products that specialize in visualization, optimization, simulation, data mining, and statistical are ________
software tools
Methodology - After the model is developed, the _______ step refers to making certain that the model is built the way it was designed and meant to be.
verification
Methodology - After the model is developed, the _______ step refers to making certain that the model is representing real-life to a certain level of accuracy.
validation
Methodology - The help the testing process, it is advisable to divide data into these three portions:
- Building
- Testing
- Validation
Model building - A logistic regression, a decision tree, and a neural network can all predict in _____ target.
binary - (this is NOT typically done a priori. Instead you might identify several types of models, fit them all, and select a champion)
Model building - It should be possible to perform ______ in a real-time production environment where specialized analytics software might not be available.
Scoring
Model building - A predictive model should always be selected using an honest assessment of the model on ______ data.
holdout
Model building - A model that will be used to select the “top x%” from a sample should be assessed using a metric that evaluates the rank order of predicted values such as these three things:
- concordance
- discordance
- ROC/c-statistic
Model building - validation assessment techniques can vary and include the following 3 types:
- data splitting
- k-fold cross validation
- leave-one-out cross validation
Model building - Honest validation assessment - It is critical that the observations used to fit the model and estimate parameters are not observations that are ________ in the assessment.
scored
Model building - honest assessment with data splitting on binary target - You must select a large sample of data for modeling, for a binary target a good practice is to ensure that you have at least _______ observations in the small of the two classes.
2000
Model building - honest assessment - use stratified random sampling without replacement to create two data sets with appx. the same proportion of ______ target levels.
0 and 1
Model building - Fit models and estimate parameters using the _______ data.
training
Model building - using assessment statistic, score observations in the _______ data set.
validation
Model building - If the model uses stop training, pruning, or model selection without stopping rules, then those selection should always be based on the ________ data performance
validation
Model building - selection - you might select the champion based on a combination of model performance and ________
interpretability - (models like neural networks might not be selected because they are difficult to interpret, but might be used as a benchmark against which other models are compared)
Model building - Segmentation through clustering, rule generation through market basket/association analysis, deriving links among nodes through social network analysis, measurement of latent variables through common factor analysis are all examples of:
unsupervised techniques
Model building - Techniques for validating unsupervised analyses are not as straightforward and typically rely on the analysts _________
best judgement
BPF - Popular way to frame business opportunity or problem is to obtain reliable info on the 5 Ws. What are they?
- Who are the stakeholders
- What problem are we trying to solve
- Where does the problem occur
- When does the problem occur
- Why does the problem occur
BPF - Of the 5 Ws, which is the most critical to the long term success of the project
Who - stakeholders
BPF - In determining if problem is amenable to analytics solution, most important question is can the organization accept and ______ the answer.
deploy
BPF - If there is no feasible way forward, the ethical analyst will notify who?
stakeholders
BPF - After initial analysis, it may be necessary to refine the problem statement to make it more accurate, more appropriate to the stakeholders, or _______
more amenable to available analytic tools/methods
BPF - It will be necessary to define constraints. These constraints could be any of the three:
- analytical
- financial
- political
BPF - If an optimization problem has a large number of constraints, it may need to be restated with fewer constraints and/or a less complex _________.
objective function
BPF - List 4 types of potential constraints:
- Desired accuracy and repeatability
- Program cost
- Timeframe
- Number of stakeholders impacted
BPF - After problem statement is set, you define business benefits which can be quantitative or qualitative. This is also known as the _______
business case
overall - What are the 5 E’s that are the pillars of the Certified Analytics Professional?
- Ethics
- Education
- Experience
- Examination
- Effectiveness
Model Building - what are the 4 overall objectives in the Model Building phase?
- Identify and build effective model structures
- Run and evaluate
- Calibrate models and data
- Integrate the models
Deployment - what are the two methods used for deployment?
- CRISP-DM (cross industry standard process for data mining)
- DMAIC (6 sigma - define, measure, analyze, improve, control)
Deployment - What are the four steps for CRISP-DM deployment?
- Planning deployment - your methods for integrating data mining discoveries into use
- Planning monitoring and maintenance
- Reporting final results
- Reviewing final results
Deployment - After deployment, it is necessary to ensure your answer is still tied to the original question. However, discrepancies can creep in. It is common for business context to have changed, which can invalidate key ________?
Assumptions
Deployment - For organizations to accept the results of the process, those results must be integral and acknowledged as having _______?
Integrity (not just what senior mgmt wants to hear)
Deployment - What are the 2 key items to consider as a model becomes the basis for an organization taking action?
- Plan the deployment
2. Plan monitoring and maintenance
Deployment - When surveying key stakeholders use of the model, pay attention to functional areas where the model is being ignored - this will tell you where key assumptions have been invalidated and use that as a way to ____________ the model.
Strengthen and update
Life cycle mgmt - A good lifecycle process helps with the following 3 items:
- Keep the process orderly
- minimizes cost and efforts
- provides business users with clear roles
Life cycle mgmt - An effective process requires defining the roles of the various departments involved and the _______ process that will be used to iron out differences and make decisions.
Governance
Life cycle mgmt - For the model to be trusted it has to be _______.
Repeatable
Life cycle mgmt - Documentation should include the following 6 items:
- Key assumptions made about the business context and analytics problem
- Data sources and schema
- Methods used to clean and harmonize the data
- Model approach and model review artifacts
- Documentation for any software code written
- Recommendations for future improvements to the model
Life cycle mgmt - Evalution criteria should be created up front both in terms of the business results expected and the ______ and ______ expected from the model.
Accuracy and Confidence
Life cycle mgmt - What are 5 useful model evaluation criteria?
- Value of the model in terms of the business
- Does the model discover/predict something that is new and useful?
- Is the model reliable across a wide range of data?
- Can a “lift” or “gain” graph be constructed to show how well the model is predicting?
- Check if the model’s predictions on unknown data vs. train/test data
Life cycle mgmt - When the model quality starts to decay, it is time for the next step of _______ the model and rechecking its _______.
Recalibrating, assumptions
Life cycle mgmt - The results of the model should be tracked over the long term because a model may degrade if either of these 2 things happen:
- input data changes
2. user requirements change
Life cycle mgmt - If there has been a fundamental change in a key assumption or two, then the project needs to be….
revalidated against the business problem (to see if the overall approach is still valid)
Life cycle mgmt - One of the keys to a successful analytics project or engagement is appropriate _______ for the users of the model and its results
training
Life cycle mgmt - One way to demonstrate business benefits of a model is to compare how your organization is doing against industry _______ during the time period in question.
benchmarks