CAP Study Guide Flashcards
What are the seven CAP domains?
- Frame the business problem, 2. Frame the analytics problem, 3. Data, 4. Select methodology / approach, 5. Build model, 6. Deploy solution, 7. Model lifecycle
The five E’s are
ethics, education, experience, examination, and effectiveness
One popular way to frame a business opportunity or problem is to obtain reliable information on
the five W’s: who, what, where, when, and why
Five W’s: Who
are the stakeholders who satisfy one or more of the following with respect to the project: funding, using, creating, or affected by the project’s outcome?
Five W’s: What
problem/function is the project meant to solve/perform?
Five W’s: When
When: does the problem occur, or function need to be performed? When does the project need to be completed?
Five W’s: Where
does the problem occur? Or where does the function need to be performed? Are the physical and spatial characteristics articulated?
Five W’s: Why
does the problem occur, or function need to occur?
After the initial analysis, it may be necessary to
refine the problem statement to make it more accurate, more appropriate to the stakeholders, or more amenable to available analytic tools/methods.
In framing the analytics problem, one danger we’re trying to avoid is
“anchoring.”
What is “anchoring”?
People have a tendency to hang on to views that they’ve seen and held before, even if they are incorrect.
How can you help mitigate the anchoring effect?
Remind team that assumptions are initial and preliminary, rather than finalized views.
Decomposition
the act of breaking down a higher-level requirement to multiple lower-level requirements
A requirement should be
unitary (no conjunctions such as and, but, or or), positive, and testable
What is EDA?
Exploratory data analaysis
DBSCAN stands for
Density-based spatial clustering of applications with noise
DBSCAN is a _____-based _____
density-based clustering algorithm
DBSCAN is one of the _____ algorithms
most common clustering algorithms
DBSCAN works by
grouping together points that are closely packed together (points with many nearby neighbors) and marks as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away).
R squared is a statistic that will give some information about
the goodness of fit of a model.
In regression, the R squared coefficient of determination is a statistical measure of
how well the regression line approximates the real data points.
R squared is also known as
Coefficient of determination
An R squared of 1 indicates that
the regression line perfectly fits the data.
Low R-squared values are
not always bad and high R-squared values are not always good!
In a normal distribution, _____ percent of the data values are within one standard deviation of the mean.
68%
In a normal distribution, _____ percent of the data values are within two standard deviations of the mean.
95%
In a normal distribution, _____ percent of the data values are within three standard deviations of the mean.
99.70%
Conjoint analysis
is a statistical technique used in market research to determine how people value different attributes (feature, function, benefits) that make up an individual product or service.
Goodness of fit
degree of linear correlation of variables, it is computed with the statistical methods such as chi-square test or coefficient of determination
R-squared =
Explained variation / Total variation
R-squared is always between
0 and 100%
adjusted R-squared is a modified version of R-squared that has been adjusted for
the number of predictors in the model.
When you add a predictor to a model, the R-squared
increases, even if due to chance alone. It never decreases.
If a model has too many predictors and higher order polynomials, it begins to
model the random noise in the data. This condition is known as overfitting.
The predicted R-squared indicates
how well a regression model predicts responses for new observations. This statistic helps you determine when the model fits the original data but is less capable of providing valid predictions for new observations.
In time series analysis, the Box–Jenkins method
applies autoregressive moving average ARMA or ARIMA models to find the best fit of a time-series model to past values of a time series.
ARMA or ARIMA
autoregressive moving average
Canopy Clustering
is a very simple, fast and surprisingly accurate method for grouping objects into clusters. All objects are represented as a point in a multidimensional feature space. The algorithm uses a fast approximate distance metric and two distance thresholds T1 > T2 for processing.
Canopy clustering is often used as preprocessing step for
the K-means algorithm or the Hierarchical clustering algorithm.
Canopy clustering is intended to speed up clustering operations on
large data sets