CAP Study Guide Flashcards

1
Q

What are the seven CAP domains?

A
  1. Frame the business problem, 2. Frame the analytics problem, 3. Data, 4. Select methodology / approach, 5. Build model, 6. Deploy solution, 7. Model lifecycle
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

The five E’s are

A

ethics, education, experience, examination, and effectiveness

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

One popular way to frame a business opportunity or problem is to obtain reliable information on

A

the five W’s: who, what, where, when, and why

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Five W’s: Who

A

are the stakeholders who satisfy one or more of the following with respect to the project: funding, using, creating, or affected by the project’s outcome?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Five W’s: What

A

problem/function is the project meant to solve/perform?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Five W’s: When

A

When: does the problem occur, or function need to be performed? When does the project need to be completed?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Five W’s: Where

A

does the problem occur? Or where does the function need to be performed? Are the physical and spatial characteristics articulated?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Five W’s: Why

A

does the problem occur, or function need to occur?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

After the initial analysis, it may be necessary to

A

refine the problem statement to make it more accurate, more appropriate to the stakeholders, or more amenable to available analytic tools/methods.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

In framing the analytics problem, one danger we’re trying to avoid is

A

“anchoring.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is “anchoring”?

A

People have a tendency to hang on to views that they’ve seen and held before, even if they are incorrect.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How can you help mitigate the anchoring effect?

A

Remind team that assumptions are initial and preliminary, rather than finalized views.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Decomposition

A

the act of breaking down a higher-level requirement to multiple lower-level requirements

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

A requirement should be

A

unitary (no conjunctions such as and, but, or or), positive, and testable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is EDA?

A

Exploratory data analaysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

DBSCAN stands for

A

Density-based spatial clustering of applications with noise

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

DBSCAN is a _____-based _____

A

density-based clustering algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

DBSCAN is one of the _____ algorithms

A

most common clustering algorithms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

DBSCAN works by

A

grouping together points that are closely packed together (points with many nearby neighbors) and marks as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

R squared is a statistic that will give some information about

A

the goodness of fit of a model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

In regression, the R squared coefficient of determination is a statistical measure of

A

how well the regression line approximates the real data points.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

R squared is also known as

A

Coefficient of determination

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

An R squared of 1 indicates that

A

the regression line perfectly fits the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Low R-squared values are

A

not always bad and high R-squared values are not always good!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

In a normal distribution, _____ percent of the data values are within one standard deviation of the mean.

A

68%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

In a normal distribution, _____ percent of the data values are within two standard deviations of the mean.

A

95%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

In a normal distribution, _____ percent of the data values are within three standard deviations of the mean.

A

99.70%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Conjoint analysis

A

is a statistical technique used in market research to determine how people value different attributes (feature, function, benefits) that make up an individual product or service.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Goodness of fit

A

degree of linear correlation of variables, it is computed with the statistical methods such as chi-square test or coefficient of determination

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

R-squared =

A

Explained variation / Total variation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

R-squared is always between

A

0 and 100%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

adjusted R-squared is a modified version of R-squared that has been adjusted for

A

the number of predictors in the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

When you add a predictor to a model, the R-squared

A

increases, even if due to chance alone. It never decreases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

If a model has too many predictors and higher order polynomials, it begins to

A

model the random noise in the data. This condition is known as overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

The predicted R-squared indicates

A

how well a regression model predicts responses for new observations. This statistic helps you determine when the model fits the original data but is less capable of providing valid predictions for new observations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

In time series analysis, the Box–Jenkins method

A

applies autoregressive moving average ARMA or ARIMA models to find the best fit of a time-series model to past values of a time series.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

ARMA or ARIMA

A

autoregressive moving average

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Canopy Clustering

A

is a very simple, fast and surprisingly accurate method for grouping objects into clusters. All objects are represented as a point in a multidimensional feature space. The algorithm uses a fast approximate distance metric and two distance thresholds T1 > T2 for processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Canopy clustering is often used as preprocessing step for

A

the K-means algorithm or the Hierarchical clustering algorithm.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Canopy clustering is intended to speed up clustering operations on

A

large data sets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Hierarchical clustering involves creating clusters that have

A

a predetermined ordering from top to bottom.

42
Q

In the hierarchical clustering divisive method

A

we assign all of the observations to a single cluster and then partition the cluster to two least similar clusters. Finally, we proceed recursively on each cluster until there is one cluster for each observation.

43
Q

In the hierarchical clustering agglomerative method

A

we assign each observation to its own cluster. Then, compute the similarity (e.g., distance) between each of the clusters and join the two most similar clusters. Finally, repeat steps 2 and 3 until there is only a single cluster left.

44
Q

subjective probability

A

summary measure of an individual’s beliefs about whether an event occurs

45
Q

stratified random sampling

A

Intentionally sampling from subpopulations to reduce sampling error for low frequency groups

46
Q

When the variable is ratio scale use

A

Box-Cox transformations

47
Q

Blue, red, white is an example of

A

Categorical

48
Q

Fully agree, partially agree, neutral, partially disagree, fully disagree is an example of

A

Likert-type

49
Q

Very hard, somewhat hard, OK, somewhat easy, very easy is an example of

A

Semantic diferential

50
Q

Rate factors in order of importance is an example of

A

Rank-order

51
Q

30 degrees, 40 degrees, 50 degrees is an example of

A

Interval

52
Q

Data quality check for Completeness:

A

Are all the fields of the data complete?

53
Q

Data quality check for Correctness:

A

Is the data accurate?

54
Q

Data quality check for Consistency:

A

Is the data provided under a given field and for a given concept consistent with the definition of that field and concept?

55
Q

Data quality check for Currency:

A

Is the data obsolete?

56
Q

Data quality check for Collaborative:

A

Is the data based on one opinion or on a consensus of experts in the relative area?

57
Q

Data quality check for Confidential:

A

Is the data secure from unauthorized use by individuals other than the decision maker?

58
Q

Data quality check for Clarity:

A

Is the data legible and comprehensible?

59
Q

Data quality check for Common Format:

A

Is the data in a format easily used in the application for which it is intended?

60
Q

Data quality check for Convenient:

A

Can the data be conveniently and quickly accessed by the intended user in a time-frame that allows for it to be effectively used?

61
Q

Data quality check for Cost-effective:

A

Is the cost of collecting and using the data commensurate with its value?

62
Q

Data warehouse

A

Staging area, centralized data, access layers (multiple OLAP data marts)

63
Q

Data mart

A

Organized along a single point of view (e.g. time, product type, geography) for efficient data retrieval

64
Q

Data mart: slice data

A

filtering data by picking a specific subset of the data-cube and choosing a single value for one of its dimensions;

65
Q

Data mart: dice data

A

grouping data by picking specific values for multiple Dimensions

66
Q

Data mart: drill-down/up

A

allow the user to navigate from the most summarized (high-level) to the most detailed (drill-down);

67
Q

Data mart: roll-up

A

summarize the data along a dimension (e.g., computing totals or using some other formula);

68
Q

Data mart: pivot

A

interchange rows and columns (`rotate the cube’).

69
Q

Wrapper methods

A

identify a set of features on a small sample and then testing that set on a holdout sample.

70
Q

Stochastic

A

Situations ormodelscontaininga randomelement, hence unpredictable and without astablepatternororder. All naturaleventsare stochastic phenomenon.

71
Q

Often used to understand bottleneck(s) in systems

A

Discrete event simulation

72
Q

Handles cases that cannot be handled by queuing theory

A

Discrete event simulation

73
Q

Often used for multistage processes modeling with variations in their arrivals and service time and utilizing shared resources to perform multiple operations

A

Discrete event simulation

74
Q

Designed to identify the most efficient pathway to solution; i.e., at a bank it might identify the number of tellers needed to satisfy customers in a particular time frame such as no more than 10 minutes waiting.

A

Queuing model

75
Q

Monte Carlo simulation can be used if

A

Queuing modeling is not needed

76
Q

Used primarily to estimate dependent variable randomness out of a set of independent variable randomness. This is especially necessary when distributions of the input variables are not necessarily normally distributed and the relationship to estimate the dependent variable is not simple (e.g. additive)

A

Monte Carlo simulation

77
Q

A simulation approach used to understanding the interactions of a complex system over time.

A

System dynamics (SD)

78
Q

Study of strategic decision-making processes through competition and collaboration.

A

Game theory.

79
Q

The likelihood of a particular event occurring expressed as a percentage to make decisions under chosen risk or tolerance.

A

Probability

80
Q

Binomial distribution

A

In probability theory and statistics, the binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability p. A success/failure experiment is also called a Bernoulli experiment or Bernoulli trial; when n = 1, the binomial distribution is a Bernoulli distribution.

81
Q

Gamma distribution

A

distribution that arises naturally in processes for which the waiting times between events are relevant. In particular, the arrival times in the Poisson process have gamma distributions, and the chi-square distribution in statistics is a special case of the gamma distribution. Also, the gamma distribution is widely used to model physical quantities that take positive values.

82
Q

Poisson distribution

A

a discrete frequency distribution that gives the probability of a number of independent events occurring in a fixed time.

83
Q

ROC stands for

A

receiver operating characteristic

84
Q

ROC curve is a

A

graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied.

85
Q

ROC curve is created by

A

plotting the true positive rate against the false positive rate at various threshold settings

86
Q

ROC true-positive rate is also known as

A

sensitivity in signal detection and biomedical informatics, or recall in machine learning

87
Q

The ROC false-positive rate is also known as

A

fall-out

88
Q

ROC True negative rate is also known as

A

Specificity

89
Q

ROC False negative rate is also known as

A

Miss rate

90
Q

ROC Type I error:

A

False positive

91
Q

ROC Type II error:

A

False negative

92
Q

Quadrupling the number of individuals sampled

A

reduces the uncertainty by half

93
Q

Model honest assessment: For a binary target, a good practice is to ensure that you have at least

A

2000 observations in the smaller of the two target classes.

94
Q

Lift

A

a measure of the effectiveness of a predictive model calculated as the ratio between the results obtained with and without the predictive model

95
Q

Stochastic dominance

A

The concept arises in decision theory and decision analysis in situations where one gamble (a probability distribution over possible outcomes, also known as prospects) can be ranked as superior to another gamble for a broad class of decision-makers. It is based on shared preferences regarding sets of possible outcomes and their associated probabilities. Only limited knowledge of preferences is required for determining dominance.

96
Q

Select approaches or software that deal with data accuracy at

A

same or better level than your data accuracy (e.g. +/- 10% if your data is less than +/- 20%)

97
Q

If you can’t explain precisely with exact numbers you may consider using

A

Fuzzy logic

98
Q

Model maintenance is necessary when

A

Underlying assumptions change

99
Q

Local or neighborhood searches

A

take a potential solution to a problem and check its immediate neighbors (that is, solutions that are similar except for one or two minor details) in the hope of finding an improved solution. Local search methods have a tendency to become stuck in suboptimal regions or on plateaus where many solutions are equally fit.

100
Q

Tabu search

A

is a metaheuristic search method employing local search methods used for mathematical optimization. Tabu search enhances the performance of local search by relaxing its basic rule. First, at each step worsening moves can be accepted if no improving move is available (like when the search is stuck at a strict local mimimum). In addition, prohibitions (henceforth the term tabu) are introduced to discourage the search from coming back to previously-visited solutions.

101
Q

Ant colony optimization

A

is a probabilistic technique for solving computational problems which can be reduced to finding good paths through graphs.