CAP definitions Flashcards

1
Q

What is cross correlation?

A

When two different sequences are correlated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is autocorrelation?

A

Degree of similarity between the values of the same variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is spatial autocorrelation?

A

Degree of similarity - When error terms across cross section data are correlated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is serial autocorrelation

A

degree of similarity - When error terms across time series data are correlated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is OLAP?

A

Online Analytical Processing - uses complex queries to analyze aggregated historical data from OLTP systems - associated with data warehouses. Operations include roll-up, drill down, slice and dice, pivoting, drill through, drill across, etc. OLAP data cubes can be mapped to any (infinite) number of dimensions .

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is OLTP?

A

Online Transaction Processing - captures, stores, and processes data from transactions in real-time. Faster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Monte Carlo simulation

A

Necessary to develop a cumulative probability distribution. Technique that allows people to account for risk in quantitative analysis and decision making.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

snowball sampling

A

survey sampling where subjects are based on referral from other survey respondents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

quota sampling

A

Method for selecting survey participants that is a non-probabilisitc version of stratified samplings. involves a specific group

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

judgement sampling

A

based on researcher’s judgement to select

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

strata sampling

A

random sampling is stratified random sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Central limit theorum

A

The central limit theorem (CLT) states that the distribution of sample means approximates a normal distribution as the sample size gets larger, regardless of the population’s distribution. Sample sizesequal to or greater than 30are often considered sufficient for the CLT to hold.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

When do you reject null hypothesis?

A

When p-value (probability) is less than alpha (level of significance) - alpha is typically 0.05. Null hypothesis is rejected only in two cases. If p-value is less than alpha or test statistic calculated is greater than tabular value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is alpha in statistics?

A

The level of significance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

When do you fail to reject null hypothesis?

A

When p-value is above alpha

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

p-value

A

A p-value, or probability value, is a number describing how likely it is that your data would have occurred by random chance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

r squared

A

R-Squared is a statistical measure of fit that indicates how much variation of a dependent variable is explained by the independent variable(s) in a regression model. R-squaredevaluates the scatter of the data points around the fitted regression line. This statistic indicates the percentage of the variance in the dependent variable that the independent variables explain collectively

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

paired t-test

A

used when we are interested in the difference between two variables for the same subject. Can never be applied on two different samples. Data is in the form of matched pairs. Parametric test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

one-sample t-test

A

Used to determine if there is a significant difference between the means of two groups. If value of sample size is less than 30 and variance is unknown, the one sample t-test is the best statistic to test the hypothesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

two sample t-test

A

Used to determine if two population means are equal. Used to test if a new process is superior to a current process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

one sample z-test

A

Used when we want to know whether our sample comes from a particular population. If N is greater than 30, we would have used the z-test with variance still unknown.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

f-test

A

F test used when variance is known. An “F Test” is a catch-all term forany test that uses the F-distribution. In most cases, when people talk about the F-Test, what they are actually talking about is TheF-Test to Compare Two Variances

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

chi-squared test

A

Measures the difference between observed and expected values. Used on categorical data. Chi-square test uses the observed and expected frequency of categorical data from the contingency table. Other tests are based on mean and variance of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Mann-whitney test (U test)

A

Used to test if two samples came from same population. Involves the calculation of a statistic, called U, whose distribution under the null hypothesis is known. Non-parametric test to compare outcomes between two independent groups. Alternative to two sample t-test.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Kruskal-Wallis test

A

non-parametric method to test whether samples originate from same distribution. Used for comparing two or more independent samples of equal or different sample sizes. Alternative to ANOVA.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Wilcoxon signed-rank test

A

Non-paramentric test used either to test the location of a set of samples, or locations of two populations using a set of matched samples. Does not assume the data is normally distributed. Alternative to paired sample t-test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Sign test

A

Method to test for consistent differences between pairs of observations, such as weight of subjects before and after treatment. Alternative to one sample t-test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

cross tabulation test

A

Generally for categorical data on 2 or more dimensions to store the frequency of the data. Method to quantitatively analyze the relationship between multiple variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

interval level data

A

also called integer, data type measured along a scale. Does not have fixed zero point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

ratio level data

A

quantitative data with same properties as interval data, with an equal and definitive ratio between each data and absolute “zero” being treated as a point of origin. Has a fixed zero point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

ordinal level data

A

has a predetermined or natural order. Non metric data (categorical)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

nominal level data

A

classified without a natural order or rank. Non metric data (categorical)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Scalar data

A

Contain a single value. Can be continuous or categorical.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

reference data

A

only used when working with complex data structures. Subset of master data used for classification.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Parametric statistics

A

Can be applied on the data type which is of continuous type. Used when data should follow Normal Distribution. Used on Quantitative data. Used when data is measured on approximate interval or ratio scales of measurement.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

OGIVE

A

OGIVE is a graph of cumulative distribution showing data value on the horizontal axis and either the cumulative frequencies, or the cumulative relative frequencies, or the cumulative percent frequencies on the vertical scales.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Stem and leaf display

A

Used for exploratory data analysis. Easy to construct and can provide more info within a class interval than any other methods.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Inter-quartile range

A

Measures difference between first and third quartile. Middle 50% of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

box plot

A

Upper limit and lower limit is 1.5 times the interquartile range. Points beyond the upper and lower limit are considered as outliers. Important for exploratory analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

ANCOVA

A

analysis of covariance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

ANOVA

A

analysis of variance. Used for two or more groups and one dependent variable. Used to analyze the differences among means.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

MANOVA

A

multiple analysis of variance. Used for multiple groups and multiple dependent variables. There is no concept of residuals in MANOVA analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

MANCOVA

A

multiple analysis of covariance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Unimodal

A

distribution with one clear peak or most frequent value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

Bimodal

A

distribution with two clear peaks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Multimodal

A

distribution with two peaks or more

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

No modal

A

distribution with no peaks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Little’s MCAR test

A

to assess if data is missing at random. Used for Random Value analysis. MCAR = missing completely at random.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

Cronbach Alpha test

A

to determine internal consistency, how closely related a set of items are as a group. Common method to test survey reliability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

Durban Watson test

A

to test for autocorrelation in the residuals from a regression analysis. Used to check auto-correlation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

Shapiro Wilk Test

A

to evaluate whether the observations deviate from the normal curve (nonparametric test). Used to check normality of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

split half test

A

to test for survey reliability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

test-retest

A

to test for survey reliability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

Hotelling T-Square

A

used for two groups and two or more dependent variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

Tukey LSD (least significant difference)

A

post-hoc test. Controls the Type 1 error rate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

Newman-Keuls test

A

stepwise multiple comparisons procedure used to identify sample means that are significantly different from each other.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

Conjoint analysis

A

Conjoint analysis maps consumer preference structures into mathematical tradeoffs. optimal market research approach for measuring the value that consumer place on features of a product or service. Multivariate technique. Based on premise that consumers evaluate the value of an object by combining the separate counts of value provided by each attribute.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

Regression analysis

A

statistical process for estimating the relationship between a dependent variable and one or more independent variables. One important assumption while performing regression is that the errors are independent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

Discriminant analysis

A

used by the researcher to analyze the research data when the criterion or dependent variable is categorical and the predictor or independent variable is interval.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

Confirmatory Factor Analysis (CFA)

A

CFA is usually used for Structural Equation Modelling (SEM)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

lift chart

A

graphically represents the improvement that a mining model provides when compared against a random guess. To see prediction accuracy lines for any individual value of the predictable attribute, you need not create a separate lift chart for each targeted value. A measure of the effectiveness of a predictive model calculated as the ratio between the results obtained with and without the predictive model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

Project Mgmt - Network Analysis

A

Goal is to minimize total project duration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

SLIQ / SPRINT

A

Algorithms used for addressing scalability issues of decision trees construction from very large training sets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

BOAT

A

Bootstrapped optimistic algorithm for tree construction used for creating small samples of training set while constructing a tree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

cost / time tradeoff

A

Cost decreases linearly as time increases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

Correlation

A

The covariance of the two variables normalized by the variance of each variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

correlation coefficient -R-

A

measure the strength of the linear relationship between two variables in a correlation analysis

68
Q

shadow price

A

Shadow price is defined as the rate of change in the optimal objective function value with respect to the unit change in the availability of the resources.

69
Q

slack variable

A

in post-optimality analysis or sensitivity analysis, the slack variable enters in the objective function coefficients.

70
Q

What is sensitivity analysis or post-optimality analysis

A

It measures the degree to which a solution responds to modifications of the elements of the analysis (such as objective function or coefficients). Study of knowing the effect on optimal solution of Linear Programming model due to variations in the input coefficients one at a time.

71
Q

binary integer programming problem

A

Requires the decision variables to have values between zero and one

72
Q

Primal or dual problem

A

If either the primal or dual problem has an unbounded objective function value, the other problem has no feasible solution

73
Q

Economic Order Quantity

A

Order quantity that minimizes the total holding costs and ordering costs

74
Q

queuing theory

A

the study of the movement of people, objects, or information through a line

75
Q

Calling population

A

the population of potential customers in queuing theory

76
Q

data prep

A

Cleaning, formatting, and integration are all part of data prep. Division into training and testing is not part of initial data prep.

77
Q

4 regression model assumptions

A

1) The true relationship is linear 2) Errors are normally distributed 3) homoscedasticity - equal variance around the line 4) independence of observations - errors are independent

78
Q

How is exponential distribution characterized?

A

By only one parameter which is Lambda. It has a constant failure rate

79
Q

Information architecture

A

Data and information flow within an organization

80
Q

How does increase in size of data affect R^2?

A

Negligible impact on the adjusted R^2

81
Q

What is multi-collinearity?

A

If regressor are perfectly correlated to each other there is multi-collinearity

82
Q

What does a box and whisker plot show?

A

Shows if data are skewed, and in which direction. Way to graphically display distribution. Ends of the box are the first and third quartile. Line in the box represents the median value. Whiskers extend to the min and max values, or possibly less if they do not include points identified as outliers.

83
Q

How can an under prediction be detected?

A

Bias. The bias measures the difference, including the direction of the estimate and the right answer. Depending on whether it’s positive or negative, it will show whether there is an over or under estimate.

84
Q

Customer segmentation

A

Consists of dividing a customer base into groups of individuals that are similar in specific ways relevant to marketing. Two ways to do this are clustering and decision trees.

85
Q

Data accuracy % required

A

approach or software that deals with data at +/- 10% accuracy is preferred.

86
Q

Regression / Stepwise regression can be used for?

A

computing the most likely value for the missing value in a particular column or row.

87
Q

project scheduling problem can be formulated as what?

A

A linear programming problem

88
Q

What is standard normal form (or distribution) function?

A

u=0, o=1

89
Q

covariance

A

Covariance is a measure of how much two random variables vary together. It is necessary that if two random variables are independent their covariance is 0

90
Q

Increase of confidence level does what?

A

Increases the width of the confidence interval

91
Q

What are sequencing problems?

A

They are concerned with an appropriate selection of a sequence of jobs to be done on a finite number of service facilities (like machines). The selection of an appropriate order for finite number of different jobs to be done on a finite number of machines is called sequencing problem.

92
Q

What is linear programming?

A

A mathematical technique for maximizing or minimizing a linear function of several variables such as output or cost. (a class of optimization problems)

93
Q

What is duality in linear programming?

A

Duality implies that each linear programming problem can be analyzed in two different ways but would have equivalent solutions.

94
Q

What is Integer linear programming?

A

An integer programming (IP) program is a LP problem in which the decision variables are further constrained to take integer values

95
Q

What is goal programming?

A

A goal programming model seeks to simultaneously take into account multiple objectives or goals that are of concern. LP models consist of constraints and a single objective to be maximized or minimized. A Goal Programming model consists of constraints and a set of goals that are prioritized.

96
Q

What are Transportation Problems?

A

TP is a special kind of Linear Programming problem (LPP) in which goods are transported from a set of sources to a set of destinations subject to the supply and demand of the sources and destination respectively such that the total cost of the transportation is minimized.

97
Q

What are Assignment Problems?

A

An assignment problem is a particular case of transportation problem where the objective is to assign a number of resources to an equal number of activities. (workers and tasks)

98
Q

What are Sequencing Problems?

A

They are concerned with an appropriate selection of a sequence of jobs to be done on a finite number of service facilities (like machines). The selection of an appropriate order for finite number of different jobs to be done on a finite number of machines is called sequencing problem.

99
Q

What are Decision Trees

A

Decision support tool that uses a tree-like model of decisions and their possible consequences. Non-parametric supervised learning method for classification and regression. Goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred form the data features.

100
Q

What are PERT/CPM in Project Management?

A

Both are network -based project mgmt. techniques which exhibit the flow and sequence of the activities and events. PERT (Project Management Review Technique) is appropriate for projects where the time required to complete activities is not known. CPM (Critical Path Method) is apt for project which are recurring in nature. PERT is a probabilistic method. CPM is deterministic. PERT is used more for uncertain activities. CPM for predictable.

101
Q

What are Inventory Control Models?

A

Probabilistic model – concerned with minimizing the total cost of inventory. Common models are EOQ (Economic Order Quantity), Inventory Production Quantity, and ABC Analysis. EOQ – what is most prudent number of things a business should request to limit costs and boost esteem when reloading stock. EOQ = 2DS/C (D = annual demand, C= Carrying cost, S= Ordering cost).

102
Q

What is Queuing Theory?

A

The study of the movement of people, objects, or information through a line. A queuing model is constructed so that queue lengths and waiting time can be predicted.

103
Q

What is Replacement Theory?

A

In operations research used in the decision making process of replacing a used equipment with a substitute.

104
Q

What are Markov Chains?

A

An important stochastic process. Assumes that the description of the present state fully captures all the information that could influence the future evolution of the process. Predicting traffic flows, communication networks, genetic issues, and queues are examples where Markov chains can be used to model performance. Markov processes are the basis for general stochastic simulation methods ie. Markov Chain Monte Carlo, which are used for simulating sampling from complex probability distributions. A Markov chain essentially consists of a set of transitions, which are determined by some probability distribution.

105
Q

What is Dynamic Programming (DP)?

A

An algorithmic technique for solving a problem by recursively breaking it down into simpler subproblems. When we see a recursive solution that has repeated calls for the same inputs, use dynamic programming. The idea is to simply store the results of the subproblems.

106
Q

What are Optimization Methods?

A

Used to find solutions that maximize or minimize some study parameters, such as minimize costs in the production of a good or service, maximize profits, minimize raw material in development of a good, or maximize production

107
Q

What is Frequent Pattern Mining?

A

Same as association rules mining – analytical process that finds frequent patterns, associations.

108
Q

What is multivariate analysis?

A

a Statistical procedure for analysis of data involving more than one type of measurement or observation. It may also mean solving problems where more than one dependent variable is analyzed simultaneously with other variables

109
Q

What is Qualitative forecasting?

A

A method of making predictions using judgement from experts. Is necessary at times but can lead to bias due to recency or personal worldview

110
Q

What is Probability Theory and Bayes Theorem?

A

Describes probability of event based on prior knowledge of conditions that might be related to the event. Bayesian inference is a method used to update the probability for a hypothesis as more evidence or information becomes available.

111
Q

What is non-parametric testing?

A

Also called a distribution free test, does not assume anything about the underlying data.

112
Q

What is net working capital?

A

the difference between a company’s current assets (cash, accounts receivable unpaid goods) and it’s current liabilities (accounts payable and debts). NWC formula: NWC = Accounts Receivable + Inventory – Accounts Payable.

113
Q

What is Operating Margin?

A

The ratio of operating income to net sales. OM formula: Operating margin = (Operating income / Revenue)

114
Q

What is Payback Period?

A

The amount of time it takes to recover the cost of an investment or the length of time it takes an investor to break-even. PP formula: Payback period = amount to be invested / estimated annual net cash flow

115
Q

What is Net Present Value?

A

Method used to determine the current value of all future cash flows generated by a project. NPV formula: NPV = NPV = F / [ (1 + r)^n ] where, PV = Present Value, F = Future payment (cash flow), r = Discount rate, n = the number of periods in the future is based on future cash flows.

116
Q

What is Standard Deviation?

A

Measure of amount of variation or dispersion of a set of values. Empirical rule or 68-95-99.7 rule. 68% of scores are within 1 standard deviations of the mean. 95% are within 2 standard deviations of the mean. 99.7 of scores are within 3 standard deviations of the mean.

117
Q

What is Efficient Frontier?

A

Set of optimal portfolios that offer the highest expected return for a defined level of risk.

118
Q

What is Anchoring Bias?

A

A cognitive bias that causes us to rely too heavily on the first piece of information we are given about a topic.

119
Q

What is the 80/20 rule?

A

AKA the Pareto principle: roughly 80% of results come from 20% of effort

120
Q

What is activity-based costing

A

method of assigning costs to products or services on the resources that they consume

121
Q

What is Agent-based modeling

A

a class of computation models for simulating actions and interactions of autonomous agents with a view to assessing their effects on the system as a whole

122
Q

What is Amortization?

A

allocation of cost of an item or items over a time period such that the actual cost is recovered;

123
Q

What are artificial neural networks?

A

Computer based models inspired by animal central nervous systems

124
Q

What is Assemble-to-Order (ATO)?

A

manufacturing process where products are assembled as they are ordered; characterized by rapid production and customization

125
Q

What is benchmarking?

A

act of comparison against a standard or the behavior of another in attempt to determine degree of conformity to standard or behavior

126
Q

What is Branch-and-Bound?

A

a general algorithm for finding optimal solutions of various optimization problems

127
Q

What is Business Process Modeling or Mapping?

A

act of representing processes of an enterprise so that the current process may be analyzed and improved; typically action performed by business analysis and managers seeking improved efficiency and quality

128
Q

What is Chi-squared Automated Interaction Detection (CHAID)

A

CHAID is one of several commonly used techniques for decision trees and is based upon hypothesis testing using

129
Q

What is the confidence interval?

A

A type of interval estimate of a population parameter used to indicate the reliability of an estimate.

130
Q

What is the confidence level?

A

A confidence level refers to the percentage of all possible samples that can be expected to include the true population parameter. In surveys, confidence levels of 90/95/99% are frequently used

131
Q

What is the cost of capital?

A

The cost of funds used for financing a business

132
Q

What is the cumulative density function?

A

Used to specify the distribution of multivariate random variables

133
Q

What is a cutting stock problem?

A

Optimization or integer linear programming problem arising from applications in industry where high production problems exist

134
Q

What is the effective domain?

A

The domain of a function for which its value is finite

135
Q

What is experimental design?

A

In quality management, a written plan that describes the specifics for conducting an experiment such as which conditions, factors, responses, tools, and treatments are to be included or used.

136
Q

What are expert systems?

A

a computer program that simulates the judgment and behavior of a human or an organization that has expert knowledge and experience in a particular field

137
Q

What is factor analysis?

A

A statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors.

138
Q

What is fuzzy logic?

A

A form of mathematical logic in which truth can assume a continuum of values between 0 and 1

139
Q

What are greedy heuristics?

A

an algorithm that follows the problem-solving heuristic of making the locally-optimal choice at each stage with the hope of finding a global optimum

140
Q

What is a heuristic?

A

in mathematical programming, this usually means a procedure that seeks an optimal solution but does not guarantee it will find one, even if one exists. It is often used in contrast to an algorithm, so branch and bound would not be considered a heuristic in this sense. In AI, however, a heuristic is an algorithm (with some guarantees) that uses a heuristic function to estimate the “cost” of branching from a given node to a leaf of the search tree

141
Q

What is KDD?

A

Acronym for knowledge discovery in database processes

142
Q

What is the knapsack problem?

A

an integer program of the form, Max{cx: x in Zn+ and ax <= b}, where a > 0. The original problem models the maximum value of a knapsack that is limited by volume or weight (b), where x_j = number of items of type j put into the knapsack at unit return c_j, that uses a_j units per item

143
Q

What is Little’s Law?

A

Queuing theory where numerator and denominator are halved so queues are roughly equivalent no matter how many are in line. Littles Law: the long-term average number of customers in a stable system L is equal to the long-term average effective arrival rate, λ, multiplied by the (Palm) average time a customer spends in the system, W; or expressed algebraically: L = λW

144
Q

What is mean squared error (MSE)?

A

the unbiased estimator of population variance.

145
Q

What is the mode?

A

Value of the term that occurs most frequently

146
Q

What is the Nominal Group Technique (NGT)?

A

A structured method for group brainstorming that encourages contributions from everyone

147
Q

What is normalization?

A

Splits up data to avoid redundancy (duplication) by moving commonly repeating groups of data into new tables. Normalization tends to increase the number of tables that need to be joined to perform a given query, but reduces space required to hold the data.

148
Q

What is precision?

A

The degree to which repeated measurements under unchanged conditions show the same results

149
Q

What is Principal Component Analysis (PCA)?

A

A dimension reduction tool that can be used to reduce a large set of variables

150
Q

What is a probability density function?

A

The equation used to describe a continuous probability distribution

151
Q

What is return on investment (ROI)?

A

Calculation that provide a basis for comparison with other investment opportunities; typically calculated using ROI = (Net Return on Investment/Total cost of investment) *100

152
Q

What is RFM?

A

Recency, frequency, and monetary value of purchases

153
Q

What is Six Sigma?

A

A set of strategies, techniques, and tools for process improvement

154
Q

What is stepwise regression?

A

A semi-automated process of building a model by successively adding or removing variables based solely on the t-statistics of their estimated coefficients.

155
Q

What is poisson distribution?

A

A discrete frequency distribution which gives the probability of a number of independent events occurring in a fixed time.

156
Q

What is kurtosis?

A

Measure of skewness of distribution

157
Q

What is a c-Chart?

A

A quality control chart used to control the number of defects per unit output

158
Q

What is a p-Chart?

A

A quality control chart that is used to control attributes

159
Q

What is an R-chart?

A

A process control chart that tracks the range within a sample

160
Q

What is an X-chart?

A

A quality control chart that indicates when changes occur in the central tendency of a production process

161
Q

What is uniform distribution?

A

Looks like a rectangular distribution

162
Q

What is Type 1 error?

A

False-positive. Failure to reject the null hypothesis

163
Q

What is Type 2 error?

A

False-negative. Failure to accept the null hypothesis.

164
Q

What is type 3 and type 4 error?

A

There is no such thing

165
Q

What is goodness of fit?

A

Degree of assurance or confidence to which the results of a survey or test can be relied upon form making dependable projections. Also described as - the degree of linear correlation of variables. It is computed with statistical methods such as chi-square test or coefficient of determination

166
Q

Examples of parametric tests?

A

t-test (n<30), anova, pearsons r correlation, z-test for large samples (>30)

167
Q

Examples of non-parametric tests?

A

Mann-Whitney U test, Wilcoxen Signed Rank, Kruskall-Wallis