CAP definitions Flashcards
What is cross correlation?
When two different sequences are correlated
What is autocorrelation?
Degree of similarity between the values of the same variables
What is spatial autocorrelation?
Degree of similarity - When error terms across cross section data are correlated
What is serial autocorrelation
degree of similarity - When error terms across time series data are correlated
What is OLAP?
Online Analytical Processing - uses complex queries to analyze aggregated historical data from OLTP systems - associated with data warehouses. Operations include roll-up, drill down, slice and dice, pivoting, drill through, drill across, etc. OLAP data cubes can be mapped to any (infinite) number of dimensions .
What is OLTP?
Online Transaction Processing - captures, stores, and processes data from transactions in real-time. Faster
Monte Carlo simulation
Necessary to develop a cumulative probability distribution. Technique that allows people to account for risk in quantitative analysis and decision making.
snowball sampling
survey sampling where subjects are based on referral from other survey respondents
quota sampling
Method for selecting survey participants that is a non-probabilisitc version of stratified samplings. involves a specific group
judgement sampling
based on researcher’s judgement to select
strata sampling
random sampling is stratified random sampling
Central limit theorum
The central limit theorem (CLT) states that the distribution of sample means approximates a normal distribution as the sample size gets larger, regardless of the population’s distribution. Sample sizesequal to or greater than 30are often considered sufficient for the CLT to hold.
When do you reject null hypothesis?
When p-value (probability) is less than alpha (level of significance) - alpha is typically 0.05. Null hypothesis is rejected only in two cases. If p-value is less than alpha or test statistic calculated is greater than tabular value.
What is alpha in statistics?
The level of significance
When do you fail to reject null hypothesis?
When p-value is above alpha
p-value
A p-value, or probability value, is a number describing how likely it is that your data would have occurred by random chance
r squared
R-Squared is a statistical measure of fit that indicates how much variation of a dependent variable is explained by the independent variable(s) in a regression model. R-squaredevaluates the scatter of the data points around the fitted regression line. This statistic indicates the percentage of the variance in the dependent variable that the independent variables explain collectively
paired t-test
used when we are interested in the difference between two variables for the same subject. Can never be applied on two different samples. Data is in the form of matched pairs. Parametric test
one-sample t-test
Used to determine if there is a significant difference between the means of two groups. If value of sample size is less than 30 and variance is unknown, the one sample t-test is the best statistic to test the hypothesis.
two sample t-test
Used to determine if two population means are equal. Used to test if a new process is superior to a current process.
one sample z-test
Used when we want to know whether our sample comes from a particular population. If N is greater than 30, we would have used the z-test with variance still unknown.
f-test
F test used when variance is known. An “F Test” is a catch-all term forany test that uses the F-distribution. In most cases, when people talk about the F-Test, what they are actually talking about is TheF-Test to Compare Two Variances
chi-squared test
Measures the difference between observed and expected values. Used on categorical data. Chi-square test uses the observed and expected frequency of categorical data from the contingency table. Other tests are based on mean and variance of the data.
Mann-whitney test (U test)
Used to test if two samples came from same population. Involves the calculation of a statistic, called U, whose distribution under the null hypothesis is known. Non-parametric test to compare outcomes between two independent groups. Alternative to two sample t-test.
Kruskal-Wallis test
non-parametric method to test whether samples originate from same distribution. Used for comparing two or more independent samples of equal or different sample sizes. Alternative to ANOVA.
Wilcoxon signed-rank test
Non-paramentric test used either to test the location of a set of samples, or locations of two populations using a set of matched samples. Does not assume the data is normally distributed. Alternative to paired sample t-test
Sign test
Method to test for consistent differences between pairs of observations, such as weight of subjects before and after treatment. Alternative to one sample t-test
cross tabulation test
Generally for categorical data on 2 or more dimensions to store the frequency of the data. Method to quantitatively analyze the relationship between multiple variables.
interval level data
also called integer, data type measured along a scale. Does not have fixed zero point.
ratio level data
quantitative data with same properties as interval data, with an equal and definitive ratio between each data and absolute “zero” being treated as a point of origin. Has a fixed zero point.
ordinal level data
has a predetermined or natural order. Non metric data (categorical)
nominal level data
classified without a natural order or rank. Non metric data (categorical)
Scalar data
Contain a single value. Can be continuous or categorical.
reference data
only used when working with complex data structures. Subset of master data used for classification.
Parametric statistics
Can be applied on the data type which is of continuous type. Used when data should follow Normal Distribution. Used on Quantitative data. Used when data is measured on approximate interval or ratio scales of measurement.
OGIVE
OGIVE is a graph of cumulative distribution showing data value on the horizontal axis and either the cumulative frequencies, or the cumulative relative frequencies, or the cumulative percent frequencies on the vertical scales.
Stem and leaf display
Used for exploratory data analysis. Easy to construct and can provide more info within a class interval than any other methods.
Inter-quartile range
Measures difference between first and third quartile. Middle 50% of the data.
box plot
Upper limit and lower limit is 1.5 times the interquartile range. Points beyond the upper and lower limit are considered as outliers. Important for exploratory analysis.
ANCOVA
analysis of covariance
ANOVA
analysis of variance. Used for two or more groups and one dependent variable. Used to analyze the differences among means.
MANOVA
multiple analysis of variance. Used for multiple groups and multiple dependent variables. There is no concept of residuals in MANOVA analysis.
MANCOVA
multiple analysis of covariance
Unimodal
distribution with one clear peak or most frequent value
Bimodal
distribution with two clear peaks
Multimodal
distribution with two peaks or more
No modal
distribution with no peaks
Little’s MCAR test
to assess if data is missing at random. Used for Random Value analysis. MCAR = missing completely at random.
Cronbach Alpha test
to determine internal consistency, how closely related a set of items are as a group. Common method to test survey reliability.
Durban Watson test
to test for autocorrelation in the residuals from a regression analysis. Used to check auto-correlation
Shapiro Wilk Test
to evaluate whether the observations deviate from the normal curve (nonparametric test). Used to check normality of data.
split half test
to test for survey reliability
test-retest
to test for survey reliability
Hotelling T-Square
used for two groups and two or more dependent variables.
Tukey LSD (least significant difference)
post-hoc test. Controls the Type 1 error rate
Newman-Keuls test
stepwise multiple comparisons procedure used to identify sample means that are significantly different from each other.
Conjoint analysis
Conjoint analysis maps consumer preference structures into mathematical tradeoffs. optimal market research approach for measuring the value that consumer place on features of a product or service. Multivariate technique. Based on premise that consumers evaluate the value of an object by combining the separate counts of value provided by each attribute.
Regression analysis
statistical process for estimating the relationship between a dependent variable and one or more independent variables. One important assumption while performing regression is that the errors are independent.
Discriminant analysis
used by the researcher to analyze the research data when the criterion or dependent variable is categorical and the predictor or independent variable is interval.
Confirmatory Factor Analysis (CFA)
CFA is usually used for Structural Equation Modelling (SEM)
lift chart
graphically represents the improvement that a mining model provides when compared against a random guess. To see prediction accuracy lines for any individual value of the predictable attribute, you need not create a separate lift chart for each targeted value. A measure of the effectiveness of a predictive model calculated as the ratio between the results obtained with and without the predictive model.
Project Mgmt - Network Analysis
Goal is to minimize total project duration
SLIQ / SPRINT
Algorithms used for addressing scalability issues of decision trees construction from very large training sets
BOAT
Bootstrapped optimistic algorithm for tree construction used for creating small samples of training set while constructing a tree
cost / time tradeoff
Cost decreases linearly as time increases
Correlation
The covariance of the two variables normalized by the variance of each variable
correlation coefficient -R-
measure the strength of the linear relationship between two variables in a correlation analysis
shadow price
Shadow price is defined as the rate of change in the optimal objective function value with respect to the unit change in the availability of the resources.
slack variable
in post-optimality analysis or sensitivity analysis, the slack variable enters in the objective function coefficients.
What is sensitivity analysis or post-optimality analysis
It measures the degree to which a solution responds to modifications of the elements of the analysis (such as objective function or coefficients). Study of knowing the effect on optimal solution of Linear Programming model due to variations in the input coefficients one at a time.
binary integer programming problem
Requires the decision variables to have values between zero and one
Primal or dual problem
If either the primal or dual problem has an unbounded objective function value, the other problem has no feasible solution
Economic Order Quantity
Order quantity that minimizes the total holding costs and ordering costs
queuing theory
the study of the movement of people, objects, or information through a line
Calling population
the population of potential customers in queuing theory
data prep
Cleaning, formatting, and integration are all part of data prep. Division into training and testing is not part of initial data prep.
4 regression model assumptions
1) The true relationship is linear 2) Errors are normally distributed 3) homoscedasticity - equal variance around the line 4) independence of observations - errors are independent
How is exponential distribution characterized?
By only one parameter which is Lambda. It has a constant failure rate
Information architecture
Data and information flow within an organization
How does increase in size of data affect R^2?
Negligible impact on the adjusted R^2
What is multi-collinearity?
If regressor are perfectly correlated to each other there is multi-collinearity
What does a box and whisker plot show?
Shows if data are skewed, and in which direction. Way to graphically display distribution. Ends of the box are the first and third quartile. Line in the box represents the median value. Whiskers extend to the min and max values, or possibly less if they do not include points identified as outliers.
How can an under prediction be detected?
Bias. The bias measures the difference, including the direction of the estimate and the right answer. Depending on whether it’s positive or negative, it will show whether there is an over or under estimate.
Customer segmentation
Consists of dividing a customer base into groups of individuals that are similar in specific ways relevant to marketing. Two ways to do this are clustering and decision trees.
Data accuracy % required
approach or software that deals with data at +/- 10% accuracy is preferred.
Regression / Stepwise regression can be used for?
computing the most likely value for the missing value in a particular column or row.
project scheduling problem can be formulated as what?
A linear programming problem
What is standard normal form (or distribution) function?
u=0, o=1
covariance
Covariance is a measure of how much two random variables vary together. It is necessary that if two random variables are independent their covariance is 0
Increase of confidence level does what?
Increases the width of the confidence interval
What are sequencing problems?
They are concerned with an appropriate selection of a sequence of jobs to be done on a finite number of service facilities (like machines). The selection of an appropriate order for finite number of different jobs to be done on a finite number of machines is called sequencing problem.
What is linear programming?
A mathematical technique for maximizing or minimizing a linear function of several variables such as output or cost. (a class of optimization problems)
What is duality in linear programming?
Duality implies that each linear programming problem can be analyzed in two different ways but would have equivalent solutions.
What is Integer linear programming?
An integer programming (IP) program is a LP problem in which the decision variables are further constrained to take integer values
What is goal programming?
A goal programming model seeks to simultaneously take into account multiple objectives or goals that are of concern. LP models consist of constraints and a single objective to be maximized or minimized. A Goal Programming model consists of constraints and a set of goals that are prioritized.
What are Transportation Problems?
TP is a special kind of Linear Programming problem (LPP) in which goods are transported from a set of sources to a set of destinations subject to the supply and demand of the sources and destination respectively such that the total cost of the transportation is minimized.
What are Assignment Problems?
An assignment problem is a particular case of transportation problem where the objective is to assign a number of resources to an equal number of activities. (workers and tasks)
What are Sequencing Problems?
They are concerned with an appropriate selection of a sequence of jobs to be done on a finite number of service facilities (like machines). The selection of an appropriate order for finite number of different jobs to be done on a finite number of machines is called sequencing problem.
What are Decision Trees
Decision support tool that uses a tree-like model of decisions and their possible consequences. Non-parametric supervised learning method for classification and regression. Goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred form the data features.
What are PERT/CPM in Project Management?
Both are network -based project mgmt. techniques which exhibit the flow and sequence of the activities and events. PERT (Project Management Review Technique) is appropriate for projects where the time required to complete activities is not known. CPM (Critical Path Method) is apt for project which are recurring in nature. PERT is a probabilistic method. CPM is deterministic. PERT is used more for uncertain activities. CPM for predictable.
What are Inventory Control Models?
Probabilistic model – concerned with minimizing the total cost of inventory. Common models are EOQ (Economic Order Quantity), Inventory Production Quantity, and ABC Analysis. EOQ – what is most prudent number of things a business should request to limit costs and boost esteem when reloading stock. EOQ = 2DS/C (D = annual demand, C= Carrying cost, S= Ordering cost).
What is Queuing Theory?
The study of the movement of people, objects, or information through a line. A queuing model is constructed so that queue lengths and waiting time can be predicted.
What is Replacement Theory?
In operations research used in the decision making process of replacing a used equipment with a substitute.
What are Markov Chains?
An important stochastic process. Assumes that the description of the present state fully captures all the information that could influence the future evolution of the process. Predicting traffic flows, communication networks, genetic issues, and queues are examples where Markov chains can be used to model performance. Markov processes are the basis for general stochastic simulation methods ie. Markov Chain Monte Carlo, which are used for simulating sampling from complex probability distributions. A Markov chain essentially consists of a set of transitions, which are determined by some probability distribution.
What is Dynamic Programming (DP)?
An algorithmic technique for solving a problem by recursively breaking it down into simpler subproblems. When we see a recursive solution that has repeated calls for the same inputs, use dynamic programming. The idea is to simply store the results of the subproblems.
What are Optimization Methods?
Used to find solutions that maximize or minimize some study parameters, such as minimize costs in the production of a good or service, maximize profits, minimize raw material in development of a good, or maximize production
What is Frequent Pattern Mining?
Same as association rules mining – analytical process that finds frequent patterns, associations.
What is multivariate analysis?
a Statistical procedure for analysis of data involving more than one type of measurement or observation. It may also mean solving problems where more than one dependent variable is analyzed simultaneously with other variables
What is Qualitative forecasting?
A method of making predictions using judgement from experts. Is necessary at times but can lead to bias due to recency or personal worldview
What is Probability Theory and Bayes Theorem?
Describes probability of event based on prior knowledge of conditions that might be related to the event. Bayesian inference is a method used to update the probability for a hypothesis as more evidence or information becomes available.
What is non-parametric testing?
Also called a distribution free test, does not assume anything about the underlying data.
What is net working capital?
the difference between a company’s current assets (cash, accounts receivable unpaid goods) and it’s current liabilities (accounts payable and debts). NWC formula: NWC = Accounts Receivable + Inventory – Accounts Payable.
What is Operating Margin?
The ratio of operating income to net sales. OM formula: Operating margin = (Operating income / Revenue)
What is Payback Period?
The amount of time it takes to recover the cost of an investment or the length of time it takes an investor to break-even. PP formula: Payback period = amount to be invested / estimated annual net cash flow
What is Net Present Value?
Method used to determine the current value of all future cash flows generated by a project. NPV formula: NPV = NPV = F / [ (1 + r)^n ] where, PV = Present Value, F = Future payment (cash flow), r = Discount rate, n = the number of periods in the future is based on future cash flows.
What is Standard Deviation?
Measure of amount of variation or dispersion of a set of values. Empirical rule or 68-95-99.7 rule. 68% of scores are within 1 standard deviations of the mean. 95% are within 2 standard deviations of the mean. 99.7 of scores are within 3 standard deviations of the mean.
What is Efficient Frontier?
Set of optimal portfolios that offer the highest expected return for a defined level of risk.
What is Anchoring Bias?
A cognitive bias that causes us to rely too heavily on the first piece of information we are given about a topic.
What is the 80/20 rule?
AKA the Pareto principle: roughly 80% of results come from 20% of effort
What is activity-based costing
method of assigning costs to products or services on the resources that they consume
What is Agent-based modeling
a class of computation models for simulating actions and interactions of autonomous agents with a view to assessing their effects on the system as a whole
What is Amortization?
allocation of cost of an item or items over a time period such that the actual cost is recovered;
What are artificial neural networks?
Computer based models inspired by animal central nervous systems
What is Assemble-to-Order (ATO)?
manufacturing process where products are assembled as they are ordered; characterized by rapid production and customization
What is benchmarking?
act of comparison against a standard or the behavior of another in attempt to determine degree of conformity to standard or behavior
What is Branch-and-Bound?
a general algorithm for finding optimal solutions of various optimization problems
What is Business Process Modeling or Mapping?
act of representing processes of an enterprise so that the current process may be analyzed and improved; typically action performed by business analysis and managers seeking improved efficiency and quality
What is Chi-squared Automated Interaction Detection (CHAID)
CHAID is one of several commonly used techniques for decision trees and is based upon hypothesis testing using
What is the confidence interval?
A type of interval estimate of a population parameter used to indicate the reliability of an estimate.
What is the confidence level?
A confidence level refers to the percentage of all possible samples that can be expected to include the true population parameter. In surveys, confidence levels of 90/95/99% are frequently used
What is the cost of capital?
The cost of funds used for financing a business
What is the cumulative density function?
Used to specify the distribution of multivariate random variables
What is a cutting stock problem?
Optimization or integer linear programming problem arising from applications in industry where high production problems exist
What is the effective domain?
The domain of a function for which its value is finite
What is experimental design?
In quality management, a written plan that describes the specifics for conducting an experiment such as which conditions, factors, responses, tools, and treatments are to be included or used.
What are expert systems?
a computer program that simulates the judgment and behavior of a human or an organization that has expert knowledge and experience in a particular field
What is factor analysis?
A statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors.
What is fuzzy logic?
A form of mathematical logic in which truth can assume a continuum of values between 0 and 1
What are greedy heuristics?
an algorithm that follows the problem-solving heuristic of making the locally-optimal choice at each stage with the hope of finding a global optimum
What is a heuristic?
in mathematical programming, this usually means a procedure that seeks an optimal solution but does not guarantee it will find one, even if one exists. It is often used in contrast to an algorithm, so branch and bound would not be considered a heuristic in this sense. In AI, however, a heuristic is an algorithm (with some guarantees) that uses a heuristic function to estimate the “cost” of branching from a given node to a leaf of the search tree
What is KDD?
Acronym for knowledge discovery in database processes
What is the knapsack problem?
an integer program of the form, Max{cx: x in Zn+ and ax <= b}, where a > 0. The original problem models the maximum value of a knapsack that is limited by volume or weight (b), where x_j = number of items of type j put into the knapsack at unit return c_j, that uses a_j units per item
What is Little’s Law?
Queuing theory where numerator and denominator are halved so queues are roughly equivalent no matter how many are in line. Littles Law: the long-term average number of customers in a stable system L is equal to the long-term average effective arrival rate, λ, multiplied by the (Palm) average time a customer spends in the system, W; or expressed algebraically: L = λW
What is mean squared error (MSE)?
the unbiased estimator of population variance.
What is the mode?
Value of the term that occurs most frequently
What is the Nominal Group Technique (NGT)?
A structured method for group brainstorming that encourages contributions from everyone
What is normalization?
Splits up data to avoid redundancy (duplication) by moving commonly repeating groups of data into new tables. Normalization tends to increase the number of tables that need to be joined to perform a given query, but reduces space required to hold the data.
What is precision?
The degree to which repeated measurements under unchanged conditions show the same results
What is Principal Component Analysis (PCA)?
A dimension reduction tool that can be used to reduce a large set of variables
What is a probability density function?
The equation used to describe a continuous probability distribution
What is return on investment (ROI)?
Calculation that provide a basis for comparison with other investment opportunities; typically calculated using ROI = (Net Return on Investment/Total cost of investment) *100
What is RFM?
Recency, frequency, and monetary value of purchases
What is Six Sigma?
A set of strategies, techniques, and tools for process improvement
What is stepwise regression?
A semi-automated process of building a model by successively adding or removing variables based solely on the t-statistics of their estimated coefficients.
What is poisson distribution?
A discrete frequency distribution which gives the probability of a number of independent events occurring in a fixed time.
What is kurtosis?
Measure of skewness of distribution
What is a c-Chart?
A quality control chart used to control the number of defects per unit output
What is a p-Chart?
A quality control chart that is used to control attributes
What is an R-chart?
A process control chart that tracks the range within a sample
What is an X-chart?
A quality control chart that indicates when changes occur in the central tendency of a production process
What is uniform distribution?
Looks like a rectangular distribution
What is Type 1 error?
False-positive. Failure to reject the null hypothesis
What is Type 2 error?
False-negative. Failure to accept the null hypothesis.
What is type 3 and type 4 error?
There is no such thing
What is goodness of fit?
Degree of assurance or confidence to which the results of a survey or test can be relied upon form making dependable projections. Also described as - the degree of linear correlation of variables. It is computed with statistical methods such as chi-square test or coefficient of determination
Examples of parametric tests?
t-test (n<30), anova, pearsons r correlation, z-test for large samples (>30)
Examples of non-parametric tests?
Mann-Whitney U test, Wilcoxen Signed Rank, Kruskall-Wallis