CAP definitions Flashcards
What is cross correlation?
When two different sequences are correlated
What is autocorrelation?
Degree of similarity between the values of the same variables
What is spatial autocorrelation?
Degree of similarity - When error terms across cross section data are correlated
What is serial autocorrelation
degree of similarity - When error terms across time series data are correlated
What is OLAP?
Online Analytical Processing - uses complex queries to analyze aggregated historical data from OLTP systems - associated with data warehouses. Operations include roll-up, drill down, slice and dice, pivoting, drill through, drill across, etc. OLAP data cubes can be mapped to any (infinite) number of dimensions .
What is OLTP?
Online Transaction Processing - captures, stores, and processes data from transactions in real-time. Faster
Monte Carlo simulation
Necessary to develop a cumulative probability distribution. Technique that allows people to account for risk in quantitative analysis and decision making.
snowball sampling
survey sampling where subjects are based on referral from other survey respondents
quota sampling
Method for selecting survey participants that is a non-probabilisitc version of stratified samplings. involves a specific group
judgement sampling
based on researcher’s judgement to select
strata sampling
random sampling is stratified random sampling
Central limit theorum
The central limit theorem (CLT) states that the distribution of sample means approximates a normal distribution as the sample size gets larger, regardless of the population’s distribution. Sample sizesequal to or greater than 30are often considered sufficient for the CLT to hold.
When do you reject null hypothesis?
When p-value (probability) is less than alpha (level of significance) - alpha is typically 0.05. Null hypothesis is rejected only in two cases. If p-value is less than alpha or test statistic calculated is greater than tabular value.
What is alpha in statistics?
The level of significance
When do you fail to reject null hypothesis?
When p-value is above alpha
p-value
A p-value, or probability value, is a number describing how likely it is that your data would have occurred by random chance
r squared
R-Squared is a statistical measure of fit that indicates how much variation of a dependent variable is explained by the independent variable(s) in a regression model. R-squaredevaluates the scatter of the data points around the fitted regression line. This statistic indicates the percentage of the variance in the dependent variable that the independent variables explain collectively
paired t-test
used when we are interested in the difference between two variables for the same subject. Can never be applied on two different samples. Data is in the form of matched pairs. Parametric test
one-sample t-test
Used to determine if there is a significant difference between the means of two groups. If value of sample size is less than 30 and variance is unknown, the one sample t-test is the best statistic to test the hypothesis.
two sample t-test
Used to determine if two population means are equal. Used to test if a new process is superior to a current process.
one sample z-test
Used when we want to know whether our sample comes from a particular population. If N is greater than 30, we would have used the z-test with variance still unknown.
f-test
F test used when variance is known. An “F Test” is a catch-all term forany test that uses the F-distribution. In most cases, when people talk about the F-Test, what they are actually talking about is TheF-Test to Compare Two Variances
chi-squared test
Measures the difference between observed and expected values. Used on categorical data. Chi-square test uses the observed and expected frequency of categorical data from the contingency table. Other tests are based on mean and variance of the data.
Mann-whitney test (U test)
Used to test if two samples came from same population. Involves the calculation of a statistic, called U, whose distribution under the null hypothesis is known. Non-parametric test to compare outcomes between two independent groups. Alternative to two sample t-test.
Kruskal-Wallis test
non-parametric method to test whether samples originate from same distribution. Used for comparing two or more independent samples of equal or different sample sizes. Alternative to ANOVA.
Wilcoxon signed-rank test
Non-paramentric test used either to test the location of a set of samples, or locations of two populations using a set of matched samples. Does not assume the data is normally distributed. Alternative to paired sample t-test
Sign test
Method to test for consistent differences between pairs of observations, such as weight of subjects before and after treatment. Alternative to one sample t-test
cross tabulation test
Generally for categorical data on 2 or more dimensions to store the frequency of the data. Method to quantitatively analyze the relationship between multiple variables.
interval level data
also called integer, data type measured along a scale. Does not have fixed zero point.
ratio level data
quantitative data with same properties as interval data, with an equal and definitive ratio between each data and absolute “zero” being treated as a point of origin. Has a fixed zero point.
ordinal level data
has a predetermined or natural order. Non metric data (categorical)
nominal level data
classified without a natural order or rank. Non metric data (categorical)
Scalar data
Contain a single value. Can be continuous or categorical.
reference data
only used when working with complex data structures. Subset of master data used for classification.
Parametric statistics
Can be applied on the data type which is of continuous type. Used when data should follow Normal Distribution. Used on Quantitative data. Used when data is measured on approximate interval or ratio scales of measurement.
OGIVE
OGIVE is a graph of cumulative distribution showing data value on the horizontal axis and either the cumulative frequencies, or the cumulative relative frequencies, or the cumulative percent frequencies on the vertical scales.
Stem and leaf display
Used for exploratory data analysis. Easy to construct and can provide more info within a class interval than any other methods.
Inter-quartile range
Measures difference between first and third quartile. Middle 50% of the data.
box plot
Upper limit and lower limit is 1.5 times the interquartile range. Points beyond the upper and lower limit are considered as outliers. Important for exploratory analysis.
ANCOVA
analysis of covariance
ANOVA
analysis of variance. Used for two or more groups and one dependent variable. Used to analyze the differences among means.
MANOVA
multiple analysis of variance. Used for multiple groups and multiple dependent variables. There is no concept of residuals in MANOVA analysis.
MANCOVA
multiple analysis of covariance
Unimodal
distribution with one clear peak or most frequent value
Bimodal
distribution with two clear peaks
Multimodal
distribution with two peaks or more
No modal
distribution with no peaks
Little’s MCAR test
to assess if data is missing at random. Used for Random Value analysis. MCAR = missing completely at random.
Cronbach Alpha test
to determine internal consistency, how closely related a set of items are as a group. Common method to test survey reliability.
Durban Watson test
to test for autocorrelation in the residuals from a regression analysis. Used to check auto-correlation
Shapiro Wilk Test
to evaluate whether the observations deviate from the normal curve (nonparametric test). Used to check normality of data.
split half test
to test for survey reliability
test-retest
to test for survey reliability
Hotelling T-Square
used for two groups and two or more dependent variables.
Tukey LSD (least significant difference)
post-hoc test. Controls the Type 1 error rate
Newman-Keuls test
stepwise multiple comparisons procedure used to identify sample means that are significantly different from each other.
Conjoint analysis
Conjoint analysis maps consumer preference structures into mathematical tradeoffs. optimal market research approach for measuring the value that consumer place on features of a product or service. Multivariate technique. Based on premise that consumers evaluate the value of an object by combining the separate counts of value provided by each attribute.
Regression analysis
statistical process for estimating the relationship between a dependent variable and one or more independent variables. One important assumption while performing regression is that the errors are independent.
Discriminant analysis
used by the researcher to analyze the research data when the criterion or dependent variable is categorical and the predictor or independent variable is interval.
Confirmatory Factor Analysis (CFA)
CFA is usually used for Structural Equation Modelling (SEM)
lift chart
graphically represents the improvement that a mining model provides when compared against a random guess. To see prediction accuracy lines for any individual value of the predictable attribute, you need not create a separate lift chart for each targeted value. A measure of the effectiveness of a predictive model calculated as the ratio between the results obtained with and without the predictive model.
Project Mgmt - Network Analysis
Goal is to minimize total project duration
SLIQ / SPRINT
Algorithms used for addressing scalability issues of decision trees construction from very large training sets
BOAT
Bootstrapped optimistic algorithm for tree construction used for creating small samples of training set while constructing a tree
cost / time tradeoff
Cost decreases linearly as time increases
Correlation
The covariance of the two variables normalized by the variance of each variable