Data Science and Statistics Vocab Flashcards
What are the 3 modeling types and describe
Here is a brief description of the three modeling types:
Continuous -are numeric values used directly in an analysis.
Ordinal - values are category labels, but their order is meaningful.
Nominal - values are treated as unordered, categorical names of levels.
The ordinal and nominal modeling types are treated the same in most analyses, and are often referred to collectively as categorical.
Distribution
Provides a histogram for continuous data and a bar chart for nominal or ordinal data, along with
relevant summary statistics. Presents options for many one-sample analyses, based on modeling
type.
Fit Y by X
Shows plots that describe the relationship between any two variables. Provides two-sample
analyses based on the modeling types of the two variables, such as bivariate, oneway, logistic,
and contingency analysis.
Matched Pairs
Analyzes two continuous variables that are measurements on the same experimental unit or
subject.
Tabulate
Constructs tables of descriptive statistics using an interactive interface.
Fit Model
Fits models involving one or more Y variables and multiple X variables. Techniques include
standard least squares, stepwise, generalized regression, mixed models, MANOVA, loglinear
variance, logistic, proportional hazards, parametric survival, generalized linear models, partial
least squares, and response screening.
Modeling
Offers various modeling techniques: nonlinear, neural, Gaussian process, partition analysis, time
series, and model comparison. Screening is for designs with many effects. Response screening
is for a larger number of effects across groups.
Multivariate
Methods
Offers techniques for exploring relationships among multiple variables: multivariate fitting,
clustering, principal components, discriminant analysis, and partial least squares.
Quality and
Process
Offers techniques for evaluating quality-related issues in processes or products: control charts
(including an interactive control chart builder), measurement systems analysis, variability and
attribute gauge charts, capability charts on multiple responses, Pareto plots, and fishbone
(Ishikawa Cause and Effect) diagrams.
Reliability and
Survival
Offers techniques for fitting survival and reliability data: life distribution, fit life by x, recurrence
analysis, degradation, reliability growth and forecasting, product-limit survival fit, parametric
survival distributions, and proportional hazards modeling.
Consumer
Research
Provides methods for studying consumer preferences. Options include categorical response
survey analysis, factor analysis, choice models, item analysis, and uplift models for identifying
the positive affects of marketing actions.
t test
T-test calculation is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. When the scaling term is unknown and is replaced by an estimate based on the data, the test statistic (under certain conditions) follows a Students t distribution.
X axis
x axis is defined as the horizontal number line in a Cartesian Coordinate System.
Y Axis
y axis is defined as the vertical number line in a Cartesian Coordinate System.
Box Plot
Used to display the response distribution at different combinations of factor levels. Box plots can reveal differences in the response Mean at different levels, suggesting Main Effects. Box plots can also reveal whether the response variation is homogenous across factor levels, an assumption made in ANOVA.
Bubble Plot.
A two-dimensional Scatterplot showing the relationship between two Variables over time. Each circle, or bubble, represents a single instance of an ID variable.
Arithmetic Mean
Arithmetic mean is the average of a set of n numbers.
Median
It is the middle value located in a group of ordered numbers. Median splits the higher number with the lowest number. It is also termed as middle value in a collection of numbers.
Mode
A number that is available more number of times in a group is called mode. It is the technique to collect details of a variable, for e.g. population. There can be more than one mode in a group/data/collection of numbers.
Geometric mean
Geometric mean is a kind of average of a set of numbers that is different from the arithmetic average. The geometric mean is well defined only for sets of positive real numbers. This is calculated by multiplying all the numbers , and taking the nth root of the total. A common example where the geometric mean is the correct choice is when averaging growth rates.
Harmonic Mean
Harmonic mean is used to calculate the average of a set of numbers. Here the number of elements will be averaged and divided by the sum of the reciprocals of the elements. The Harmonic mean is always the lowest mean.
Interpreting the t-Test
The primary result of a t-test is the p-value. In this example, the p-value is 0.396 and the analyst is using a significance level of 0.05. Since 0.396 is greater than 0.05, you cannot conclude that the average weight of car models in the broader population is significantly different from 3000 pounds. Had the p-value been lower than the significance level, the planning specialist would have concluded that the average car weight in the broader population is significantly different from 3000 pounds.
Analyzing a continuous variable might include questions such as the following:
- Does the shape of the data match any known distributions?
- Are there any outliers in the data?
- What is the average of the data?
- Is the average statistically different from a target or historical value?
- How spread out are the data? In other words, what is the standard deviation?
- What are the minimum and maximum values?
Binomial Regression
A regression method where the Dependent Variable contains binomial values (for example, 0 and 1, often corresponding to ‘no’ and ‘yes’, or ‘failure’ and ‘success’, respectively).
chi-squared test
A statistical test used to test the existence of a relationship between two nominal Variables where the sampling distribution of the Test Statistic is a chi-squared distribution when the Null Hypothesis is true (or where it is asymptotically true).
Matched Pairs
The Matched Pairs command handles bivariate data in the special situation where the two responses
form a pair of measurements coming from the same experimental unit or subject.
For example, a matched pair might be a before-and-after blood pressure measurement from the same subject. The responses are correlated, and the statistical method called the paired t-test takes that into account..
Tree Plot
Creates a rectangular tiling for a nominal or ordinal variable where you can tile categories to be proportional
in size to a selected variable. A positional specification is optional. Useful when there are a lot of categories
or when histograms are ineffective.
Confidence
Also has a broader meaning in statistics (confidence interval), concerning the degree of error in an estimate that results from selecting one sample as opposed to another.
P (A | B)
Is the conditional probability of event A occurring given that event B has occurred. Read as “the probability that A will occur given that B has occurred.”
Predictor
Usually denoted by X, is also called a feature, input variable, inde- pendent variable, or from a database perspective, a field.
Response
usually denoted by Y , is the variable being predicted in super- vised learning; also called dependent variable, output variable, target variable, or outcome variable.
Score
Refers to a predicted value or class. Scoring new data means to use a model developed with training data to predict output values in new data.
Supervised Learning
Refers to the process of providing an algorithm (logistic regression, regression tree, etc.) with records in which an output variable of interest is known and the algorithm “learns” how to predict this value with new records where the output is unknown.
Test Data (or test set)
Refers to that portion of the data used only at the end of the model building and selection process to assess how well the final model might perform on additional data.
Training Data (or training set)
Refers to that portion of data used to fit a model.
Unsupervised Learning
Refers to analysis in which one attempts to learn something about the data other than predicting an output value of interest (e.g., whether it falls into clusters).
Validation Data (or validation set)
Refers to that portion of the data used to assess how well the model fits, to adjust some models, and to select the best model from among those that have been tried.
CRISP
Cross Industry Standard Process Data Mining. Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment