Data Science and Statistics Vocab Flashcards
What are the 3 modeling types and describe
Here is a brief description of the three modeling types:
Continuous -are numeric values used directly in an analysis.
Ordinal - values are category labels, but their order is meaningful.
Nominal - values are treated as unordered, categorical names of levels.
The ordinal and nominal modeling types are treated the same in most analyses, and are often referred to collectively as categorical.
Distribution
Provides a histogram for continuous data and a bar chart for nominal or ordinal data, along with
relevant summary statistics. Presents options for many one-sample analyses, based on modeling
type.
Fit Y by X
Shows plots that describe the relationship between any two variables. Provides two-sample
analyses based on the modeling types of the two variables, such as bivariate, oneway, logistic,
and contingency analysis.
Matched Pairs
Analyzes two continuous variables that are measurements on the same experimental unit or
subject.
Tabulate
Constructs tables of descriptive statistics using an interactive interface.
Fit Model
Fits models involving one or more Y variables and multiple X variables. Techniques include
standard least squares, stepwise, generalized regression, mixed models, MANOVA, loglinear
variance, logistic, proportional hazards, parametric survival, generalized linear models, partial
least squares, and response screening.
Modeling
Offers various modeling techniques: nonlinear, neural, Gaussian process, partition analysis, time
series, and model comparison. Screening is for designs with many effects. Response screening
is for a larger number of effects across groups.
Multivariate
Methods
Offers techniques for exploring relationships among multiple variables: multivariate fitting,
clustering, principal components, discriminant analysis, and partial least squares.
Quality and
Process
Offers techniques for evaluating quality-related issues in processes or products: control charts
(including an interactive control chart builder), measurement systems analysis, variability and
attribute gauge charts, capability charts on multiple responses, Pareto plots, and fishbone
(Ishikawa Cause and Effect) diagrams.
Reliability and
Survival
Offers techniques for fitting survival and reliability data: life distribution, fit life by x, recurrence
analysis, degradation, reliability growth and forecasting, product-limit survival fit, parametric
survival distributions, and proportional hazards modeling.
Consumer
Research
Provides methods for studying consumer preferences. Options include categorical response
survey analysis, factor analysis, choice models, item analysis, and uplift models for identifying
the positive affects of marketing actions.
t test
T-test calculation is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. When the scaling term is unknown and is replaced by an estimate based on the data, the test statistic (under certain conditions) follows a Students t distribution.
X axis
x axis is defined as the horizontal number line in a Cartesian Coordinate System.
Y Axis
y axis is defined as the vertical number line in a Cartesian Coordinate System.
Box Plot
Used to display the response distribution at different combinations of factor levels. Box plots can reveal differences in the response Mean at different levels, suggesting Main Effects. Box plots can also reveal whether the response variation is homogenous across factor levels, an assumption made in ANOVA.
Bubble Plot.
A two-dimensional Scatterplot showing the relationship between two Variables over time. Each circle, or bubble, represents a single instance of an ID variable.
Arithmetic Mean
Arithmetic mean is the average of a set of n numbers.
Median
It is the middle value located in a group of ordered numbers. Median splits the higher number with the lowest number. It is also termed as middle value in a collection of numbers.
Mode
A number that is available more number of times in a group is called mode. It is the technique to collect details of a variable, for e.g. population. There can be more than one mode in a group/data/collection of numbers.
Geometric mean
Geometric mean is a kind of average of a set of numbers that is different from the arithmetic average. The geometric mean is well defined only for sets of positive real numbers. This is calculated by multiplying all the numbers , and taking the nth root of the total. A common example where the geometric mean is the correct choice is when averaging growth rates.
Harmonic Mean
Harmonic mean is used to calculate the average of a set of numbers. Here the number of elements will be averaged and divided by the sum of the reciprocals of the elements. The Harmonic mean is always the lowest mean.
Interpreting the t-Test
The primary result of a t-test is the p-value. In this example, the p-value is 0.396 and the analyst is using a significance level of 0.05. Since 0.396 is greater than 0.05, you cannot conclude that the average weight of car models in the broader population is significantly different from 3000 pounds. Had the p-value been lower than the significance level, the planning specialist would have concluded that the average car weight in the broader population is significantly different from 3000 pounds.
Analyzing a continuous variable might include questions such as the following:
- Does the shape of the data match any known distributions?
- Are there any outliers in the data?
- What is the average of the data?
- Is the average statistically different from a target or historical value?
- How spread out are the data? In other words, what is the standard deviation?
- What are the minimum and maximum values?
Binomial Regression
A regression method where the Dependent Variable contains binomial values (for example, 0 and 1, often corresponding to ‘no’ and ‘yes’, or ‘failure’ and ‘success’, respectively).