Data Science and Statistics Vocab Flashcards by Brian Toro

What are the 3 modeling types and describe

Here is a brief description of the three modeling types:

Continuous -are numeric values used directly in an analysis.

Ordinal - values are category labels, but their order is meaningful.

Nominal - values are treated as unordered, categorical names of levels.

The ordinal and nominal modeling types are treated the same in most analyses, and are often referred to collectively as categorical.

How well did you know this?

Not at all

Perfectly

Distribution

Provides a histogram for continuous data and a bar chart for nominal or ordinal data, along with
relevant summary statistics. Presents options for many one-sample analyses, based on modeling
type.

How well did you know this?

Not at all

Perfectly

Fit Y by X

Shows plots that describe the relationship between any two variables. Provides two-sample
analyses based on the modeling types of the two variables, such as bivariate, oneway, logistic,
and contingency analysis.

How well did you know this?

Not at all

Perfectly

Matched Pairs

Analyzes two continuous variables that are measurements on the same experimental unit or
subject.

How well did you know this?

Not at all

Perfectly

Tabulate

Constructs tables of descriptive statistics using an interactive interface.

How well did you know this?

Not at all

Perfectly

Fit Model

Fits models involving one or more Y variables and multiple X variables. Techniques include
standard least squares, stepwise, generalized regression, mixed models, MANOVA, loglinear
variance, logistic, proportional hazards, parametric survival, generalized linear models, partial
least squares, and response screening.

How well did you know this?

Not at all

Perfectly

Modeling

Offers various modeling techniques: nonlinear, neural, Gaussian process, partition analysis, time
series, and model comparison. Screening is for designs with many effects. Response screening
is for a larger number of effects across groups.

How well did you know this?

Not at all

Perfectly

Multivariate

Methods

Offers techniques for exploring relationships among multiple variables: multivariate fitting,
clustering, principal components, discriminant analysis, and partial least squares.

How well did you know this?

Not at all

Perfectly

Quality and

Process

Offers techniques for evaluating quality-related issues in processes or products: control charts
(including an interactive control chart builder), measurement systems analysis, variability and
attribute gauge charts, capability charts on multiple responses, Pareto plots, and fishbone
(Ishikawa Cause and Effect) diagrams.

How well did you know this?

Not at all

Perfectly

Reliability and

Survival

Offers techniques for fitting survival and reliability data: life distribution, fit life by x, recurrence
analysis, degradation, reliability growth and forecasting, product-limit survival fit, parametric
survival distributions, and proportional hazards modeling.

How well did you know this?

Not at all

Perfectly

Consumer

Research

Provides methods for studying consumer preferences. Options include categorical response
survey analysis, factor analysis, choice models, item analysis, and uplift models for identifying
the positive affects of marketing actions.

How well did you know this?

Not at all

Perfectly

t test

T-test calculation is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. When the scaling term is unknown and is replaced by an estimate based on the data, the test statistic (under certain conditions) follows a Students t distribution.

How well did you know this?

Not at all

Perfectly

X axis

x axis is defined as the horizontal number line in a Cartesian Coordinate System.

How well did you know this?

Not at all

Perfectly

Y Axis

y axis is defined as the vertical number line in a Cartesian Coordinate System.

How well did you know this?

Not at all

Perfectly

Box Plot

Used to display the response distribution at different combinations of factor levels. Box plots can reveal differences in the response Mean at different levels, suggesting Main Effects. Box plots can also reveal whether the response variation is homogenous across factor levels, an assumption made in ANOVA.

How well did you know this?

Not at all

Perfectly

Bubble Plot.

Study These Flashcards

A two-dimensional Scatterplot showing the relationship between two Variables over time. Each circle, or bubble, represents a single instance of an ID variable.

Arithmetic Mean

Study These Flashcards

Arithmetic mean is the average of a set of n numbers.

Median

Study These Flashcards

It is the middle value located in a group of ordered numbers. Median splits the higher number with the lowest number. It is also termed as middle value in a collection of numbers.

Mode

Study These Flashcards

A number that is available more number of times in a group is called mode. It is the technique to collect details of a variable, for e.g. population. There can be more than one mode in a group/data/collection of numbers.

Geometric mean

Study These Flashcards

Geometric mean is a kind of average of a set of numbers that is different from the arithmetic average. The geometric mean is well defined only for sets of positive real numbers. This is calculated by multiplying all the numbers , and taking the nth root of the total. A common example where the geometric mean is the correct choice is when averaging growth rates.

Harmonic Mean

Study These Flashcards

Harmonic mean is used to calculate the average of a set of numbers. Here the number of elements will be averaged and divided by the sum of the reciprocals of the elements. The Harmonic mean is always the lowest mean.

Interpreting the t-Test

Study These Flashcards

The primary result of a t-test is the p-value. In this example, the p-value is 0.396 and the analyst is using a significance level of 0.05. Since 0.396 is greater than 0.05, you cannot conclude that the average weight of car models in the broader population is significantly different from 3000 pounds. Had the p-value been lower than the significance level, the planning specialist would have concluded that the average car weight in the broader population is significantly different from 3000 pounds.

Analyzing a continuous variable might include questions such as the following:

Study These Flashcards

Does the shape of the data match any known distributions?
Are there any outliers in the data?
What is the average of the data?
Is the average statistically different from a target or historical value?
How spread out are the data? In other words, what is the standard deviation?
What are the minimum and maximum values?

Binomial Regression

Study These Flashcards

A regression method where the Dependent Variable contains binomial values (for example, 0 and 1, often corresponding to ‘no’ and ‘yes’, or ‘failure’ and ‘success’, respectively).

chi-squared test

A statistical test used to test the existence of a relationship between two nominal Variables where the sampling distribution of the Test Statistic is a chi-squared distribution when the Null Hypothesis is true (or where it is asymptotically true).

Matched Pairs

The Matched Pairs command handles bivariate data in the special situation where the two responses form a pair of measurements coming from the same experimental unit or subject. For example, a matched pair might be a before-and-after blood pressure measurement from the same subject. The responses are correlated, and the statistical method called the paired t-test takes that into account..

Tree Plot

Creates a rectangular tiling for a nominal or ordinal variable where you can tile categories to be proportional in size to a selected variable. A positional specification is optional. Useful when there are a lot of categories or when histograms are ineffective.

Confidence

Also has a broader meaning in statistics (confidence interval), concerning the degree of error in an estimate that results from selecting one sample as opposed to another.

P (A | B)

Is the conditional probability of event A occurring given that event B has occurred. Read as “the probability that A will occur given that B has occurred.”

Predictor

Usually denoted by X, is also called a feature, input variable, inde- pendent variable, or from a database perspective, a field.

Response

usually denoted by Y , is the variable being predicted in super- vised learning; also called dependent variable, output variable, target variable, or outcome variable.

Score

Refers to a predicted value or class. Scoring new data means to use a model developed with training data to predict output values in new data.

Supervised Learning

Refers to the process of providing an algorithm (logistic regression, regression tree, etc.) with records in which an output variable of interest is known and the algorithm “learns” how to predict this value with new records where the output is unknown.

Test Data (or test set)

Refers to that portion of the data used only at the end of the model building and selection process to assess how well the final model might perform on additional data.

Training Data (or training set)

Refers to that portion of data used to fit a model.

Unsupervised Learning

Refers to analysis in which one attempts to learn something about the data other than predicting an output value of interest (e.g., whether it falls into clusters).

Validation Data (or validation set)

Refers to that portion of the data used to assess how well the model fits, to adjust some models, and to select the best model from among those that have been tried.

CRISP

``` Cross Industry Standard Process Data Mining. Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment ```

Data Science and Statistics Vocab Flashcards

(38 cards)