Predictive Analytics Flashcards by Julian Sneve

Marital status and eye color are examples of what sort of data?

Nominal scales - categorical data divided into distinct categories

How well did you know this?

Not at all

Perfectly

Survey responses (e.g. strong agree, agree, neutral…) are examples of what sort of data?

Ordinal scales - categorical data that aims to rank data in a specific order

How well did you know this?

Not at all

Perfectly

What are the three patterns of missing data?

MCAR - Missing completely at random
MAR - Missing at random
MNAR - Missing not at random

How well did you know this?

Not at all

Perfectly

Data missing due to test sensors losing connectivity is an example of which missing data pattern?

MCAR - Missing completely at random

How well did you know this?

Not at all

Perfectly

Assuming more women than men will answer a survey regarding skin routine would create an example of which missing data pattern?

MAR - Missing at random

How well did you know this?

Not at all

Perfectly

When there is a pattern to the missing data however it is not on the primary dependent variable is a trait of which missing data pattern?

MAR - Missing at random

How well did you know this?

Not at all

Perfectly

Data missing due to patients refusing to disclose info on sensitive topics is an example of which missing data pattern?

MNAR - Missing not at random

How well did you know this?

Not at all

Perfectly

A __________ is a supposition or observation regarding the results of sampling data.

Hypothesis

How well did you know this?

Not at all

Perfectly

The null hypothesis being rejected in error when it is actually true is which type of error?

Type I error

How well did you know this?

Not at all

Perfectly

The null hypothesis is not rejected in error when it is actually false is which type of error?

Type II error

How well did you know this?

Not at all

Perfectly

This statistic is used for normally distributed data and a known population standard deviation.

z-statistic

How well did you know this?

Not at all

Perfectly

This statistic is used for normally distributed data and an unknown population standard deviation.

t-statistic

How well did you know this?

Not at all

Perfectly

T/F: P-values measure the probability that the null hypothesis is true.

False

How well did you know this?

Not at all

Perfectly

Which statistical summary value provides the measure of location, or central tendency?

Arithmetic mean

How well did you know this?

Not at all

Perfectly

The _______ is the middle number in a set of observations that are in order.

Median

How well did you know this?

Not at all

Perfectly

The _______ is the number in a set of observations that occurs most often.

Mode

How well did you know this?

Not at all

Perfectly

The _______ measure the spread of the distribution of a set of observations.

Standard deviation

How well did you know this?

Not at all

Perfectly

Most often in predictive modeling, _______ is considered in the context of normal distributions and provides a measure of statistical dispersion.

Standard deviation

How well did you know this?

Not at all

Perfectly

Which statistical summary measurement measures the balance of a distribution?

Skewness

How well did you know this?

Not at all

Perfectly

_______ measures the shape (how fat or thin compared to a normal distribution) of the distribution.

Study These Flashcards

Kurtosis

_______ is a measure of the numerical relationship of one variable to another.

Study These Flashcards

Correlation

What does PCA stand for?

Study These Flashcards

Principal component analysis

From chapter 3, what are the three key measures of central tendency?

Study These Flashcards

Mean, median, mode

From chapter 3, what are the five key measures of variability?

Study These Flashcards

Standard deviation, variance, range, kurtosis, and skewness

As a measure of central tendency the _______ is not influenced by outliers.

Median

What is the average of the squared deviations of each observation from the mean?

Variance

What is the square root of the variance?

Standard deviation

The _______ is the most widely used transformation method to deal with skewed data.

Logarithmic function

By _______ data there is a loss of information and power and, if not necessary, it should be avoided.

Binning

An r-squared value of 0.25 indicates what?

That 25% of the variation in target variable (y) is explained by the input variable (x)

_______ identifies the correlations and covariances between the input variables and creates groups or clusters of similar variables.

Variable clustering

When using variable clustering, the clusters are determined by calculating the _______ distance.

Euclidean distance

When using variable clustering, the _______ is the average of the points in the cluster; a representative point that lies at the center.

Centroid

The _______ for a given cluster measures the variance in all the variables.

Eigenvalue

_______ is a variable reduction strategy that is used when there are several redundant variables or variables that are correlated with one another and may be measuring the same construct.

Principal component analysis (PCA)

The _______ defines the condition that the researchers need to discredit before suggesting an effect exists.

Null hypothesis

Regression analysis is _______; meaning it comes from a population that follows a probability distribution based on a fixed set of parameters.

Parametric

What are the five common assumptions that must be validated for a model to generate good results?

1. Linearity 2. Independence : No multicollinearity 3. Constant variance : Homoskedasticity 4. Autocorrelation 5. Normality

Which assumption states there should be a linear relationship between the target and input variables?

Linearity

Which assumption states the input variables should not be correlated?

Independence : No multicollinearity

_______ assumes that the variance of a variable is constant across all values of another variable.

Homoskedasticity (constant variance)

_______ occurs when one data point in a variable is dependent on another data point within the same variable.

Autocorrelation

Which assumption assumes the error terms are normally distributed for any given value of the input variables with a mean of zero?

Normality

According to chapter 4, what are the three common metrics to evaluate the strength of a regression line?

1. R-squared 2. Adjusted r-squared 3. P-value

Which regression evaluation metric provides the percentage variation in the target variable explained by the input variables?

R-squared (coefficient of determination)

Which regression evaluation metric integrates the model's degrees of freedom as a means of adjustment?

Adjusted r-squared

T/F When using multiple linear regression, the adjusted r-squared should always be used over the r-squared value.

TRUE

Generally, a p-value less-than-or-equal-to _______ indicates strong evidence against the null hypothesis.

0.05

Predictive Analytics Flashcards

(49 cards)