Predictive Analytics Flashcards
Marital status and eye color are examples of what sort of data?
Nominal scales - categorical data divided into distinct categories
Survey responses (e.g. strong agree, agree, neutral…) are examples of what sort of data?
Ordinal scales - categorical data that aims to rank data in a specific order
What are the three patterns of missing data?
MCAR - Missing completely at random
MAR - Missing at random
MNAR - Missing not at random
Data missing due to test sensors losing connectivity is an example of which missing data pattern?
MCAR - Missing completely at random
Assuming more women than men will answer a survey regarding skin routine would create an example of which missing data pattern?
MAR - Missing at random
When there is a pattern to the missing data however it is not on the primary dependent variable is a trait of which missing data pattern?
MAR - Missing at random
Data missing due to patients refusing to disclose info on sensitive topics is an example of which missing data pattern?
MNAR - Missing not at random
A __________ is a supposition or observation regarding the results of sampling data.
Hypothesis
The null hypothesis being rejected in error when it is actually true is which type of error?
Type I error
The null hypothesis is not rejected in error when it is actually false is which type of error?
Type II error
This statistic is used for normally distributed data and a known population standard deviation.
z-statistic
This statistic is used for normally distributed data and an unknown population standard deviation.
t-statistic
T/F: P-values measure the probability that the null hypothesis is true.
False
Which statistical summary value provides the measure of location, or central tendency?
Arithmetic mean
The _______ is the middle number in a set of observations that are in order.
Median
The _______ is the number in a set of observations that occurs most often.
Mode
The _______ measure the spread of the distribution of a set of observations.
Standard deviation
Most often in predictive modeling, _______ is considered in the context of normal distributions and provides a measure of statistical dispersion.
Standard deviation
Which statistical summary measurement measures the balance of a distribution?
Skewness
_______ measures the shape (how fat or thin compared to a normal distribution) of the distribution.
Kurtosis
_______ is a measure of the numerical relationship of one variable to another.
Correlation
What does PCA stand for?
Principal component analysis
From chapter 3, what are the three key measures of central tendency?
Mean, median, mode
From chapter 3, what are the five key measures of variability?
Standard deviation, variance, range, kurtosis, and skewness
As a measure of central tendency the _______ is not influenced by outliers.
Median
What is the average of the squared deviations of each observation from the mean?
Variance
What is the square root of the variance?
Standard deviation
The _______ is the most widely used transformation method to deal with skewed data.
Logarithmic function
By _______ data there is a loss of information and power and, if not necessary, it should be avoided.
Binning
An r-squared value of 0.25 indicates what?
That 25% of the variation in target variable (y) is explained by the input variable (x)
_______ identifies the correlations and covariances between the input variables and creates groups or clusters of similar variables.
Variable clustering
When using variable clustering, the clusters are determined by calculating the _______ distance.
Euclidean distance
When using variable clustering, the _______ is the average of the points in the cluster; a representative point that lies at the center.
Centroid
The _______ for a given cluster measures the variance in all the variables.
Eigenvalue
_______ is a variable reduction strategy that is used when there are several redundant variables or variables that are correlated with one another and may be measuring the same construct.
Principal component analysis (PCA)
The _______ defines the condition that the researchers need to discredit before suggesting an effect exists.
Null hypothesis
Regression analysis is _______; meaning it comes from a population that follows a probability distribution based on a fixed set of parameters.
Parametric
What are the five common assumptions that must be validated for a model to generate good results?
- Linearity
- Independence : No multicollinearity
- Constant variance : Homoskedasticity
- Autocorrelation
- Normality
Which assumption states there should be a linear relationship between the target and input variables?
Linearity
Which assumption states the input variables should not be correlated?
Independence : No multicollinearity
_______ assumes that the variance of a variable is constant across all values of another variable.
Homoskedasticity (constant variance)
_______ occurs when one data point in a variable is dependent on another data point within the same variable.
Autocorrelation
Which assumption assumes the error terms are normally distributed for any given value of the input variables with a mean of zero?
Normality
According to chapter 4, what are the three common metrics to evaluate the strength of a regression line?
- R-squared
- Adjusted r-squared
- P-value
Which regression evaluation metric provides the percentage variation in the target variable explained by the input variables?
R-squared (coefficient of determination)
Which regression evaluation metric integrates the model’s degrees of freedom as a means of adjustment?
Adjusted r-squared
T/F When using multiple linear regression, the adjusted r-squared should always be used over the r-squared value.
TRUE
Generally, a p-value less-than-or-equal-to _______ indicates strong evidence against the null hypothesis.
0.05