Predictive Analytics Flashcards

1
Q

Marital status and eye color are examples of what sort of data?

A

Nominal scales - categorical data divided into distinct categories

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Survey responses (e.g. strong agree, agree, neutral…) are examples of what sort of data?

A

Ordinal scales - categorical data that aims to rank data in a specific order

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the three patterns of missing data?

A

MCAR - Missing completely at random
MAR - Missing at random
MNAR - Missing not at random

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Data missing due to test sensors losing connectivity is an example of which missing data pattern?

A

MCAR - Missing completely at random

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Assuming more women than men will answer a survey regarding skin routine would create an example of which missing data pattern?

A

MAR - Missing at random

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

When there is a pattern to the missing data however it is not on the primary dependent variable is a trait of which missing data pattern?

A

MAR - Missing at random

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Data missing due to patients refusing to disclose info on sensitive topics is an example of which missing data pattern?

A

MNAR - Missing not at random

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

A __________ is a supposition or observation regarding the results of sampling data.

A

Hypothesis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

The null hypothesis being rejected in error when it is actually true is which type of error?

A

Type I error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

The null hypothesis is not rejected in error when it is actually false is which type of error?

A

Type II error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

This statistic is used for normally distributed data and a known population standard deviation.

A

z-statistic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

This statistic is used for normally distributed data and an unknown population standard deviation.

A

t-statistic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

T/F: P-values measure the probability that the null hypothesis is true.

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Which statistical summary value provides the measure of location, or central tendency?

A

Arithmetic mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

The _______ is the middle number in a set of observations that are in order.

A

Median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

The _______ is the number in a set of observations that occurs most often.

A

Mode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

The _______ measure the spread of the distribution of a set of observations.

A

Standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Most often in predictive modeling, _______ is considered in the context of normal distributions and provides a measure of statistical dispersion.

A

Standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Which statistical summary measurement measures the balance of a distribution?

A

Skewness

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

_______ measures the shape (how fat or thin compared to a normal distribution) of the distribution.

21
Q

_______ is a measure of the numerical relationship of one variable to another.

A

Correlation

22
Q

What does PCA stand for?

A

Principal component analysis

23
Q

From chapter 3, what are the three key measures of central tendency?

A

Mean, median, mode

24
Q

From chapter 3, what are the five key measures of variability?

A

Standard deviation, variance, range, kurtosis, and skewness

25
As a measure of central tendency the _______ is not influenced by outliers.
Median
26
What is the average of the squared deviations of each observation from the mean?
Variance
27
What is the square root of the variance?
Standard deviation
28
The _______ is the most widely used transformation method to deal with skewed data.
Logarithmic function
29
By _______ data there is a loss of information and power and, if not necessary, it should be avoided.
Binning
30
An r-squared value of 0.25 indicates what?
That 25% of the variation in target variable (y) is explained by the input variable (x)
31
_______ identifies the correlations and covariances between the input variables and creates groups or clusters of similar variables.
Variable clustering
32
When using variable clustering, the clusters are determined by calculating the _______ distance.
Euclidean distance
33
When using variable clustering, the _______ is the average of the points in the cluster; a representative point that lies at the center.
Centroid
34
The _______ for a given cluster measures the variance in all the variables.
Eigenvalue
35
_______ is a variable reduction strategy that is used when there are several redundant variables or variables that are correlated with one another and may be measuring the same construct.
Principal component analysis (PCA)
36
The _______ defines the condition that the researchers need to discredit before suggesting an effect exists.
Null hypothesis
37
Regression analysis is _______; meaning it comes from a population that follows a probability distribution based on a fixed set of parameters.
Parametric
38
What are the five common assumptions that must be validated for a model to generate good results?
1. Linearity 2. Independence : No multicollinearity 3. Constant variance : Homoskedasticity 4. Autocorrelation 5. Normality
39
Which assumption states there should be a linear relationship between the target and input variables?
Linearity
40
Which assumption states the input variables should not be correlated?
Independence : No multicollinearity
41
_______ assumes that the variance of a variable is constant across all values of another variable.
Homoskedasticity (constant variance)
42
_______ occurs when one data point in a variable is dependent on another data point within the same variable.
Autocorrelation
43
Which assumption assumes the error terms are normally distributed for any given value of the input variables with a mean of zero?
Normality
44
According to chapter 4, what are the three common metrics to evaluate the strength of a regression line?
1. R-squared 2. Adjusted r-squared 3. P-value
45
Which regression evaluation metric provides the percentage variation in the target variable explained by the input variables?
R-squared (coefficient of determination)
46
Which regression evaluation metric integrates the model's degrees of freedom as a means of adjustment?
Adjusted r-squared
47
T/F When using multiple linear regression, the adjusted r-squared should always be used over the r-squared value.
TRUE
48
Generally, a p-value less-than-or-equal-to _______ indicates strong evidence against the null hypothesis.
0.05
49