Predictive Analytics Flashcards

1
Q

Marital status and eye color are examples of what sort of data?

A

Nominal scales - categorical data divided into distinct categories

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Survey responses (e.g. strong agree, agree, neutral…) are examples of what sort of data?

A

Ordinal scales - categorical data that aims to rank data in a specific order

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the three patterns of missing data?

A

MCAR - Missing completely at random
MAR - Missing at random
MNAR - Missing not at random

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Data missing due to test sensors losing connectivity is an example of which missing data pattern?

A

MCAR - Missing completely at random

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Assuming more women than men will answer a survey regarding skin routine would create an example of which missing data pattern?

A

MAR - Missing at random

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

When there is a pattern to the missing data however it is not on the primary dependent variable is a trait of which missing data pattern?

A

MAR - Missing at random

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Data missing due to patients refusing to disclose info on sensitive topics is an example of which missing data pattern?

A

MNAR - Missing not at random

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

A __________ is a supposition or observation regarding the results of sampling data.

A

Hypothesis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

The null hypothesis being rejected in error when it is actually true is which type of error?

A

Type I error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

The null hypothesis is not rejected in error when it is actually false is which type of error?

A

Type II error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

This statistic is used for normally distributed data and a known population standard deviation.

A

z-statistic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

This statistic is used for normally distributed data and an unknown population standard deviation.

A

t-statistic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

T/F: P-values measure the probability that the null hypothesis is true.

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Which statistical summary value provides the measure of location, or central tendency?

A

Arithmetic mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

The _______ is the middle number in a set of observations that are in order.

A

Median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

The _______ is the number in a set of observations that occurs most often.

A

Mode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

The _______ measure the spread of the distribution of a set of observations.

A

Standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Most often in predictive modeling, _______ is considered in the context of normal distributions and provides a measure of statistical dispersion.

A

Standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Which statistical summary measurement measures the balance of a distribution?

A

Skewness

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

_______ measures the shape (how fat or thin compared to a normal distribution) of the distribution.

A

Kurtosis

21
Q

_______ is a measure of the numerical relationship of one variable to another.

A

Correlation

22
Q

What does PCA stand for?

A

Principal component analysis

23
Q

From chapter 3, what are the three key measures of central tendency?

A

Mean, median, mode

24
Q

From chapter 3, what are the five key measures of variability?

A

Standard deviation, variance, range, kurtosis, and skewness

25
Q

As a measure of central tendency the _______ is not influenced by outliers.

A

Median

26
Q

What is the average of the squared deviations of each observation from the mean?

A

Variance

27
Q

What is the square root of the variance?

A

Standard deviation

28
Q

The _______ is the most widely used transformation method to deal with skewed data.

A

Logarithmic function

29
Q

By _______ data there is a loss of information and power and, if not necessary, it should be avoided.

A

Binning

30
Q

An r-squared value of 0.25 indicates what?

A

That 25% of the variation in target variable (y) is explained by the input variable (x)

31
Q

_______ identifies the correlations and covariances between the input variables and creates groups or clusters of similar variables.

A

Variable clustering

32
Q

When using variable clustering, the clusters are determined by calculating the _______ distance.

A

Euclidean distance

33
Q

When using variable clustering, the _______ is the average of the points in the cluster; a representative point that lies at the center.

A

Centroid

34
Q

The _______ for a given cluster measures the variance in all the variables.

A

Eigenvalue

35
Q

_______ is a variable reduction strategy that is used when there are several redundant variables or variables that are correlated with one another and may be measuring the same construct.

A

Principal component analysis (PCA)

36
Q

The _______ defines the condition that the researchers need to discredit before suggesting an effect exists.

A

Null hypothesis

37
Q

Regression analysis is _______; meaning it comes from a population that follows a probability distribution based on a fixed set of parameters.

A

Parametric

38
Q

What are the five common assumptions that must be validated for a model to generate good results?

A
  1. Linearity
  2. Independence : No multicollinearity
  3. Constant variance : Homoskedasticity
  4. Autocorrelation
  5. Normality
39
Q

Which assumption states there should be a linear relationship between the target and input variables?

A

Linearity

40
Q

Which assumption states the input variables should not be correlated?

A

Independence : No multicollinearity

41
Q

_______ assumes that the variance of a variable is constant across all values of another variable.

A

Homoskedasticity (constant variance)

42
Q

_______ occurs when one data point in a variable is dependent on another data point within the same variable.

A

Autocorrelation

43
Q

Which assumption assumes the error terms are normally distributed for any given value of the input variables with a mean of zero?

A

Normality

44
Q

According to chapter 4, what are the three common metrics to evaluate the strength of a regression line?

A
  1. R-squared
  2. Adjusted r-squared
  3. P-value
45
Q

Which regression evaluation metric provides the percentage variation in the target variable explained by the input variables?

A

R-squared (coefficient of determination)

46
Q

Which regression evaluation metric integrates the model’s degrees of freedom as a means of adjustment?

A

Adjusted r-squared

47
Q

T/F When using multiple linear regression, the adjusted r-squared should always be used over the r-squared value.

A

TRUE

48
Q

Generally, a p-value less-than-or-equal-to _______ indicates strong evidence against the null hypothesis.

A

0.05

49
Q
A