Basic Definitions Flashcards

Question 1

Q

Target Variable

Answer

A

-Variable we wish to study in PA (denoted as y^).
-Hope to predict the target of a future observation and to see whether it can be understood better using other variables.
-Response variable; Output variable

Question 2

Q

Predictor

Answer

A

-Any variable used to investigate and reveal patterns of the target (denoted x_j)
-We aim to discover and exploit the relationship that potentially exists between the target and a predictor.
-Explanatory variable; Input variable.

Question 3

Q

Variables: Continuous; Count; Factor

Answer

A

Continuous - Numeric variable that takes on values from an interval.
Count - Numeric variable that takes on non-negative integers
Factor - Categorical var.; A level refers to a category of a factor. 2 = binary (0s and 1s)

Other:
-Variables that measure time may be numeric or categorical depending on perspective and/or preference.
-May have the choice to view as Count or Factor; Integers may represent levels of a factor rather than numeric values.

Question 4

Q

Dimensionality vs Granularity

Answer

A

Dimensionality - For a factor, the number of levels

Granularity - The degree of precision in recording the variable

-Always possible to compare dimensionality but not always granularity which requires factors to be about the same subject.

Question 5

Q

Structured vs Unstructured

Answer

A

Structured
-Data that is suitable in tabular form
-Easier to access but are rigidly defined

Unstructured
-Data that is not suitable in tabular form
-Harder to access but are flexible in form

Semi-structured
-Data with elements of both

Question 6

Q

Data Adequacy

Answer

A

Expectations of adequate data
-Historical data should reflect future behavior
-The sample should be representative of the population

When and how the data was collected are essential.
-Note relevance of old data, presence of impactful and rare events during data collection, and sampling bias

Question 7

Q

Features

Answer

A

Predictor derived from the dataset, meaning some thought has gone into it to be a suitable predictor.

A common way to improve a dataset is to create features, often through transformations
–A variable without transformation can be a predictor
–Not allowing a variable to be a feature can also be seen as part of the feature creation/selection process.

Question 8

Q

Objective function

Answer

A

The expression to be optimized for the model to produce predictions that satisfy a goal or objective

MLR: Use Ordinary Least Squares (OLS) to find b’s that minimize the Error Sum of Squares (SSE)

GLM: Use Maximum Likelihood Estimate (MLE) to find b’s that maximize the log-likelihood

Question 9

Q

Interactions

Answer

A

The effect of one predictor on the target should change based on a second predictor’s value, producing a joint influence on the target.

Interaction terms come from multiplying different predictors (*)

Interactions are not easy to describe in simple language relative to linear relationships.

If two non-target variables are related in some way, this may explain certain model output that is confusing or common when there is collinearity.

Question 10

Q

Compound variable

Answer

A

Factor resulting from merging two factors; each level from one factor is paired with each level from the other factor to produce all possible pairings.

Reason to create - to combine two overlapping, yet distinct factors

As a predictor, a compound variable is equivalent to modeling an interaction. However, it is more common to achieve it differently than using a compound variable.

Question 11

Q

Transformations used for

Answer

A

Validity
-Ensure dataset is suitable for analysis
-Potential issues could be either context-specific ore relevant regardless
-If a predictor is only known/available after knowing the target, the model will suffer from target leakage.

Format
-Purposes: exploration and modeling
–Format needed from one may not be appropriate for the other.

Feature creation
Numeric vs Factor

Question 12

Q

Missing Data - Forms of

Answer

A

Missing: expect to find data, but it isn’t there

NA in R
Factor level labeled “unknown” or similar
Dummy value
blank character in R

A dataset’s data dictionary and its summary output can help decipher the presence of missing data

Fixing:
If % missing is low - remove rows; should have low impact on analysis
If % missing is high - remove columns; limited information available about the variable
Impute - used if data appears to be missing at random
Replace values in a categorical column with a level indicating the value was initially missing. Can work in almost all situations

Question 13

Q

Predictor predictive

Answer

A

When a change in its value corresponds to a change in the target. If a bivariate analysis indicates a variable may be predictive, we would have confidence including it in the model.

Ex:
A factor is predictive when its different levels relate to different means or medians of the target

Question 14

Q

Overlapping Information

Answer

A

If predictors have overlapping information, they can be less effective when used together.

Question 15

Q

Exponentiating

Answer

A

Always results in a positive number

log-transforming a target ensures predictions are positive.

Question 16

Q

Collinearity

Answer

Study These Flashcards

A

When a predictor is close to being a linear combination of the other predictors. It will be difficult for the model to distinguish which ones are truly meaningful. This can lead to unstable coefficient estimates, which may result in high p-values for t-tests.

Perfect collinearity (aka singularities or rank-deficient fit) when a predictor is a linear combination of the other predictors. OLS will fail to find a unique set of solutions for the coefficient estimates.

If Perfect Collinearity: A fitting procedure may attempt to self-correct by excluding predictors, allowing all other coefficients to be estimated

Question 17

Q

Higher-order terms

Answer

Study These Flashcards

A

Variables raised to an integer power that exceeds 1. As predictors, include a term for each power of the original variable leading up to the highest order desired.

Polynomial relationships are not easy to describe in simple language relative to linear relationships; it is harder to interpret the relevant coefficient estimates. Performing a sensitivity analysis could be a helpful alternative.

Question 18

Q

Missing Data - NA vs NULL

Answer

Study These Flashcards

A

NA - indicates a specific data point is unknown
NULL indicates that an object does not exist

Distinction lies between a value in an object vs an object itself

Basic Definitions Flashcards

(18 cards)