Basic Definitions Flashcards

1
Q

Target Variable

A

-Variable we wish to study in PA (denoted as y^).
-Hope to predict the target of a future observation and to see whether it can be understood better using other variables.
-Response variable; Output variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Predictor

A

-Any variable used to investigate and reveal patterns of the target (denoted x_j)
-We aim to discover and exploit the relationship that potentially exists between the target and a predictor.
-Explanatory variable; Input variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Variables: Continuous; Count; Factor

A

Continuous - Numeric variable that takes on values from an interval.
Count - Numeric variable that takes on non-negative integers
Factor - Categorical var.; A level refers to a category of a factor. 2 = binary (0s and 1s)

Other:
-Variables that measure time may be numeric or categorical depending on perspective and/or preference.
-May have the choice to view as Count or Factor; Integers may represent levels of a factor rather than numeric values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Dimensionality vs Granularity

A

Dimensionality - For a factor, the number of levels

Granularity - The degree of precision in recording the variable

-Always possible to compare dimensionality but not always granularity which requires factors to be about the same subject.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Structured vs Unstructured

A

Structured
-Data that is suitable in tabular form
-Easier to access but are rigidly defined

Unstructured
-Data that is not suitable in tabular form
-Harder to access but are flexible in form

Semi-structured
-Data with elements of both

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Data Adequacy

A

Expectations of adequate data
-Historical data should reflect future behavior
-The sample should be representative of the population

When and how the data was collected are essential.
-Note relevance of old data, presence of impactful and rare events during data collection, and sampling bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Features

A

Predictor derived from the dataset, meaning some thought has gone into it to be a suitable predictor.

A common way to improve a dataset is to create features, often through transformations
–A variable without transformation can be a predictor
–Not allowing a variable to be a feature can also be seen as part of the feature creation/selection process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Objective function

A

The expression to be optimized for the model to produce predictions that satisfy a goal or objective

MLR: Use Ordinary Least Squares (OLS) to find b’s that minimize the Error Sum of Squares (SSE)

GLM: Use Maximum Likelihood Estimate (MLE) to find b’s that maximize the log-likelihood

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Interactions

A

The effect of one predictor on the target should change based on a second predictor’s value, producing a joint influence on the target.

Interaction terms come from multiplying different predictors (*)

Interactions are not easy to describe in simple language relative to linear relationships.

If two non-target variables are related in some way, this may explain certain model output that is confusing or common when there is collinearity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Compound variable

A

Factor resulting from merging two factors; each level from one factor is paired with each level from the other factor to produce all possible pairings.

Reason to create - to combine two overlapping, yet distinct factors

As a predictor, a compound variable is equivalent to modeling an interaction. However, it is more common to achieve it differently than using a compound variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Transformations used for

A

Validity
-Ensure dataset is suitable for analysis
-Potential issues could be either context-specific ore relevant regardless
-If a predictor is only known/available after knowing the target, the model will suffer from target leakage.

Format
-Purposes: exploration and modeling
–Format needed from one may not be appropriate for the other.

Feature creation
Numeric vs Factor

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Missing Data - Forms of

A

Missing: expect to find data, but it isn’t there

NA in R
Factor level labeled “unknown” or similar
Dummy value
blank character in R

A dataset’s data dictionary and its summary output can help decipher the presence of missing data

Fixing:
If % missing is low - remove rows; should have low impact on analysis
If % missing is high - remove columns; limited information available about the variable
Impute - used if data appears to be missing at random
Replace values in a categorical column with a level indicating the value was initially missing. Can work in almost all situations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Predictor predictive

A

When a change in its value corresponds to a change in the target. If a bivariate analysis indicates a variable may be predictive, we would have confidence including it in the model.

Ex:
A factor is predictive when its different levels relate to different means or medians of the target

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Overlapping Information

A

If predictors have overlapping information, they can be less effective when used together.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Exponentiating

A

Always results in a positive number

log-transforming a target ensures predictions are positive.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Collinearity

A

When a predictor is close to being a linear combination of the other predictors. It will be difficult for the model to distinguish which ones are truly meaningful. This can lead to unstable coefficient estimates, which may result in high p-values for t-tests.

Perfect collinearity (aka singularities or rank-deficient fit) when a predictor is a linear combination of the other predictors. OLS will fail to find a unique set of solutions for the coefficient estimates.

If Perfect Collinearity: A fitting procedure may attempt to self-correct by excluding predictors, allowing all other coefficients to be estimated

17
Q

Higher-order terms

A

Variables raised to an integer power that exceeds 1. As predictors, include a term for each power of the original variable leading up to the highest order desired.

Polynomial relationships are not easy to describe in simple language relative to linear relationships; it is harder to interpret the relevant coefficient estimates. Performing a sensitivity analysis could be a helpful alternative.

18
Q

Missing Data - NA vs NULL

A

NA - indicates a specific data point is unknown
NULL indicates that an object does not exist

Distinction lies between a value in an object vs an object itself