Basic Definitions Flashcards
Target Variable
-Variable we wish to study in PA (denoted as y^).
-Hope to predict the target of a future observation and to see whether it can be understood better using other variables.
-Response variable; Output variable
Predictor
-Any variable used to investigate and reveal patterns of the target (denoted x_j)
-We aim to discover and exploit the relationship that potentially exists between the target and a predictor.
-Explanatory variable; Input variable.
Variables: Continuous; Count; Factor
Continuous - Numeric variable that takes on values from an interval.
Count - Numeric variable that takes on non-negative integers
Factor - Categorical var.; A level refers to a category of a factor. 2 = binary (0s and 1s)
Other:
-Variables that measure time may be numeric or categorical depending on perspective and/or preference.
-May have the choice to view as Count or Factor; Integers may represent levels of a factor rather than numeric values.
Dimensionality vs Granularity
Dimensionality - For a factor, the number of levels
Granularity - The degree of precision in recording the variable
-Always possible to compare dimensionality but not always granularity which requires factors to be about the same subject.
Structured vs Unstructured
Structured
-Data that is suitable in tabular form
-Easier to access but are rigidly defined
Unstructured
-Data that is not suitable in tabular form
-Harder to access but are flexible in form
Semi-structured
-Data with elements of both
Data Adequacy
Expectations of adequate data
-Historical data should reflect future behavior
-The sample should be representative of the population
When and how the data was collected are essential.
-Note relevance of old data, presence of impactful and rare events during data collection, and sampling bias
Features
Predictor derived from the dataset, meaning some thought has gone into it to be a suitable predictor.
A common way to improve a dataset is to create features, often through transformations
–A variable without transformation can be a predictor
–Not allowing a variable to be a feature can also be seen as part of the feature creation/selection process.
Objective function
The expression to be optimized for the model to produce predictions that satisfy a goal or objective
MLR: Use Ordinary Least Squares (OLS) to find b’s that minimize the Error Sum of Squares (SSE)
GLM: Use Maximum Likelihood Estimate (MLE) to find b’s that maximize the log-likelihood
Interactions
The effect of one predictor on the target should change based on a second predictor’s value, producing a joint influence on the target.
Interaction terms come from multiplying different predictors (*)
Interactions are not easy to describe in simple language relative to linear relationships.
If two non-target variables are related in some way, this may explain certain model output that is confusing or common when there is collinearity.
Compound variable
Factor resulting from merging two factors; each level from one factor is paired with each level from the other factor to produce all possible pairings.
Reason to create - to combine two overlapping, yet distinct factors
As a predictor, a compound variable is equivalent to modeling an interaction. However, it is more common to achieve it differently than using a compound variable.
Transformations used for
Validity
-Ensure dataset is suitable for analysis
-Potential issues could be either context-specific ore relevant regardless
-If a predictor is only known/available after knowing the target, the model will suffer from target leakage.
Format
-Purposes: exploration and modeling
–Format needed from one may not be appropriate for the other.
Feature creation
Numeric vs Factor
Missing Data - Forms of
Missing: expect to find data, but it isn’t there
NA in R
Factor level labeled “unknown” or similar
Dummy value
blank character in R
A dataset’s data dictionary and its summary output can help decipher the presence of missing data
Fixing:
If % missing is low - remove rows; should have low impact on analysis
If % missing is high - remove columns; limited information available about the variable
Impute - used if data appears to be missing at random
Replace values in a categorical column with a level indicating the value was initially missing. Can work in almost all situations
Predictor predictive
When a change in its value corresponds to a change in the target. If a bivariate analysis indicates a variable may be predictive, we would have confidence including it in the model.
Ex:
A factor is predictive when its different levels relate to different means or medians of the target
Overlapping Information
If predictors have overlapping information, they can be less effective when used together.
Exponentiating
Always results in a positive number
log-transforming a target ensures predictions are positive.