2.2 Variable Transformations Flashcards
What is the main goal of data transformations, and how do transformed variables differ from the original variables in a dataset?
The main goal of data transformations is to create variables from the existing dataset that may enhance its value. Transformed variables are altered or converted versions of original variables, often created to improve the dataset’s usefulness for analysis.
Explain the difference between variables and features in the context of predictive analytics.
Variables are data in its original recorded form, while features are predictors derived from the dataset. Features have been thoughtfully created and are deemed suitable predictors, while variables may or may not be used as features.
What is target leakage, and why should it be avoided when deciding whether a variable should be a feature?
Target leakage occurs when a model is built with access to information about the target that would not be available during prediction. It should be avoided because it can lead to overly optimistic model performance that will not generalize to new data.
Describe the challenges posed by skewed variables in predictive analytics and how they can impact interpretability and modeling.
Skewed variables, particularly right-skewed ones, pose challenges in predictive analytics. They can obscure understanding of where most values lie, making it difficult to examine relationships with the target variable. In models like linear regression, they can violate the assumption of a normally distributed target variable. Additionally, some models may be biased towards predicting observations with extremely large values when dealing with skewed variables.
What are the two common types of transformations used to handle skewed variables, and how do they differ in their approach?
Two common transformations for skewed variables are logarithmic and roots. Logarithmic transformations cannot handle variables with zero values directly, while root transformations can.
Explain the purpose of standardization in data formatting and how it is achieved.
Standardization ensures different variables are on the same scale, facilitating better comparison. It is achieved by subtracting the sample mean from a variable and dividing by its sample standard deviation.
When dealing with a variable in R, what things should be considered when deciding whether to treat it as a numeric vector or a factor?
When deciding whether to treat a variable as numeric or factor in R:
- If it is already a factor, it is likely for a good reason, except for binary targets which often need to be numeric.
- For numeric variables:
-Use the “math test” to determine if it is naturally a factor.
- Consider the number of unique values (generally, more than 10 unique values suggests keeping it numeric).
- Use data exploration to help decide such as examining the variable’s relationship with the target.
Consider the context and intended use in the analysis.
What are polynomial transformations, and how are they important in certain models?
Polynomial transformations create multiple polynomial terms to capture non-linear relationships between a numeric variable and the target. They are important in models that struggle to identify such relationships on their own, like multiple linear regression.
Describe the process of creating factors of intervals and how it can help detect non-linear patterns in the relationship between a numeric variable and the target.
Creating factors of intervals involves dividing a numeric variable into several intervals and transforming it into a factor. This can help detect different mean targets for each level, potentially revealing non-linear patterns as the numeric variable increases.
What are the reasons for combining levels of a factor, and how can it improve the model?
Combining levels of a factor simplifies analysis, especially when there are too many categories or some categories have insufficient data. It can improve model accuracy and interpretability by grouping similar categories together.
What does it mean to relevel a factor, and what is one common way to do so?
Releveling a factor means changing its reference level. A common way to do this is to choose the level with the most observations as the reference level.
Explain the concept of compound variables and the reasons for creating them when transforming multiple variables.
Compound variables are created by merging multiple factors into one. Reasons for creating them include:
- Combining distinct but overlapping factors.
- Including interactions between factors.
What are the two main unsupervised learning techniques used for transforming multiple numeric variables, and how do they differ?
Two main unsupervised learning techniques for transforming multiple numeric variables:
- Principal components analysis (PCA): Condenses information into a smaller set of new numeric variables called principal components.
- Clustering: Groups similar data points together, transforming multiple numeric variables into one factor with multiple levels.
Why is identifying and handling data errors crucial during the data cleaning process, and what strategies can be used to address them?
Identifying and handling data errors is crucial because they can negatively impact analysis quality. Strategies include removing outliers, correcting inconsistencies, or removing rows or columns with significant errors.
Describe the different forms of missing data and the options available for handling them in predictive analytics.
Missing data can be denoted by NA, special factor levels, or dummy values. Before deciding how to handle it, understand why it is missing. If missing at random, it can often be ignored. Options for handling missing data include:
- Removing columns with missing data (if a lot of missing values).
- Removing rows with missing data (if very few missing values).
- Imputing values (e.g., using mean for numeric variables or a new “unknown” level for factors).
The choice depends on the amount of missing data and its nature (missing at random or not).
How can unstructured text data be transformed into structured data using sentiment analysis and keyword identification techniques?
Unstructured text data can be transformed into structured data using:
- Sentiment analysis: Classifying text as positive, negative, or neutral.
- Keyword identification: Creating dummy variables to indicate presence or absence of specific words or phrases in the text.