Data Design and Visualization (20-30%) Flashcards
Define structured data.
Info that typically sits in tables and is easily comparable - easier to access but less flexible.
Define unstructured data.
Info that typically doesn’t fit into a tabular structure ie. categorical variables like mention of certain diseases or birth control and health programs that cause life expectancy increases - more flexible but harder to access.
Define target variable.
Variables that we are trying to predict.
- Also known as dependent variables.
- We want to predict the variables that have a direct impact on the business’s key performance indicators (KPIs).
- Know before they happen, which is why we are trying to predict them.
Define predictor variables.
Variables used to predict the target variable.
- These variables are often referred to as self-fulfilling because they reflect the true outcome and aren’t known until the outcome is observed. Another term for this is “target leakage”.
- Having too many predictor variables can introduce issues such as “collinearity” (which can increase the variance of the parameter estimates) and “curse of dimensionality”.
Define categorical variables.
Variables that have predefined discrete values that are not treated as numbers.
- If there is a meaningful order associated with the variables, they are called ordinal ie. gold, silver, bronze. Otherwise, they are called nominal (refer to notes)
- Factor variables (categorical) have predefined levels ie. state, gender, postal code
Define numerical variables.
Variables that take the form of numbers and have a range associated with them.
- Continuous variables can take any value within range.
- Discrete variables are restricted to certain values within that range.
- Only define variables as numeric if they have an order
Define binary/Boolean/indicator.
A type of variable that can only take once or two values, true or false(also stored as 1 or 0, in which case they are binary).
- Boolean variables are typically used as “indicators” or “flags” that highlight whether a particular characteristic is true for an observation or not.
- “Binarization” turns a single categorical (factor) variable into multiple binary/boolean variables.
Define date/time/geospatial data.
Variables that appear to be numeric but have special properties that make it suboptimal to store them as numeric variables.
Geospatial variables (location) have a defined order (ie. one point is further north than the other) but this is rarely useful in a predictive modeling sense. Typically, geospatial data (latitude and longitude) are mapped into a variable that represents regions.
Define dimensionality.
- Usually talking about the number of variables within the data (number of columns).
- Dimensionality of categorical variables: means how many different possible values or levels that variables have.
- It is useful to reduce the dimensionality of a variable to make it more manageable. Note, this is not that same as reducing the dimensionality of the data.
When can high dimensional variables can be problematic?
- When there may be low exposure (or occurrence) in some levels which hinders our ability to build robust predictive models.
- When some algorithms treat each level separately and consider every possible combo of variables levels, which can lead to unstable and unintuitive results when you have a large number of variable levels.
- When high-dimensional variables are more difficult to comprehend, and human intuition can often fail as a result.
- because of potential issues outlined here, high-dimensional variables should always be treated with care
Define granularity.
Refers to how precisely a variable is measured ie. for locations, addresses can be recorded (more granular), postal code (less granular), country (even less granular).
Granularity is closely related to dimensionality, in that high granularity often implies high dimensionality and low granularity implies low dimensionality. Often, transformations that reduce the dimensionality of a variable take the form of reducing the granularity ie. instead of using a customers exact address in your analysis (as there is only one observation per address, making it useless for identifying trends), you might want to transform the data to look at postal code instead.
Reasons for reducing granularity?
Similar reasons to reducing a variables dimensionality.
- To increase the number of observations per level of the variable, smoothing out trends and reducing the likelihood of overfitting to noise in the data.
- To make model results more intuitive.
- To reduce the complexity of a model.
- In some circumstances, it might make sense to increase the granularity of your data in order to identify more detailed trends, assuming you have enough observations at the higher level of granularity ie. more useful insights might be found at the postal code level than at state or province level
- Table in notes
Define binarization.
The process of transforming a single categorical variables into multiple binary variables, where each new binary variable is an “indicator” for one of the levels in the categorical variable.
If an algorithm requires numeric values to be supplied, you can binarize the categorical variables. This associates a new numeric variable to each level of the categorical variable that takes the value of 0 or 1. This notation has implicit order (only two values), so it will be safe to use in algorithms that make this assumption.
Purpose of variable distributions?
- To check if the data makes sense (the distribution is appropriate given your knowledge of what a sensible distribution would look like).
- To understand what the distribution of the target variables is, so you can make sure your model is capable of fitting to it ie. if you are trying to predict occurrence of claims and only 0.01% of the data has a claim, this will affect how you model the data.
- To ensure that data samples are a representative of the wider population.
- To identify any areas where there is limited exposure (ie. a small number of rows) for certain values of a variable, which could lead to model overfitting.
How to treat variable implications in modeling?
- Use an alternative algorithm
- Ignore the variable
- Transform the variable into something more useful