Data Design and Visualization (20-30%) Flashcards

1
Q

Define structured data.

A

Info that typically sits in tables and is easily comparable - easier to access but less flexible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Define unstructured data.

A

Info that typically doesn’t fit into a tabular structure ie. categorical variables like mention of certain diseases or birth control and health programs that cause life expectancy increases - more flexible but harder to access.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Define target variable.

A

Variables that we are trying to predict.
- Also known as dependent variables.
- We want to predict the variables that have a direct impact on the business’s key performance indicators (KPIs).
- Know before they happen, which is why we are trying to predict them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Define predictor variables.

A

Variables used to predict the target variable.
- These variables are often referred to as self-fulfilling because they reflect the true outcome and aren’t known until the outcome is observed. Another term for this is “target leakage”.
- Having too many predictor variables can introduce issues such as “collinearity” (which can increase the variance of the parameter estimates) and “curse of dimensionality”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Define categorical variables.

A

Variables that have predefined discrete values that are not treated as numbers.
- If there is a meaningful order associated with the variables, they are called ordinal ie. gold, silver, bronze. Otherwise, they are called nominal (refer to notes)
- Factor variables (categorical) have predefined levels ie. state, gender, postal code

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Define numerical variables.

A

Variables that take the form of numbers and have a range associated with them.
- Continuous variables can take any value within range.
- Discrete variables are restricted to certain values within that range.
- Only define variables as numeric if they have an order

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Define binary/Boolean/indicator.

A

A type of variable that can only take once or two values, true or false(also stored as 1 or 0, in which case they are binary).
- Boolean variables are typically used as “indicators” or “flags” that highlight whether a particular characteristic is true for an observation or not.
- “Binarization” turns a single categorical (factor) variable into multiple binary/boolean variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Define date/time/geospatial data.

A

Variables that appear to be numeric but have special properties that make it suboptimal to store them as numeric variables.

Geospatial variables (location) have a defined order (ie. one point is further north than the other) but this is rarely useful in a predictive modeling sense. Typically, geospatial data (latitude and longitude) are mapped into a variable that represents regions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Define dimensionality.

A
  • Usually talking about the number of variables within the data (number of columns).
  • Dimensionality of categorical variables: means how many different possible values or levels that variables have.
  • It is useful to reduce the dimensionality of a variable to make it more manageable. Note, this is not that same as reducing the dimensionality of the data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

When can high dimensional variables can be problematic?

A
  • When there may be low exposure (or occurrence) in some levels which hinders our ability to build robust predictive models.
  • When some algorithms treat each level separately and consider every possible combo of variables levels, which can lead to unstable and unintuitive results when you have a large number of variable levels.
  • When high-dimensional variables are more difficult to comprehend, and human intuition can often fail as a result.
  • because of potential issues outlined here, high-dimensional variables should always be treated with care
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Define granularity.

A

Refers to how precisely a variable is measured ie. for locations, addresses can be recorded (more granular), postal code (less granular), country (even less granular).

Granularity is closely related to dimensionality, in that high granularity often implies high dimensionality and low granularity implies low dimensionality. Often, transformations that reduce the dimensionality of a variable take the form of reducing the granularity ie. instead of using a customers exact address in your analysis (as there is only one observation per address, making it useless for identifying trends), you might want to transform the data to look at postal code instead.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Reasons for reducing granularity?

A

Similar reasons to reducing a variables dimensionality.

  • To increase the number of observations per level of the variable, smoothing out trends and reducing the likelihood of overfitting to noise in the data.
  • To make model results more intuitive.
  • To reduce the complexity of a model.
  • In some circumstances, it might make sense to increase the granularity of your data in order to identify more detailed trends, assuming you have enough observations at the higher level of granularity ie. more useful insights might be found at the postal code level than at state or province level
  • Table in notes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Define binarization.

A

The process of transforming a single categorical variables into multiple binary variables, where each new binary variable is an “indicator” for one of the levels in the categorical variable.

If an algorithm requires numeric values to be supplied, you can binarize the categorical variables. This associates a new numeric variable to each level of the categorical variable that takes the value of 0 or 1. This notation has implicit order (only two values), so it will be safe to use in algorithms that make this assumption.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Purpose of variable distributions?

A
  • To check if the data makes sense (the distribution is appropriate given your knowledge of what a sensible distribution would look like).
  • To understand what the distribution of the target variables is, so you can make sure your model is capable of fitting to it ie. if you are trying to predict occurrence of claims and only 0.01% of the data has a claim, this will affect how you model the data.
  • To ensure that data samples are a representative of the wider population.
  • To identify any areas where there is limited exposure (ie. a small number of rows) for certain values of a variable, which could lead to model overfitting.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How to treat variable implications in modeling?

A
  • Use an alternative algorithm
  • Ignore the variable
  • Transform the variable into something more useful
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Define sampling.

A

The process of taking a subset of records from a larger dataset.

17
Q

Define random sampling.

A

Drawing random records (without replacement) from the dataset until you have the required number, where each record has an equally likely probability of being drawn.

18
Q

Define stratified sampling.

A

Independently drawing a set number of random records from each strata or group in your data. This is similar to random sampling, except that you control exactly how many records you have from each group within your data. Special cases of stratified sampling include methods to control imbalances in datasets, which helps models to better predict the minority group such as oversampling, undersampling, and systematic sampling.

19
Q

Define oversampling.

A

Drawing more samples from the minority group than the majority group. In some cases, you may actually duplicate records to increase the number of records from the minority group.

20
Q

Define undersampling.

A

Same effect as oversampling but phased as drawing fewer samples from the majority group than the minority group.

21
Q

Define systematic sampling.

A

Drawing according to a pattern ie. every 5th record until you have 100 records.

Other types of systematic sampling include drawing samples according to certain conditions ie. time frame

22
Q

Why sample data?

A
  1. Size - original dataset could be too big for efficient analysis to be performed, therefore random sampling can reduce subset to be more manageable/representative
  2. Model testing and validation process - training/test sets, perhaps applying sampling techniques to correct for bias, or to emphasis inherent features
  3. Irrelevant or misleading data
  4. Imbalanced data
23
Q

Define imbalanced data.

A

Where one or more groups in the data are responsible for significantly more distribution density compared with others. Ie. this is typical of health insurance claims data (such as emergency room coverage), which may have less than 1% of policies with an actual claim.

Imbalanced datasets can be problematic when trying to fit a model because the model places more weight on the majority groups by virtue of the fact that there is more data and, thus, it is easier to achieve an overall good model fit by fitting well to those data points even if it means ignoring the others. This won’t be an issue if the effect of ignoring the minority groups isn’t large; however, it becomes a real problem if the minority group is what you are most interested in, as in the claims example. Common methods for overcoming this include over- or undersampling

24
Q

Univariate data exploration technique purpose?

A
  • Understand basic relationships in the data in order to perform common sense checks on model output.
  • Check relationships in the data against common knowledge and intuition to identify potential data errors that could lead to misleading models.
  • Identify outliers and understand their potential effects on the model.
  • Gain clues about relationships in the data and how the target or response varies for different predictors, leading to more informed variables transformation and modeling choices, which will improve the overall performance of the model.
25
Q

Types of univariate variable distributions?

A
  1. Numeric statistics or summaries ie. mean, variance, and frequencies
    - statistical summaries have the advantage of being precise and easily comparable across variables.
  2. Visualization (the values of a variable in one graphic image) ie. histogram and bar plot
    - Visualization have the advantage of being able to show the behaviour of the variable without being confined to predefined statistics ie. being able to see a binomial distribution or the presence of outliers
26
Q

Numeric variables - Univariate

A
  1. Statistics or summaries - mean, median, variance
  2. Visualization - boxplots
    - for continuous numbers = histogram and boxplots
27
Q

Categorical variables - Univariate

A
  1. Statistics or summaries - N/A because there is no sense of “order”
  2. Visualization - frequency tables, counts, percentages, and bar charts
28
Q

Why is data exploration important?

A

It allows us to see patterns and questions what is about the data that makes these patterns appear. The patterns can indicate:
- An error in the data that needs to be corrected
- An abnormality in the data collection that should be accounted for or acknowledged
- An interesting occurrence that can lead to useful transformations of the data, or decisions that will improve our model

29
Q

What are the 3 types of Bivariate data exploration and their techniques?

A
  1. Categorical vs Categorical - stacked bar charts
  2. Categorical vs numeric - split box plots, and split histograms
  3. Numeric vs numeric - scatter plots
30
Q

Bivariate data exploration steps

A
  1. Examine relationships between the target variable and each predictor variable
  2. Examine “obvious” or well-known relationships ie. insured amounts vs premium or how some predictors change over time
  3. Examine other potentially interesting relationships that you hypothesize will be important in your model
  4. Examine any other relationship for sense checking purposes