2.1 Data Exploration Flashcards

Question 1

Q

Explain the difference between observations and variables in a dataset. Provide an example to illustrate your understanding.

Answer

A

Observations are rows in a dataset, each representing a single entity with all its recorded data points.

Variables are columns, representing different attributes measured for each observation.

For example, in a student dataset, each row (observation) might represent a student, while columns (variables) could include age, grade, and gender.

Question 2

Q

Describe continuous variables, count variables, and factors. Provide an example for each type.

Answer

A

Continuous variables: Represent measurements that can take any value within a given interval, including decimals (e.g., age, income).
Count variables: Represent discrete counts, restricted to non-negative integers (e.g., number of purchases).
Factors: Record categories, with each category termed as a level (e.g., marital status: single, married, divorced).

Question 3

Q

Why is it important to understand the distinction between target variables and predictor variables in predictive analytics?

Answer

A

Understanding the distinction between target and predictor variables is crucial because the goal in predictive analytics is to use predictor variables to investigate and reveal patterns of the target variable. This distinction guides the analysis and model-building process.

Question 4

Q

Describe the three main sampling techniques covered in the manual.

Answer

A

Random sampling: Every record has an equal probability of being sampled.
Stratified sampling: Dataset is divided into groups or strata, then samples are drawn from each stratum.

Oversampling: drawing a higher proportion of samples from the minority stratum compared to the majority stratum.
Undersampling: Draw a lower proportion of samples from the majority straum compared to the minority straum

Systematic sampling: Follows a pattern when drawing records.

Question 5

Q

What are some key reasons for sampling data? Explain how sampling can help address each of these issues.

Answer

A

When working with datasets, it’s important to consider how the data was collected.

Key reasons for sampling:

– Managing dataset size for computational limitations.
– Avoiding irrelevant or misleading data.
– Addressing imbalanced data.
– Facilitating model testing (creating training and testing sets).

Sampling helps by creating manageable, representative subsets of data. Note that sampling does not always result in a smaller dataset, especially when dealing with imbalanced data.

Question 6

Q

What is the difference between oversampling and undersampling? In what situations might you apply these techniques?

Answer

A

Oversampling draws a higher proportion of samples from the minority stratum, potentially including duplicate records.

Undersampling draws a lower proportion from the majority stratum.

These techniques are used when dealing with imbalanced datasets to achieve better representation of minority classes.

Question 7

Q

Describe the purpose of univariate analyses. What are some common numerical and graphical summaries used in univariate analyses?

Answer

A

Univariate analyses study single variables. Numerical summaries include mean, variance, and quantiles. Graphical summaries include histograms, density plots, bar charts, and box plots. Frequency counts are especially useful for factors. These help understand the distribution and characteristics of individual variables.

Question 8

Q

How can you assess the skewness of a distribution using the mean and median? Explain the rule of thumb presented in the manual.

Answer

A

Rule of thumb for assessing skewness:

If mean < median: possible left skewness.
If mean > median: possible right skewness.
If mean = median: likely symmetric distribution.

Question 9

Q

What is the goal of bivariate analyses? How does it differ from univariate analyses?

Answer

A

Bivariate analyses study how pairs of variables behave together. While analyses between predictors are important, the focus is on target-predictor pairs. It differs from univariate analyses by examining relationships between variables rather than individual variables in isolation.

Question 10

Q

Explain the concept of correlation in the context of bivariate analyses. How is correlation interpreted, and what are its limitations?

Answer

A

Correlation measures the linear relationship between two numeric variables, ranging from -1 to 1. Perfect linear relationships have correlations of -1 or 1, with the sign indicating direction. A correlation of 0 suggests no linear relationship, but does not rule out non-linear relationships.

Question 11

Q

Describe some common visualization techniques used in bivariate analyses. How can these techniques help in understanding the relationship between two variables?

Answer

A

Common bivariate visualization techniques:

Scatterplots: Show the relationship between two numeric variables.
Special bar charts: Plot statistics of a numeric variable against levels of a factor.
Split plots (histograms, bar charts, box plots): Compare a numeric variable across levels of a factor.
Density plots and split line graphs.

These help visualize patterns, trends, and differences between variables. For example, in a split plot, if the distribution of the target variable differs across the levels of a factor, the factor is considered predictive of the target variable.

Question 12

Q

What is the purpose of multivariate analyses? How might techniques like color-coding and faceting be employed?

Answer

A

Multivariate analyses explore relationships between three or more variables simultaneously. Color-coding can introduce a third variable in a visual by assigning colors to levels of a factor. Faceting creates subplots for each level of a factor, allowing comparison of bivariate relationships across these levels.

2.1 Data Exploration Flashcards

(12 cards)