T4: EDA Flashcards
What is EDA?
- A crucial first step in the data analysis process.
- Helps understand the data’s main characteristics.
Why EDA?
- Offers a clear insight into the underlying structure of the data.
- Helps identify obvious errors and outliers.
- Provides a foundation for subsequent analysis.
IMPORTANCE OF EDA
- Uncovering patterns: Helps detect and visualize patterns in the data.
- Identifying anomalies: Spot potential outliers or mistakes in the data.
- Informing model selection: Understand which models might work best.
- Validating assumptions: Ensure data meets assumptions required by
modeling techniques.
STEPS IN EDA
1) Data Collection: Gathering relevant data from various sources (last time)
2) Data Cleaning: Preparing the data for analysis (last time)
3) Data Visualization: Using plots and charts to understand data
(today)
4) Statistical Analysis: Applying stats to derive insights (next time).
1) DATA COLLECTION (RECAP)
- Sources of Data: Surveys, databases, logs, etc.
- Diverse and Accurate Data: Ensure varied sources for unbiased results.
- Initial Observations: First look at raw data for obvious issues or patterns.
2) DATA CLEANING (HANDLING MISSING
VALUES)
R Code:
# Identify missing values
missing_values <- is.na(data)
Remove rows with missing values
cleaned_data <- na.omit(data)
2) DATA CLEANING (DEALING WITH
OUTLIERS)
Definition: Data points that differ significantly from others.
Types: Point outliers, contextual outliers, and collective outliers.
R Code:
# Boxplot to visualize outliers
boxplot(data$column_name)
IQR method to identify outliers
IQR <- IQR(data$column_name)
upper_bound <- quantile(data$column_name, 0.75) + 1.5 * IQR
lower_bound <- quantile(data$column_name, 0.25) - 1.5 * IQR
outliers < data$column_name[data$column_name > upper_bound |
data$column_name < lower_bound]
2) DATA TRANSFORMATION AND
NORMALIZATION
Log transformation (E.g., Reduce skewness Normal dist.) 🡪
Why Transform Data? Enhance model performance; meet assumptions of certain
algorithms.
Common Transformations: Log, square root, z-score.
R Code:
log_data <- log(data$column_name)
Square root transformation
(E.g., Reduce skewness in count data Uniform dist.)
sqrt_data <- sqrt(data$column_name)
Z-score normalization
(E.g., to create mean = 0 and sd = 1).
z_score <- scale(data$column_name)
3) DATA VISUALIZATION
Histograms and Box Plots: Understand data distribution.
Scatter Plots: Visualize bivariate relationships.
Heatmaps: Show correlations.
R Code (for a simple scatter plot):
plot(data$column1, data$column2, main=”Scatter Plot of Column1 vs Column2”,
xlab=”Column1”, ylab=”Column2”)
4) STATISTICAL ANALYSIS (NEXT TIME)
- Descriptive Statistics: Summarize main features of data.
- Inferential Statistics: Make predictions or inferences.
- Testing Hypotheses: Determine validity of certain claims.
DATA VISUALIZATION
- Line plot
- Bar plots
- Box plots
- Density plots
- Scatter plots
- Word clouds
- Pie chart
- Raincloud plot
- Heatmap
- Animated plots
- (Interactive plots)
LINE PLOT
ggplot(data = df, aes(x = date, y = unemploy) +
geom_line()
BAR PLOT
ggplot(data = df, aes(x = class) +
geom_bar()
BOX PLOT
ggplot(data = df, aes(x = ‘Distance measure’, y = temperature) +
geom_boxplot()
DENSITY PLOT
ggplot(data = df, aes(x = X, fill = cut)) +
geom_density(alpha = 0.5)