Module 2B Exploring Data Visually Flashcards
What is Exploratory Data Analysis (EDA)?
EDA is the process of understanding data through the heavy use of descriptive statistics and visualization.
What are the objectives of Exploratory Data Analysis?
The objectives include detecting errors, missing values, unusual values, characterizing the distribution of values for variables, and identifying patterns and relationships between variables.
What are the challenges associated with Exploratory Data Analysis?
Challenges include dealing with tall data (many rows) and wide data (many columns).
What are ordinal variables?
Ordinal variables are categorical variables that have a natural ordering.
How can the strength of a correlation between two variables be visually gauged?
The strength can be gauged by how closely data points cluster around the linear trendline.
What is a spurious relationship in data analysis?
A spurious relationship is when two variables appear to be related but are not, often due to a lurking variable or sample bias.
How can data be organized to facilitate exploratory analysis?
Techniques include creating and using Excel tables, applying filters, sorting values, and creating new variables for summary statistics.
What does univariate analysis focus on?
It focuses on the distribution of values within a single variable using methods like frequency tables, histograms, and box-and-whisker plots.
What is crosstabulation used for in data analysis?
Crosstabulation is used to compare two or more variables using PivotTables and PivotCharts.
What is legitimately missing data?
Missing data is deemed legitimate when they naturally occur. No remedial action is taken.
What features can be distinguished in time-series data using line charts?
Features include trends, variability, and seasonality.
What are the two general types of geographic visualizations?
Choropleth maps, which use colors or symbols to represent data, and cartograms, which represent areas non-proportionally to show data density or frequency.
What is considered illegitimately missing data?
When the missing data do not occur naturally, they are deemed illegitimate.
What are the options to address illegitimately missing data?
Discard observations, Estimate values, treat as a seperate category for a categorical variable.
What are the categories of illegitimately missing data?
Missing completely at random (MCAR), Missing at random (MAR), Missing not at random (MNAR)
What is the remedial action for the categories of illegitimately missing data?
MCAR = discard observations or replace them with mean, median or mode
MAR = estimate by using values in the observation
MNAR = consider removing the variable