Data Analysis Flashcards
Exploratory Data Analysis
the process of examining
a dataset to find insights and patterns.
Data Cleaning
EDA helps in identifying errors, missing values, and outliers that need to be addressed before deeper analysis.
Insight Generation
It provides a first look at the data, allowing analysts to develop insights and hypotheses for testing.
Model Preparation
By understanding relationships and patterns, EDA informs feature selection and model building strategies.
Assumption Validation
Many statistical models and machine learning algorithms require assumptions about data distribution or variable relationships. EDA validates these assumptions upfront.
Communication
EDA techniques generate visualizations and summaries that communicate findings effectively to stakeholders.
Common Data formats
CSV/TSV : tabular data, easy to work with
JSON : Nested dictionary format, ideal for hierarchical data.
XML/HTML : Nested data, used for web scraping.
Log data : unstructured text, parsed with Regular Expressions.
Granularity
The level of detail represented by each record in a dataset.
Scope
The coverage of the dataset in relation to what we are interested in analyzing.
Temporality
Refers to the timing aspects of data, crucial for understanding when events occurred,
Two different types of scope
Temporal Scope : dataset spans a concise period
Geographical scope: dataset geographically focused on certain hotspots
Four aspects of temporality
Date and time fields - in the calls and stops datasets, datetime fields mark when police interactions were reported.
Timezone awareness - Important to consider timezone and daylight savings for accurate temporal analysis.
Data format - US datetime format (MM/DD/YYYY) used, essential for correct interpretation.
Placeholder dates - watch for default timestamps that may indicate missing values.
Faithfulness
A dataset is considered faithful if it accurately reflects reality, crucial for reliable analysis.
Common faithfulness issues
Unrealistic values - Future dates, nonexistent locations, negative counts, or significant outliers.
Dependency violations - mismatch between related fields, like age and birthdate.
Manual data entry errors - prone to spelling mistakes and inconsistencies
Data falsification indicators - repeated unusual names or email addresses, suggesting fabricated entries.
Quantiative Data Plots
Histograms, box plots, scatter plots
Qualitative data plots
Bar plots, dot plots, mosaic plots
Mixed data type plots
Overlaid density curves, side-by-side box plots
Rug plots
These plots show individual data points as marks along an axis, like threads on a rug. They’re helpful for a small number of observations but become cluttered with larger datasets
Two quantitative features in a relationship
Use scatterplots to explore relationships, looking for linear or nonlinear patterns
One qualitative and one quantitative variable in a relationship
Divide data into groups based on the qualitative feature and compare the quantitative distribution across these groups.
Density curves, box plots, and violin plots help compare distributions, highlighting differences in spread, central tendency and outliers
Two qualitative features in a relationship
Compare the distribution of one feature across subgroups defined by another feature, focusing on proportions.
Scatter matrix
Shows the correlations between each variable. They can also help reveal non linear correlations