Data Analysis Flashcards
Exploratory Data Analysis
the process of examining
a dataset to find insights and patterns.
Data Cleaning
EDA helps in identifying errors, missing values, and outliers that need to be addressed before deeper analysis.
Insight Generation
It provides a first look at the data, allowing analysts to develop insights and hypotheses for testing.
Model Preparation
By understanding relationships and patterns, EDA informs feature selection and model building strategies.
Assumption Validation
Many statistical models and machine learning algorithms require assumptions about data distribution or variable relationships. EDA validates these assumptions upfront.
Communication
EDA techniques generate visualizations and summaries that communicate findings effectively to stakeholders.
Common Data formats
CSV/TSV : tabular data, easy to work with
JSON : Nested dictionary format, ideal for hierarchical data.
XML/HTML : Nested data, used for web scraping.
Log data : unstructured text, parsed with Regular Expressions.
Granularity
The level of detail represented by each record in a dataset.
Scope
The coverage of the dataset in relation to what we are interested in analyzing.
Temporality
Refers to the timing aspects of data, crucial for understanding when events occurred,
Two different types of scope
Temporal Scope : dataset spans a concise period
Geographical scope: dataset geographically focused on certain hotspots
Four aspects of temporality
Date and time fields - in the calls and stops datasets, datetime fields mark when police interactions were reported.
Timezone awareness - Important to consider timezone and daylight savings for accurate temporal analysis.
Data format - US datetime format (MM/DD/YYYY) used, essential for correct interpretation.
Placeholder dates - watch for default timestamps that may indicate missing values.
Faithfulness
A dataset is considered faithful if it accurately reflects reality, crucial for reliable analysis.
Common faithfulness issues
Unrealistic values - Future dates, nonexistent locations, negative counts, or significant outliers.
Dependency violations - mismatch between related fields, like age and birthdate.
Manual data entry errors - prone to spelling mistakes and inconsistencies
Data falsification indicators - repeated unusual names or email addresses, suggesting fabricated entries.
Quantiative Data Plots
Histograms, box plots, scatter plots