Data Wrangling Flashcards
Data wrangling consists of…
Data preparation, Data Munging, Data Transformation
What is the curse of dimensionality and how to solve it?
- if there are many more dimensions than data points, data become sparse
- Huge areas of the multi-dimensional space are empty
- Hard to draw conclusions
- SOLUTION —> dimensionality reduction: flattening the points onto a few clever chosen dimensions in this huge space.
Wrangling Stages
a. Raw: data ingestion and discovery — “unboxing”
- Data unboxing
What data do I have? What do I want to do with it?
Basic tools
- UNIX command line
- SublimeText editor
Trifacta: free visual data wrangling tool
- Codifies some good practices you can also follow by hand
Python’s Pandas library
b. Refined: curating data for reuse
- What: data warehousing, canonical models
- Who: data curators, IT engineers, actuaries …
c. Production: Ensuring feeds and workflows
- What: recurrent, automated use cases
- Who: often involves SW engineers and IT/ops folks
Data Wrangling issues
- Structure: the shape of the data
- Regular data: easy to access, filter, tabulate; Two variants: Relations (tables, data frames); Matrices (manipulate with algebra)
- Granularity: how fine/coarse is each datum
- Faithfulness: how well does the data capture “reality”; Outliers; Functional dependencies; Correlations; Good data cleaning
- Temporality: how is the data situated in time
- Scope: how (in)complete is the data