Data Wrangling Flashcards

1
Q

Data wrangling consists of…

A

Data preparation, Data Munging, Data Transformation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the curse of dimensionality and how to solve it?

A
  • if there are many more dimensions than data points, data become sparse
  • Huge areas of the multi-dimensional space are empty
  • Hard to draw conclusions
  • SOLUTION —> dimensionality reduction: flattening the points onto a few clever chosen dimensions in this huge space.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Wrangling Stages

A

a. Raw: data ingestion and discovery — “unboxing”
- Data unboxing
 What data do I have? What do I want to do with it?
Basic tools
- UNIX command line
- SublimeText editor
Trifacta: free visual data wrangling tool
- Codifies some good practices you can also follow by hand
Python’s Pandas library
b. Refined: curating data for reuse
- What: data warehousing, canonical models
- Who: data curators, IT engineers, actuaries …
c. Production: Ensuring feeds and workflows
 - What: recurrent, automated use cases
 - Who: often involves SW engineers and IT/ops folks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Data Wrangling issues

A
  • Structure: the shape of the data
  • Regular data: easy to access, filter, tabulate; Two variants: Relations (tables, data frames); Matrices (manipulate with algebra)
  • Granularity: how fine/coarse is each datum
  • Faithfulness: how well does the data capture “reality”; Outliers; Functional dependencies; Correlations; Good data cleaning
  • Temporality: how is the data situated in time
  • Scope: how (in)complete is the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly