Data Wrangling Flashcards

Question 1

Q

Data wrangling consists of…

Answer

A

Data preparation, Data Munging, Data Transformation

Question 2

Q

What is the curse of dimensionality and how to solve it?

Answer

A

if there are many more dimensions than data points, data become sparse
Huge areas of the multi-dimensional space are empty
Hard to draw conclusions
SOLUTION —> dimensionality reduction: flattening the points onto a few clever chosen dimensions in this huge space.

Question 3

Q

Wrangling Stages

Answer

A

a. Raw: data ingestion and discovery — “unboxing”
- Data unboxing  What data do I have? What do I want to do with it?
Basic tools - UNIX command line - SublimeText editor
Trifacta: free visual data wrangling tool - Codifies some good practices you can also follow by hand
Python’s Pandas library
b. Refined: curating data for reuse
- What: data warehousing, canonical models
- Who: data curators, IT engineers, actuaries …
c. Production: Ensuring feeds and workflows  - What: recurrent, automated use cases  - Who: often involves SW engineers and IT/ops folks

Question 4

Q

Data Wrangling issues

Answer

A

Structure: the shape of the data
Regular data: easy to access, filter, tabulate; Two variants: Relations (tables, data frames); Matrices (manipulate with algebra)
Granularity: how fine/coarse is each datum
Faithfulness: how well does the data capture “reality”; Outliers; Functional dependencies; Correlations; Good data cleaning
Temporality: how is the data situated in time
Scope: how (in)complete is the data

Data Wrangling Flashcards

(4 cards)