15_Dataprep Flashcards
1
Q
What is Dataprep
- Explore, cleaning and preparing data
- Partnered with Trifacta for data cleaning/processing service
- Fully managed, serverless and web-based
- User-friendly interface
- Clean data by clicking on it
- Visually define transformation
- Export to Cloud Dataflow
- Supported file types
- Input: CSV, JSON (including nested), Plain text, Excel, LOG, TSV and Avro
- Output: CSV, JSON, Avro, BigQuery table
- CSV/JSON can be compressed or uncompressed
A
2
Q
How it works
- Backed by Cloud Dataflow
- After preparing, Dataflow processes via Apache Beam pipeline
- “User-friendly Dataflow pipeline”
- Dataprep process:
- Create a flow, which is container object to access and manipulate datasets
- Import dataset
- Transform sampled data with recipes
- Run Dataflow job on transformed dataset
- Export results (GCS, BigQuery)
- Intelligent suggestions:
- Selecting data will often automatically give the best suggestion
- Can manually create recipes, however simple tasks (remove outliers, de-duplicate) should use auto-suggestions
A
3
Q
IAM
- Dataprep User: Run Dataprep in a project
-
Dataprep Service Agent: Gives Trifacta necessary access to project resources
- Access GCS buckets, Dataflow Developer, BigQuery user/data editor
- Necessary for cross-project access + GCE service account
A