15_Dataprep Flashcards

1
Q

What is Dataprep

  • Explore, cleaning and preparing data
  • Partnered with Trifacta for data cleaning/processing service
  • Fully managed, serverless and web-based
  • User-friendly interface
    • Clean data by clicking on it
    • Visually define transformation
  • Export to Cloud Dataflow
  • Supported file types
    • Input: CSV, JSON (including nested), Plain text, Excel, LOG, TSV and Avro
    • Output: CSV, JSON, Avro, BigQuery table
      • CSV/JSON can be compressed or uncompressed
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How it works

  • Backed by Cloud Dataflow
    • After preparing, Dataflow processes via Apache Beam pipeline
    • “User-friendly Dataflow pipeline”
  • Dataprep process:
    • Create a flow, which is container object to access and manipulate datasets
    • Import dataset
    • Transform sampled data with recipes
    • Run Dataflow job on transformed dataset
    • Export results (GCS, BigQuery)
  • Intelligent suggestions:
    • Selecting data will often automatically give the best suggestion
    • Can manually create recipes, however simple tasks (remove outliers, de-duplicate) should use auto-suggestions
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

IAM

  • Dataprep User: Run Dataprep in a project
  • Dataprep Service Agent: Gives Trifacta necessary access to project resources
    • Access GCS buckets, Dataflow Developer, BigQuery user/data editor
    • Necessary for cross-project access + GCE service account
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly