Course 2 - Leveraging Unstructured Data with Cloud Dataproc on Google Cloud Platform Flashcards
In a boolean 2x2 matrix of analyzed vs. collected data, where is the focus of “big data”?
In the cross section between collected data that has yet to be analyzed. Think of the iceberg - much of it is unused and ignored.
Give some examples of unused data.
- Google Street View data.
- Emails.
- Parking footage.
- Purchase history.
What are some barriers to big data analysis?
- Unstructured data
- Too large data amounts
- Data quality
- Too fast data streams
Big data is often called counting problems. What’s the difference between easy and hard counting problems?
Hard problems:
Difficult to quantify “fitness”. Eg. vision analysis or natural language processing.
Easy problems:
Straightforward problems but large data amounts.
Is one petabyte large?
Depends on data type and funds. PB is a lot of text, but not necessarily with pictures or video.
BUT a lot does not necessarily impact processing time.
Describe how MapReduce works.
Split the data into small, parallelizable chunks. The output is then aggregated later.
What is the difference between typical development with Dataproc and typical Spark/Hadoop?
Dataproc manages all the setup necessary in Spark/Hadoop.
Spark/Hadoop has a lot of setup, config and optimization.
What are some drawbacks of managing a Hadoop cluster yourself?
- Difficult to scale/add new hardware.
- Less than 100% utilization -> bigger cost.
- ## Downtime when upgrading/redistributing tasks.
What is a cluster?
A setup of master and worker nodes for crunching big data tasks. Data is centralized in master nodes and distributed (mapped) to worker nodes.
Why use nearby zones?
- Lower latency
- Egress (exporting) data might incur costs
What’s the difference between cluster masters and nodes?
Master: Contains and splits data so workers can work in chunks. This is called mapping. Aggregates data later in reducing.
Worker: Data power attached to a master node. Receives data and processes it. Workers might be configured as preemptive and disappear from the cluster.
What is a preemptive worker?
Unused data power from Google may be allocated and utilized. Think of last-minute airplane tickets. They can be revoked when someone requests that data power.
What can images do for Dataproc?
Clusters can be installed with different versions of software stack.
What is the gcloud tool?
A commandline program for interfacing with gcloud services, including creating dataproc clusters and submitting jobs.
What is pyspark?
A python interface to the Spark framework for distributed computing.