Course 2 - Leveraging Unstructured Data with Cloud Dataproc on Google Cloud Platform Flashcards

1
Q

In a boolean 2x2 matrix of analyzed vs. collected data, where is the focus of “big data”?

A

In the cross section between collected data that has yet to be analyzed. Think of the iceberg - much of it is unused and ignored.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Give some examples of unused data.

A
  • Google Street View data.
  • Emails.
  • Parking footage.
  • Purchase history.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some barriers to big data analysis?

A
  • Unstructured data
  • Too large data amounts
  • Data quality
  • Too fast data streams
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Big data is often called counting problems. What’s the difference between easy and hard counting problems?

A

Hard problems:
Difficult to quantify “fitness”. Eg. vision analysis or natural language processing.

Easy problems:
Straightforward problems but large data amounts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Is one petabyte large?

A

Depends on data type and funds. PB is a lot of text, but not necessarily with pictures or video.

BUT a lot does not necessarily impact processing time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Describe how MapReduce works.

A

Split the data into small, parallelizable chunks. The output is then aggregated later.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the difference between typical development with Dataproc and typical Spark/Hadoop?

A

Dataproc manages all the setup necessary in Spark/Hadoop.

Spark/Hadoop has a lot of setup, config and optimization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are some drawbacks of managing a Hadoop cluster yourself?

A
  • Difficult to scale/add new hardware.
  • Less than 100% utilization -> bigger cost.
  • ## Downtime when upgrading/redistributing tasks.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a cluster?

A

A setup of master and worker nodes for crunching big data tasks. Data is centralized in master nodes and distributed (mapped) to worker nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why use nearby zones?

A
  • Lower latency

- Egress (exporting) data might incur costs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What’s the difference between cluster masters and nodes?

A

Master: Contains and splits data so workers can work in chunks. This is called mapping. Aggregates data later in reducing.

Worker: Data power attached to a master node. Receives data and processes it. Workers might be configured as preemptive and disappear from the cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a preemptive worker?

A

Unused data power from Google may be allocated and utilized. Think of last-minute airplane tickets. They can be revoked when someone requests that data power.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What can images do for Dataproc?

A

Clusters can be installed with different versions of software stack.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the gcloud tool?

A

A commandline program for interfacing with gcloud services, including creating dataproc clusters and submitting jobs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is pyspark?

A

A python interface to the Spark framework for distributed computing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the hamburger stack?

A

The three lines in the left corner of the Google web console interface.

17
Q

Which ports does Hadoop use?

A

HDFS: 50070

Hadoop web interface/Yarn: 8088 and 8080 (Unsure)

18
Q

How can you make custom machines? What can be changed?

A

Through the web console or command line.

CPU and memory can be changed.

19
Q

How are data and processing structured in MapReduce?

A

The data and operations are separated.

20
Q

How is data stored in a Hadoop system?

A

Data is typically split into multiple parts on the Hadoop file system (HDFS). This is called sharding.

21
Q

What is sharding?

A

Splitting data into several chunks for processing.

22
Q

What is the traditional way of storing data on Hadoop vs. Google’s way?

A

Traditional: Sharded data is transferred to each node separately.

Google’s way: Data is stored in Google Cloud Storage.

23
Q

Describe a traditional Google workflow.

A

Ingest -> process -> analysis
using
Pub/Sub -> Dataflow -> BigQuery

24
Q

What’s a problem with keeping data on Hadoop nodes?

A

In the node dies, its data must be moved.

25
Q

How should you move data to Hadoop on Dataproc?

A

1) Move data to GCS.
2) Update prefixes (hdfs:// to gs://).
3) Start using Hadoop on Dataproc as usual.

26
Q

How do you install software to a Dataproc stack?

A

1) Write an init script.
2) Upload it to GCS.
3) Provide it when creating a Dataproc cluster.

27
Q

What is Hadoop?

A

Apache Hadoop is an open-source software framework used for distributed storage and processing of dataset of big data using the MapReduce programming model.

28
Q

What is Apache Pig?

A

Apache Pig is an abstraction over MapReduce. It’s a tool/platform used to analyze larger datasets with a data flow representation.

29
Q

What is PySpark?

A

PySpark is a Python library for interacting with Spark.

30
Q

What is Spark?

A

Spark is a big data platform similar to Hadoop.