Exam questions Flashcards

1
Q

What is the storage transfer service?

A

The storage transfer service is a tool to import inline data into GCS.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

When would you use the storage transfer service?

A

When transferring data from an online source (HTTP(S), Amazon S3 etc.) to a data sink (Eg. GCS).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is GCS?

A

Google Cloud Storage, Google’s online file storage service.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a bucket?

A

It’s similar to a partition (file system) on a hard drive. A segregated place to store files.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Which access rights do you need to use the storage transfer service?

A

Editor or owner of the project that creates the transfer job. Viewers can view info/jobs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

When do you use storage transfer service vs. gsutil?

A

On-premise, use gsutil. Use STS when transferring from remote cloud storage providers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Google Transfer Appliance?

A

High capacity storage servers rented from Google. Used to transfer data too large for network transferring.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

When would you use the Google Transfer Appliance?

A

When transferring your data over the Internet takes more than a few weeks. (Evaluate data / speed)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How is Google Transfer Appliance priced?

A

100TB: $300
480TB: $1800

Late fees:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How is the storage transfer service priced?

A

No extra costs for the transfer service itself. Other fees apply, like:

  • GCS storage/bandwidth ($0.01/GB regional or ~$0.1 inter-regional)
  • Data source’s pricing
  • Data insertion costs (PUT operations).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is nearline storage?

A

Storage for commonly used data. Higher cost per GB, lower price for bandwidth.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is coldline storage?

A

Storage for rarely used data, eg. backups. Low cost per GB, but high bandwidth cost.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What’s the price difference between nearline/coldline storage?

A

Nearline: $0.01/GB
Coldline: $0.007/GB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is data egress?

A

Data transfer across regions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

When would you use GCS as your storage platform?

A

If

  • your data is not structured
  • you don’t need mobile SDKs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

When would you use Google Firebase as your storage platform?

A

When

  • your data is (un)structured,
  • your data is non-relational
  • your main workload is not analytics
  • you need mobile SDKs.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

When would you use Cloud Spanner as your storage platform?

A

When

  • your data is relational,
  • you don’t primarily need analytics
  • you need horizontal scaling.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is horizontal scaling?

A

When you add more machines.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is vertical scaling?

A

When you add more power to an existing machine.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

When would you use Cloud SQL as your storage platform?

A

When

  • your data is relational
  • you don’t primarily need analytics
  • you don’t need to scale horizontally.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

When would you use Cloud Datastore as your storage platform?

A

When

  • your data is structured
  • your data is non-relational
  • your primary workload is not analytics
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

When would you use Cloud Bigtable as your storage platform?

A

When

  • your data is structured
  • your data is non-relational
  • your workload is analytics
  • you need low latency
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

When would you use BigQuery as your storage platform?

A

When

  • your data is structured
  • your data is non-relational
  • your workload is analytics
  • you don’t need low latency
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Describe Google Cloud Storage.

A

A scalable, fully-managed, highly reliable, and cost-efficient object / blob store.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is CBT?

A

A scalable, fully-managed NoSQL wide-column database that is suitable for both real-time access and analytics workloads.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is cloud datastore?

A

A scalable, fully-managed NoSQL document database for your web and mobile applications.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is cloud sql?

A

A fully-managed MySQL and PostgreSQL database service that is built on the strength and reliability of Google’s infrastructure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is cloud spanner?

A

Mission-critical, relational database service with transactional consistency, global scale and high availability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is bigquery?

A

A scalable, fully-managed Enterprise Data Warehouse (EDW) with SQL and fast response times.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is OLAP short for?

A

Online analytical processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Give some examples of unused data.

A
  • Google Street View data.
  • Emails.
  • Parking footage.
  • Purchase history.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What are some barriers to big data analysis?

A
  • Unstructured data
  • Too large data amounts
  • Data quality
  • Too fast data streams
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Big data is often called counting problems. What’s the difference between easy and hard counting problems?

A

Hard problems:
Difficult to quantify “fitness”. Eg. vision analysis or natural language processing.

Easy problems:
Straightforward problems but large data amounts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Is one petabyte large?

A

Depends on data type and funds. PB is a lot of text, but not necessarily with pictures or video.

BUT a lot does not necessarily impact processing time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Describe how MapReduce works.

A

Split the data into small, parallelizable chunks. The output is then aggregated later.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What is the difference between typical development with Dataproc and typical Spark/Hadoop?

A

Dataproc manages all the setup necessary in Spark/Hadoop.

Spark/Hadoop has a lot of setup, config and optimization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What are some drawbacks of managing a Hadoop cluster yourself?

A
  • Difficult to scale/add new hardware.
  • Less than 100% utilization -> bigger cost.
  • Downtime when upgrading/redistributing tasks.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What is a cluster?

A

A setup of master and worker nodes for crunching big data tasks. Data is centralized in master nodes and distributed (mapped) to worker nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Why use nearby zones?

A
  • Lower latency

- Egress (exporting) data might incur costs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What’s the difference between cluster masters and nodes?

A

Master: Contains and splits data so workers can work in chunks. This is called mapping. Aggregates data later in reducing.

Worker: Data power attached to a master node. Receives data and processes it. Workers might be configured as preemptive and disappear from the cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What is a preemptive worker?

A

Unused data power from Google may be allocated and utilized. Think of last-minute airplane tickets. They can be revoked when someone requests that data power.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What can images do for Dataproc?

A

Clusters can be installed with different versions of software stack.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

What is the gcloud tool?

A

A commandline program for interfacing with gcloud services, including creating dataproc clusters and submitting jobs.

44
Q

What is pyspark?

A

A python interface to the Spark framework for distributed computing.

45
Q

What is the hamburger stack?

A

The three lines in the left corner of the Google web console interface.

46
Q

How can you make custom machines? What can be changed?

A

Through the web console or command line.

CPU and memory can be changed.

47
Q

How are data and processing structured in MapReduce?

A

The data and operations are separated.

48
Q

How is data stored in a Hadoop system?

A

Data is typically split into multiple parts on the Hadoop file system (HDFS). This is called sharding.

49
Q

What is sharding?

A

Splitting data into several chunks for processing.

50
Q

What is the traditional way of storing data on Hadoop vs. Google’s way?

A

Traditional: Sharded data is transferred to each node separately.

Google’s way: Data is stored in Google Cloud Storage.

51
Q

Describe a traditional Google workflow.

A

Ingest -> process -> analysis
using
Pub/Sub -> Dataflow -> BigQuery

52
Q

What’s a problem with keeping data on Hadoop nodes?

A

In the node dies, its data must be moved.

53
Q

How should you move data to Hadoop on Dataproc?

A

1) Move data to GCS.
2) Update prefixes (hdfs:// to gs://).
3) Start using Hadoop on Dataproc as usual.

54
Q

How do you install software to a Dataproc stack?

A

1) Write an init script.
2) Upload it to GCS.
3) Provide it when creating a Dataproc cluster.

55
Q

What is Hadoop?

A

Apache Hadoop is an open-source software framework used for distributed storage and processing of dataset of big data using the MapReduce programming model.

56
Q

What is Apache Pig?

A

Apache Pig is an abstraction over MapReduce. It’s a tool/platform used to analyze larger datasets with a data flow representation.

57
Q

What is PySpark?

A

PySpark is a Python library for interacting with Spark.

58
Q

What is Spark?

A

Spark is a big data platform similar to Hadoop.

59
Q

What is BigQuery?

A

BigQuery is a data warehouse for data analysis. It’s built to run large SQL statements. It supports streaming ingestion of data, which offers real-time analysis.

60
Q

What is DataFlow?

A

DataFlow is a service for transforming and enriching data in stream and batch modes.

61
Q

In statistics, what is accuracy?

A

How many items did you get right out of the total?

TP + TN) /
(TP + TN + FP + FN

62
Q

Assume you’re using a device to test for infected people in a village. Describe what recall is in this case.

A

Recall: Out of the people that tested positive, what percentage was actually infected?

TP / (TP + FP)

63
Q

Assume you’re using a device to test for infected people in a village. Describe what precision is in this case.

A

Precision is the percentage of people who were actually infected when your device said they were.

TP / (TP + FN)

64
Q

What is dataproc?

A

Dataproc is Google’s managed Hadoop service.

65
Q

What is the core idea behind MapReduce?

A

Split data into smaller chunks. Run operations on these chunks in parallel (Mapping function to data). Aggregate the results from this functions (Reducing).

Eg. parallelize the squaring of every number in a list, then summing the results.

66
Q

What is a cluster in dataproc?

A

A combination of a master node and several worker worker nodes.

67
Q

Where is data stored in dataproc?

A

Data is best stored in GCS, then copied to each worker node as needed.

68
Q

When would you use dataproc over BigQuery?

A

When you need to run other things than SQL, eg. machine learning algorithms.

69
Q

What is the sql code for a windowing function?

A
select 
over (
  partition by 
  [order by]
  frame
)
Eg. 
SELECT
  AVG(value)
    OVER (ORDER BY value
          ROWS BETWEEN 10 PRECEDING AND CURRENT ROW)
FROM Dataset;
70
Q

When would you use windowing functions in bigquery?

A

For instance when calculating running averages and analysis of time series.

71
Q

What is a UDF?

A

User defined function.

72
Q

How can you optimize BigQuery queries

A
  • Select only the columns you need.

- Big joins first, small joins later.

73
Q

Does Dataflow support compressed files?

A

Yes, TextIO supports them.

74
Q

Where should you store data when using Dataflow for batch processing?

A

“Anywhere”, but GCS is a great place to start. BigQuery also works if you have structured data.

75
Q

What is ParDo?

A

A function in Dataflow for operating on data in parallel.

76
Q

In Dataflow, what is the difference between GroupBy and Combine?

A

Combine is typically predefined functions optimized for one task.

GroupBys are slower, but lets you write the function yourself.

77
Q

What is a side input in Dataflow?

A

A side input is like another set of parameters. This is typically done with “views”. You can combine two flows into one.

78
Q

What is softmax?

A

An ML term. The softmax function takes in a vector and outputs it so they sum to 1. Think classification probabilities in a neural network.

79
Q

What is argmax?

A

An ML term. The argmax function takes in a vector. The output has 1 in the cell with the highest value, all other cells are 0.

80
Q

How many neurons should you use in a neural network?

A

Only as many as needed. Unused neurons’ weights tend towards 0 (not activated).

81
Q

What is a typical machine learning workflow? (5)

A
Collect data
Organize
Prepare/preprocess
Number crunching
Deployment
82
Q

How much data do you need for a neural network?

A

Enough to cover every case you want to predict.

83
Q

How can you confuse a neural network? (2)

A

Negative examples, same label - cloud vs. cartoon clouds.

Outliers - Only a problem when there are too few.

84
Q

What is a regression problem?

A

A problem on (semi-)continuous data. Eg. house pricing.

85
Q

What is logistic regression?

A

Classification problems - is this a cat or a dog?

86
Q

What is MSE?

A

Mean squared error, often used as a fitness/error function in neural networks.

87
Q

How do you calculate the mean squared error?

A

1) For each value in the dataset, sum (real_i - predicted_i)^2 == (Y_n - y_n)^2.
2) Divide by the number of data points.

88
Q

When do you use MSE?

A

To evaluate the error in a regression problem.

89
Q

What is cross entropy?

A

An error function for logistic regression.

90
Q

How do you calculate cross entropy?

A

1) Sum (y_n * log(Y_n) + (1 - y_n) * log(Y_n)).

2) Divide by number of data points ( |Y| ).

91
Q

What is a confusion matrix?

A

A confusion matrix has 4 values: number of true positive, true negative, false positive and false negative.

92
Q

What does a confusion matrix help you with?

A

The confusion matrix help illustrate the performance of your ML model.

93
Q

What is an unbalanced dataset?

A

Balanced datasets have a roughly even distribution of categories/values. Unbalanced datasets are skewed.

94
Q

What is thresholding in ML?

A

A value to separate positive/negative guesses in a classification model.

Eg, the guess is that the image is 75% cat. Should this be counted as a cat? Threshold might be at 80%.

95
Q

What is an epoch in ML?

A

Feeding the neural network your whole dataset once.

96
Q

What is training loss in ML?

A

How bad the neural network performs when comparing input/output. Synonym: error. Target function to minimize.

97
Q

What is batch size in ML?

A

How many examples are shown to the neural network before backpropagation.

98
Q

What is feature engineering?

A

Studying and selecting which features to use/not use in a machine learning model.

99
Q

What is one-hot encoding?

A

A way of giving the model classification data.

Eg. for a rating feature from 1-5:
[3] vs. [0, 0, 0, 0, 1]

100
Q

What is a sparse input?

A

Inputs where only a few inputs are activated at the same time. Eg. one-hot is sparse.

101
Q

What is dense input?

A

Most of the input is activated at the same time.

102
Q

What is a hyperparameter?

A

A tunable part of an ML model not related to its inputs (examples). Eg. how many neurons to use or learning rate.

103
Q

What is hyperparameter tuning?

A

Trying out several hyperparameters to find a good combination.

104
Q

What is streaming?

A

Processing of unbounded data, eg. data coming in over time.

105
Q

What are the three Vs of streaming analytics?

A

Volume - lots of data.
Velocity - data is generated quickly.
Variety - unstructured, lots of different kinds.

106
Q

What’s the difference between tight and loose coupling?

A

Tight coupling: one receiver/sender for all data.

Loose coupling: message buffer between sender/receiver.

107
Q

What are senders/receivers called in pub/sub?

A

Sender: publisher.
Receiver: subscriber.