Exam questions Flashcards

Question

What is CBT?

Answer 1

A scalable, fully-managed NoSQL wide-column database that is suitable for both real-time access and analytics workloads.

Answer 2

A scalable, fully-managed NoSQL document database for your web and mobile applications.

Answer 3

A fully-managed MySQL and PostgreSQL database service that is built on the strength and reliability of Google’s infrastructure.

Answer 4

Mission-critical, relational database service with transactional consistency, global scale and high availability.

Answer 5

A scalable, fully-managed Enterprise Data Warehouse (EDW) with SQL and fast response times.

Answer 6

Online analytical processing.

Answer 7

- Google Street View data. - Emails. - Parking footage. - Purchase history.

Answer 8

- Unstructured data - Too large data amounts - Data quality - Too fast data streams

Answer 9

Hard problems: Difficult to quantify "fitness". Eg. vision analysis or natural language processing. Easy problems: Straightforward problems but large data amounts.

Answer 10

Depends on data type and funds. PB is a lot of text, but not necessarily with pictures or video. BUT a lot does not necessarily impact processing time.

Answer 11

Split the data into small, parallelizable chunks. The output is then aggregated later.

Answer 12

Dataproc manages all the setup necessary in Spark/Hadoop. Spark/Hadoop has a lot of setup, config and optimization.

Answer 13

- Difficult to scale/add new hardware. - Less than 100% utilization -> bigger cost. - Downtime when upgrading/redistributing tasks.

Answer 14

A setup of master and worker nodes for crunching big data tasks. Data is centralized in master nodes and distributed (mapped) to worker nodes.

Answer 15

- Lower latency | - Egress (exporting) data might incur costs

Answer 16

Master: Contains and splits data so workers can work in chunks. This is called mapping. Aggregates data later in reducing. Worker: Data power attached to a master node. Receives data and processes it. Workers might be configured as preemptive and disappear from the cluster.

Answer 17

Unused data power from Google may be allocated and utilized. Think of last-minute airplane tickets. They can be revoked when someone requests that data power.

Answer 18

Clusters can be installed with different versions of software stack.

Answer 19

A commandline program for interfacing with gcloud services, including creating dataproc clusters and submitting jobs.

Answer 20

A python interface to the Spark framework for distributed computing.

Answer 21

The three lines in the left corner of the Google web console interface.

Answer 22

Through the web console or command line. CPU and memory can be changed.

Answer 23

The data and operations are separated.

Answer 24

Data is typically split into multiple parts on the Hadoop file system (HDFS). This is called sharding.

Answer 25

Splitting data into several chunks for processing.

Answer 26

Traditional: Sharded data is transferred to each node separately. Google's way: Data is stored in Google Cloud Storage.

Answer 27

Ingest -> process -> analysis using Pub/Sub -> Dataflow -> BigQuery

Answer 28

In the node dies, its data must be moved.

Answer 29

1) Move data to GCS. 2) Update prefixes (hdfs:// to gs://). 3) Start using Hadoop on Dataproc as usual.

Answer 30

1) Write an init script. 2) Upload it to GCS. 3) Provide it when creating a Dataproc cluster.

Answer 31

Apache Hadoop is an open-source software framework used for distributed storage and processing of dataset of big data using the MapReduce programming model.

Answer 32

Apache Pig is an abstraction over MapReduce. It's a tool/platform used to analyze larger datasets with a data flow representation.

Answer 33

PySpark is a Python library for interacting with Spark.

Answer 34

Spark is a big data platform similar to Hadoop.

Answer 35

BigQuery is a data warehouse for data analysis. It's built to run large SQL statements. It supports streaming ingestion of data, which offers real-time analysis.

Answer 36

DataFlow is a service for transforming and enriching data in stream and batch modes.

Answer 37

How many items did you get right out of the total? | TP + TN) / (TP + TN + FP + FN

Answer 38

Recall: Out of the people that tested positive, what percentage was actually infected? TP / (TP + FP)

Answer 39

Precision is the percentage of people who were actually infected when your device said they were. TP / (TP + FN)

Answer 40

Dataproc is Google's managed Hadoop service.

Answer 41

Split data into smaller chunks. Run operations on these chunks in parallel (Mapping function to data). Aggregate the results from this functions (Reducing). Eg. parallelize the squaring of every number in a list, then summing the results.

Answer 42

A combination of a master node and several worker worker nodes.

Answer 43

Data is best stored in GCS, then copied to each worker node as needed.

Answer 44

When you need to run other things than SQL, eg. machine learning algorithms.

Answer 45

``` select over ( partition by [order by] frame ) ``` ``` Eg. SELECT AVG(value) OVER (ORDER BY value ROWS BETWEEN 10 PRECEDING AND CURRENT ROW) FROM Dataset; ```

Answer 46

For instance when calculating running averages and analysis of time series.

Answer 47

User defined function.

Answer 48

- Select only the columns you need. | - Big joins first, small joins later.

Answer 49

Yes, TextIO supports them.

Answer 50

"Anywhere", but GCS is a great place to start. BigQuery also works if you have structured data.

Answer 51

A function in Dataflow for operating on data in parallel.

Answer 52

Combine is typically predefined functions optimized for one task. GroupBys are slower, but lets you write the function yourself.

Answer 53

A side input is like another set of parameters. This is typically done with "views". You can combine two flows into one.

Answer 54

An ML term. The softmax function takes in a vector and outputs it so they sum to 1. Think classification probabilities in a neural network.

Answer 55

An ML term. The argmax function takes in a vector. The output has 1 in the cell with the highest value, all other cells are 0.

Answer 56

Only as many as needed. Unused neurons' weights tend towards 0 (not activated).

Answer 57

``` Collect data Organize Prepare/preprocess Number crunching Deployment ```

Answer 58

Enough to cover every case you want to predict.

Answer 59

Negative examples, same label - cloud vs. cartoon clouds. | Outliers - Only a problem when there are too few.

Answer 60

A problem on (semi-)continuous data. Eg. house pricing.

Answer 61

Classification problems - is this a cat or a dog?

Answer 62

Mean squared error, often used as a fitness/error function in neural networks.

Answer 63

1) For each value in the dataset, sum (real_i - predicted_i)^2 == (Y_n - y_n)^2. 2) Divide by the number of data points.

Answer 64

To evaluate the error in a regression problem.

Answer 65

An error function for logistic regression.

Answer 66

1) Sum (y_n * log(Y_n) + (1 - y_n) * log(Y_n)). | 2) Divide by number of data points ( |Y| ).

Answer 67

A confusion matrix has 4 values: number of true positive, true negative, false positive and false negative.

Answer 68

The confusion matrix help illustrate the performance of your ML model.

Answer 69

Balanced datasets have a roughly even distribution of categories/values. Unbalanced datasets are skewed.

Answer 70

A value to separate positive/negative guesses in a classification model. Eg, the guess is that the image is 75% cat. Should this be counted as a cat? Threshold might be at 80%.

Answer 71

Feeding the neural network your whole dataset once.

Answer 72

How bad the neural network performs when comparing input/output. Synonym: error. Target function to minimize.

Answer 73

How many examples are shown to the neural network before backpropagation.

Answer 74

Studying and selecting which features to use/not use in a machine learning model.

Answer 75

A way of giving the model classification data. Eg. for a rating feature from 1-5: [3] vs. [0, 0, 0, 0, 1]

Answer 76

Inputs where only a few inputs are activated at the same time. Eg. one-hot is sparse.

Answer 77

Most of the input is activated at the same time.

Answer 78

A tunable part of an ML model not related to its inputs (examples). Eg. how many neurons to use or learning rate.

Answer 79

Trying out several hyperparameters to find a good combination.

Answer 80

Processing of unbounded data, eg. data coming in over time.

Answer 81

Volume - lots of data. Velocity - data is generated quickly. Variety - unstructured, lots of different kinds.

Answer 82

Tight coupling: one receiver/sender for all data. Loose coupling: message buffer between sender/receiver.

Answer 83

Sender: publisher. Receiver: subscriber.

Exam questions Flashcards

(107 cards)