Databases & Analytics Flashcards

1
Q

What are the four types of Cloud Storage class and what are their detailed cost structures?

A

Standard

Nearline

Coldline

Archive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is strong transactional consistency and which databases on GCP offer strong transactional consistency?

A

Strong transactional consistency means when changes are made or updated to a DB, the changes take place across all replicas / shards immediately.

Cloud Spanner and Cloud Firestore are two examples of database services with strong transactional consistency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

It is a requirement of your application that even if the database is distributed across multiple nodes, writes to the database need to be replicated to all nodes before any reads to the data are allowed. What is the name of this concept?

A

Strong transactional consistency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

You need a fully managed data warehouse for analytics datasets with millions of rows, but your analytics team must be able to query it using standard SQL statements. Which GCP product should you choose?

A

BigQuery

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Which Cloud Storage class would you choose for backups that need to be kept for at least 90 days and will only be accessed in a disaster recovery scenario?

A

Coldline storage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Your company is streaming real-time sensor data into Bigtable. You will be required to run searches of the data based on the “SensorID” of the sensor and a time window. Bearing in mind that Bigtable sorts its rows lexicographically, what would be a sensible row key design?

a) SenorID#TimeStamp
b) TimeStamp#SensorID
c) TimeStamp
d) SensorID

A

a) SenorID#TimeStamp

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

You need to store collections of JSON documents in a managed database service. Which GCP product should you choose?

A

Cloud Firestore

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

You need to store millions of rows of wide-column NoSQL time-series data in a managed database service. Which GCP product should you choose?

A

Cloud Bigtable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Your team has developed a mobile web application where global users vote on popular topics. For each topic, you expect a very high volume of votes during each individual 30-minute voting window. You need to capture and count all votes within 24 hours and then store the votes for future analysis and reporting. What should you do?

A. Save the votes to Memorystore, and use Cloud Functions to insert the data into BigQuery. Display the results in Google Data Studio or Looker.

B. Publish the votes to Pub/Sub, and use a Datafow pipeline to insert the data into BigQuery. Display the results in Google Data Studio or Looker.

C. Publish the votes to Pub/Sub, and use Cloud Functions to insert the data into Cloud Storage. Display the results in Google Data Studio or Looker.

D. Use Firebase to authenticate the mobile users, and publish the data directly to the database. Export the data to a CSV file, and import it into Sheets for reporting.

A

B. Correct. Pub/Sub supports the ingestion of millions of records per second and guarantees the delivery of the messages. BigQuery should be used for analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Your company has a successful multi-player game that has become popular in the US. Now, it wants to expand to other regions. It is launching a new feature that allows users to trade points. This feature will work for users across the globe. Your company’s current MySQL backend is reaching the limit of the Compute Engine instance that hosts the game. Your company wants to migrate to a different database that will provide global consistency and high availability across the regions. Which database should they choose?

A

Cloud Spanner

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

You are creating a new web application for a global audience. You need to choose a database service specifically for storing user sessions. Your users may connect from any location in the world, and transactions should be strongly consistent. As this is a new application, you would like to keep costs down where possible. Which database should you choose?

A

Cloud Firestore

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Your company plans to expand their analytics use cases. One of the new use cases requires your data analysts to analyze events using SQL on a near real–time basis. You expect rapid growth and want to use managed services as much as possible. What should you do?

A

Create a Pub/Sub topic and a subscription. Stream your events from the source into the Pub/Sub topic. Leverage Dataflow to ingest these events into BigQuery for SQL analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

You have two tables in Cloud SQL with identical columns that you need to quickly combine into a single table, removing duplicate rows from the result set. What should you do?

A. Use the JOIN operator in SQL to combine the tables.

B. Use nested WITH statements to combine the tables.

C. Use the UNION operator in SQL to combine the tables.

D. Query the tables from a Linux shell, combine the results into a single CSV, and re-import the rows into the database. Use the UNION ALL operator in SQL to combine the tables.

A

C is correct because the UNION operator combines result sets while removing duplicates.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

You are building a storage layer for an analytics Hadoop cluster in a region for your company. This cluster will run multiple jobs on a nightly basis, and you need to access the data frequently. You want to use Cloud Storage for this purpose. What is the most cost effective option?

A. Regional Coldline storage

B. Regional Nearline storage

C. Regional Standard storage

D. Multi-regional Standard storage

A

C. Regional Standard storage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

You have a data warehouse built on BigQuery that contains a table with array fields. To analyze the data for a specific use case using Standard SQL, you need to read all elements from the array and write them with all other non-array fields in a table. You don’t want to lose any records if they don’t match records in the array fields. What should you do?

A. Perform SELECT * FROM tablename.

B. Perform UNNEST and JOIN with the table to get these results.

C. Perform UNNEST and INNER JOIN with the table to get these results.

D. Perform UNNEST and CROSS JOIN with the table to get these results.

A

D is correct because it does not lose records when the join is performed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Cloud Memorystore?

A

Memorystore is a fully managed in-memory data store service for Redis and Memcached at Google Cloud.

Instead of having to spin up your own VM and install Redis or Memchached, you have your own managed service.

17
Q

You are creating a new application that will use Cloud Bigtable as its backend database. What are the different options you have for connecting to Bigtable?
(Select all that apply)

A) Using Google’s cbt tool
B) Using the HBase client for Java
C) Using the MariaDB SQL client
D) Using Google’s bq tool
E) Using Client Libraries
F) Directly Through BigQuery

A

A. B and E

18
Q

Your company has a BigQuery data mart that provides analytics information to hundreds of employees. One user of wants to run jobs without interrupting important workloads. This user isn’t concerned about the time it takes to run these jobs. You want to fulfill this request while minimizing cost to the company and the effort required on your part.
What should you do?

A

Ask the user to run the jobs as batch jobs.

19
Q

Which Google Cloud database services supports the HBase API?

A

Cloud BigTable

20
Q

What are the 3 benefits of creating Views within a SQL database?

A

1) Limit Access - Views can define only the columns that a particular user can access, rather than them accessing the whole table.

2) Aggregate and Anonymise Data - Can be used to create a summary of anonymous data, and masking underlying data points.

3) Repeatable Queries - Enables repeatability of common queries.

21
Q

What is are Vectors?

A

Vectors are arrays of numbers generated by a machine learning model that can represent complex structured and unstructured data, like words, images, videos and audio.

22
Q

What are Vector Embeddings?

A

Vectors are represented in a continuous, multi-dimensional space known as an embedding, grouping sets of data based on semantic meaning or similar features across virtually any data type.

Think of this as a x,y,z coordinate graph and semantic similarity of vector database entires is represented by distance in n-dimensional vector space.

23
Q

What are Vector Databases?

A

Vector databases serve to store and index the output of an embedding model.

They are specifically designed for fast retrieval and search, making them perfect for AI inference.

24
Q

What are some of the key use cases of Vector Search?

A

Recommendations Engines

Search Engines

Autonomous Vehicles

Medical Image Analysis

Drug / Proton / Molecule Discovery

25
Q

What are the 3 stages of creating a Vector Database?

A

Encode - Generate embeddings with AI models

Index - Build a Vector Search index

Search - Search vector space

26
Q

What is the Manhattan Distance measure type in Vector Index creation?

A

L1 - Measures the distance across all dimensions by utilizing a grid like system, where the distance is measured by the shortest path ALONG the gridlines.

Good for datasets with a large amount of dimensions given it’s grid based system.

27
Q

What is the Euclidean Distance measure type in Vector Index creation

A

L2 - Measures the distance between two points on a multi-dimensional plane using the shortest distance between the 2 points.

28
Q

What is the Cosine Similarity / Distance measure type in Vector index creation?

A

Measures the similarity between two vectors in terms of direction. This particular metric is used when the magnitude between vectors does not matter but the orientation.

This metric is highly used in recommendation systems.

29
Q

What is the difference between the Brute-Force Algorithm and TreeAh Algorithm in Vector index creation?

A

Brute force algorithm is an exhaustive search by calculating the distance between the query and every single vector in the search space. This can become extremely computationally expensive when the database reaches sizes of billions to trillions.

Therefore, TreeAh is favored for production, given it uses Approximate Nearest Neighbor to divide the vector space into multiple spaces and index the spaces using a tree structure instead.

30
Q

What are the two main techniques used by ScaNN (Scalable Approximate Nearest Neighbor) to improve search performance in vector search?

A

1) Space Pruning - Reduce the search space by using multi-level tree search by dividing the search space into hierarchical positions, selecting nearest neighbor and diving down layer by layer of the tree.

2) Data Quantization - Compresses the size of vectors, an can be used to compress a 9 dimensional vector to 12 bits.

31
Q

What is the main benefit of using vector search in RAG (retrieval-augmented generation) to address LLM (Large Language Models) hallucination?

A

It allows the LLM to access real-time information for fact-checking.