Adswerve Study Guide Flashcards Flashcards by Zach M

What is BigQuery?

BigQuery is the data warehouse, the petabyte scale data warehouse on Google Cloud

How well did you know this?

Not at all

Perfectly

BigQuery Formats

Avro, CSV, JSON(newline delimited), ORC, Parquet, Cloud Datastore Exports, Cloud Firestore exports

How well did you know this?

Not at all

Perfectly

Parquet

A data format on HDFS, BQ compatible

How well did you know this?

Not at all

Perfectly

ORC

(Optimized Row Columnar) A data format on HDFS, BQ compatible

How well did you know this?

Not at all

Perfectly

Hadoop

Software framework for distributed storage and processing of big data

How well did you know this?

Not at all

Perfectly

HDFS

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware.

How well did you know this?

Not at all

Perfectly

Dataproc: HDFS or GCS

Google reccomends using GCS for storage instead of HDFS

How well did you know this?

Not at all

Perfectly

Why Google recommends PubSub over Kafka?

Better scaling, fully managed service

How well did you know this?

Not at all

Perfectly

How long can Kafka retain messages

However long you configure it

How well did you know this?

Not at all

Perfectly

How long can PubSub retain messages?

7 days

How well did you know this?

Not at all

Perfectly

Is Kafka push or pull?

Pull

How well did you know this?

Not at all

Perfectly

Is PubSub push or pull?

Both

How well did you know this?

Not at all

Perfectly

Does Kafka guarantee ordering?

Yes in a partition

How well did you know this?

Not at all

Perfectly

Does PubSub guarantee ordering?

How well did you know this?

Not at all

Perfectly

Kafka Delivery Guarantee

At most once, at least once, exactly once (limited)

How well did you know this?

Not at all

Perfectly

PubSub Delivery Guarantee

At least once for each subscription

How well did you know this?

Not at all

Perfectly

Spark

Lives on Hadoop. Framework that uses RAM to process data

How well did you know this?

Not at all

Perfectly

What is BQ default encoding?

UTF-8

How well did you know this?

Not at all

Perfectly

Dataproc: is HDFS data persistant?

No, it goes away when the dataproc cluster is shut down

How well did you know this?

Not at all

Perfectly

Dataproc: is GCS data persistant?

Yes, it remains even when a cluster is shut down

How well did you know this?

Not at all

Perfectly

BigTable: What causes hotspotting?

Contiguous row keys. Example keys: 20190101 20190102 20190103 …

How well did you know this?

Not at all

Perfectly

BigTable: How to prevent hotspotting?

Make row keys non contiguous. Example keys: a93js-20190101, vomdn-20190102, odsjs-20190103

How well did you know this?

Not at all

Perfectly

What is ANSI SQL

Standard SQL in BQ

How well did you know this?

Not at all

Perfectly

Dataproc: when to create a cluster?

Clusters are recommended to be job specific. Have a separate cluster for each job

How well did you know this?

Not at all

Perfectly

Which is simpler? cbt or hbase shell

cbt

Databases: What does an index do?

Improves the search speed of a specific column

What is an MID?

Machine-generated IDentifier. A unique identifier for an entity in Google's Knowledge Graph

BQ: Does LIMIT clause reduce cost?

No. All the data will still be queried and billed

BQ: Can you change an existing table to use partitions?

No. You must create a partitioned table from scratch

BQ: which column specifies a partition?

_PARTITIONTIME

BQ: which column specifies a shard (wildcard)?

_TABLE_SUFFIX

BQ: At what levels can you control access?

Project and Dataset

BQ: Can you limit access to a table?

No. Dataset is the most granular level of access

BQ: How long are query results cached?

24 hours

BQ: Majority of time spent in wait stage

Just wait. Buy more slots. Query over smaller datasets. Have queries that have faster executing compute stages

BQ: Majority of time spent in read stage

All other operations were less expensive than the base cost of reading input data. Ideal. Can improve via partitioning commonly used tables but not much else to do.

BQ: Majority of time spent in compute stage

Filter as early as possible. Pre-calculate common calculations.

BQ: Majority of time spent in write stage

Expected if we emit more data than was originally read from inputs.

Which storage options have transactions?

Cloud SQL, Spanner, Datastore

Which storage options have high throughput?

BigTable

When to use datastore?

NoSQL. Quick Read slow Write

When to use bigtable?

NoSQL. Massive throughput

PubSub: What is a push subscription?

PubSub sends message to preconfigured endpoint

PubSub: What is a pull subscription?

Subscriber must ask PubSub for a message

PubSub: How to get message ordering?

Include sequence information in the message

PubSub: Subscriptions are auto-deleted after how many days of inactivity by default?

Wide-column store

NoSQL database. Rows and columns. Names and format of columns can vary from row to row. 2D key value store

Pig

scripting language that compiles into MapReduce jobs. Runs on top of Hadoop

Hive

SQL-like data warehousing system and language runs on top of Hadoop

Sqoop

Sqoop imports data from a relational database system or a mainframe into HDFS.

Oozie

Workflow scheduler system to manage Apache Hadoop jobs

Cassandra

Wide-column store based on ideas of BigTable. Has a Query Language

HBase

Wide-column store based on ideas of BigTable. No Query Language

Redis

Very fast in-memory data structure store

Kafka

Pub/sub message queue

Impala

Use SQL to query data in HDFS or HBase. Like bq or spanner

Stackdriver: Monitoring

Full-stack monitoring for Google Cloud Platform and Amazon Web Services

Stackdriver: Logging

Real-time log management and analysis

Stackdriver: Error Reporting

Identify and understand your application errors

Stackdriver: Debugger

Investigate your code's behavior in production

Stackdriver: Trace

Find performance bottlenecks in production.

Stackdriver: Profiler

Identify patterns of CPU, time, and memory consumption in production

YARN

Resource Negotiator for hadoop. Allows arbitrary application to be executed on a hadoop cluster

GIRAPH

Graph processing on hadoop

BigQueryML

Train ML models entirely within BQ

AutoML Tables

Beta: Automatically build and deploy ML models on structured data

Cloud Inference API

Alpha: Run large-scale correlations over typed time-series datasets

Recommendations AI

Beta: Build an end-to-end personalized recommendation system

Cloud AutoML

Beta: Easily train ML models with basically a UI or something

When to use syncronous speech recognition?

Audio files shorter than ~1 min

When to use asynchronous speech recogintion?

Audio files are long (longer than 1 min)

SSML

Speech Synthesis Markup Language. Has markup for emphasis, pitch, countour, volume etc.

NLU

Natural Language Understanding

BigTable Minimum Nodes

When should you consider HDDs for Bigtable

Generally only when storing at least 10TB of data. Data is not time sensitive

How many clusters can you have in a Bigtable instance

Bigtable: recommended row size limit

100 MB

Bigtable: recommended column value limit

10 MB

Max size of single object in GCS

5 TB

Cloud SQL: About how big you should go?

10 TB

Size of Largest Datastore Entity

1 MiB

Bigtable: Hard limit row size

256 MB

Bigtable: max tables per instance

1,000 tables

Adswerve Study Guide Flashcards

(83 cards)