Adswerve Study Guide Flashcards

1
Q

What is BigQuery?

A

BigQuery is the data warehouse, the petabyte scale data warehouse on Google Cloud

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

BigQuery Formats

A

Avro, CSV, JSON(newline delimited), ORC, Parquet, Cloud Datastore Exports, Cloud Firestore exports

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Parquet

A

A data format on HDFS, BQ compatible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

ORC

A

(Optimized Row Columnar) A data format on HDFS, BQ compatible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Hadoop

A

Software framework for distributed storage and processing of big data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

HDFS

A

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Dataproc: HDFS or GCS

A

Google reccomends using GCS for storage instead of HDFS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why Google recommends PubSub over Kafka?

A

Better scaling, fully managed service

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How long can Kafka retain messages

A

However long you configure it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How long can PubSub retain messages?

A

7 days

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Is Kafka push or pull?

A

Pull

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Is PubSub push or pull?

A

Both

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Does Kafka guarantee ordering?

A

Yes in a partition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Does PubSub guarantee ordering?

A

No

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Kafka Delivery Guarantee

A

At most once, at least once, exactly once (limited)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

PubSub Delivery Guarantee

A

At least once for each subscription

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Spark

A

Lives on Hadoop. Framework that uses RAM to process data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is BQ default encoding?

A

UTF-8

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Dataproc: is HDFS data persistant?

A

No, it goes away when the dataproc cluster is shut down

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Dataproc: is GCS data persistant?

A

Yes, it remains even when a cluster is shut down

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

BigTable: What causes hotspotting?

A

Contiguous row keys. Example keys: 20190101 20190102 20190103 …

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

BigTable: How to prevent hotspotting?

A

Make row keys non contiguous. Example keys: a93js-20190101, vomdn-20190102, odsjs-20190103

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is ANSI SQL

A

Standard SQL in BQ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Dataproc: when to create a cluster?

A

Clusters are recommended to be job specific. Have a separate cluster for each job

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Which is simpler? cbt or hbase shell
cbt
26
Databases: What does an index do?
Improves the search speed of a specific column
27
What is an MID?
Machine-generated IDentifier. A unique identifier for an entity in Google's Knowledge Graph
28
BQ: Does LIMIT clause reduce cost?
No. All the data will still be queried and billed
29
BQ: Can you change an existing table to use partitions?
No. You must create a partitioned table from scratch
30
BQ: which column specifies a partition?
_PARTITIONTIME
31
BQ: which column specifies a shard (wildcard)?
_TABLE_SUFFIX
32
BQ: At what levels can you control access?
Project and Dataset
33
BQ: Can you limit access to a table?
No. Dataset is the most granular level of access
34
BQ: How long are query results cached?
24 hours
35
BQ: Majority of time spent in wait stage
Just wait. Buy more slots. Query over smaller datasets. Have queries that have faster executing compute stages
36
BQ: Majority of time spent in read stage
All other operations were less expensive than the base cost of reading input data. Ideal. Can improve via partitioning commonly used tables but not much else to do.
37
BQ: Majority of time spent in compute stage
Filter as early as possible. Pre-calculate common calculations.
38
BQ: Majority of time spent in write stage
Expected if we emit more data than was originally read from inputs.
39
Which storage options have transactions?
Cloud SQL, Spanner, Datastore
40
Which storage options have high throughput?
BigTable
41
When to use datastore?
NoSQL. Quick Read slow Write
42
When to use bigtable?
NoSQL. Massive throughput
43
PubSub: What is a push subscription?
PubSub sends message to preconfigured endpoint
44
PubSub: What is a pull subscription?
Subscriber must ask PubSub for a message
45
PubSub: How to get message ordering?
Include sequence information in the message
46
PubSub: Subscriptions are auto-deleted after how many days of inactivity by default?
31
47
Wide-column store
NoSQL database. Rows and columns. Names and format of columns can vary from row to row. 2D key value store
48
Pig
scripting language that compiles into MapReduce jobs. Runs on top of Hadoop
49
Hive
SQL-like data warehousing system and language runs on top of Hadoop
50
Sqoop
Sqoop imports data from a relational database system or a mainframe into HDFS.
51
Oozie
Workflow scheduler system to manage Apache Hadoop jobs
52
Cassandra
Wide-column store based on ideas of BigTable. Has a Query Language
53
HBase
Wide-column store based on ideas of BigTable. No Query Language
54
Redis
Very fast in-memory data structure store
55
Kafka
Pub/sub message queue
56
Impala
Use SQL to query data in HDFS or HBase. Like bq or spanner
57
Stackdriver: Monitoring
Full-stack monitoring for Google Cloud Platform and Amazon Web Services
58
Stackdriver: Logging
Real-time log management and analysis
59
Stackdriver: Error Reporting
Identify and understand your application errors
60
Stackdriver: Debugger
Investigate your code's behavior in production
61
Stackdriver: Trace
Find performance bottlenecks in production.
62
Stackdriver: Profiler
Identify patterns of CPU, time, and memory consumption in production
63
YARN
Resource Negotiator for hadoop. Allows arbitrary application to be executed on a hadoop cluster
64
GIRAPH
Graph processing on hadoop
65
BigQueryML
Train ML models entirely within BQ
66
AutoML Tables
Beta: Automatically build and deploy ML models on structured data
67
Cloud Inference API
Alpha: Run large-scale correlations over typed time-series datasets
68
Recommendations AI
Beta: Build an end-to-end personalized recommendation system
69
Cloud AutoML
Beta: Easily train ML models with basically a UI or something
70
When to use syncronous speech recognition?
Audio files shorter than ~1 min
71
When to use asynchronous speech recogintion?
Audio files are long (longer than 1 min)
72
SSML
Speech Synthesis Markup Language. Has markup for emphasis, pitch, countour, volume etc.
73
NLU
Natural Language Understanding
74
BigTable Minimum Nodes
3
75
When should you consider HDDs for Bigtable
Generally only when storing at least 10TB of data. Data is not time sensitive
76
How many clusters can you have in a Bigtable instance
4
77
Bigtable: recommended row size limit
100 MB
78
Bigtable: recommended column value limit
10 MB
79
Max size of single object in GCS
5 TB
80
Cloud SQL: About how big you should go?
10 TB
81
Size of Largest Datastore Entity
1 MiB
82
Bigtable: Hard limit row size
256 MB
83
Bigtable: max tables per instance
1,000 tables