Big Data Flashcards

Question 1

Q

What defines Big Data (3V)

Answer

A

Volume, Velocity, Veracity

Question 2

Q

What is Volume

Answer

A

The scale of information being handled by data processing system

Question 3

Q

What is Velocity

Answer

A

The speed at which data is being processed: ingested, analyzed, and visualized

Question 4

Q

What is Variety

Answer

A

The diversity of data sources, formats, and quality

Question 5

Q

Data Warehouses

Answer

A

Structural or Processed: Data is organized, may have been transformed, and is stored in a structural way
Ready to use: Data exists in the warehouse for a defined purpose, and in a format where it is ready to be consumed
Rigid: Data may be easier to understand, but less up-to-date. Structures are hard to change

Question 6

Q

Data Lakes

Answer

A

Raw or Unstructured: The data lake contains all raw unprocessed data, before any kind of transformation or organization
Ready to analyze: Data is more up to date, but may require more advanced tools for analysis
Flexible: No structure is enforced, so new types of data can be added at any time

Question 7

Q

OLTP

Answer

A

High volume of short transactions
Fast queries
high integrity

MODIFY DATA

Question 8

Q

OLAP

Answer

A

Low volume of long-running queries
Aggregated historical data

QUERY DATA

Question 9

Q

Stages of a Data Pipeline

Answer

A

Ingestion
Storage
Processing
Visualization

Question 10

Q

Data ingestion Technical Challenges

Answer

A

choose the correct compute and storage options. Otherwise, a solution can be too expensive or too slow
data should have value
security of data

Question 11

Q

Common data transformations

Answer

A

formatting
labeling
filtering
validating

Question 12

Q

Stages of Data Modeling

Answer

A

Conceptual. What are the entities in my data? What are their attributes and relationships?
Logical
Physical

Question 13

Q

Google Cloud Storage (GCS)

Answer

A

Fully managed object storage
For unstructured data: images, videos. Access via API or programmatic SDKs
Multiple storage classes
Instant access in all classes. Lifecycle management for objects and buckets
Secure and durable
Secure access control. High availability and maximum durability

Question 14

Q

Google Cloud Storage concepts (buckets)

Answer

A

a bucket is a logical container for objects
buckets exist within projects
bucket names exists within a global namespace
bucket can be:
- regional
- dual-regional
- nulti-regional

Question 15

Q

Storage classes in GCS

Answer

A

Standard
Nearline
Coldline
Archive

Question 16

Q

Standard storage class in GCS

Answer

A

minimum storage: -
storage fee (per Gb): $0.02
retrieval fee: -
regional availability: 99.99%
multi and dual reg.: > 99.99%

Question 17

Q

Nearline storage class in GCS

Answer

A

minimum storage: 30 days
storage fee (per Gb): $ 0.01
retrieval fee: $ 0.01
regional availability: 99.9%
multi and dual reg.: 99.95%

Question 18

Q

Coldline storage class in GCS

Answer

A

minimum storage: 90 days
storage fee (per Gb): $0.004
retrieval fee: $ 0.02
regional availability: 99.9%
multi and dual reg.: 99.95%

Question 19

Q

Archive storage class in GCS

Answer

A

minimum storage: 365 days
storage fee (per Gb): $0.0012
retrieval fee: $0.05
regional availability: 99.9%
multi and dual reg.: 99.95%

Question 20

Q

Objects in Google Cloud Storage

Answer

A

Objects are stored as opaque data
Objects are immutable
Overwrites are atomic
Objects can be versioned (optionally)

Question 21

Q

Accessing Buckets and Objects

Answer

A

Google Cloud Console
HTTP API
SDKs
gsutil (command line tool)

Question 22

Q

Advanced features of Google Cloud Storage

Answer

A

Parallel uploads of composite objects
Integrity checking
Transcoding
Requestor pays

Question 23

Q

Google Cloud Storage Costs

Answer

A

operation charges
network charges
data retrieval charges

Question 24

Q

Google Cloud storage Lifestyle management

Answer

A

apply a lifestyle configuration to a bucket
GCS periodically checks configuration
matching rules applied to objects
delete objects or set storage classes

lifestyle management configuration file is a JSON-file

Question 25

Q

Security and Access Control in GCS

Answer

A

IAM for bulk access to buckets
ACLS* for granular access to buckets
Signed URLs for temporary access
Signed policy documents
Access control lists

Question 26

Q

Amazon analog of Google Cloud SQL

Answer

A

Amazon RDS

Question 27

Q

Google Cloud SQL

Answer

A

Managed SQL instances
Automate instance and database creation, replication, backups, patches and updates
Multiple database engines
MySQL 5.6 and 5.7, PostgreSQL 9.6 or 11, SQL Server in beta
Scalability and availability
Vertically scale to 64 cores and 416 Gb RAB. Live migration and less configurations

Question 28

Q

Google Cloud Firestore replaces…

Answer

A

Cloud DataStore

Question 29

Q

Amazon analog of Google Cloud Firestore

Answer

A

Amazon DynamoDB

Question 30

Q

Google Cloud Firestore

Answer

A

Fully managed NoSQL database
Serverless autoscaling NoSQL document store. Integrated with GCP and Firebase
Realtime DB with mobile SDK
Android and IOS client libraries, frameworks for all popular programming languages
Scalability and consistency
Horizontal autoscaling and strong consistency, with support of ACID transactions

Question 31

Q

Firestore Data Model

Answer

A

it’s a document store
a document is just some JSON data
documents bundled together into a collection
documents can contain nested sub-collection
references

Question 32

Q

Firestore supported datatypes

Answer

A

String, integer, boolean, float, null
Bytes, date and time, geographical point
Array and map
Reference (to document)

Question 33

Q

Indexes in Cloud Firestore

Answer

A

Automatic single-field indexes
Index exemption
Composite indexes

Question 34

Q

Google Cloud Spanner

Answer

A

Managed SQL-compiant DB
SQL (ANSI 2011) schemas and queries with ACID transactions
Horizontally scalable
Strong consistency across rows, regions from 1 to 1000 of nodes
Highly available
Automatic global replication, no planned downtime and 99.999% SLA

Question 35

Q

CAP theorem

Answer

A

either 2 of 3:
- Consistency
- Availability
- Partition tolerance

Question 36

Q

Google Cloud Spanner is (CAP)

Answer

A

CP system. Sometimes it sacrifices availability for consistency