Big Data Flashcards

1
Q

What defines Big Data (3V)

A

Volume, Velocity, Veracity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Volume

A

The scale of information being handled by data processing system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Velocity

A

The speed at which data is being processed: ingested, analyzed, and visualized

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Variety

A

The diversity of data sources, formats, and quality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data Warehouses

A
  1. Structural or Processed: Data is organized, may have been transformed, and is stored in a structural way
  2. Ready to use: Data exists in the warehouse for a defined purpose, and in a format where it is ready to be consumed
  3. Rigid: Data may be easier to understand, but less up-to-date. Structures are hard to change
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Data Lakes

A
  1. Raw or Unstructured: The data lake contains all raw unprocessed data, before any kind of transformation or organization
  2. Ready to analyze: Data is more up to date, but may require more advanced tools for analysis
  3. Flexible: No structure is enforced, so new types of data can be added at any time
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

OLTP

A
  • High volume of short transactions
  • Fast queries
  • high integrity

MODIFY DATA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

OLAP

A
  • Low volume of long-running queries
  • Aggregated historical data

QUERY DATA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Stages of a Data Pipeline

A
  1. Ingestion
  2. Storage
  3. Processing
  4. Visualization
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Data ingestion Technical Challenges

A
  • choose the correct compute and storage options. Otherwise, a solution can be too expensive or too slow
  • data should have value
  • security of data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Common data transformations

A
  • formatting
  • labeling
  • filtering
  • validating
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Stages of Data Modeling

A
  1. Conceptual. What are the entities in my data? What are their attributes and relationships?
  2. Logical
  3. Physical
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Google Cloud Storage (GCS)

A
  • Fully managed object storage
    For unstructured data: images, videos. Access via API or programmatic SDKs
  • Multiple storage classes
    Instant access in all classes. Lifecycle management for objects and buckets
  • Secure and durable
    Secure access control. High availability and maximum durability
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Google Cloud Storage concepts (buckets)

A
  • a bucket is a logical container for objects
  • buckets exist within projects
  • bucket names exists within a global namespace
  • bucket can be:
    - regional
    - dual-regional
    - nulti-regional
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Storage classes in GCS

A
  • Standard
  • Nearline
  • Coldline
  • Archive
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Standard storage class in GCS

A

minimum storage: -
storage fee (per Gb): $0.02
retrieval fee: -
regional availability: 99.99%
multi and dual reg.: > 99.99%

17
Q

Nearline storage class in GCS

A

minimum storage: 30 days
storage fee (per Gb): $ 0.01
retrieval fee: $ 0.01
regional availability: 99.9%
multi and dual reg.: 99.95%

18
Q

Coldline storage class in GCS

A

minimum storage: 90 days
storage fee (per Gb): $0.004
retrieval fee: $ 0.02
regional availability: 99.9%
multi and dual reg.: 99.95%

19
Q

Archive storage class in GCS

A

minimum storage: 365 days
storage fee (per Gb): $0.0012
retrieval fee: $0.05
regional availability: 99.9%
multi and dual reg.: 99.95%

20
Q

Objects in Google Cloud Storage

A
  • Objects are stored as opaque data
  • Objects are immutable
  • Overwrites are atomic
  • Objects can be versioned (optionally)
21
Q

Accessing Buckets and Objects

A
  • Google Cloud Console
  • HTTP API
  • SDKs
  • gsutil (command line tool)
22
Q

Advanced features of Google Cloud Storage

A
  • Parallel uploads of composite objects
  • Integrity checking
  • Transcoding
  • Requestor pays
23
Q

Google Cloud Storage Costs

A
  • operation charges
  • network charges
  • data retrieval charges
24
Q

Google Cloud storage Lifestyle management

A
  • apply a lifestyle configuration to a bucket
  • GCS periodically checks configuration
  • matching rules applied to objects
  • delete objects or set storage classes

lifestyle management configuration file is a JSON-file

25
Q

Security and Access Control in GCS

A
  • IAM for bulk access to buckets
  • ACLS* for granular access to buckets
  • Signed URLs for temporary access
  • Signed policy documents
  • Access control lists
26
Q

Amazon analog of Google Cloud SQL

A

Amazon RDS

27
Q

Google Cloud SQL

A
  • Managed SQL instances
    Automate instance and database creation, replication, backups, patches and updates
  • Multiple database engines
    MySQL 5.6 and 5.7, PostgreSQL 9.6 or 11, SQL Server in beta
  • Scalability and availability
    Vertically scale to 64 cores and 416 Gb RAB. Live migration and less configurations
28
Q

Google Cloud Firestore replaces…

A

Cloud DataStore

29
Q

Amazon analog of Google Cloud Firestore

A

Amazon DynamoDB

30
Q

Google Cloud Firestore

A
  • Fully managed NoSQL database
    Serverless autoscaling NoSQL document store. Integrated with GCP and Firebase
  • Realtime DB with mobile SDK
    Android and IOS client libraries, frameworks for all popular programming languages
  • Scalability and consistency
    Horizontal autoscaling and strong consistency, with support of ACID transactions
31
Q

Firestore Data Model

A
  • it’s a document store
  • a document is just some JSON data
  • documents bundled together into a collection
  • documents can contain nested sub-collection
  • references
32
Q

Firestore supported datatypes

A
  • String, integer, boolean, float, null
  • Bytes, date and time, geographical point
  • Array and map
  • Reference (to document)
33
Q

Indexes in Cloud Firestore

A
  • Automatic single-field indexes
  • Index exemption
  • Composite indexes
34
Q

Google Cloud Spanner

A
  • Managed SQL-compiant DB
    SQL (ANSI 2011) schemas and queries with ACID transactions
  • Horizontally scalable
    Strong consistency across rows, regions from 1 to 1000 of nodes
  • Highly available
    Automatic global replication, no planned downtime and 99.999% SLA
35
Q

CAP theorem

A

either 2 of 3:
- Consistency
- Availability
- Partition tolerance

36
Q

Google Cloud Spanner is (CAP)

A

CP system. Sometimes it sacrifices availability for consistency