GCP & ML Fundamentals Flashcards

1
Q

Big data challenges are…

A
  • Migrating existing data solutions
  • Analysing large datasets at scale
  • Building streaming pipeline
  • Applying machine learning
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Google Cloud platform is made up of three layers…

A

Top: Big Data & ML Products

Middle: Compute Power, Storage, Networking

Bottom: Security

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Google cloud bucket names have to be…

A

Globally unique.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the command line command for google cloud?

A

gsutil

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Are we charged when a virtual machine is ‘stopped’?

A

Yes - but only for the disk space (keep the VM and the software which is installed on it).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why does cloud computing differ from desktop computing?

A

In cloud computing - compute power and storage are independent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the 4 types of google cloud storage?

A

Standard, Nearline, Coldline and Archive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Stardard Storage?

A

Best for frequently accessed data and for data which is not going to be stored for a long time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is nearline storage?

A

Low cost and highly durable for data which is read/modified once a month.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is coldline storage?

A

Lower cost than nearline and highly durable for data which is read/modified once a quarter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is archive storage?

A

Lowest cost for storing data, archived as a online backup.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a project?

A

Base level organising entity for creating and using resources, managing billing, APIs and permissions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a folder?

A

A folder contains multiple projects within an organisation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the root node of the GCP hierarchy?

A

Organisation

Organisation -> Folder -> Project -> Resources e.g. BQ dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What google service controls GCP resources?

A

Identity and Access Management

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is IAMs and what does it do?

A

Identity and Access Management - it allows you to fine tune access control to all the GCP resources in use.

17
Q

What is the main benefit of networking?

A

We don’t have to do everything on one machine if we have a fast enough network. Googles data center network enables separation of compute and storage (process data without copying it).

18
Q

What are edge points of presence (in networking)?

A

Edge points of presence are where the private google network connects to the public one.

19
Q

What are the different levels of security you can have?

A
  • On-premise (full responsibility)
  • Infrastructure as a Service (IaaS)
  • Platform as a Service (PaaS)
  • Managed Services (Least Responsibility)
20
Q

What google tool helps to implement security policies?

A

IAMs (Identity and Access Management)

21
Q

How can BigQuery be accessed?

A

Command Line, REST API, Web UI and third party tools (e.g. matillion).

22
Q

What is Compute Engine?

A

Lets you run virtual machines on demand in the cloud = IaaS solution.

23
Q

What are the 3 types of compute engine available on GCP?

A
  1. Custom Machines - optimal cpu and memory for problem.
  2. Spot Machines - reduce computing cost by 91%
  3. Predefined Machines
24
Q

What is Google Kubernetes Engine (GKE)?

A

GKE is clusters of machines running containers, it is way to orchestrate code which is running in containers.

25
Q

What is a container?

A

A container is a package of code and its dependencies - it is highly portable and resource efficient.

26
Q

What is App Engine?

A

Googles PaaS which is a way to run code without worrying about infrastructure.

27
Q

What are Cloud Functions?

A

Serverless execution environments (FaaS (function as a service)) - it executes code in response to an event and google scales resources as required.

28
Q

What are Googles database managed services?

A

Cloud BigTable, Storage, SQL, Spanner and Datasource.

29
Q

What roles exist in an analytics team?

A
  • Data Engineer
  • Decision Makers
  • Analysts
  • Statisticians
  • Applied ML Engineer
  • Data Scientists
30
Q

What is a recommendation system?

A

Model recommends products based on preferences e.g. Netflix or YouTube.

  • They must scale to meet demand.
  • Prediction can be stream or batch.
31
Q

Where should you store unstructured data?

A

Cloud Storage

32
Q

Where should you store data which is structured and has a latency in seconds?

A

BigQuery

33
Q

Where should you store data which is structured and has a latency in milli-seconds?

A

Cloud BigTable

34
Q

Where should you store data which is structured and is a No SQL workload?

A

Cloud DataStore

35
Q

Where should you store data which is structured, is SQL based and where 1 database is enough?

A

Cloud SQL

36
Q

Where did Big Data Tools evolve from?

A
  • Hadoop = MapReduce (map; performs filter and reduce; summary operations).
  • Cloud Services = seperate, specialise and connect.
37
Q

What are clusters?

A

Clusters are a fungible resource, they are used when required and manged automatically through dataproc.
- Clusters can be up and running in 90seconds.

38
Q

What can we do to clusters to avoid under or over provisioning?

A
  • Auto Scaling = turn on or off according to job size.

- Incorporate Premptible VMs - affordable (80% cheaper) and short lived but limited.