Lecture 9 - Cloud Computing, Big Data, Tensor-flow, Recurrent Neural Networks, Distributed Deep Learning Flashcards

1
Q

What is cloud computing?

A

A model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management or service provider interaction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is cloud computing composed of?

A

5 characteristics, 4 deployment models and 3 service models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the five characteristics of cloud computing?

A
  1. On-demand self-service: Provision computing resources as needed without requiring human interaction with each service provider
  2. Broad network access: Resources are available over network and accessed through standard mechanisms
  3. Resource Pooling: Provider’s computing resources are pooled to serve multiple consumers using a multi-tenant model (sense of location independence)
  4. Rapid Elasticity: Capabilities can be elastically provisioned and released, to scale rapidly outward and inward as per demand
  5. Measured Service: Control and optimize resource use by metering capability
    - Usages are monitored, controlled and reported -> Transparency
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the four deployment models of cloud computing?

A
  1. Private Cloud: Exclusive use by single organization comprising multiple consumers.
    - May be owned/managed, and operated by orgaization/3rd party/combination (exists on or off premises)
  2. Community cloud: Exclusive use by a specific community of consumers from organizations that have shared concerns
  3. Public cloud: Use by general user.
  4. Hybrid cloud: Composition of two or more distinct cloud infrastructures (Private, community or public)
    • Bound together by standardized technology that enables data and application portability
    • E.g., Cloud bursting for load balancing between clouds
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the three service models of cloud computing?

A
  1. Software as a Service (SaaS): Capability provided to consumers is to use provider’s applications running on Cloud. Applications are accessible from various client devices (Example: Dropbox)
  2. Platform as a Service (PaaS): Capability provided to consumers is to deploy onto Cloud consumer-created or acquired applications created using programming languages, libraries, services and tools supported by provider (Example: MS Azure Cloud, Google Cloud Platform, AWS)
  3. Infrastructure as a Service (IaaS): Capability provided to consumers is to provision processing, storage, networks and other fundamental computing resources where consumer can deploy and run arbitrary software including OSs and applications
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the difference between traditional data and big data?

A

Big Data refers to the inability of traditional data architectures to efficiently handle new datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the characteristics of big data? (4 V’s)

A
  • Volume (Size of dataset)
  • Variety (Data from multiple sources/types)
  • Velocity (Rate of flow)
  • Variability(Change in other characteristics)

Bonus: (New V)
- Value

(There are different ideas about how many V’s there are. Our teacher sticks to 4(5) V’s)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

True or False: Machine Learning Techniques are very capable of processing large raw data

A

True!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Big data require a ____ architecture for efficient storage, manipulation and analysis

A

Big data require a SCALABLE architecture for efficient storage, manipulation and analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does Data Science try to extract from data? And how?

A

Data science tries to extract: Actionable knowledge / Patterns

Through: Discovery or hypothesis formulation and hypothesis testing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the six primary components of the Data lifecycle management system? (DLMS)

A

Comprises of six primary components:
- Metadata management (Maintains static and dynamic characteristics of data. I.e data about the data)

  • Data placement (handles efficient data placement and data replication while satisfying user requirements - I.e will it be placed on your premises or on the cloud?)
  • Data storage (is responsible for efficient(transactional) storage and data retrieval support - Key value storage/other)
  • Data ingestion (enables importing and exporting data over respective system)
  • Big data processing (supports efficient and clustered processing of big data by executing main logic of user application(s) - i.e how do we process the data)
  • Resource management (is responsible for proper and efficient management of computational resources)
  • Not necessarily part of Data Scientist’s job. But we should know it. Could be the responsible of a devops guy perhaps
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Name a couple of Big Data processing frameworks

A
  • Hadoop MapReduce
  • Apache Spark
  • Apache Beam
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How does Hadoop MapReduce work? (I highly doubt this will be very relevant - See that chart in Notion)

A

A MR job splits data into independent chunks which are processed in-parallel by map tasks. Sorted map outputs are fed to reduce tasks.

YARN (Yet Another Resource Negotiator) offers resource management and job scheduling. It is a cluster management technology that became a core part of Hadoop 2.0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the advantages of Apache Spark?

A

Apache Spark offers:

  • In memory processing capability where data is cached in memory for rapid read and write
  • Supports Data Analytics, machine learning algorithms
  • Multiple languages (Such as Scala, Java, Python, R) support)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the five components of Apache Spark?

A
  • Apache Spark Core: Execution engine for spark (Provides in-memory computing)
  • Spark SQL: Introduces data abstraction for semi-/structured data. Resilient Distributed Datasets (RDD) is a fundamental data structure of spark
  • Spark Streaming: performs streaming analytics. Crunches data in mini-batches and performs RDD(Resilient Distributed Datasets) transformations on those mini-batches of data
  • MLLib (Machine Learning Library): is a distributed machine learning framework
  • GraphX: A distributed graph-processing framework
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How is Apache Beam different from Apache Spark?

A
  • Apache Beam is a unified Batch+strEAM processing engine.
  • Primarily focused on programming model

Beam does not contain infrastructure for distributed processing

Dunno, don’t focus too much on this, I think

17
Q

What are the primary components of Apache Beam?

A

Beam supports (What/Where/When/How) via Pipelines, PCollections, Transformations and Runners

  • Pipelines(collection of processes) are streaming and processing user logic (Kinda similar to machine learning pipelines)
  • PCollection is a distributed data set, processed inside pipeline
  • Transform is a data processing operation/a step of a pipeline
  • Runner used to execute pipelines. Spark runner for Apache Spark
18
Q

What are the core features of TensorFlow?

A
  • Key ideas: Express a numeric computation as a graph
  • Graph Nodes are operations with any number of inputs and outputs
  • Graph edges are tensors which flow between nodes
  • Computation graphs can be exported to a portable format, train here - run there
  • TF implements autodiff and provides optimizers
  • Minimize all sorts of loss functions

Also has a lot of data processing features: tf.keras, data loading and preprocessing (tf.data, tf.io), image processing (tf.image), signal processing (tf.signal)

19
Q

What is TensorFlow?

A

Its core is very similar to Numpy, but with GPU support

TensorFlow’s API revolves around tensors which flow from operation to operation hence name TensorFlow

Tensor is like a numpy ndarray

  • Can create a tensor from Numpy array and vice versa
    • Can apply TF operations to Numpy arrays and numpy operations to tensors
  • Ideally a multi-dimensional array, but can hold a scalar
  • Helps during custom cost functions, custom metrics, custom layers

Note: Type conversions can significantly hurt performance

20
Q

Describe (in very broad terms) TensorFlow’s architecture

A
ARCHITECTURE:
High level (Python code) -> Keras/Data API -> Low-level python API/C++ -> Local/distributed execution engine -> CPU-/GPU-/TPU Kernels

Execution steps:

  • Build a graph using variables and placeholders
  • Deploy the graph for execution
  • Train model by defining loss function and gradients computations
21
Q

True or False: Deep Learning frameworks hide mathematics and focus on design of neural nets

A

True

22
Q

Name some Deep Learning Frameworks

A
  1. TensorFlow
  2. Caffe
  3. CNTK
  4. PyTorch
23
Q

How do you train a Neural Network?

A

1) Build a computational graph from network definition
2) Input Training data and compute loss function
3) Update parameters

Define-and-run: DL Frameworks complete step one in advance (TensorFlow, Caffe)

Define-by-run: Combines steps one and two into a single step (PyTorch)
- Computational graph is not given before training, but obtained while training

24
Q

Not important, but pretty cool: ONNX (https://onnx.ai)

Acronym for: Open Neural Network eXchange

A

https://onnx.ai

25
Q

What is the difference between Distributed ML and non-distributed ML?

A

Non-distributed ML is what we’ve been doing throughout the course. Working on the whole dataset

Distributed ML is a way of partitioning a dataset and working on it simultaneously. Implemented by Hadoop/Spark etc

26
Q

What is data parallelism?

A

A number of machines loads an identical copy of a DL model -> Training data is split into non-overlapping chunks and then fed to the workers

Basically a way of splitting data and letting different “workers” to work on it simultaneously

27
Q

What is model parallelism?

A

Using the same data, but splitting the model into chunks and letting different “workers work on it simultaneously”

28
Q

Why were Recurrent Neural Networks made? (RNN)

A

RNNs were developed to solve learning problems (time series or sequential tasks) where information about past (i.e., past instances/events) is directly linked to making future predictions

29
Q

A recurrent neuron is different from a traditional neuron in that it…

A

maintains a memory or a state from past computations

30
Q

Name some different use cases for RNNs

A

Each of these have different graphs, showing the input -> memory -> output relationship. See Notion:

  • Image captioning
  • Sentiment analysis
  • Video classification
  • Machine translation and speech recognition
31
Q

How do you train RNNs?

A

Backpropagation through time (BPTT)

  1. Unroll recurrent neuron across time instants
  2. Apply backpropagation to unrolled neurons at each time layer same way it is done for traditional feedforward NN
32
Q

The challenge of training RNN is…

A

a vanishing and exploding gradient problem

-> Solved by creating a new model called the Long Short-Term memory/LSTM