Lecture 9 - Cloud Computing, Big Data, Tensor-flow, Recurrent Neural Networks, Distributed Deep Learning Flashcards by Todd Brackman

What is cloud computing?

A model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management or service provider interaction.

How well did you know this?

Not at all

Perfectly

What is cloud computing composed of?

5 characteristics, 4 deployment models and 3 service models

How well did you know this?

Not at all

Perfectly

What are the five characteristics of cloud computing?

On-demand self-service: Provision computing resources as needed without requiring human interaction with each service provider
Broad network access: Resources are available over network and accessed through standard mechanisms
Resource Pooling: Provider’s computing resources are pooled to serve multiple consumers using a multi-tenant model (sense of location independence)
Rapid Elasticity: Capabilities can be elastically provisioned and released, to scale rapidly outward and inward as per demand
Measured Service: Control and optimize resource use by metering capability
- Usages are monitored, controlled and reported -> Transparency

How well did you know this?

Not at all

Perfectly

What are the four deployment models of cloud computing?

Private Cloud: Exclusive use by single organization comprising multiple consumers.
- May be owned/managed, and operated by orgaization/3rd party/combination (exists on or off premises)
Community cloud: Exclusive use by a specific community of consumers from organizations that have shared concerns
Public cloud: Use by general user.
Hybrid cloud: Composition of two or more distinct cloud infrastructures (Private, community or public)
- Bound together by standardized technology that enables data and application portability
- E.g., Cloud bursting for load balancing between clouds

How well did you know this?

Not at all

Perfectly

What are the three service models of cloud computing?

Software as a Service (SaaS): Capability provided to consumers is to use provider’s applications running on Cloud. Applications are accessible from various client devices (Example: Dropbox)
Platform as a Service (PaaS): Capability provided to consumers is to deploy onto Cloud consumer-created or acquired applications created using programming languages, libraries, services and tools supported by provider (Example: MS Azure Cloud, Google Cloud Platform, AWS)
Infrastructure as a Service (IaaS): Capability provided to consumers is to provision processing, storage, networks and other fundamental computing resources where consumer can deploy and run arbitrary software including OSs and applications

How well did you know this?

Not at all

Perfectly

What is the difference between traditional data and big data?

Big Data refers to the inability of traditional data architectures to efficiently handle new datasets.

How well did you know this?

Not at all

Perfectly

What are the characteristics of big data? (4 V’s)

Volume (Size of dataset)
Variety (Data from multiple sources/types)
Velocity (Rate of flow)
Variability(Change in other characteristics)

Bonus: (New V)
- Value

(There are different ideas about how many V’s there are. Our teacher sticks to 4(5) V’s)

How well did you know this?

Not at all

Perfectly

True or False: Machine Learning Techniques are very capable of processing large raw data

True!

How well did you know this?

Not at all

Perfectly

Big data require a ____ architecture for efficient storage, manipulation and analysis

Big data require a SCALABLE architecture for efficient storage, manipulation and analysis

How well did you know this?

Not at all

Perfectly

What does Data Science try to extract from data? And how?

Data science tries to extract: Actionable knowledge / Patterns

Through: Discovery or hypothesis formulation and hypothesis testing

How well did you know this?

Not at all

Perfectly

What are the six primary components of the Data lifecycle management system? (DLMS)

Comprises of six primary components:
- Metadata management (Maintains static and dynamic characteristics of data. I.e data about the data)

Data placement (handles efficient data placement and data replication while satisfying user requirements - I.e will it be placed on your premises or on the cloud?)
Data storage (is responsible for efficient(transactional) storage and data retrieval support - Key value storage/other)
Data ingestion (enables importing and exporting data over respective system)
Big data processing (supports efficient and clustered processing of big data by executing main logic of user application(s) - i.e how do we process the data)
Resource management (is responsible for proper and efficient management of computational resources)
Not necessarily part of Data Scientist’s job. But we should know it. Could be the responsible of a devops guy perhaps

How well did you know this?

Not at all

Perfectly

Name a couple of Big Data processing frameworks

Hadoop MapReduce
Apache Spark
Apache Beam

How well did you know this?

Not at all

Perfectly

How does Hadoop MapReduce work? (I highly doubt this will be very relevant - See that chart in Notion)

A MR job splits data into independent chunks which are processed in-parallel by map tasks. Sorted map outputs are fed to reduce tasks.

YARN (Yet Another Resource Negotiator) offers resource management and job scheduling. It is a cluster management technology that became a core part of Hadoop 2.0

How well did you know this?

Not at all

Perfectly

What are the advantages of Apache Spark?

Apache Spark offers:

In memory processing capability where data is cached in memory for rapid read and write
Supports Data Analytics, machine learning algorithms
Multiple languages (Such as Scala, Java, Python, R) support)

How well did you know this?

Not at all

Perfectly

What are the five components of Apache Spark?

Apache Spark Core: Execution engine for spark (Provides in-memory computing)
Spark SQL: Introduces data abstraction for semi-/structured data. Resilient Distributed Datasets (RDD) is a fundamental data structure of spark
Spark Streaming: performs streaming analytics. Crunches data in mini-batches and performs RDD(Resilient Distributed Datasets) transformations on those mini-batches of data
MLLib (Machine Learning Library): is a distributed machine learning framework
GraphX: A distributed graph-processing framework

How well did you know this?

Not at all

Perfectly

How is Apache Beam different from Apache Spark?

Apache Beam is a unified Batch+strEAM processing engine.
Primarily focused on programming model

Beam does not contain infrastructure for distributed processing

Dunno, don’t focus too much on this, I think

What are the primary components of Apache Beam?

Beam supports (What/Where/When/How) via Pipelines, PCollections, Transformations and Runners

Pipelines(collection of processes) are streaming and processing user logic (Kinda similar to machine learning pipelines)
PCollection is a distributed data set, processed inside pipeline
Transform is a data processing operation/a step of a pipeline
Runner used to execute pipelines. Spark runner for Apache Spark

What are the core features of TensorFlow?

Key ideas: Express a numeric computation as a graph
Graph Nodes are operations with any number of inputs and outputs
Graph edges are tensors which flow between nodes
Computation graphs can be exported to a portable format, train here - run there
TF implements autodiff and provides optimizers
Minimize all sorts of loss functions

Also has a lot of data processing features: tf.keras, data loading and preprocessing (tf.data, tf.io), image processing (tf.image), signal processing (tf.signal)

What is TensorFlow?

Its core is very similar to Numpy, but with GPU support

TensorFlow’s API revolves around tensors which flow from operation to operation hence name TensorFlow

Tensor is like a numpy ndarray

Can create a tensor from Numpy array and vice versa
- Can apply TF operations to Numpy arrays and numpy operations to tensors
Ideally a multi-dimensional array, but can hold a scalar
Helps during custom cost functions, custom metrics, custom layers

Note: Type conversions can significantly hurt performance

Describe (in very broad terms) TensorFlow’s architecture

ARCHITECTURE:
High level (Python code) -> Keras/Data API -> Low-level python API/C++ -> Local/distributed execution engine -> CPU-/GPU-/TPU Kernels

Execution steps:

Build a graph using variables and placeholders
Deploy the graph for execution
Train model by defining loss function and gradients computations

True or False: Deep Learning frameworks hide mathematics and focus on design of neural nets

True

Name some Deep Learning Frameworks

TensorFlow
Caffe
CNTK
PyTorch

How do you train a Neural Network?

1) Build a computational graph from network definition
2) Input Training data and compute loss function
3) Update parameters

Define-and-run: DL Frameworks complete step one in advance (TensorFlow, Caffe)

Define-by-run: Combines steps one and two into a single step (PyTorch)
- Computational graph is not given before training, but obtained while training

Not important, but pretty cool: ONNX (https://onnx.ai)

Acronym for: Open Neural Network eXchange

https://onnx.ai

What is the difference between Distributed ML and non-distributed ML?

Non-distributed ML is what we've been doing throughout the course. Working on the whole dataset Distributed ML is a way of partitioning a dataset and working on it simultaneously. Implemented by Hadoop/Spark etc

What is data parallelism?

A number of machines loads an identical copy of a DL model -> Training data is split into non-overlapping chunks and then fed to the workers Basically a way of splitting data and letting different "workers" to work on it simultaneously

What is model parallelism?

Using the same data, but splitting the model into chunks and letting different "workers work on it simultaneously"

Why were Recurrent Neural Networks made? (RNN)

RNNs were developed to solve learning problems (time series or sequential tasks) where information about past (i.e., past instances/events) is directly linked to making future predictions

A recurrent neuron is different from a traditional neuron in that it...

maintains a memory or a state from past computations

Name some different use cases for RNNs

Each of these have different graphs, showing the input -> memory -> output relationship. See Notion: - Image captioning - Sentiment analysis - Video classification - Machine translation and speech recognition

How do you train RNNs?

Backpropagation through time (BPTT) 1. Unroll recurrent neuron across time instants 2. Apply backpropagation to unrolled neurons at each time layer same way it is done for traditional feedforward NN

The challenge of training RNN is...

a vanishing and exploding gradient problem -> Solved by creating a new model called the Long Short-Term memory/LSTM