Lecture 9 - Cloud Computing, Big Data, Tensor-flow, Recurrent Neural Networks, Distributed Deep Learning Flashcards
What is cloud computing?
A model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management or service provider interaction.
What is cloud computing composed of?
5 characteristics, 4 deployment models and 3 service models
What are the five characteristics of cloud computing?
- On-demand self-service: Provision computing resources as needed without requiring human interaction with each service provider
- Broad network access: Resources are available over network and accessed through standard mechanisms
- Resource Pooling: Provider’s computing resources are pooled to serve multiple consumers using a multi-tenant model (sense of location independence)
- Rapid Elasticity: Capabilities can be elastically provisioned and released, to scale rapidly outward and inward as per demand
-
Measured Service: Control and optimize resource use by metering capability
- Usages are monitored, controlled and reported -> Transparency
What are the four deployment models of cloud computing?
- Private Cloud: Exclusive use by single organization comprising multiple consumers.
- May be owned/managed, and operated by orgaization/3rd party/combination (exists on or off premises) - Community cloud: Exclusive use by a specific community of consumers from organizations that have shared concerns
- Public cloud: Use by general user.
- Hybrid cloud: Composition of two or more distinct cloud infrastructures (Private, community or public)
- Bound together by standardized technology that enables data and application portability
- E.g., Cloud bursting for load balancing between clouds
What are the three service models of cloud computing?
- Software as a Service (SaaS): Capability provided to consumers is to use provider’s applications running on Cloud. Applications are accessible from various client devices (Example: Dropbox)
- Platform as a Service (PaaS): Capability provided to consumers is to deploy onto Cloud consumer-created or acquired applications created using programming languages, libraries, services and tools supported by provider (Example: MS Azure Cloud, Google Cloud Platform, AWS)
- Infrastructure as a Service (IaaS): Capability provided to consumers is to provision processing, storage, networks and other fundamental computing resources where consumer can deploy and run arbitrary software including OSs and applications
What is the difference between traditional data and big data?
Big Data refers to the inability of traditional data architectures to efficiently handle new datasets.
What are the characteristics of big data? (4 V’s)
- Volume (Size of dataset)
- Variety (Data from multiple sources/types)
- Velocity (Rate of flow)
- Variability(Change in other characteristics)
Bonus: (New V)
- Value
(There are different ideas about how many V’s there are. Our teacher sticks to 4(5) V’s)
True or False: Machine Learning Techniques are very capable of processing large raw data
True!
Big data require a ____ architecture for efficient storage, manipulation and analysis
Big data require a SCALABLE architecture for efficient storage, manipulation and analysis
What does Data Science try to extract from data? And how?
Data science tries to extract: Actionable knowledge / Patterns
Through: Discovery or hypothesis formulation and hypothesis testing
What are the six primary components of the Data lifecycle management system? (DLMS)
Comprises of six primary components:
- Metadata management (Maintains static and dynamic characteristics of data. I.e data about the data)
- Data placement (handles efficient data placement and data replication while satisfying user requirements - I.e will it be placed on your premises or on the cloud?)
- Data storage (is responsible for efficient(transactional) storage and data retrieval support - Key value storage/other)
- Data ingestion (enables importing and exporting data over respective system)
- Big data processing (supports efficient and clustered processing of big data by executing main logic of user application(s) - i.e how do we process the data)
- Resource management (is responsible for proper and efficient management of computational resources)
- Not necessarily part of Data Scientist’s job. But we should know it. Could be the responsible of a devops guy perhaps
Name a couple of Big Data processing frameworks
- Hadoop MapReduce
- Apache Spark
- Apache Beam
How does Hadoop MapReduce work? (I highly doubt this will be very relevant - See that chart in Notion)
A MR job splits data into independent chunks which are processed in-parallel by map tasks. Sorted map outputs are fed to reduce tasks.
YARN (Yet Another Resource Negotiator) offers resource management and job scheduling. It is a cluster management technology that became a core part of Hadoop 2.0
What are the advantages of Apache Spark?
Apache Spark offers:
- In memory processing capability where data is cached in memory for rapid read and write
- Supports Data Analytics, machine learning algorithms
- Multiple languages (Such as Scala, Java, Python, R) support)
What are the five components of Apache Spark?
- Apache Spark Core: Execution engine for spark (Provides in-memory computing)
- Spark SQL: Introduces data abstraction for semi-/structured data. Resilient Distributed Datasets (RDD) is a fundamental data structure of spark
- Spark Streaming: performs streaming analytics. Crunches data in mini-batches and performs RDD(Resilient Distributed Datasets) transformations on those mini-batches of data
- MLLib (Machine Learning Library): is a distributed machine learning framework
- GraphX: A distributed graph-processing framework
How is Apache Beam different from Apache Spark?
- Apache Beam is a unified Batch+strEAM processing engine.
- Primarily focused on programming model
Beam does not contain infrastructure for distributed processing
Dunno, don’t focus too much on this, I think
What are the primary components of Apache Beam?
Beam supports (What/Where/When/How) via Pipelines, PCollections, Transformations and Runners
- Pipelines(collection of processes) are streaming and processing user logic (Kinda similar to machine learning pipelines)
- PCollection is a distributed data set, processed inside pipeline
- Transform is a data processing operation/a step of a pipeline
- Runner used to execute pipelines. Spark runner for Apache Spark
What are the core features of TensorFlow?
- Key ideas: Express a numeric computation as a graph
- Graph Nodes are operations with any number of inputs and outputs
- Graph edges are tensors which flow between nodes
- Computation graphs can be exported to a portable format, train here - run there
- TF implements autodiff and provides optimizers
- Minimize all sorts of loss functions
Also has a lot of data processing features: tf.keras, data loading and preprocessing (tf.data, tf.io), image processing (tf.image), signal processing (tf.signal)
What is TensorFlow?
Its core is very similar to Numpy, but with GPU support
TensorFlow’s API revolves around tensors which flow from operation to operation hence name TensorFlow
Tensor is like a numpy ndarray
- Can create a tensor from Numpy array and vice versa
- Can apply TF operations to Numpy arrays and numpy operations to tensors
- Ideally a multi-dimensional array, but can hold a scalar
- Helps during custom cost functions, custom metrics, custom layers
Note: Type conversions can significantly hurt performance
Describe (in very broad terms) TensorFlow’s architecture
ARCHITECTURE: High level (Python code) -> Keras/Data API -> Low-level python API/C++ -> Local/distributed execution engine -> CPU-/GPU-/TPU Kernels
Execution steps:
- Build a graph using variables and placeholders
- Deploy the graph for execution
- Train model by defining loss function and gradients computations
True or False: Deep Learning frameworks hide mathematics and focus on design of neural nets
True
Name some Deep Learning Frameworks
- TensorFlow
- Caffe
- CNTK
- PyTorch
How do you train a Neural Network?
1) Build a computational graph from network definition
2) Input Training data and compute loss function
3) Update parameters
Define-and-run: DL Frameworks complete step one in advance (TensorFlow, Caffe)
Define-by-run: Combines steps one and two into a single step (PyTorch)
- Computational graph is not given before training, but obtained while training
Not important, but pretty cool: ONNX (https://onnx.ai)
Acronym for: Open Neural Network eXchange
https://onnx.ai
What is the difference between Distributed ML and non-distributed ML?
Non-distributed ML is what we’ve been doing throughout the course. Working on the whole dataset
Distributed ML is a way of partitioning a dataset and working on it simultaneously. Implemented by Hadoop/Spark etc
What is data parallelism?
A number of machines loads an identical copy of a DL model -> Training data is split into non-overlapping chunks and then fed to the workers
Basically a way of splitting data and letting different “workers” to work on it simultaneously
What is model parallelism?
Using the same data, but splitting the model into chunks and letting different “workers work on it simultaneously”
Why were Recurrent Neural Networks made? (RNN)
RNNs were developed to solve learning problems (time series or sequential tasks) where information about past (i.e., past instances/events) is directly linked to making future predictions
A recurrent neuron is different from a traditional neuron in that it…
maintains a memory or a state from past computations
Name some different use cases for RNNs
Each of these have different graphs, showing the input -> memory -> output relationship. See Notion:
- Image captioning
- Sentiment analysis
- Video classification
- Machine translation and speech recognition
How do you train RNNs?
Backpropagation through time (BPTT)
- Unroll recurrent neuron across time instants
- Apply backpropagation to unrolled neurons at each time layer same way it is done for traditional feedforward NN
The challenge of training RNN is…
a vanishing and exploding gradient problem
-> Solved by creating a new model called the Long Short-Term memory/LSTM