Lecture 9 - Cloud Computing, Big Data, Tensor-flow, Recurrent Neural Networks, Distributed Deep Learning Flashcards
What is cloud computing?
A model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management or service provider interaction.
What is cloud computing composed of?
5 characteristics, 4 deployment models and 3 service models
What are the five characteristics of cloud computing?
- On-demand self-service: Provision computing resources as needed without requiring human interaction with each service provider
- Broad network access: Resources are available over network and accessed through standard mechanisms
- Resource Pooling: Provider’s computing resources are pooled to serve multiple consumers using a multi-tenant model (sense of location independence)
- Rapid Elasticity: Capabilities can be elastically provisioned and released, to scale rapidly outward and inward as per demand
-
Measured Service: Control and optimize resource use by metering capability
- Usages are monitored, controlled and reported -> Transparency
What are the four deployment models of cloud computing?
- Private Cloud: Exclusive use by single organization comprising multiple consumers.
- May be owned/managed, and operated by orgaization/3rd party/combination (exists on or off premises) - Community cloud: Exclusive use by a specific community of consumers from organizations that have shared concerns
- Public cloud: Use by general user.
- Hybrid cloud: Composition of two or more distinct cloud infrastructures (Private, community or public)
- Bound together by standardized technology that enables data and application portability
- E.g., Cloud bursting for load balancing between clouds
What are the three service models of cloud computing?
- Software as a Service (SaaS): Capability provided to consumers is to use provider’s applications running on Cloud. Applications are accessible from various client devices (Example: Dropbox)
- Platform as a Service (PaaS): Capability provided to consumers is to deploy onto Cloud consumer-created or acquired applications created using programming languages, libraries, services and tools supported by provider (Example: MS Azure Cloud, Google Cloud Platform, AWS)
- Infrastructure as a Service (IaaS): Capability provided to consumers is to provision processing, storage, networks and other fundamental computing resources where consumer can deploy and run arbitrary software including OSs and applications
What is the difference between traditional data and big data?
Big Data refers to the inability of traditional data architectures to efficiently handle new datasets.
What are the characteristics of big data? (4 V’s)
- Volume (Size of dataset)
- Variety (Data from multiple sources/types)
- Velocity (Rate of flow)
- Variability(Change in other characteristics)
Bonus: (New V)
- Value
(There are different ideas about how many V’s there are. Our teacher sticks to 4(5) V’s)
True or False: Machine Learning Techniques are very capable of processing large raw data
True!
Big data require a ____ architecture for efficient storage, manipulation and analysis
Big data require a SCALABLE architecture for efficient storage, manipulation and analysis
What does Data Science try to extract from data? And how?
Data science tries to extract: Actionable knowledge / Patterns
Through: Discovery or hypothesis formulation and hypothesis testing
What are the six primary components of the Data lifecycle management system? (DLMS)
Comprises of six primary components:
- Metadata management (Maintains static and dynamic characteristics of data. I.e data about the data)
- Data placement (handles efficient data placement and data replication while satisfying user requirements - I.e will it be placed on your premises or on the cloud?)
- Data storage (is responsible for efficient(transactional) storage and data retrieval support - Key value storage/other)
- Data ingestion (enables importing and exporting data over respective system)
- Big data processing (supports efficient and clustered processing of big data by executing main logic of user application(s) - i.e how do we process the data)
- Resource management (is responsible for proper and efficient management of computational resources)
- Not necessarily part of Data Scientist’s job. But we should know it. Could be the responsible of a devops guy perhaps
Name a couple of Big Data processing frameworks
- Hadoop MapReduce
- Apache Spark
- Apache Beam
How does Hadoop MapReduce work? (I highly doubt this will be very relevant - See that chart in Notion)
A MR job splits data into independent chunks which are processed in-parallel by map tasks. Sorted map outputs are fed to reduce tasks.
YARN (Yet Another Resource Negotiator) offers resource management and job scheduling. It is a cluster management technology that became a core part of Hadoop 2.0
What are the advantages of Apache Spark?
Apache Spark offers:
- In memory processing capability where data is cached in memory for rapid read and write
- Supports Data Analytics, machine learning algorithms
- Multiple languages (Such as Scala, Java, Python, R) support)
What are the five components of Apache Spark?
- Apache Spark Core: Execution engine for spark (Provides in-memory computing)
- Spark SQL: Introduces data abstraction for semi-/structured data. Resilient Distributed Datasets (RDD) is a fundamental data structure of spark
- Spark Streaming: performs streaming analytics. Crunches data in mini-batches and performs RDD(Resilient Distributed Datasets) transformations on those mini-batches of data
- MLLib (Machine Learning Library): is a distributed machine learning framework
- GraphX: A distributed graph-processing framework