U5 Flashcards
As its name suggests, “big data” is huge and fast-growing data. Big data initially:
With the dramatic increase in data, database costs, e.g., hardware, software, and operating costs, have accordingly increased. Hence,
attributed to search engines and social networks is now making its way into enterprises.
There exist several challenges while working with big data, including?
how to store it and how to process it.
Among these challenges is enabling the databases to meet the needs of?
high concurrent reading and writing with low latency.
there is an immense need to lower the costs of storing big data?
Because, with the dramatic increase in data, database costs,
e.g., hardware,
software,
and operating costs,
have accordingly increased.
The traditional relational databases, e.g., structured query language (SQL), are a?
collection of data items with pre-defined relationships between them.
These items are organized as a set of tables with:
columns and rows.
Unfortunately, these relational databases have some inherent limitations which emerge with:
the rapid growth of data.
In these cases, relational databases are:
widely prone to deadlocks and other concurrency issues.
These situations lead to rapid declines in?
the efficiency of reading and writing.
Furthermore, the multi-table correlation mechanism that exists in —————————————-represents a major limitation of database scalability. To overcome these problems, ——————–databases were proposed instead of the traditional database. NoSQL is an —————–term for ——————- databases which do not use the SQL structure.
relational database
NoSQL
umbrella
non-relational
NoSQL databases are useful for ?
applications that deal with very large semi-structured and unstructured data.
Unlike relational databases, NoSQL databases are designed to ?
scale horizontally and can be hosted on a cluster of processors.
In most of these databases, each row is a ?
key-value pair.
NoSQL databases contains truly elastic databases, e.g., MongoDB and Cassandra, which allows?
the addition/removal of nodes to/from a cluster without any observable down-time for the clients.
To this end, routing algorithms are used to decide when to move the inter-related data chunks, for instance, ?
when data must be moved to newly added node B. During the copying process, the data is served from the original node A. When the new node B has an up-to-date version of the data, the routing processes start to send requests to the node B.
In general, there are some important aspects related to distributed databases that need to be thoroughly addressed, including :
scalability,
availability,
and consistency.
First, scaling is typically achieved through?
“sharding” to meet the data volume.
Sharding is ?
a type of database partitioning that separates very large databases into smaller, faster, more easily managed parts, referred to as data shards.
NoSQL databases support an auto-sharding mode in which?
the shards are automatically balanced across the nodes on a cluster.
Additional nodes can be easily added as ?
necessary to the cluster to align with data volume.
Second, availability can be achieved via replication, i.e., ?
master-slave replication or peer-to-peer replication.
With master-slave replication, two types of nodes are typically implemented including:
a master node where all the write operations go to the master node.
Data can be read from any node, either a —————————– If a master node goes down, a slave node gets promoted to a ————–, and continues to replicate to the —————.
master or a slave.
master node
third node
When a failed master node is resurrected, it joins the cluster as a slave. Alternatively,?
peer-to-peer replication is slightly complex where all the nodes receive read/write requests.
In terms of consistency, two major types of inconsistencies exist:
read and write.
Read inconsistencies arise in ————————- replication when a user tries to read of a —————– before changes propagate from the ———————–, while in —————————- replication the user runs into both read and write inconsistencies, as write (update) is allowed on —————-s.
master/slave
slave
master node
peer-to-peer
multiple node
It is obvious that availability and consistency are ?
two contradicting metrics.
Achieving the right balance between these metrics highly depends on ?
the nature of the IoT application.
For example,
a user can prohibit read and write inconsistencies through considering slaves as hot standby without reading from them.
MongoDB is a prominent example of a document-oriented, scalable NoSQL database system which has?
a powerful query language.
MongoDB supports complex data types, e.g.,?
BJSON data structures.
It allows most functions like?
-query in single-table of relational databases,
-and it also supports indexing.
-Furthermore, MongoDB has the advantage of supporting high-speed access to mass data.
When the stored data exceeds 50 GB, the access speed of MongoDB is?
ten times higher than MySQL (Yan, 2015).
Thanks to these characteristics, many system designers are?
considering MongoDB instead of relational database.
Another example of a NoSQL database is?
the Apache Cassandra. It offers good scalability and high availability without compromising performance.
Cassandra demonstrated?
fault-tolerance on commodity hardware (i.e., cloud infrastructures)
and linear scalability,
thus making it the ideal platform for mission-critical data.
Cassandra features allow?
replication across multiple datacenters,
offering lower latency for data availability during regional outages.
With Cassandra, columns can be easily indexed with ?
powerful built-in caching mechanism.
Netflix, Twitter, Urban Airship, Reddit, Cisco, OpenX, and Digg are examples of the companies that use?
Cassandra to deal with huge, active, online interactive datasets.
The largest known Cassandra cluster has over?
300 TB (terabytes) of information in over 400 machines.
Processing a massive amount of data, i.e., big data, demands?
a shift from the client server model of data processing where a client node pulls the data from a server node.
Instead, data can be processed on the ————————————-. In addition, this processing can be carried out independently in parallel as the underlying data is already ————————–across different nodes.
cluster nodes
partitioned
This approach of data processing is referred to as ?
the MapReduce framework and it also interestingly uses key-value pairs.
MapReduce makes use of hundreds or even thousands of “pluggable” nodes in a cluster to?
process data in parallel, which significantly shortens the time between the operational events and presenting the analytics results.
The ——————————– framework offers an effective method for the efficient analysis of the collected ————-data, especially when the computations involve linearly computable ———–functions over the elements of the data streams, e.g.,
——————————————————-.
MapReduce
sensor
statistical
MIN, MAX, SUM, and MEAN
Google’s original MapReduce framework was designed for ?
Therefore, this framework represents an ideal candidate for sensor data analytics. The figure below demonstrates the MapReduce architecture for processing sensor data in parallel on different processing nodes.
analyzing large amounts of web logs, and more specifically deriving such linearly computable statistics from the logs.
In fact, the sensor-generated data has many conceptual similarities to web logs. Specifically,?
they are similarly repetitive, and the typical statistical computations which are often performed on sensor data for many applications are linear in nature.
this framework represents an ideal candidate for sensor data analytics?
the sensor-generated data are often performed on sensor data for many applications are linear in nature.
The figure below demonstrates the MapReduce architecture for processing sensor data in parallel on different processing nodes.
To understand the MapReduce framework, consider a case where?
the data can be in the form of (year, value) where the year is the key. The Map function takes a list of pairs (year, value) from one domain and then returns a list of pairs (year, local max value). The local max value denotes the local maximum in the subset of the data processed by that node. This computation is typically performed in parallel by dividing the key value pairs across different distributed computers.
the maximum temperature each year is to be determined from sensor data recorded over a long period of time.
To this end, the “Map” and “Reduce” functions of MapReduce are defined with respect to data structured in (key, value) pairs.
For example,
the data can be in the form of (year, value) where the year is the key.
The Map function takes a list of pairs (year, value) from one domain and then returns a list of pairs (year, local max value).
The local max value denotes the local maximum in the subset of the data processed by that node.
This computation is typically performed in parallel by dividing the key value pairs across different distributed computers.
Subsequently, the MapReduce framework combines ?
all pairs with the same key from all lists, thus creating one group for each one of the different generated keys.
This grouping step requires ——————-between the different ———————–. However, the cost of this communication is much lower than moving the——————————- around because the ———– has already generated a —————- summary of the processed data.
communication
computers
original data
Map step
compact
It is worth mentioning that the exact implementation of the Map step widely depends upon?
the implementation of the adopted MapReduce,
and also on the exact nature of the distributed data.
For instance, the sensor data may be distributed over a local cluster of computers (with the use of an implementation such as Hadoop). An alternative solution is to geographically distribute the sensor data because?
the data is originally created at different locations, and it is too expensive to move the data around.
The latter scenario is much more suited for —— applications. Nevertheless, the steps for collecting the intermediate results from the different Map steps may depend upon the specific ———————————————– in which the MapReduce framework is —————–.
IoT
implementation and scenario
utilized
After performing the grouping step, the Reduce function is applied in?
parallel to each group.
Such a step generates a collection of?
values in the same domain.
Next, we apply Reduce —————- in order to create list————.
(k2, list(V2))
(v3)
Each execution of the Reduce function returns only one value, although it is also?
possible for the function to return more than one value.
For instance,
the input to the “Reduce” function will be a list in the form (Year [local max1, local max2, …, local maxr]),
where the local maximum values are determined by the execution of the different Map functions.
The Reduce function determines the maximum value over the corresponding list in each call of the Reduce function.
A Hadoop client typically submits jobs to the MapReduce framework through what is called?
the “jobtracker” running on the master server.
Subsequently, the ————————— automatically assigns the jobs to ————————— running on many —————————–.
jobtracker
“tasktrackers”
slave nodes
The tasktrackers regularly send heartbeats to the jobtracker to update the status, e.g.,?
alive, idle or busy.
If a job fails or timeouts, or a node is dead, the jobtracker can automatically reschedule ?
the jobs to run on available nodes.
In general, HDFS comprises two components, namely ?
name-nodes
and data-nodes.
A name-node is?
responsible for keeping the metadata about the data on each data-node.
When a client application reads or writes data into HDFS, it must communicate with?
the name-node to get the locations of data block to be read from or written to.
The metadata is read into main memory when ?
Hadoop starts, and is dynamically maintained.
A data-node updates name-node the metadata of its local data blocks through?
heart beats.
Hadoop also has?
a secondary name-node mainly used to store the latest checkpoints of HDFS states.
Although the Hadoop MapReduce framework has the goal of
high scalability
and better fault-tolerance,
it is not ?
optimized for input/output efficiency. Specifically, both the Map and Reduce functions are “block operations” in which data transition cannot proceed to the next stage until the tasks of the current stage have finished.
Accordingly, the output of mappers needs to be ?
first written into HDFS before shuffled to the reducers.
Such blocking, the one-to-one shuffling strategy, and the runtime scheduling ?
degrade the performance of each node.
The MapReduce framework lacks:
a database management system
and does not optimize data transferring across various nodes.
it is more suitable for batch jobs than real-time processing.
Because Hadoop has a latency problem due to its inherent nature.
Large-scale IoT applications,
e.g.,traffic monitoring,
weather forecasting,
homeland security,
entertainment,
and disaster response, often have ?
the challenge of capturing too much data with too little inter-operability.
Such challenges are accompanied with ?
too little knowledge about the ability to utilize different resources which are available in real time.
To overcome these challenges, the Sensor Web Enablement initiative ?
defines service interfaces which enable developers to make all types of sensors, transducers, and sensor data repositories discoverable, accessible, and usable via the Web.
Such standardized interfaces are extremely beneficial since?
they hide the heterogeneity of the underlying IoT devices from the applications that use them.
In this context, the term “Sensor Web” defines ?
an infrastructure enabling access to IoT devices and archived sensor data.
Such data can readily be discovered and accessed using ———————————————–. The goal of the Sensor Web is to enable real-time “———————” in order to ensure timely ——————— to a wide variety of events.
standard protocols and APIs
situation awareness
responses
The major benfits of the IoT sensor data can only be realized if we have?
the infrastructure and mechanisms to synthesize, interpret, and apply this data intelligently via automated means.
The Sensor Web enables automated applications to
understand,
interpret,
and reason with basic but critical semantic notions such as ?
“nearby,” “far,” “soon,” “immediately,” “dangerously high,” “safe,” “blocked,” or “smooth.” Ontologies are at the heart of the semantic sensor web technology.
An ontology is a mechanism for ?
resources, the Resource Description Framework (RDF) data model is widely used to describe resources.
knowledge sharing and reuse.
Ontologies are generally knowledge representation systems. To represent ?
resources, the Resource Description Framework (RDF) data model is widely used to describe resources.
Literally, a resource is ?
any device or concept, e.g., person, place, restaurant. Each resource is uniquely identified by a URI.
Aside from describing resources, RDFs are capable of ?
specifying how resources are inter-related through performing inference.
The building blocks of RDF are triples, where a triple is?
a 3-tuple of the form <subject, predicate, object> where subject, predicate, and object are interpreted as in a natural language sentence.
It is most helpful to perceive RDF as a
———–, where subject resources are represented in ———, literals in ————, and predicate (relationships) represented as directed ————————- or between ———————————.
graph
ovals
rectangles
edges between ovals
ovals and rectangles
For instance,
the triple representation of the sentence, “Washington, D.C. is the capital of the United States,” is illustrated in the following figure.
The Web Ontology Language (OWL) is ?
another ontology formalism that was developed to overcome the drawbacks of RDF.
Specifically, RDFs do not provide ways to represent constraints,
e.g., domain or range constraints.
Furthermore, ————————————————————- cannot be represented in RDF data model.
transitive or inverse properties
Extending RDF(s) makes it straightforward to provide?
a formal specification in OWL.
Both RDF and OWL ontology formats have extensive developer community support in terms of?
the availability of tools for ontology creation and authoring.
An example is ?
Protege, which supports RDF and OWL formats, data storage, and management stores, such as
OpenSesame, for efficient storage and querying of data in RDF or OWL formats.
Furthermore, there is significant availability of actual ontologies in a variety of domains in the —————————- formats.
RDF and OWL
The Semantic Sensor Network (SSN) is an example of ?
an ontology which relies on the OWL data model to describe sensors and observations.
It describes the IoT sensors in terms of ?
their capabilities,
measurement processes,
observations,
and deployments.
The SSN ontology is conceptually organized into ———————–. In fact, the ontology can be seen from ———- main perspectives, namely ?
ten modules
four
sensor perspective,
observation perspective,
system perspective,
and feature and property perspective.
The full ontology consists of?
41 concepts and 39 object properties.
The ontology can describe?
sensors,
the accuracy and capabilities of such sensors,
observations,
and methods used for sensing.
Concepts for operating and survival ranges are also included, as?
these are often part of a given specification for a sensor, along with its performance within those ranges.
Finally, a structure for field deployments is included to?
describe deployment lifetimes and sensing purposes of the deployed macro instrument.
To achieve automatic processing and interpretation of the IoT data, we need ?
common agreements on providing and describing the IoT data.
To evaluate the quality aspects of data, the source provider,
device,
and environment-specific information also need to be?
associated to the data.
Considering the diversity of data types, device types, and potential providers in the IoT domain, common description frameworks are essential to ?
describe
and represent the data to make it seamlessly accessible
and processable across heterogeneous platforms.
Considering the diversity of data types, device types, and potential providers in the IoT domain, common description frameworks are essential to ?
describe
and represent the data to make it seamlessly accessible
and processable across heterogeneous platforms.
The semantic descriptions and annotations must be provided at different layers of the IoT framework, including:
the “Things” level, device,
and network level (e.g., SSN ontology), and the interaction and business process
model to?
enable autonomous processing and interpretation of the IoT data.
In fact, the effective discovery, access, and utilization of the IoT resource require?
machine-interpretable descriptions of different components and resources in the IoT framework,
e.g., sensors, actuators, and network resources.
The current Semantic Web technologies and ontologies can efficiently describe various aspects of?
the IoT data and resources.
Description models and representation frameworks that can describe the IoT data and services need to consider the constraints and dynamicity of the IoT domain.
Since IoT environments are often dynamic and pervasive,
In this context, the concept of “linked data” emerges to connect?
individual data items to support semantic query and inferences on the data coming from the physical and virtual objects.
In other words,
linked data simply refers to data published on the Web in such a way that it is machine-readable, its meaning is explicitly defined, and it is readily linked to other external data sets.
The linked data, represented using formal knowledge representation ———————such as ——————————, provides potential for information reuse and ———-among ———————-sources. The information published for the linked data is typically structured and connected to one another.
formalism
RDF and OWL
interoperability
heterogeneous
In general, publishing linked data widely encourages the reuse?
(1) using URIs as names for data; (2) providing HTTP access to those URIs; (3) providing useful information for URIs using the standards such as RDF and SPARQL; and finally (4) including links to other URIs.
of existing information rather than creating new information.
This implies that human users can exploit the existing knowledge base by?
simply providing links to the data in it.
For instance,
the DBpedia project
-extracts structured information from Wikipedia.
-DBpedia enables sophisticated queries over the information that exists in Wikipedia.
- Moreover, it provides new ways of browsing and navigation through the semantic links.
Nevertheless, semantic descriptions without being linked to other existing data on the Web would be mostly?
processed locally and according to the domain descriptions (i.e., ontologies) and their properties.
The linked data offers four main principles to publish linked-data:
(1) using URIs as names for data;
(2) providing HTTP access to those URIs; (3) providing useful information for URIs using the standards such as RDF and SPARQL; and finally
(4) including links to other URIs.
In fact, the emergence of sensor data as linked data enables IoT applications and sensor network providers to?
connect sensor descriptions to potentially endless data existing on the Web.
Specifically, the action of relating sensor data attributes, such as location, type, and measurement features, to the other resources on the Web of data enables the users to integrate physical world data and the logical world data.
The results of such an integration are:
drawing beneficial conclusions,
creating business intelligence,
enabling smart environments,
and supporting automated decision-making systems.
In order to get the most out of the integration of IoT and cloud computing, the use of ?
microservices is recommended.
Microservices represent an architectural approach for developing applications as a set of small services, where?
each service is running as a separate process, communicating through simple mechanisms .
Most of the advantages of the microservices architecture stem from ?
decomposing a service or an application into smaller components, i.e., microservices.
Each of these components should implement a specific functionality. As a result, :
we can independently develop,
deploy,
upgrade,
and scale every microservice.
each microservice can be separately scaled?
Since the different microservices may have different workloads.
Accordingly, we can?
-use an optimal amount of resources making microservices architecture a natural fit for achieving both scalability and elasticity.
-We can separately control every microservice where they are easily manageable thanks to being small.
Developing microservices separately enables?
the employment of different technologies, e.g., different programming languages for each microservice
Furthermore, the task of releasing an update for a part of our application or service does not require?
the redeployment of the whole application,
but only
the corresponding microservice.
Microservices often communicate through web services, such as:
REST,
or through remote procedure calls (RPC).
there is a need to reduce the communication between the different microservices to the minimum. ?
as communication between processes may become costly.
The advantages of microservices architecture are best identified when?
we compare it to the traditional monolithic architecture.
Monolithic application has all of its components packed together.
For instance,
monolithic web applications have the client-side,
the server-side
, and the database in a single logical executable.
Similarly, monolithic IoT applications have the whole logic for communication with IoT devices,
processing of devices’ data, communication with databases,
and visualization, in a single logical executable.
To achieve scalability and elasticity in monolithic applications, more instances of the whole application must be?
deployed or terminated. However, different application functionalities rarely have an equal share of the workload.
Alternatively, every microservice is packed as?
an independent component in the microservices approach.
We can scale every —————————-independently and change the number of instances for each microservice separately. In this context, the application ———————–can be controlled according to the workload of each of the microservices. To summarize, microservices —————————————————— application development.
microservice
scalability
enable a scalable
elastic,
and resource-efficient
Another major difference between microservices and monolithic architecture is that?
a change in how an application communicates with IoT devices and receives data from them must have no impact or only minimal impact on how we process the data. Therefore, we should “componentize” the application into microservices in a way that would allow for the communication between microservices to be minimal.
the latter usually runs as a single process.
If we want to release an update of the application, the whole application must be?
redeployed for the changes to take effect;
it does not matter which component we have changed. With microservices, the update of one microservice has to cause?
no changes or only minor changes to the other microservices.
In this realm, we can highlight a potential challenge of the microservices approach. As mentioned previously, the communication between components is?
relatively expensive and has to be deliberately minimized.
If a change in a single microservice imposes many changes in other microservices, ?
the advantages of the microservices architecture might be lost.
For instance,
we should “componentize” the application into microservices in a way that would allow for the communication between microservices to be minimal?
Because a change in how an application communicates with IoT devices and receives data from them must have no impact or only minimal impact on how we process the data.
In some applications, the monolithic architecture could become excessively large. In these cases, several drawbacks emerge, such as:
if a microservice which communicates with a certain group of sensors crashes, such a crash will not affect or stop the processing of the data provided by microservices which communicate with other sensors. The other components of the application will still be up and properly running.
the difficulty of software management, being more vulnerable, and being harder to update.
Bugs in monolithic applications could be expensive,?
as they cause the whole application to crash, whereas in microservices architecture only the corresponding microservice collapse.
Bugs in monolithic applications could be expensive, as they cause the whole application to crash, whereas in microservices architecture only the corresponding microservice collapse. In this case, the microservices-based application can continue?
running and only the specific functionality implemented by the malfunctioned microservice is unavailable.
Such behavior of ———————————————-is highly important in the IoT domain. For example, ?
microservices-based architectures
if a microservice which communicates with a certain group of sensors crashes,
such a crash will not affect or stop the processing of the data provided by microservices
which communicate with other sensors.
The other components of the application will still be up and properly running.
In general, IoT applications have high requirements regarding?
scalability.
These scalability requirements fundamentally push toward?
designing distributed architectures rather than monolithic ones.
In general, the microservices architecture is ?
adaptable to the requirements of IoT applications.
When developing applications, it is generally good practice to break down the application into several ——————–. Such components that programmers frequently use are referred to as————-.
components
libraries
The concept of services in microservices architecture is similar to the libraries concept with one major difference:
libraries are essentially linked to a main program and when the program is running, there is only one process.
On the other hand, the microservices architecture tends to componentize a project into services, where each service is running in its own separate process.
each microservice could be deployed and scaled independently?
As the microservices architecture tends to componentize a project into services, where each service is running in its own separate process.
each microservice could be deployed and scaled independently. By ?
componentization into microservices, the problem of vast heterogeneity of devices could be simply addressed.
To this end, distinctive microservices can be implemented as?.
proxies for the IoT devices that communicate using different protocols, e.g., Wi-Fi, LoRa, BLE
Furthermore, adding new devices, which may communicate using unsupported protocol, is usually resolved by?
adding a microservice acting as a proxy between protocols.
In general, there exist two common approaches for decomposition of applications into microservices:
verb-based
and noun-based strategies.
The former strategy deals with the ——–of an application around single use cases. Such a decomposition strategy is ————- for —— applications.
decomposition
ill-suited
IoT
If we are dealing with multiple groups of devices, we might group the logic for communication with certain type of devices, e.g.,?
temperature sensors, the data processing logic, and the visualization logic for this certain group of sensors in one microservice.
In fact, the approach is not a natural fit for IoT applications, ?
as the scaling of different modules is dependent upon different factors.
For instance:
The communication with devices is most dependent upon the number of devices and the amount of data they generate, while the visualization application must consider the number of users which access it simultaneously.
In the noun-based decomposition, a microservice is responsible for?
every operation related to a certain functionality.
A single microservice communicates with ? data visualization.
the devices and exchanges data with them
a second microservice processes the data, e.g., ?
CEP engine;
a third microservice might store the data in ?
a database for later processing;
and finally, a fourth microservice might be responsible for?
data visualization.
Such a decomposition leads to the design of a dynamic application, where ?
each functionality can be separately scaled.
A combination of the verb-based and noun-based approach is also ?
possible.
fault tolerance can be easily considered?
Since microservices are independent components.