Data Storage and Processing Flashcards

Question 1

Q

What challenge do most existing IoT solutions face?

Answer

A

Most IoT solutions are tailored to specific verticals, leading to separate data silos, which makes it difficult to capture the full potential of IoT across multiple domains.

Question 2

Q

Why is handling IoT data from different domains challenging?

Answer

A

IoT data come from various structures, sources, and descriptions, making it complex to integrate and process them properly across different domains.

Question 3

Q

What is required to ensure interoperability of IoT devices?

Answer

A

IoT data must be stored in different network databases, shared among multiple nodes, analyzed by various tools, and interpreted by different machines to ensure interoperability.

Question 4

Q

What is the Semantic Web, and how does it help with IoT data?

Answer

A

The Semantic Web, or linked data web, provides reasoning engines and tools to analyze and link IoT data meaningfully across various domains.

Question 5

Q

What role does complex event processing play in IoT data analysis?

Answer

A

Complex event processing searches for dependencies and patterns in streaming IoT data, creating real-time insights to help businesses identify opportunities and threats early.

Question 6

Q

Why is a single server insufficient for handling IoT data?

Answer

A

IoT data are often too large for a single server or database to handle, requiring distributed processing approaches like MapReduce.

Question 7

Q

How does the MapReduce programming model help manage IoT data?

Answer

A

MapReduce distributes datasets across multiple databases to process the data separately and then recombines the results, making it possible to handle large volumes of structured and unstructured IoT data.

Question 8

Q

How did the web evolve from its initial phase to the Semantic Web (Web 3.0)?

Answer

A

The web started as a collection of documents linked to each other and gradually evolved into the Semantic Web, where documents and pieces of data are meaningfully connected.

Question 9

Q

What was unclear about the relationships between documents in the early phases of the web?

Answer

A

In the early phases, relationships between documents were unclear because they were not linked to specific pieces of data.

Question 10

Q

What does the Semantic Web enable for users and machines?

Answer

A

The Semantic Web provides meaningful links between data, allowing users (both humans and machines) to explore and understand connections between pieces of information.

Question 11

Q

What is linked data, and what does it create?

Answer

A

Linked data refers to semantically linking and integrating pieces of information across domains, creating a global web that connects data on topics like books, companies, and social media.

Question 12

Q

How do machines use linked data in the Semantic Web?

Answer

A

Machines can connect distributed data sources, process new data as they appear on the web, and produce integrated results, enhancing applications like data browsers and search engines.

Question 13

Q

What does a generic linked data browser allow users to do?

Answer

A

A generic linked data browser lets users browse a data source and travel along links to related sources, enhancing data exploration.

Question 14

Q

What capability do linked data search engines provide?

Answer

A

Linked data search engines allow expressive query capabilities over aggregated data by crawling the global web of linked data.

Question 15

Q

What is linked data?

Answer

A

Linked data refers to machine-readable, well-defined information published on the web that can be connected to external datasets from various sources.

Question 16

Q

What format is used in linked data technologies to connect information?

Answer

A

Linked data technologies use the Resource Description Framework (RDF) format to create a web of data by linking different things.

Question 17

Q

What kinds of data sources can linked data technologies connect?

Answer

A

Linked data can connect data sources ranging from geographically distributed database to heterogeneous systems that cannot interoperate at the data level.

Question 18

Q

Who specified the rules for publishing data as part of the global web of data?

Answer

A

Tim Berners-Lee, the inventor of the World Wide Web, specified the rules for publishing data as part of the global web of data.

Question 19

Q

What are the four linked data principles as specified by Tim Berners-Lee?

Answer

A

Use Uniform Resource Identifiers (URIs) as names for things.
Use HTTP URIs to help people look up the things’ names.
Use RDF and SPARQL standards to provide useful data.
Include links to other URIs to help people discover more things.

Question 20

Q

Which two fundamental web technologies are relied on by the first two linked data principles?

Answer

A

The first two linked data principles rely on Uniform Resource Identifiers (URIs) and Hypertext Transfer Protocol (HTTP).

Question 21

Q

How does RDF enhance linked data?

Answer

A

RDF supports a generic, graph-based data model that structures and links data describing things in the world, enhancing linked data.

Question 22

Q

What does the Resource Description Framework (RDF) syntax encode and represent?

Answer

A

RDF encodes and represents web resources and data in a structure known as triples.

Question 23

Q

What are the three components of an RDF triple?

Answer

A

Subject: A resource identified by a URI.
Predicate: A URI specifying the relationship between the subject and object.
Object: A resource or literal (a basic string value) identified by a URI, related to the subject.

Question 24

Q

What does the predicate represent in an RDF triple?

Answer

A

The predicate specifies the relationship between the subject and the object, represented by a URI.

Question 25

Q

What can the object in an RDF triple be?

Answer

A

The object can be either a resource or a literal (basic string value) identified by a URI.

Question 26

Q

How are subjects and objects in RDF triples similar to hypertext links?

Answer

A

Like hypertext links that connect documents, subjects and objects in RDF triples link items in various datasets, contributing to the web of data.

Question 27

Q

Give an example of an RDF triple relationship.

Answer

A

An example is “Berlin” (subject) and “Germany” (object) being related through the predicate “is the capital of,” showing that Berlin is the capital of Germany.

Question 28

Q

What type of relationship exists in RDF between subject and object resources?

Answer

A

RDF defines a unidirectional relationship from the subject to the object resource.

Question 29

Q

Can a resource in RDF be used in multiple triples? If yes, in what roles?

Answer

A

Yes, a resource can be used in various triples with different roles: as a subject, predicate, or object.

Question 30

Q

What do multiple connections between RDF triples create?

Answer

A

Multiple connections between RDF triples create a connected graph of data.

Question 31

Q

In an RDF graph, how are resources and predicates represented?

Answer

A

Resources are represented as nodes.
Predicates (relationships between nodes) are depicted with lines connecting the nodes.

Question 32

Q

What is the significance of the connected graph in RDF?

Answer

A

The connected graph allows for multiple relationships between data points, enabling more complex and meaningful data linkages across different datasets.

Question 33

Q

What is a major benefit of using centralized systems for RDF datasets?

Answer

A

Benefit: No communication overhead between different nodes, as all data storage and queries are processed on a single machine.

Question 34

Q

What limits the capabilities of centralized systems in handling RDF datasets?

Answer

A

Limitation: The system is restricted by the memory and computational capacity of the single node.

Question 35

Q

How do distributed systems improve over centralized systems for RDF datasets?

Answer

A

Improvement: Distributed systems offer larger memory and computational power by utilizing multiple machines.

Question 36

Q

What are the potential drawbacks of distributed systems when processing RDF data?

Answer

A

Drawback 1: Expensive communication between machines.  Drawback 2: Intermediate data shuffling during complex queries can degrade system performance.

Question 37

Q

What is the DBPedia project, and what does it aim to do?

Answer

A

DBPedia Project: It extracts the structured content of Wikipedia and makes it available in RDF. It allows users to semantically query properties, relationships, and link to related datasets.

Question 38

Q

How does the DBPedia project improve user experience in applications?

Answer

A

Improvement: Applications can exploit information from other datasets to enhance the user experience by linking related information in RDF triples.

Question 39

Q

Why is RDF Schema (RDFS) used in conjunction with RDF?

Answer

A

RDF Schema (RDFS) is used to define classes of resources in RDF, enabling the categorization of things into hierarchical classes, which RDF alone does not support.

Question 40

Q

What is a resource in RDFS, and how is it classified?

Answer

A

A resource in RDFS is an instance of a certain class, and each class can have subclasses with additional descriptions.

Question 41

Q

Does RDF Schema (RDFS) specify how applications should use the class descriptions?

Answer

A

No, RDFS does not specify how an application should use the descriptions of resources in the classes.

Question 42

Q

How does linked data facilitate data abstraction in IoT?

Answer

A

Linked data uses common identifiers like International Resource Identifiers (IRIs), which integrate common data structures from various IoT sensors, enhancing data abstraction.

Question 43

Q

What role do machines play in interpreting linked data in IoT?

Answer

A

Machines can interpret data descriptions by extracting the origin, attributes, and understanding the relationships between the data and other related information.

Question 44

Q

What is the main purpose of the Internet of Things (IoT)?

Answer

A

The main purpose of IoT is to interpret the semantic data captured from various sources and sensors and transform it into actionable knowledge.

Question 45

Q

Why are IoT data considered useless?

Answer

A

IoT data are considered useless if they cannot be understood or interpreted, as they must provide meaningful insights to be actionable.

Question 46

Q

What challenges arise from the heterogeneous nature of IoT data?

Answer

A

The heterogeneous nature of IoT data presents challenges in ensuring interoperability among IoT devices due to the support for different protocols and data formats.

Question 47

Q

How does the Semantic Web contribute to IoT?

Answer

A

The Semantic Web provides analytical tools and best practices that facilitate data reasoning, help satisfy interoperability requirements, and enable effective integration and analysis of different sources of IoT data.

Question 48

Q

What is the relationship between the Internet of Things and the Semantic Web?

Answer

A

The relationship between IoT and the Semantic Web results in global interoperability between devices, enabling the generation of new services through effective data integration and analysis.

Question 49

Q

What are the key open approaches developed by the Semantic Web community for data analytics?

Answer

A

The key open approaches include sharing and reusing open data through linked data, linked vocabularies, and linked services.

Question 50

Q

How are semantic IoT data stored and managed in the Semantic Web?

Answer

A

Semantic IoT data are stored and managed in RDF databases as RDF graphs.

Question 51

Q

What language is used for querying and reasoning over RDF graphs?

Answer

A

SPARQL is used for querying and reasoning over the stored RDF graphs.

Question 52

Q

What is the role of semantic technologies in data analytics?

Answer

A

Semantic technologies help derive meaning from collected data, transforming it into actionable information.

Question 53

Q

What are some well-known methods and technologies employed by the Semantic Web to process IoT data?

Answer

A

Some well-known methods include linking data, real-time and linked stream processing, logic-based approaches, machine learning, distributed semantic reasoning, and cross-domain recommender systems.

Question 54

Q

What advantage does linking data provide in the context of the Semantic Web?

Answer

A

Linking data allows for meaningful connections not just between documents, but also between machine-readable and interpretable datasets, enhancing data interoperability and insight extraction.

Question 55

Q

What extension has been added to SPARQL to handle stream sensor data?

Answer

A

An extension called linked stream data has been added to SPARQL to help handle stream sensor data.

Question 56

Q

What does linked stream data allow SPARQL to do?

Answer

A

Linked stream data allows SPARQL to enrich stream sensor data with linked open data, which is freely used and distributed.

Question 57

Q

What mechanisms does the Semantic Web provide to ensure the consistency of IoT data?

Answer

A

The Semantic Web provides mechanisms to develop rules, check data consistency, and ensure that IoT data are logically valid.

Question 58

Q

What is the purpose of the Linked Edit Rules (LER) approach?

Answer

A

The Linked Edit Rules (LER) approach checks the consistency of data to ensure its validity, such as verifying that relative humidity values cannot be negative.

Question 59

Q

How is logic-based reasoning utilized in the analysis of IoT data?

Answer

A

Logic-based reasoning is used to analyze simple sensor data, such as temperature or humidity, and is characterized by being fast and easy to implement.

Question 60

Q

When are machine learning techniques and data mining approaches applied in the context of IoT?

Answer

A

Machine learning techniques and data mining approaches are applied to reason about complex semantic IoT data (e.g., electrocardiography [ECG] signals) where logic-based reasoning alone is insufficient.

Question 61

Q

Where is time-sensitive data processed in many IoT network architectures?

Answer

A

Time-sensitive data are processed at the edge of the network to reduce latency and save bandwidth.

Question 62

Q

What new challenges are introduced by processing data at the edge of the network?

Answer

A

New challenges include the need for reasoning at every layer of the IoT data management and computational stack (i.e., cloud, fog, and edge) and the difficulty for smart nodes to understand and interpret data from heterogeneous IoT sensors.

Question 63

Q

How does distributed reasoning improve reasoning latency in large datasets?

Answer

A

Distributed reasoning can improve reasoning latency by allowing processing to occur at the sensors and edge devices, thus reducing the amount of data sent to centralized locations.

Question 64

Q

What are the advantages of distributed reasoning over centralized reasoning?

Answer

A

Distributed reasoning is advantageous when:
* Data are distributed both logically and physically.
* Communication costs are negligible compared to problem solution costs.
* There is collaboration between the system’s components to solve problems.

Answer 65

A

Distributed reasoning is beneficial when:
* Data are dynamic with ambiguous content.
* Data size exceeds the computational capacity of IoT devices.
* Sharing data and reasoning tasks can yield comprehensive intelligence.

Answer 66

A

Distributed reasoning can improve knowledge system performance by splitting large computational tasks into sub-tasks that can be solved more efficiently.

Answer 67

A

Traditional recommender systems focus on a single vertical and assist users in finding their topic of interest among vast amounts of data in a specific domain.

Answer 68

A

CDRS use data and knowledge gained from multiple source domains to provide recommendations in a target domain, assuming there is information overlap between items and users across different domains.

Answer 69

A

Cross-domain recommendation approaches are classified into two categories based on how knowledge is exploited:
* Knowledge linking approach: User preferences are merged, and recommendations from both domains are combined.
* Knowledge sharing approach: The source domain transfers its data to the target domain for producing recommendations.

Answer 70

A

Prefix.cc simplifies the RDF development process by looking up URI prefixes.

Answer 71

A

rdf-vocab is an open-source project used by RDF developers to look up and search for linked data vocabularies.

Answer 72

A

The W3C RDF Validator is an online service that checks and visualizes RDF documents.

Answer 73

A

Examples of data reasoners include:
* CEL Description Logic (DL)
* Euler
* FaCT++
* HermiT Reasoner
* Java Expert System Shell (JESS)
* Jena Eyeball (a command-line semantics validator).

Answer 74

A

Data reasoners can be classified based on their linkage and discovery mechanisms or their usability.

Answer 75

A

CEP is a set of techniques used to aggregate, process, and analyze large amounts of streaming data to generate real-time insights from those events as they happen, even before the data is stored in databases.

Answer 76

A

CEP generates real-time insights as the events happen, before storing the data in databases.

Answer 77

A

Insights are generated by searching for dependencies and complex patterns in the incoming raw data.

Answer 78

A

CEP helps businesses and organizations identify opportunities and threats, enabling systems and applications to respond in real-time as quickly as possible.

Answer 79

A

CEP identifies meaningful events by continuously processing raw data and finding correlations between other events before the data is stored in databases.

Answer 80

A

CEP searches for dependencies and complex patterns in the raw data to generate insights.

Answer 81

A

Processing data before storing it allows for real-time insights and enables faster responses to opportunities and threats.

Answer 82

A

CEP insights help systems and applications respond to opportunities and threats in real-time and as quickly as possible.

Answer 83

A

They are used interchangeably because they rely on the same underlying technologies.

Answer 84

A

CEP is focused on searching for complex patterns and dependencies between different events to identify a particular event.

Answer 85

A

Stream processing focuses on aggregating data in time windows and responding to a single event, often using time series data.

Answer 86

A

It collects and processes the images (data) captured by the camera and responds to that event by analyzing the time series data.

Answer 87

A

Apache Kafka is an open-source stream processing engine, known for its streaming analytics, mission-critical applications, and data integration.

Answer 88

A

CEP is used to monitor business processes and resources, helping businesses identify opportunities and problems at early stages.

Answer 89

A

CEP is used in sensor networks to measure physical parameters, such as temperature, for predictive maintenance in industrial and manufacturing facilities.

Answer 90

A

CEP analyzes patterns from IoT devices to predict when equipment may need to be shut down or repaired.

Answer 91

A

CEP processes RFID data to help management optimize store layout and inventory tracking.

Answer 92

A

CEP is used to derive useful data about the stock market by analyzing real-time data streams and detecting complex patterns.

Answer 93

A

Some well-known tools include Hadoop/MapReduce, Amazon Kinesis Analytics, and Microsoft Azure Stream Analytics. LinkedIn uses Apache Samza, and Twitter uses Apache Storm.

Answer 94

A

CEP detects complex patterns from real-time data streams, transforming low-level data into high-level business information.

Answer 95

A

CEP is time-sensitive because it requires ultra-low latency (typically less than a few milliseconds) to handle real-time data effectively.

Answer 96

A

An example is a vehicle where ice sensors warn of a slippery road, the weather forecast shows a high chance of precipitation, and other sensors in the car indicate unusual conditions. CEP processes all these events to provide critical information to the driver, connected cars, and road units.

Answer 97

A

Ultra-low latency is crucial in CEP because it enables the system to process and respond to real-time events almost instantly, which is necessary in scenarios like connected vehicles.

Answer 98

A

CEP handles low-level data streams that are transformed into meaningful, high-level business information.

Answer 99

A

CEP processes data from various sensors (e.g., ice sensors, brakes, steering) to provide crucial information to drivers, other vehicles, and road infrastructure in real-time.

Answer 100

A

Big data refers to the large volumes of structured and unstructured data produced by billions of connected IoT devices and sensors. This data is often too complex to be processed using conventional tools.

Answer 101

A

The 5Vs are:
* Volume: Size of the data.
* Velocity: Speed at which data is generated.
* Variety: Types of data (structured, unstructured, semi-structured).
* Veracity: Trustworthiness and accuracy of the data.
* Value: Usefulness of the data in gaining insights.

Answer 102

A

Apache Hadoop is an open-source software utility that allows for distributed storage and processing of large datasets using the MapReduce programming model.

Answer 103

A

The map function breaks down a dataset into key-value pairs and transforms it into a structured set of data. It operates on one key-value pair at a time.

Answer 104

A

The reduce function combines the output of the map function into smaller sets of data tuples. It groups values associated with the same key and outputs reduced key-value pairs.

Answer 105

A

In a shopping scenario, the grocery list is split into smaller lists (e.g., bakery, seafood), and multiple shopping carts collect items in parallel. They all meet at the cashier, speeding up the process compared to using a single cart.

Answer 106

A

Traditional relational databases are limited in handling the complexity and volume of unstructured data typically found in big data because they rely on tabular relations and SQL for querying structured data.

Answer 107

A

NoSQL databases are non-relational, highly scalable databases designed to store and retrieve unstructured data. They don’t use tabular relations, and they support various models such as key-value, document, column-oriented, and graph models.

Answer 108

A

Key-value store databases store data as key-value pairs, with the key being a unique identifier and the value a large data field. They offer high performance but do not support complex queries, as only keys can be queried, not the values.

Answer 109

A

Column-oriented databases store data in columns instead of rows. They use a key space that contains column families, each having rows and columns.

Answer 110

A

Document-oriented databases store data as JSON documents. They allow fast querying and are flexible, making them suitable for use cases such as IoT applications in healthcare.

Answer 111

A

Graph-oriented databases are used to store graph-based data, such as social network information. They focus on the relationships between data points, storing the data as originally produced without a predefined structure.

Answer 112

A

The main benefit of the MapReduce model is its ability to scale data processing across multiple machines, making it efficient for handling large datasets in distributed environments.

Answer 113

A

Veracity refers to the uncertainty, accuracy, and trustworthiness of the data in big data analytics.

Answer 114

A

Value refers to the ability of data to provide useful insights or information that can be acted upon.

Answer 115

A

Tools like Apache Hadoop are needed to process big data using distributed computing over several machines and servers..

Answer 116

A

The Map stage breaks down a dataset into key-value pairs, converting unstructured data into structured data.

Answer 117

A

IoT devices produce large volumes of structured and unstructured data, often exceeding the processing capabilities of traditional databases.

Answer 118

A

Relational databases are designed for structured data with predefined relationships and cannot handle the complexity and volume of unstructured big data.

Answer 119

A

NoSQL databases are non-relational databases that are highly scalable and can store unstructured data. They are important because they can handle the complexities of big data without requiring tabular relations.

Answer 120

A

Key-value store: Stores data as key-value pairs.
Column-oriented: Stores data in columns rather than rows.
Document-oriented: Stores data as JSON documents.
Graph-oriented: Stores data in graph format, focusing on relationships.

Answer 121

A

Key-value store databases store data as key-value pairs and offer high performance. However, they do not support querying or searching values, only the keys.

Answer 122

A

Column-oriented databases store sparse tabular data in columns instead of rows, organized by key spaces that contain column families, rows, and columns.

Answer 123

A

A document-oriented database stores data as JSON documents. It provides flexibility and fast queries, making it ideal for IoT applications like healthcare that need real-time data access.

Answer 124

A

Graph-oriented databases are used to store graph-based data, such as social networks, where relationships between data are as important as the data itself.

Answer 125

A

Traditional tools cannot handle the massive, heterogeneous data produced by IoT devices, which requires special tools and techniques like semantic technologies and NoSQL databases.

Answer 126

A

Semantic technologies, such as linking data, machine learning, distributed semantic reasoning, and cross-domain recommender systems, help derive meaning from the large amounts of data collected from IoT devices.

Answer 127

A

Complex event processing is a set of techniques used to aggregate, process, and analyze large amounts of streaming data to provide real-time insights as events occur.

Answer 128

A

CEP helps businesses uncover patterns, identify opportunities, and detect potential threats in their early stages by analyzing real-time data streams.

Answer 129

A

Relational databases struggle to handle complex, semi-structured, and unstructured data typical in IoT environments, making them unsuitable for big data storage.

Answer 130

A

NoSQL databases store IoT data in documents, key-value pairs, or graph models and rely on the MapReduce programming model for processing large datasets.