Theory Flashcards

1
Q

What are the four Vs of big data and what do they stand for?

A

Volume:

Velocity:

Variety:

Veracity;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What an information system should be able to govern?

A

We should be able to govern:

o data abundance
o data and user dynamicity and mobility
o heterogeneity, data semantics o incompleteness/uncertainty
o interaction with the real-world

And make sense of all this data: Extract useful knowledge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some of the challenges of data management?

A

What has changed in the last years:

• Technological breakthroughs in machine learning (ML) and artificial intelligence (AI):
o Data science as a discipline that combines elements of data cleaning and transformation, statistical analysis, data visualization, and ML techniques
o Automatic generation of text (e.g., Chat GPT and other similar systems)

• Cloud computing is mainstream. The industry offers on-demand resources that provide on-demand, elastic storage and computation services.

• For cloud-based systems, the industry converges on data lakes as novel
systems for data integration and on modern data warehousing query engines

• Society has become more concerned about the state of data governance, concentrated in data usage, e.g. quality-aware, privacy aware, ethical and fair use of data

Extraction of (synthetic and useful) knowledge: build environments that mimic the progressive inspecting, observing, surveying activity with which users make decisions.

Massive data analysis and processing: A process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making.

Massive data integration: People and enterprises need to integrate data and the systems that handle those data: Relational DBMSs and their extensions, legacy data and legacy DBMSs, sensors and user-generated content produce structured or unstructured data

Data warehousing: A single, complete and consistent store of data obtained from a variety of different sources made available to end users, so that they can understand and use it in a business context.[Barry Devlin]

Massive use of Machine Learning, which requires clean and ethical input data to be sure that the output are reasonable and ethical

Use of reasoning services, which allow various forms of deduction and inference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the relation between data, information, knowledge and wisdom?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the relation between the data analysis and the rest of the application?

A

The actual implementation of the Data Analysis (ML,statistics,DataMining…) algorithm is usually less than 5% lines of code in a real, non-trivial application

The main effort (i.e.those95%LOC) is spenton:

§Data cleaning & annotation

§Data extraction, transformation, loading

§Data integration, pruning, & possibly new cleaning

§Parameter tuning

§Model training & deployment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the parts in the data pipeline

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the definition of data integration?

A

Combining data coming from different data sources, providing the user with a unified vision of the data

Detecting correspondences between similar concepts that come from different sources, and conflict solving

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How does the four Vs relate to data integration?

A

Velocity: As a direct consequence of the rate at which data is being collected and
continuously made available, many of the data sources are very dynamic.

Variety: Data sources (even in the same domain) are extremely heterogeneous both at the schema level, regarding how they structure their data, and at the instance level, regarding how they describe the same real world entity, exhibiting considerable variety even for substantially similar entities.

Veracity: Data sources (even in the same domain) are of widely differing qualities,
with significant differences in the coverage, accuracy and timeliness of data provided. This is consistent with the observation that “1 in 3 business leaders do not trust the information they use to make decisions.”

Volume: Not only can each data source contain a huge volume of data, but also the number of data sources has grown to be in the millions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the variety and veracity dimensions?

A

The Variety dimension:
people and enterprises need to integrate data and the systems that handle those data: relational DBMSs and their extensions, legacy data and legacy DBMSs, sensors and user-generated content produce heterogeneous, structured or unstructured data

The Veracity dimension:
Data Quality is the most general and used term, and represents a number of quality aspects besides veracity:
• Completeness,
• Validity,
• Consistency,
• Timeliness
• Accuracy
• Ethics and fairness (a new entry)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is data quality and what does its terms represent?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the origin of heterogeneity?

A

Design (representation) autonomy,

Communication (querying) autonomy,

Execution (algorithmic) autonomy

  1. Different platforms: Technological heterogeneity
  2. Different data models of the participating datasets à Model heterogeneity
  3. Different query languages -> Language heterogeneity
  4. Different data schemas and different conceptual representations in DBs previously developed à Schema (semantic) heterogeneity
  5. Different values for the same info (due to errors or to different knowledge)à Instance (semantic) heterogeneity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the steps in data integration?

A
  1. Schema Reconciliation
    Schema reconciliation: mapping the data structure (if it exists!)
  2. Record Linkage
    Record linkage (aka Entity resolution): data matching based on the same content
  3. Data Fusion
    Data fusion: reconciliation of non-identical content
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the kinds of data integration?

A

Use a materialized database (data are merged in a new database)

Use a virtual non-materialized data base (data remain at sources)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is virtual integration? What is materialized integration? What are their strengths and weaknesses?

A

Virtual integration and materialized integration are two common approaches to integrating data from multiple sources in the context of data integration systems, such as data warehouses or federated databases. Here’s a breakdown of each:

  1. Virtual Integration

Virtual integration, also known as federated integration, does not physically consolidate data into a single repository. Instead, it provides a unified view of data by querying the underlying data sources in real time or near-real time.

Strengths:

•	Real-time Data Access: Allows access to the most up-to-date data directly from the source.
•	Lower Storage Costs: No need to replicate and store data in a central location.
•	Fast Deployment: Avoids the need to design and implement a full data warehouse.
•	Flexibility: Can handle dynamic changes in the structure of data sources more easily.

Weaknesses:

•	Performance Challenges: Querying multiple data sources in real-time can lead to latency and bottlenecks.
•	Complex Query Processing: Requires sophisticated query engines to translate and optimize queries across heterogeneous data sources.
•	Source Dependency: Heavily reliant on the availability and performance of underlying data sources.
•	Limited Historical Data: Cannot easily provide historical snapshots unless the source systems retain them.
  1. Materialized Integration

Materialized integration involves consolidating data from multiple sources into a single repository, such as a data warehouse. Data is extracted, transformed, and loaded (ETL) into this central system, where it can be queried.

Strengths:

•	Improved Performance: Queries run faster since data is already consolidated and optimized for analysis.
•	Availability and Independence: Does not rely on source systems being online for querying.
•	Data Consistency: Ensures a consistent and unified version of data for analysis and reporting.
•	Support for Historical Data: Facilitates maintaining historical snapshots and trends.

Weaknesses:

•	Data Latency: Consolidated data may not be as current as data in the source systems due to batch processing.
•	High Storage Costs: Requires significant storage capacity for the centralized repository.
•	Longer Setup Time: Building and maintaining a data warehouse can be time-consuming and complex.
•	Data Duplication: Data is replicated, which may introduce redundancy and require additional synchronization efforts.

When to Use Each Approach

•	Virtual Integration: Best suited for scenarios requiring real-time or near-real-time access to data, especially when data volumes are not very high, or for exploratory analysis across diverse data sources.
•	Materialized Integration: Ideal for large-scale analytics, historical trend analysis, and environments where performance and data availability are critical.

Let me know if you’d like further examples or applications!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Describe a general framework for data integration. What is the function of the middleware in it? What is the function of the data layer? What does the right most part of the picture represent?

A

A general framework for data integration consists of the application, the middleware (where the translations happen) and the different data layers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Explain the example given.

A

When a query is submitted, the integration system has to decompose it into queries against the component datasets.

Determine first which parts of the query refer to which dataset:
- which parts apply to data from a single dataset and
- which parts apply to data from different datasets.

The latter ones can only be evaluated over the integrated data (view), whereas the former ones can be evaluated within the component datas

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Why do we use data integration when designing a unique DB?

A

Each area of the company will ask the designer to design their (part of) database (DB)

However, a lot of this data are common to some of the areas

If the global company DB is just the collection of these partial DBs, there will be many redundancies (useless memory occupation)

Worst of all, when updating one instance of these duplicates, maybe the other one will remain as before

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What a data integration system for multi database does?

A

We must build a system that:
• Supports access to different data sources
• “Knows” the contents of these data sources
• Integrates the different data sources by means of a unifying, global schema
• Receives queries expressed in the language of the global schema
• Distributes “rewritten” queries to the sources
• Combines the answers received from the sources to build the final answer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What does the global schema provides?

A

The global schema will provide a
• Reconciled • Integrated • Virtual
view of the data sources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are the steps take to design the data integration system for multidatabase?

A
  1. Source schema identification (when present)
  2. Source schema reverse engineering (data source conceptual schemata)
  3. Conceptual schemata integration and restructuring: related concept identification, conflict analysis and resolution
  4. Conceptual to logical translation (of the obtained global conceptual schema)
  5. Mapping between the global logical schema and the single schemata (logical view definition)
  6. After integration: query-answering through data views
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are the possible conflicts?

A

Name conflicts: we could have hononyms (different attributes with the same name) and Synonyms (same attribute, different names)

Type conflicts: in a single attribute gender could be represented by strings, numbers, and chars. In an entity different abstractions of the same real world concept could have different set of attributes.

Data semantics: different currencies, different measurement systems, different granularities

Structure conflicts:

Dependency or cardinality conflicts:

Key conflicts: same object represented with different key

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Given the schemas:

Poli Robots
Report(ID, datetime, #ofFaults)
ReportFault(reportID, faultID)
Fault(ID, description, solution, responsible)

Uni Robots
Message(robotID, datetime, errorCode)
Error(code, description, solution, personInCharge, urgency)

Go through the first three steps in the data integration process and rationalize the decision.

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Given the schemas:

Poli Robots
Report(ID, datetime, #ofFaults)
ReportFault(reportID, faultID)
Fault(ID, description, solution, responsible)

Uni Robots
Message(robotID, datetime, errorCode)
Error(code, description, solution, personInCharge, urgency)

Create a command to create a view for the result global scheme.

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What are the techniques for mapping the global logical schema and the single schemata? Explain them.

A

Mapping the global logical schema (GLS) to the single schemata (local data sources) is a key challenge in data integration systems. These mappings are essential to create a unified view of distributed and heterogeneous data. The two primary techniques for such mappings are Global-as-View (GAV) and Local-as-View (LAV), along with a hybrid approach known as Both-as-View (BAV). Here’s a detailed explanation:

  1. Global-as-View (GAV)

In GAV, the global schema is defined as a set of views over the local schemata. Each global concept is explicitly mapped to queries on the local data sources.

Process:

•	The global schema is predefined based on the structure and semantics of the local data sources.
•	For each concept in the global schema, queries are written that extract relevant data from the local schemata.

Strengths:

•	Simple Query Translation: Querying the global schema is straightforward since the mappings are directly tied to the local data sources.
•	Efficient Query Execution: Optimized for specific queries as the mappings are explicitly defined.

Weaknesses:

•	Low Flexibility: Changes in the local sources (e.g., schema updates) require updates to the global schema mappings.
•	Scalability Issues: Defining and maintaining mappings for large, complex systems can become cumbersome.
  1. Local-as-View (LAV)

In LAV, each local schema is defined as a view over the global schema. Here, the focus is on describing how the data in each local source contributes to the global schema.

Process:

•	The global schema is designed independently of the local sources, based on an abstract understanding of the data domain.
•	For each local schema, mappings specify how its data fits into the global schema.

Strengths:

•	High Flexibility: Changes to local sources do not necessarily require modifications to the global schema.
•	Scalable Design: Easier to add new data sources since their mappings can be written independently of existing sources.

Weaknesses:

•	Complex Query Translation: Translating a query on the global schema to the local schemata is computationally expensive due to the need for query rewriting.
•	Performance Overhead: Queries may involve extensive reasoning over multiple local views, impacting execution efficiency.
  1. Both-as-View (BAV)

BAV is a hybrid approach that combines elements of GAV and LAV. Both the global schema and the local schemata are defined as views over an intermediary conceptual schema.

Process:

•	Introduces an intermediate schema that acts as a bridge between the global and local schemata.
•	Both global and local schemas are described in terms of this intermediary schema, reducing the tight coupling between them.

Strengths:

•	Modular and Flexible: Decouples the global and local schemas, making the system adaptable to changes.
•	Balance in Complexity: Mitigates the weaknesses of both GAV and LAV.

Weaknesses:

•	Increased Design Overhead: Requires additional effort to design and maintain the intermediary schema.
•	Complexity in Implementation: May increase the computational cost of query translation.

Choosing the Right Technique

•	GAV: Suitable for systems where the local sources are stable and well-understood, and query execution needs to be efficient.
•	LAV: Better for dynamic environments where local sources frequently change or new sources are added.
•	BAV: Ideal for large-scale, complex integration scenarios requiring modularity and adaptability.

Let me know if you’d like an example or additional details about the implementation!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What are the situations in which is better to use GAV? What about LAV?

A

In the context of data integration, Global-As-View (GAV) and Local-As-View (LAV) are two main approaches for defining the mappings between a global schema and local data sources. The choice between these approaches depends on the specific requirements and constraints of the system. Here’s when to prefer each:

Global-As-View (GAV)

In GAV, the global schema is defined as a set of views over the local schemas (data sources). Each global schema concept is directly expressed as a query over the local schemas.

When to use GAV:
1. Static and stable data sources:
• Use GAV when the structure and content of the data sources are relatively static and unlikely to change frequently.
• Changes in the data sources require updates to the global schema mappings, which can be cumbersome in dynamic environments.
2. Few data sources:
• GAV is manageable when there are relatively few data sources, as each global schema element needs explicit mapping to one or more local schemas.
3. Simple integration logic:
• When the relationships between the data sources and the global schema are straightforward, GAV provides clarity and efficiency.
4. Performance optimization:
• Queries are typically more efficient in GAV because the integration system can rely on the pre-defined mappings to directly transform global queries into queries over the sources.

Advantages of GAV:
• Easier query processing: The mappings are explicitly defined, so query rewriting is straightforward.
• Performance-focused: Optimized for environments where query execution speed is critical.

Local-As-View (LAV)

In LAV, the local schemas are defined as views over the global schema. This means that each local data source is described in terms of the global schema.

When to use LAV:
1. Dynamic and evolving data sources:
• Use LAV when the structure or number of data sources may change frequently, as the global schema remains stable, and only the views for new or updated sources need to be modified.
2. Many data sources:
• LAV is better suited for systems with numerous or heterogeneous data sources because it abstracts the integration logic from the data sources.
3. Complex integration requirements:
• When the relationships between data sources and the global schema are complex or require significant abstraction, LAV offers more flexibility.
4. Exploratory data integration:
• In environments where the goal is to explore or experiment with different ways of integrating data, LAV’s flexibility is advantageous.

Advantages of LAV:
• Scalability: New data sources can be integrated without modifying the global schema.
• Flexibility: Easier to handle heterogeneous or complex data sources.

Comparison and Trade-offs:

Aspect GAV LAV
Query processing Easier and faster More complex (requires reasoning)
Flexibility Limited High
Scalability Not suitable for many sources Handles many sources well
Maintenance High when sources change Easier to adapt to source changes
Complexity of Mappings Simpler Can be more abstract

Hybrid Approach:

Sometimes, systems adopt a combination of GAV and LAV (GLAV) to leverage the advantages of both. This approach defines mappings that are neither strictly GAV nor strictly LAV, providing a balance between flexibility and query efficiency.

Let me know if you’d like more details on GLAV or specific examples!

26
Q

Given the schemes:

Source 1:
Product(code, name, description, warnings, notes, costID)
Category(ID, name, description)
Version(productCode, versionCode, size, color, name, description, stock, price)

Source 2:
Product(code, name, size, color, description, type, price, quantity)
Type(typeCode, name, description)

Implement the data integration using GAV

A
27
Q

X

A
28
Q

When can we say the source is incomplete?

A
29
Q

What are the definitions for sound, exact, and complete mappings?

A
30
Q

What are the operations most used by GAV systems to combine the data coming from the sources?

A
31
Q

When is union allowed? What about outer union? What is the result of the operation?

A
32
Q

What happens when joining both R1 and R2 in the operations???? > ???

A
33
Q

What does the generalization operation does?

A
34
Q

What is record linkage? What is its relation with data fusion?

A
35
Q

When do we perform record linkage?

A
36
Q

What are the errors that might lead to different values for the same instance?

A
37
Q

What are the methods to overcome the duplication problems?

A
38
Q

What are the types of similarities used for determining if 2 values are the same?

A
39
Q

What is the difference between similarity and distance?

A
40
Q

What is the distance between two strings?

A
41
Q

What is the edit distance? What is the Jaccard measure?

A
42
Q

Analyze the following words using the method:

Politecnico di Milano vs Politecnico Milano

And

Pino vs Pin

A
43
Q

What are the phonetic similarity measures?

A
44
Q

Why do we need methods to apply the similarities and distance evaluations?

A
45
Q

What are the types of record matching?

A
46
Q

Describe the rule-base matching technique from record matching. How can we allow the system to learn the matching rules?

A
47
Q

Describe the probabilistic distribution technique for record matching.

A
48
Q

How do we design the resolution function?

A
49
Q

What is the local schemata? What is the export schemata?

A
50
Q

When do we unify the model of the data?

A
51
Q

What are wrappers? What do they do?

A
52
Q

What could be the basis for the semistructured data?

A
53
Q

What are the parts of a mediator? What are their responsibilities?

A
54
Q

Mediator has to know the semantics of the domain of the services. True or false?

A
55
Q

What are the mediator characteristics?

A
56
Q

What are the mediator characteristics?

A
57
Q

What happens if the data mode changes?

A
58
Q

What is the data guide?

A
59
Q

When can we use automation wrapper generators?

A
60
Q

What is a wrapper generator? What does it do?

A
61
Q

Describe the two techniques GAV and LAV for the data integration in the case of existence of a Global Schema, and underline the main pros and cons of the two approaches.

A

In the context of data integration with a Global Schema, two common techniques are Global-As-View (GAV) and Local-As-View (LAV). Both approaches aim to integrate data from different sources into a unified view but differ in how they relate the Global Schema to the local data sources.

  1. Global-As-View (GAV)

In the GAV approach, the global schema is expressed as a set of queries over the local schemas. Each concept in the global schema is directly mapped to a specific query over one or more data sources. Essentially, the global schema is defined as views over the underlying data sources.

Pros:

•	Simplicity: Since the global schema is defined directly as views over the data sources, query processing is straightforward. The system can translate queries posed to the global schema directly into queries over the local sources.
•	Efficient Query Execution: As the mappings are explicit, the system knows exactly how to retrieve data from local sources, which can make query processing more efficient.

Cons:

•	Limited Flexibility: Adding new data sources can be challenging because each new source may require redefining or adjusting the global schema to incorporate the new information.
•	Less Adaptable to Change: If the structure of a local data source changes, the global schema must be updated, making maintenance more difficult.
  1. Local-As-View (LAV)

In the LAV approach, the local data sources are described as views over the global schema. Instead of defining the global schema in terms of the sources, the data sources are described as how they contribute to the global schema. The global schema remains fixed, while each source is mapped as a view over it.

Pros:

•	Flexibility: Adding new data sources is easier in LAV because it doesn’t require changes to the global schema. Instead, the new data sources are simply mapped as new views over the existing schema.
•	Scalability: This approach is more scalable in heterogeneous environments, where data sources may change frequently.

Cons:

•	Complex Query Processing: Since the mappings are indirect, answering queries over the global schema can be more complex. The system must figure out how to combine data from multiple sources, which can lead to less efficient query execution.
•	Higher Overhead: Query rewriting and optimization in LAV is more complex, often requiring advanced algorithms to ensure that queries can be efficiently answered using the available source views.

Summary of Pros and Cons:

Aspect GAV LAV
Simplicity Simple query processing More complex query processing
Flexibility Less flexible, harder to add new sources More flexible, easier to integrate new sources
Query Efficiency Efficient, direct query execution Potentially less efficient due to query rewriting
Maintenance Changes in sources require global schema update Changes in sources are easier to manage

In conclusion, GAV is ideal for environments with stable data sources and a well-defined global schema, while LAV is better suited for dynamic environments where data sources may evolve over time.