Theory Flashcards
What are the four Vs of big data and what do they stand for?
Volume:
Velocity:
Variety:
Veracity;
What an information system should be able to govern?
We should be able to govern:
o data abundance
o data and user dynamicity and mobility
o heterogeneity, data semantics o incompleteness/uncertainty
o interaction with the real-world
And make sense of all this data: Extract useful knowledge
What are some of the challenges of data management?
What has changed in the last years:
• Technological breakthroughs in machine learning (ML) and artificial intelligence (AI):
o Data science as a discipline that combines elements of data cleaning and transformation, statistical analysis, data visualization, and ML techniques
o Automatic generation of text (e.g., Chat GPT and other similar systems)
• Cloud computing is mainstream. The industry offers on-demand resources that provide on-demand, elastic storage and computation services.
• For cloud-based systems, the industry converges on data lakes as novel
systems for data integration and on modern data warehousing query engines
• Society has become more concerned about the state of data governance, concentrated in data usage, e.g. quality-aware, privacy aware, ethical and fair use of data
Extraction of (synthetic and useful) knowledge: build environments that mimic the progressive inspecting, observing, surveying activity with which users make decisions.
Massive data analysis and processing: A process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making.
Massive data integration: People and enterprises need to integrate data and the systems that handle those data: Relational DBMSs and their extensions, legacy data and legacy DBMSs, sensors and user-generated content produce structured or unstructured data
Data warehousing: A single, complete and consistent store of data obtained from a variety of different sources made available to end users, so that they can understand and use it in a business context.[Barry Devlin]
Massive use of Machine Learning, which requires clean and ethical input data to be sure that the output are reasonable and ethical
Use of reasoning services, which allow various forms of deduction and inference
What is the relation between data, information, knowledge and wisdom?
What is the relation between the data analysis and the rest of the application?
The actual implementation of the Data Analysis (ML,statistics,DataMining…) algorithm is usually less than 5% lines of code in a real, non-trivial application
The main effort (i.e.those95%LOC) is spenton:
§Data cleaning & annotation
§Data extraction, transformation, loading
§Data integration, pruning, & possibly new cleaning
§Parameter tuning
§Model training & deployment
…
What are the parts in the data pipeline
What is the definition of data integration?
Combining data coming from different data sources, providing the user with a unified vision of the data
Detecting correspondences between similar concepts that come from different sources, and conflict solving
How does the four Vs relate to data integration?
Velocity: As a direct consequence of the rate at which data is being collected and
continuously made available, many of the data sources are very dynamic.
Variety: Data sources (even in the same domain) are extremely heterogeneous both at the schema level, regarding how they structure their data, and at the instance level, regarding how they describe the same real world entity, exhibiting considerable variety even for substantially similar entities.
Veracity: Data sources (even in the same domain) are of widely differing qualities,
with significant differences in the coverage, accuracy and timeliness of data provided. This is consistent with the observation that “1 in 3 business leaders do not trust the information they use to make decisions.”
Volume: Not only can each data source contain a huge volume of data, but also the number of data sources has grown to be in the millions.
What are the variety and veracity dimensions?
The Variety dimension:
people and enterprises need to integrate data and the systems that handle those data: relational DBMSs and their extensions, legacy data and legacy DBMSs, sensors and user-generated content produce heterogeneous, structured or unstructured data
The Veracity dimension:
Data Quality is the most general and used term, and represents a number of quality aspects besides veracity:
• Completeness,
• Validity,
• Consistency,
• Timeliness
• Accuracy
• Ethics and fairness (a new entry)
What is data quality and what does its terms represent?
What is the origin of heterogeneity?
Design (representation) autonomy,
Communication (querying) autonomy,
Execution (algorithmic) autonomy
- Different platforms: Technological heterogeneity
- Different data models of the participating datasets à Model heterogeneity
- Different query languages -> Language heterogeneity
- Different data schemas and different conceptual representations in DBs previously developed à Schema (semantic) heterogeneity
- Different values for the same info (due to errors or to different knowledge)à Instance (semantic) heterogeneity
What are the steps in data integration?
- Schema Reconciliation
Schema reconciliation: mapping the data structure (if it exists!) - Record Linkage
Record linkage (aka Entity resolution): data matching based on the same content - Data Fusion
Data fusion: reconciliation of non-identical content
What are the kinds of data integration?
Use a materialized database (data are merged in a new database)
Use a virtual non-materialized data base (data remain at sources)
What is virtual integration? What is materialized integration? What are their strengths and weaknesses?
Virtual integration and materialized integration are two common approaches to integrating data from multiple sources in the context of data integration systems, such as data warehouses or federated databases. Here’s a breakdown of each:
- Virtual Integration
Virtual integration, also known as federated integration, does not physically consolidate data into a single repository. Instead, it provides a unified view of data by querying the underlying data sources in real time or near-real time.
Strengths:
• Real-time Data Access: Allows access to the most up-to-date data directly from the source. • Lower Storage Costs: No need to replicate and store data in a central location. • Fast Deployment: Avoids the need to design and implement a full data warehouse. • Flexibility: Can handle dynamic changes in the structure of data sources more easily.
Weaknesses:
• Performance Challenges: Querying multiple data sources in real-time can lead to latency and bottlenecks. • Complex Query Processing: Requires sophisticated query engines to translate and optimize queries across heterogeneous data sources. • Source Dependency: Heavily reliant on the availability and performance of underlying data sources. • Limited Historical Data: Cannot easily provide historical snapshots unless the source systems retain them.
- Materialized Integration
Materialized integration involves consolidating data from multiple sources into a single repository, such as a data warehouse. Data is extracted, transformed, and loaded (ETL) into this central system, where it can be queried.
Strengths:
• Improved Performance: Queries run faster since data is already consolidated and optimized for analysis. • Availability and Independence: Does not rely on source systems being online for querying. • Data Consistency: Ensures a consistent and unified version of data for analysis and reporting. • Support for Historical Data: Facilitates maintaining historical snapshots and trends.
Weaknesses:
• Data Latency: Consolidated data may not be as current as data in the source systems due to batch processing. • High Storage Costs: Requires significant storage capacity for the centralized repository. • Longer Setup Time: Building and maintaining a data warehouse can be time-consuming and complex. • Data Duplication: Data is replicated, which may introduce redundancy and require additional synchronization efforts.
When to Use Each Approach
• Virtual Integration: Best suited for scenarios requiring real-time or near-real-time access to data, especially when data volumes are not very high, or for exploratory analysis across diverse data sources. • Materialized Integration: Ideal for large-scale analytics, historical trend analysis, and environments where performance and data availability are critical.
Let me know if you’d like further examples or applications!
Describe a general framework for data integration. What is the function of the middleware in it? What is the function of the data layer? What does the right most part of the picture represent?
A general framework for data integration consists of the application, the middleware (where the translations happen) and the different data layers
Explain the example given.
When a query is submitted, the integration system has to decompose it into queries against the component datasets.
Determine first which parts of the query refer to which dataset:
- which parts apply to data from a single dataset and
- which parts apply to data from different datasets.
The latter ones can only be evaluated over the integrated data (view), whereas the former ones can be evaluated within the component datas
Why do we use data integration when designing a unique DB?
Each area of the company will ask the designer to design their (part of) database (DB)
However, a lot of this data are common to some of the areas
If the global company DB is just the collection of these partial DBs, there will be many redundancies (useless memory occupation)
Worst of all, when updating one instance of these duplicates, maybe the other one will remain as before
What a data integration system for multi database does?
We must build a system that:
• Supports access to different data sources
• “Knows” the contents of these data sources
• Integrates the different data sources by means of a unifying, global schema
• Receives queries expressed in the language of the global schema
• Distributes “rewritten” queries to the sources
• Combines the answers received from the sources to build the final answer
What does the global schema provides?
The global schema will provide a
• Reconciled • Integrated • Virtual
view of the data sources
What are the steps take to design the data integration system for multidatabase?
- Source schema identification (when present)
- Source schema reverse engineering (data source conceptual schemata)
- Conceptual schemata integration and restructuring: related concept identification, conflict analysis and resolution
- Conceptual to logical translation (of the obtained global conceptual schema)
- Mapping between the global logical schema and the single schemata (logical view definition)
- After integration: query-answering through data views
What are the possible conflicts?
Name conflicts: we could have hononyms (different attributes with the same name) and Synonyms (same attribute, different names)
Type conflicts: in a single attribute gender could be represented by strings, numbers, and chars. In an entity different abstractions of the same real world concept could have different set of attributes.
Data semantics: different currencies, different measurement systems, different granularities
Structure conflicts:
Dependency or cardinality conflicts:
Key conflicts: same object represented with different key
Given the schemas:
Poli Robots
Report(ID, datetime, #ofFaults)
ReportFault(reportID, faultID)
Fault(ID, description, solution, responsible)
Uni Robots
Message(robotID, datetime, errorCode)
Error(code, description, solution, personInCharge, urgency)
Go through the first three steps in the data integration process and rationalize the decision.
Given the schemas:
Poli Robots
Report(ID, datetime, #ofFaults)
ReportFault(reportID, faultID)
Fault(ID, description, solution, responsible)
Uni Robots
Message(robotID, datetime, errorCode)
Error(code, description, solution, personInCharge, urgency)
Create a command to create a view for the result global scheme.
What are the techniques for mapping the global logical schema and the single schemata? Explain them.
Mapping the global logical schema (GLS) to the single schemata (local data sources) is a key challenge in data integration systems. These mappings are essential to create a unified view of distributed and heterogeneous data. The two primary techniques for such mappings are Global-as-View (GAV) and Local-as-View (LAV), along with a hybrid approach known as Both-as-View (BAV). Here’s a detailed explanation:
- Global-as-View (GAV)
In GAV, the global schema is defined as a set of views over the local schemata. Each global concept is explicitly mapped to queries on the local data sources.
Process:
• The global schema is predefined based on the structure and semantics of the local data sources. • For each concept in the global schema, queries are written that extract relevant data from the local schemata.
Strengths:
• Simple Query Translation: Querying the global schema is straightforward since the mappings are directly tied to the local data sources. • Efficient Query Execution: Optimized for specific queries as the mappings are explicitly defined.
Weaknesses:
• Low Flexibility: Changes in the local sources (e.g., schema updates) require updates to the global schema mappings. • Scalability Issues: Defining and maintaining mappings for large, complex systems can become cumbersome.
- Local-as-View (LAV)
In LAV, each local schema is defined as a view over the global schema. Here, the focus is on describing how the data in each local source contributes to the global schema.
Process:
• The global schema is designed independently of the local sources, based on an abstract understanding of the data domain. • For each local schema, mappings specify how its data fits into the global schema.
Strengths:
• High Flexibility: Changes to local sources do not necessarily require modifications to the global schema. • Scalable Design: Easier to add new data sources since their mappings can be written independently of existing sources.
Weaknesses:
• Complex Query Translation: Translating a query on the global schema to the local schemata is computationally expensive due to the need for query rewriting. • Performance Overhead: Queries may involve extensive reasoning over multiple local views, impacting execution efficiency.
- Both-as-View (BAV)
BAV is a hybrid approach that combines elements of GAV and LAV. Both the global schema and the local schemata are defined as views over an intermediary conceptual schema.
Process:
• Introduces an intermediate schema that acts as a bridge between the global and local schemata. • Both global and local schemas are described in terms of this intermediary schema, reducing the tight coupling between them.
Strengths:
• Modular and Flexible: Decouples the global and local schemas, making the system adaptable to changes. • Balance in Complexity: Mitigates the weaknesses of both GAV and LAV.
Weaknesses:
• Increased Design Overhead: Requires additional effort to design and maintain the intermediary schema. • Complexity in Implementation: May increase the computational cost of query translation.
Choosing the Right Technique
• GAV: Suitable for systems where the local sources are stable and well-understood, and query execution needs to be efficient. • LAV: Better for dynamic environments where local sources frequently change or new sources are added. • BAV: Ideal for large-scale, complex integration scenarios requiring modularity and adaptability.
Let me know if you’d like an example or additional details about the implementation!