Theory Up To Structured Flashcards
What are the four Vs of big data and what do they stand for?
Volume:
Velocity:
Variety:
Veracity;
What an information system should be able to govern?
We should be able to govern:
o data abundance
o data and user dynamicity and mobility
o heterogeneity, data semantics o incompleteness/uncertainty
o interaction with the real-world
And make sense of all this data: Extract useful knowledge
What are some of the challenges of data management?
What has changed in the last years:
• Technological breakthroughs in machine learning (ML) and artificial intelligence (AI):
o Data science as a discipline that combines elements of data cleaning and transformation, statistical analysis, data visualization, and ML techniques
o Automatic generation of text (e.g., Chat GPT and other similar systems)
• Cloud computing is mainstream. The industry offers on-demand resources that provide on-demand, elastic storage and computation services.
• For cloud-based systems, the industry converges on data lakes as novel
systems for data integration and on modern data warehousing query engines
• Society has become more concerned about the state of data governance, concentrated in data usage, e.g. quality-aware, privacy aware, ethical and fair use of data
Extraction of (synthetic and useful) knowledge: build environments that mimic the progressive inspecting, observing, surveying activity with which users make decisions.
Massive data analysis and processing: A process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making.
Massive data integration: People and enterprises need to integrate data and the systems that handle those data: Relational DBMSs and their extensions, legacy data and legacy DBMSs, sensors and user-generated content produce structured or unstructured data
Data warehousing: A single, complete and consistent store of data obtained from a variety of different sources made available to end users, so that they can understand and use it in a business context.[Barry Devlin]
Massive use of Machine Learning, which requires clean and ethical input data to be sure that the output are reasonable and ethical
Use of reasoning services, which allow various forms of deduction and inference
What is the relation between data, information, knowledge and wisdom?
What is the relation between the data analysis and the rest of the application?
The actual implementation of the Data Analysis (ML,statistics,DataMining…) algorithm is usually less than 5% lines of code in a real, non-trivial application
The main effort (i.e.those95%LOC) is spenton:
§Data cleaning & annotation
§Data extraction, transformation, loading
§Data integration, pruning, & possibly new cleaning
§Parameter tuning
§Model training & deployment
…
What are the parts in the data pipeline
What is the definition of data integration?
Combining data coming from different data sources, providing the user with a unified vision of the data
Detecting correspondences between similar concepts that come from different sources, and conflict solving
How does the four Vs relate to data integration?
Velocity: As a direct consequence of the rate at which data is being collected and
continuously made available, many of the data sources are very dynamic.
Variety: Data sources (even in the same domain) are extremely heterogeneous both at the schema level, regarding how they structure their data, and at the instance level, regarding how they describe the same real world entity, exhibiting considerable variety even for substantially similar entities.
Veracity: Data sources (even in the same domain) are of widely differing qualities,
with significant differences in the coverage, accuracy and timeliness of data provided. This is consistent with the observation that “1 in 3 business leaders do not trust the information they use to make decisions.”
Volume: Not only can each data source contain a huge volume of data, but also the number of data sources has grown to be in the millions.
What are the variety and veracity dimensions?
The Variety dimension:
people and enterprises need to integrate data and the systems that handle those data: relational DBMSs and their extensions, legacy data and legacy DBMSs, sensors and user-generated content produce heterogeneous, structured or unstructured data
The Veracity dimension:
Data Quality is the most general and used term, and represents a number of quality aspects besides veracity:
• Completeness,
• Validity,
• Consistency,
• Timeliness
• Accuracy
• Ethics and fairness (a new entry)
What is data quality and what does its terms represent?
What is the origin of heterogeneity?
Design (representation) autonomy,
Communication (querying) autonomy,
Execution (algorithmic) autonomy
- Different platforms: Technological heterogeneity
- Different data models of the participating datasets à Model heterogeneity
- Different query languages -> Language heterogeneity
- Different data schemas and different conceptual representations in DBs previously developed à Schema (semantic) heterogeneity
- Different values for the same info (due to errors or to different knowledge)à Instance (semantic) heterogeneity
What are the steps in data integration?
- Schema Reconciliation
Schema reconciliation: mapping the data structure (if it exists!) - Record Linkage
Record linkage (aka Entity resolution): data matching based on the same content - Data Fusion
Data fusion: reconciliation of non-identical content
What are the kinds of data integration?
Use a materialized database (data are merged in a new database)
Use a virtual non-materialized data base (data remain at sources)
What is virtual integration? What is materialized integration? What are their strengths and weaknesses?
Virtual integration and materialized integration are two common approaches to integrating data from multiple sources in the context of data integration systems, such as data warehouses or federated databases. Here’s a breakdown of each:
- Virtual Integration
Virtual integration, also known as federated integration, does not physically consolidate data into a single repository. Instead, it provides a unified view of data by querying the underlying data sources in real time or near-real time.
Strengths:
• Real-time Data Access: Allows access to the most up-to-date data directly from the source. • Lower Storage Costs: No need to replicate and store data in a central location. • Fast Deployment: Avoids the need to design and implement a full data warehouse. • Flexibility: Can handle dynamic changes in the structure of data sources more easily.
Weaknesses:
• Performance Challenges: Querying multiple data sources in real-time can lead to latency and bottlenecks. • Complex Query Processing: Requires sophisticated query engines to translate and optimize queries across heterogeneous data sources. • Source Dependency: Heavily reliant on the availability and performance of underlying data sources. • Limited Historical Data: Cannot easily provide historical snapshots unless the source systems retain them.
- Materialized Integration
Materialized integration involves consolidating data from multiple sources into a single repository, such as a data warehouse. Data is extracted, transformed, and loaded (ETL) into this central system, where it can be queried.
Strengths:
• Improved Performance: Queries run faster since data is already consolidated and optimized for analysis. • Availability and Independence: Does not rely on source systems being online for querying. • Data Consistency: Ensures a consistent and unified version of data for analysis and reporting. • Support for Historical Data: Facilitates maintaining historical snapshots and trends.
Weaknesses:
• Data Latency: Consolidated data may not be as current as data in the source systems due to batch processing. • High Storage Costs: Requires significant storage capacity for the centralized repository. • Longer Setup Time: Building and maintaining a data warehouse can be time-consuming and complex. • Data Duplication: Data is replicated, which may introduce redundancy and require additional synchronization efforts.
When to Use Each Approach
• Virtual Integration: Best suited for scenarios requiring real-time or near-real-time access to data, especially when data volumes are not very high, or for exploratory analysis across diverse data sources. • Materialized Integration: Ideal for large-scale analytics, historical trend analysis, and environments where performance and data availability are critical.
Let me know if you’d like further examples or applications!
Describe a general framework for data integration. What is the function of the middleware in it? What is the function of the data layer? What does the right most part of the picture represent?
A general framework for data integration consists of the application, the middleware (where the translations happen) and the different data layers
Explain the example given.
When a query is submitted, the integration system has to decompose it into queries against the component datasets.
Determine first which parts of the query refer to which dataset:
- which parts apply to data from a single dataset and
- which parts apply to data from different datasets.
The latter ones can only be evaluated over the integrated data (view), whereas the former ones can be evaluated within the component datas
Why do we use data integration when designing a unique DB?
Each area of the company will ask the designer to design their (part of) database (DB)
However, a lot of this data are common to some of the areas
If the global company DB is just the collection of these partial DBs, there will be many redundancies (useless memory occupation)
Worst of all, when updating one instance of these duplicates, maybe the other one will remain as before
What a data integration system for multi database does?
We must build a system that:
• Supports access to different data sources
• “Knows” the contents of these data sources
• Integrates the different data sources by means of a unifying, global schema
• Receives queries expressed in the language of the global schema
• Distributes “rewritten” queries to the sources
• Combines the answers received from the sources to build the final answer
What does the global schema provides?
The global schema will provide a
• Reconciled • Integrated • Virtual
view of the data sources
What are the steps take to design the data integration system for multidatabase?
- Source schema identification (when present)
- Source schema reverse engineering (data source conceptual schemata)
- Conceptual schemata integration and restructuring: related concept identification, conflict analysis and resolution
- Conceptual to logical translation (of the obtained global conceptual schema)
- Mapping between the global logical schema and the single schemata (logical view definition)
- After integration: query-answering through data views
What are the possible conflicts?
Name conflicts: we could have hononyms (different attributes with the same name) and Synonyms (same attribute, different names)
Type conflicts: in a single attribute gender could be represented by strings, numbers, and chars. In an entity different abstractions of the same real world concept could have different set of attributes.
Data semantics: different currencies, different measurement systems, different granularities
Structure conflicts:
Dependency or cardinality conflicts:
Key conflicts: same object represented with different key
What is the general constitution of a data integration system?
Formally, a data integration system is a triple
(G, S, M)
where G is the global schema, S is the set of sources and M is the set of mappings.
A query to the integrated system is posed in terms of G and specifies which data of the virtual database we are
interested in.
Given the schemas:
Poli Robots
Report(ID, datetime, #ofFaults)
ReportFault(reportID, faultID)
Fault(ID, description, solution, responsible)
Uni Robots
Message(robotID, datetime, errorCode)
Error(code, description, solution, personInCharge, urgency)
Create a command to create a view for the result global scheme.
What are the techniques for mapping the global logical schema and the single schemata? Explain them.
Mapping the global logical schema (GLS) to the single schemata (local data sources) is a key challenge in data integration systems. These mappings are essential to create a unified view of distributed and heterogeneous data. The two primary techniques for such mappings are Global-as-View (GAV) and Local-as-View (LAV), along with a hybrid approach known as Both-as-View (BAV). Here’s a detailed explanation:
- Global-as-View (GAV)
In GAV, the global schema is defined as a set of views over the local schemata. Each global concept is explicitly mapped to queries on the local data sources.
Process:
• The global schema is predefined based on the structure and semantics of the local data sources. • For each concept in the global schema, queries are written that extract relevant data from the local schemata.
Strengths:
• Simple Query Translation: Querying the global schema is straightforward since the mappings are directly tied to the local data sources. • Efficient Query Execution: Optimized for specific queries as the mappings are explicitly defined.
Weaknesses:
• Low Flexibility: Changes in the local sources (e.g., schema updates) require updates to the global schema mappings. • Scalability Issues: Defining and maintaining mappings for large, complex systems can become cumbersome.
- Local-as-View (LAV)
In LAV, each local schema is defined as a view over the global schema. Here, the focus is on describing how the data in each local source contributes to the global schema.
Process:
• The global schema is designed independently of the local sources, based on an abstract understanding of the data domain. • For each local schema, mappings specify how its data fits into the global schema.
Strengths:
• High Flexibility: Changes to local sources do not necessarily require modifications to the global schema. • Scalable Design: Easier to add new data sources since their mappings can be written independently of existing sources.
Weaknesses:
• Complex Query Translation: Translating a query on the global schema to the local schemata is computationally expensive due to the need for query rewriting. • Performance Overhead: Queries may involve extensive reasoning over multiple local views, impacting execution efficiency.
- Both-as-View (BAV)
BAV is a hybrid approach that combines elements of GAV and LAV. Both the global schema and the local schemata are defined as views over an intermediary conceptual schema.
Process:
• Introduces an intermediate schema that acts as a bridge between the global and local schemata. • Both global and local schemas are described in terms of this intermediary schema, reducing the tight coupling between them.
Strengths:
• Modular and Flexible: Decouples the global and local schemas, making the system adaptable to changes. • Balance in Complexity: Mitigates the weaknesses of both GAV and LAV.
Weaknesses:
• Increased Design Overhead: Requires additional effort to design and maintain the intermediary schema. • Complexity in Implementation: May increase the computational cost of query translation.
Choosing the Right Technique
• GAV: Suitable for systems where the local sources are stable and well-understood, and query execution needs to be efficient. • LAV: Better for dynamic environments where local sources frequently change or new sources are added. • BAV: Ideal for large-scale, complex integration scenarios requiring modularity and adaptability.
Let me know if you’d like an example or additional details about the implementation!
What are the situations in which is better to use GAV? What about LAV?
In the context of data integration, Global-As-View (GAV) and Local-As-View (LAV) are two main approaches for defining the mappings between a global schema and local data sources. The choice between these approaches depends on the specific requirements and constraints of the system. Here’s when to prefer each:
Global-As-View (GAV)
In GAV, the global schema is defined as a set of views over the local schemas (data sources). Each global schema concept is directly expressed as a query over the local schemas.
When to use GAV:
1. Static and stable data sources:
• Use GAV when the structure and content of the data sources are relatively static and unlikely to change frequently.
• Changes in the data sources require updates to the global schema mappings, which can be cumbersome in dynamic environments.
2. Few data sources:
• GAV is manageable when there are relatively few data sources, as each global schema element needs explicit mapping to one or more local schemas.
3. Simple integration logic:
• When the relationships between the data sources and the global schema are straightforward, GAV provides clarity and efficiency.
4. Performance optimization:
• Queries are typically more efficient in GAV because the integration system can rely on the pre-defined mappings to directly transform global queries into queries over the sources.
Advantages of GAV:
• Easier query processing: The mappings are explicitly defined, so query rewriting is straightforward.
• Performance-focused: Optimized for environments where query execution speed is critical.
Local-As-View (LAV)
In LAV, the local schemas are defined as views over the global schema. This means that each local data source is described in terms of the global schema.
When to use LAV:
1. Dynamic and evolving data sources:
• Use LAV when the structure or number of data sources may change frequently, as the global schema remains stable, and only the views for new or updated sources need to be modified.
2. Many data sources:
• LAV is better suited for systems with numerous or heterogeneous data sources because it abstracts the integration logic from the data sources.
3. Complex integration requirements:
• When the relationships between data sources and the global schema are complex or require significant abstraction, LAV offers more flexibility.
4. Exploratory data integration:
• In environments where the goal is to explore or experiment with different ways of integrating data, LAV’s flexibility is advantageous.
Advantages of LAV:
• Scalability: New data sources can be integrated without modifying the global schema.
• Flexibility: Easier to handle heterogeneous or complex data sources.
Comparison and Trade-offs:
Aspect GAV LAV
Query processing Easier and faster More complex (requires reasoning)
Flexibility Limited High
Scalability Not suitable for many sources Handles many sources well
Maintenance High when sources change Easier to adapt to source changes
Complexity of Mappings Simpler Can be more abstract
Hybrid Approach:
Sometimes, systems adopt a combination of GAV and LAV (GLAV) to leverage the advantages of both. This approach defines mappings that are neither strictly GAV nor strictly LAV, providing a balance between flexibility and query efficiency.
Let me know if you’d like more details on GLAV or specific examples!
How does the query processing occurs in a GAV and in a LAV data integration system
In data integration systems, Global-As-View (GAV) and Local-As-View (LAV) are two main approaches for defining how the global schema relates to the underlying data sources. Query processing differs significantly between these two approaches due to their differing definitions and mappings.
Global-As-View (GAV):
In the GAV approach, each relation in the global schema is explicitly defined as a query (or a view) over the underlying data sources. The global schema is essentially a set of views over the local data.
Query Processing in GAV:
1. Predefined Mappings: Since the global schema is directly mapped to the local sources, queries against the global schema can often be rewritten into queries against the data sources using the predefined mappings.
2. Simple Query Rewriting: Query processing involves replacing global relations in the query with their corresponding definitions (i.e., views) in terms of the local sources.
3. Execution: Once rewritten, the query is sent to the data sources for execution, and the results are combined and returned.
4. Example: If the global schema has a relation Employee(Name, Dept) and is defined as SELECT name, department FROM HR_DB, a global query like SELECT Name FROM Employee is rewritten directly to the local query SELECT name FROM HR_DB.
Strengths of GAV:
• Simple query rewriting because the mappings are predefined.
• Efficient for query execution when data sources are stable.
Challenges in GAV:
• Hard to maintain mappings when data sources change.
• Adding or modifying a source requires updates to all mappings in the global schema.
Local-As-View (LAV):
In the LAV approach, each data source is described as a view over the global schema. The global schema is independent of the sources, and the relationships between the global schema and the sources are described in terms of the views.
Query Processing in LAV:
1. Query Reformulation: When a query is posed against the global schema, the system reformulates it by determining how the query can be answered using the views (i.e., the descriptions of the data sources).
2. Query Containment: Query rewriting in LAV requires reasoning about containment and equivalence of queries to find plans for answering the query using available views.
3. Use of Query Rewriting Algorithms: Algorithms like Bucket Algorithm or Inverse Rules Algorithm are employed to generate query plans.
4. Execution: Once reformulated, the rewritten queries are sent to the data sources, results are retrieved, and combined.
5. Example: If a source is described as a view HR_DB(Name, Dept) ⊆ Employee(Name, Dept), a global query like SELECT Name FROM Employee WHERE Dept=’IT’ requires reasoning to identify that the local source HR_DB can partially answer it.
Strengths of LAV:
• Flexible to changes in data sources, as mappings are defined per source.
• Easier to integrate new sources without altering the global schema.
Challenges in LAV:
• Query rewriting is computationally more complex due to reasoning about containment and equivalence.
• Not all queries may have answers since mappings may not fully describe the global schema.
Comparison of GAV and LAV Query Processing:
Aspect GAV LAV
Mapping Definition Global schema as views over local sources. Local sources as views over the global schema.
Query Rewriting Simple and direct, using predefined mappings. Requires reasoning about views, often more complex.
Flexibility Less flexible to changes in sources. Highly flexible to changes and additions of sources.
Use Case Best when sources are stable and well-understood. Best when sources are dynamic and may change frequently.
Complexity Lower computational complexity in query rewriting. Higher computational complexity due to reasoning and containment.
Would you like examples of query rewriting in either approach?
What are the definitions for sound, exact, and complete mappings?
A mapping defined over some data source, is sound when it provides a
subset of the data that is available in the data source that corresponds
to the definition.
A mapping is complete if it provides a superset of the available data in
the data source that corresponds to the definition.
A mapping is exact if it provides all and only data corresponding to the
definition: it is both sound and complete
When is union allowed? What about outer union? What is the result of the operation?
What happens when joining both R1 and R2 in the operations???? > ???
What are the steps in data integration?
Schema Reconciliation
Schema reconciliation: mapping
the data structure
Record Linkage
Record linkage: data matching
based on the same content
Data Fusion
Data fusion: reconciliation of non-identical content
What is record linkage? What is its relation with data fusion?
•Record Linkage (aka Entity Resolution): finding the info that refer
to same real-world entities.
•Data Fusion: once recognized that two items refer to the same
entity, how do we reconcile inconsistent information?
When do we perform record linkage?
What are the errors that might lead to different values for the same instance?
What are the methods to overcome the duplication problems?
What are the types of similarities used for determining if 2 strings are the same?
Types of similarity measures:
- Sequence-based: edit distance, Needleman-Wunch, affine gap, Smith-
Waterman, Jaro, Jaro-Winkler
- Set-based: overlap, Jaccard, TF/IDF
- Hybrid: generalized Jaccard, soft TF/IDF, Monge-Elkan
- Phonetic: Soundex
What is the difference between similarity and distance?