Lecture 2 - Knowledge Graphs Flashcards
Data management
- Becoming essential when organizations aim to be data-driven.
- Becoming also a huge challenge with Big Data (5 V’s)
We achieve good data quality through (Cf. model Verhoef):
•Governance and leadership - defined roles and responsibilities to ensure accountability for data
quality with policies and procedures in place to support the process
•Systems and processes - in place that secure the quality of data. Cf. auditing (next week)
•People and skills - train staff so they have the appropriate knowledge, competencies and capacity
for their roles
•Data use - the purpose of collecting and reporting robust, good quality data is to inform
management, make improvements to service delivery and to promote accountability to customers,
stakeholders, local residents and Government
•Data security - data collected must be secure and should only be used for authorised purposes
Data Quality dimensions
• Accuracy • Completeness • Consistency • Timeliness • Validity (potential to be accurate, e.g., right datatype)
Steps in
Step 1 - Separate and manage master data
Step 2 - Cleanse the data
Step 3 - Standardize
Step 4 - Publish (Open Data)
Step 4 - Publish (Open Data)
• Make (selected) data sets available within the enterprise, business
network or to the world
• Open data allows others to build new services, combine data etc.
• Open data is more and more expected from government agencies
• Note similarities and differences with traditional “data integration”
The problem (the need to link open data)
- Data everywhere
- Relevant data is scattered over many files and applications
- For many tasks, data from multiple sources needs to be used together
- For many tasks, data needs to be re-used out of context
- Exchange across systems, departments, organizations
- No “integrated schema”
- No centralized data governance possible anymore when you cross organizational borders
Solution: Linked Data (now called Knowledge graphs)
- URIs: Universal Identifiers for everything - object identification
- RDF: HTML (markup language) for Linked Data - data representation
- SPARQL: SQL for Linked Data - data retrieval
Triples
- All information can be broken down into simple “Subject-Predicate-Object” triples.
- Thing –Attribute – Value
- This course has name “Business Analytics Emerging Trends”
- This lecture has date “2020-12-07”
- Things – Relationship – Thing
- This lecture location is Room WZ 104
- This lecture teacher is Weigand
Things are identified by URIs
• Benefits of using URIs • Globally unique • Decentralized – doubles are not prevented, but can be resolved easily using “same-as” relationship • Resolvable (use browser) • Costs of using URIs: • Can be long and ugly • Can use international alphabets nowadays
How to query across data sources
- Make all data sources available as RDF
(RDF is usually not the primary data representation) - Put them into a single store
( Physical or virtual, ad hoc or persistent) - Execute SPARQL queries!
Knowledge Graphs in Web Search
• KGs are already used heavily by e.g., Google. Some predict that in 5
years time, the Google interface has completely changed (voice
interface, no 2,300,576 web page results but only facts and ads.
• The Winterthur example (Denny Vrandecic, 2020)
Knowledge Graphs in Web Searches problem
• Some twin towns are included in the Winterthur description.
• The Ontario page mentions Winterthur as Sister City, in a text description.
• There is not easy way to resolve the differences.
• Solution: Wikidata - publicly curated Knowledge Graph, where the relationship
is modeled as being symmetric
Artificial Intelligence
• KG can be the output representation for
• Natural Language Processing
• Computer Vision
• KG can be input for several AI tasks
• Simple reasoning
• Based on properties of the relationships. E.g., “sister city” is symmetric.
• Machine Learning
• Requires conversion from KG to numerical input: word embeddings, graph embeddings
• The ML results can be embedded again into the KG for link prediction.
– For example, (?, StarringIn, Terminator) is to predict the stars of the film Terminator when the data is
incomplete.
• Chatbots
Edge Detection
Man (circled): wearing glasses (circled)
feeding horse (circled)
horse (circled) eating from bucket (circled)
Contribution KG to Conversational AI
• First of all, the contribution of KG is providing more data from heterogeneous sources, including personal data (personalization) • KG data can also be used to generate queries that can be used to train the (ML-based) NL Interpreter • KG data can be used to improve the Intention finder, by attaching domain- specific intentions to objects. • Example: table reservation intention for restaurant objects in touristic chatbot