Week 2 - Data Management Flashcards
What is Data management?
It includes the collection, storage, retrieval, quality assurance, and security
explain data-information-knowledge-wisdom (DKIW)
Data: raw observations of the world
Information: data that has been processed to provide meaning
Knowledge: what makes possible the transformation of information into instructions or knowing how to do something
Wisdom: insight is integrated and actionable
What is meta-data
Data that describes the properties or characteristics of end-user data and the context of those data
What does meta-data do?
It enhances the searchability, categorisation and data management efficiency
Structured data
Strictly organised such that it is easily searchable - database with a rigid schema
Unstructured data
Requires special handling - email body, social media post
Semi-structured data
Mix of both structured and unstructured data
What is a database
An organised collection of logically related data
Data management system
Data integrity: ensuring accuracy and consistency
Data security: protecting sensitive information
Scalability: adapting to growing amounts of data
Collaboration: enabling cross-functional access and analysis
What is the fundamental database operations?
Create, read (retrieve), update, delete
It forms the basis of data manipulation and access
What does ACID stand for?
Atomicity: all or nothing approach, smallest unit of transaction [buying concert ticket]
Consistency: ensuring that transactions bring the database from one valid state to another [library checkout]
Isolation: making sure transactions are processed independently [airplane seat tickets]
Durability: guarantees that once a transaction is committed, it will remain even in the case of a system failure [saving a paper]
ACID are the principles that …
Ensure reliable transactions in a database
Tabular data
[+] ideal for small amounts of data
[+] easy to create and use
[-] not suitable for complex relationships, only 2 dimensional
CSV files
Text file often used for data exchange between different system
What is a relational data-base
It is a collection of tables (relation) that interact with each other
What does relational database enable
Enables complex queries and data manipulation
What are its benefits?
Data integrity, flexibility, scalability, security
What is the process of normalisation
Process of organising data in a database to reduce redundancy
How do you normalise data?
Divide large tables into smaller, related tables and defining relationships between them
What are the goals of normalisation
- Improve data integrity and consistency
- Optimise storage and query performance
Columnar data base
Data stored in columns. Typically used for data warehousing
What are the pros and cons of columnar databases?
[+] efficient data compression - data in column, same type
[+] queries that sum, count, average or otherwise aggregate values
[-] not suited for OLTP
[-] slower for write operations
What are document databases?
No fixed schema. No SQL database designed to store, retrieve, and manage document -oriented information
Schema Flexibility
Document databases typically allow for a flexible schema within the documents. Documents within the same collection may have different fields and structures
Hierarchical Data representation
Documents can contain nested structures, arrays, and other complex data types, making them suitable for hierarchical data
Distributed architecture
Distributed and can scale horizontally across multiple nodes or clusters
Indexing and querying
Allowing for efficient search and retrieval of documents
Lack of ACID transactions
Databases may not support full ACID properties across multiple documents or collections
Graph Databases
Data entities represented as nodes and the relationship between them represented as edges
Graph schema
Defines the types of nodes and relationships, while others are schema-less allowing for more flexibility
Graph query language
Support GQL like Cypher to enable efficient querying and manipulation of graph structure
Directed and undirected graphs
Directed - one way relationship
Undirected - two way relationship
Graph databases are useful for
Interconnected relationship. Where the relationship is the defining characteristic of your data and your query is based on the relationship itself. The graph represents the model in a natural and intuitive way.
Uses of graph databases?
Connections (linked in) knowledge discovery, recommender systems (TikTok)