Databases Flashcards
What is Big Data?
Data sets so large or complex that traditional systems cannot process them effectively. It includes unstructured and semi-structured formats like images, videos, and logs.
What are the 5 V’s of Big Data?
Volume - enormous size
Velocity - speed at which data is generated and processed (e.g. stock market transfers)
Variety - different formats of data: structured semi-structured and unstructured.
Veracity - the quality and trustworthiness of the data; managed bias, noise, misinformation in data.
Value - the insight and benefits derived from the data.
What is Data Mining?
Principles and steps involved in discovering patterns from large datasets.
What is a Data Warehouse?
A centralised repository designed for structured data, optimised for analytics and reporting. It uses schema-on-write, supports historical analysis an BI tools, and is optimised for SQL-based queries.
What is a Data Lake?
Stores raw data in its native format, supporting structured, semi-structured, and unstructured data. It uses schema-on-read, is scalable and cost effective for large datasets, and supports machine learning and big data analytics.
What is a Data Lakehouse?
Combines the flexibility of a data lake with the analytical capabilities of a data warehouse. It supports both schema-on-read and schema-on-write, structured and unstructured data, and is ideal for machine learning and advanced analytics.
What are some key steps in data cleaning?
Remove duplicates, Handle missing data (fill, drop, or interpolate), Fix structural errors (e.g. inconsistent date formats), Standardise data (convert measurements to the same unit).
What is the Freedom of Information Act (2000)?
Provides any citizen the right to access information controlled by public authorities.
Public authorities must produce an information scheme approved by the information commissioner.
Public authorities must deal with requests for data which is non-personal.
Public have the right to be told when information exists, to secure access to information on request and in the manner requested.
What is the Human Rights Act (1998)?
The right to respect for private and family life, home and correspondence.
What is the Data Protection Act (1998)?
Requires anyone handling personal data to comply with eight enforceable principles of good information handling practice.
What are some of the key principles of the Data Protection Act (1998)?
Data must be fairly and lawfully processed.
Processed for limited purposes.
Adequate, relevant, and not excessive.
Accurate.
Not kept longer than necessary.
Processed in accordance with the data subject’s rights.
Secure.
Not transferred to countries without adequate protection.
What are some of the rights of individuals under the Data Protection Act?
Right to subject access.
Right to prevent processing likely to cause substantial damage or distress.
Right to prevent processing for the purposes of direct marketing.
Rights in relation to automated decision-taking.
Right to compensation.
Right to rectify, block, erase, or destroy inaccurate data.
Right to ask the Commissioner to assess whether the Act has been complied with.
What does GDPR stand for?
General Data Protection Regulation.
What is personal data under GDPR?
Any information relating to an identified or identifiable natural person, including:
HR records
CCTV images
Emails
Confidential opinions
Automated and manual filing data
Even “anonymised” data can often be identifiable.
What is sensitive/special category personal data under GDPR?
Racial/ethnic origin
Political opinions
Religious/philosophical beliefs
Trade union membership
Genetic or biometric data
Health
Sex life/sexual orientation.
What are some key requirements for companies under GDPR?
Implement appropriate technical & organisational measures to ensure and demonstrate compliance.
Maintain relevant documentation.
Implement data protection by design.
Use Data Protection Impact Assessments/Risk Assessments.
Appoint a Data Protection Officer (for big companies).
What are the basic operators in Relational Algebra?
Projection (π)
Selection (σ)
Cross product (x)
Union (υ)
Rename (ρ)
Set difference (-)
What are the derived operators in Relational Algebra?
Join (⋈)
Intersect (∩)
Division (/,÷)
What does the Projection operator (π) do?
Selects specific columns from a relation.
What does the Selection operator (σ) do?
Selects rows that satisfy a given condition.
What does the Cross product operator (x) do?
Combines tuples from two relations in every possible way.
What does the Union operator (υ) do?
Combines tuples from two relations, removing duplicates.
What does the Intersection operator (∩) do?
Returns tuples that are present in both relations.
What does the Set difference operator (-) do?
Returns tuples from the first relation that are not in the second relation.
What does the Rename operator (ρ) do?
Renames a relation or its attributes.
What is a Natural Join (⋈)?
Combines tuples from two relations based on common attribute values.
What is a Full Outer Join (⟗)?
Includes all matching and unmatched rows from both tables, with NULL values where data is missing.
What is a Theta Join (θ)?
Tuples from two relations are combined based on a condition specified by a comparison operator (e.g., =, <, >, <=, >=, or !=) instead of just equality.
What are the main categories of SQL commands?
Data Definition Language (DDL)
Data Manipulation Language (DML)
Data Control Language (DCL)
Transaction Control Language (TCL)
Give examples of Data Definition Language commands.
CREATE
ALTER
DROP
TRUNCATE
RENAME
Give examples of Data Manipulation Language commands.
SELECT
INSERT
UPDATE
DELETE
Give examples of Data Control Language commands.
GRANT
REVOKE
Give examples of Transaction Control Language commands.
COMMIT
ROLLBACK
SAVEPOINT
What is the SQL equivalent of the Projection (π) operator in relational algebra?
SELECT
What is the SQL equivalent of the Selection (σ) operator in relational algebra?
WHERE
What is the SQL equivalent of the Cross Product (×) operator in relational algebra?
CROSS JOIN
What is the SQL equivalent of the Union (∪) operator in relational algebra?
UNION
What is the SQL equivalent of the Rename (ρ) operator in relational algebra?
ALTER or AS if temporary.
What is the SQL equivalent of the Set Difference (-) operator in relational algebra?
MINUS or EXCEPT
What is the SQL equivalent of the Intersection (∩) operator in relational algebra?
INTERSECT
What is the SQL equivalent of the Natural Join (⋈) operator in relational algebra?
JOIN
What is the SQL equivalent of the θ Join (⋈) operator in relational algebra?
INNER JOIN, LEFT JOIN, RIGHT JOIN
What is the SQL equivalent of the Full Outer Join (⟗), Left Outer Join (⟗<), and Right Outer Join (⟗>) in relational algebra?
FULL OUTER JOIN
LEFT OUTER JOIN
RIGHT OUTER JOIN
What is SQL injection?
A type of security exploit where malicious SQL code is inserted into a query, allowing attackers to access or manipulate a database.
What is a Prepared Statement and why is it important?
A way to execute SQL queries where the query structure is defined separately from user-supplied data, which prevents SQL injection by sanitizing inputs before processing.
What are the key types of NoSQL data models?
Key-Value Databases, Wide-Column Databases, Graph Databases, Document Databases.
What are Key-Value stores and what are they good for?
They store data in ‘key’ and ‘value’ pairs (e.g. a phonebook). Good for small, volatile, and fast-changing data. Examples include session tracking.
What are Wide-Column (or big table) databases and what are they good for?
Database is one big table, no distinct schema, each row is flexible and can hold different attributes.
Good fast-changing data and easy scaling. Very quick to query.
What are Graph databases and what are they good for?
Treat data like a graph with edges and nodes, edges relate two nodes, both can contain information.
Fraud detection or complex network analysis. Not good if edges/nodes have complex attributes.
What are Document databases and what are they good for?
Every ‘entity’ is considered a document.
Instead of tables, use a JSON-style format.
They provide increased variety through a flexible schema, and the option to embed data rather than relate. Not good if you need data integrity and consistency, can be hard to update.
What is Cloud Computing?
A technology which abstracts services, allowing them to be accessed over the internet.
What are the three main forms of cloud computing services?
Software as a service (SAAS) (e.g. zoom, slack).
Platform as a service (PAAS) (e.g. AWS Elastic Beanstalk).
Infrastructure as a service (IAAS) (e.g. AWS).
What is a database?
An organised collection of interrelated data that models some aspect of the real-world.
What is a DBMS?
Software that helps create, maintain, and secure a database.
What is an ER model used for?
A conceptual modelling tool to visualise data structure.
What are the main components of an ER model?
Entities, Attributes, and Relationships.
What is Normalisation?
The process of organising data to minimise redundancy and dependency issues by splitting it into smaller, related tables.
What is First Normal Form (1NF)?
Remove multivalued or repeating fields and ensure each column has atomic values.
What is Second Normal Form (2NF)?
Ensure the table is in 1NF and remove partial dependencies (dependent on only part of the composite key).
What is Third Normal Form (3NF)?
Ensure the table is in 2NF and remove transitive dependencies (non-key attributes dependent on other non-key attributes).
What is an update anomaly?
When the same information is expressed in multiple rows, meaning an update to the data in one row may not be reflected in all rows, leading to inconsistent data.
What is an Insertion Anomaly?
When certain facts cannot be recorded because the table structure requires specific data to be inserted first. For example, if a new product category needs to be added but no products in that category exist yet.
What is a Deletion Anomaly?
When deleting data representing certain facts inadvertently causes the loss of data representing entirely different facts. For example, if deleting a customer’s order removes the customer’s address information.
What does ACID stand for?
Atomicity: Ensures that all parts of a transaction are completed successfully; otherwise, the transaction is aborted and no changes are made.
Consistency: Guarantees that a transaction will bring the database from one valid state to another, maintaining all rules, constraints, and integrity.
Isolation: Ensures that the operations of a transaction are isolated from other transactions, preventing concurrency issues.
Durability: Ensures that once a transaction is committed, its changes are permanent, even in the event of a system failure.
Why might normalisation be used?
Reduce duplication.
Reduce chance of errors.
Reduce data loss.
Reduce storage.
Maintain integrity.
Simplify the model.
Avoid modification anomalies.