databases Flashcards
top down database design
create ER diagrams (graphical description): start with a set of system requirements and identify entities and attributes then construct relational data model i.e. tables for entities
bottom up database design
start with initial tables for entities and their attributes –> redesign in a “better” way: trickier for larger databases so need a formalization of process
purpose of normalization
- relations represent real world entities
- single valued columns
- avoid redundancy
- easier to update and maintain correctly without anomalies
no redundancy
each data item is stored only once: -minimises space required, - simplifies maintenance
consequence of redundancy
- takes up unnecessary space
- when modifying data, have to modify in different places
- risk of inconsistencies
exception for redundancy
foreign keys: act as pointers
what causes redundancy
- set valued attributes –> multiple rows in corresponding table. e.g. more than one attribute for a particular entry in the table
- dependency e.g. postcode and town are dependent on each other
define redundancy
repeating data in multiple different locations
modification anomaly
failure to maintain all existing instances of a specific value
deletion anomaly
losing other values as a side effect of deleting data. e.g. deleting staff member and deleting info of their workplace simultaneously if unnormalised
insertion anomaly
when adding more data items, much more irrelevant data needs to be added. adding rows forces us to add info about other entities –> can lead to inconsistency
e.g. adding details of new surgery, with no staff, add null to staff fields
overcoming redundancy
schema refinement (decomposition): use 2+ relations to store the original data. can be done manually for smaller tables but formalisation needed for larger DB
functional dependencies specify…
- specify which are candidate, primary and foreign keys
- specify which attributes to combine in the new tables
simple key
key consists of only one attribute
composite key
key consists of several attributes
candidate key
minimal set of attributes whose values uniquely identify tuples.
functionally determines all attributes in a relation
primary key
the candidate key selected to identify rows uniquely within a table
what is functional dependency
describes relationship between two attributes in the same relation
if B is functionally dependent on A
Represent as A -> B
If we know attribute values for A, then we know the unique attribute values of B.
Each value of A is associated with exactly one value of B
determinant
set of attributes on the left hand side
dependent
set of attributes on the right hand side
full functional dependency
B is not functionally dependent on any proper subset of A, i.e. B is functionally dependent on ALL the primary key if composite key, not just individual parts
partial functional dependency
B remains functionally dependent on at least one proper subset of A
transitive functional dependency
if A->B and B->C functional dependencies exist, then the functional dependency A -> C also exists = transitive
how to determine functional dependencies
either -obvious/common sense
or -require discussion/specification from customers
closure, F+ (+ is raised), of a set of functional dependencies, F
set of all functional dependencies implied by dependencies in F. inference rules required to compute this
Armstrong’s axioms (complete and sound)
- reflexivity (if B is a subset of A, then A -> B)
- augmentation (A -> B then A, C -> B, C)
- transitivity (if A ->B and B-> C then A->C)
what does it mean by inference rules being complete
given a set X functional dependencies, all dependencies implied by X can be derived from X using these rules
what does it mean by inference rules being sound
no additional functional dependencies (not implied by X) can be derived from these rules
further rules derived from armstrong’s axioms
- decomposition: if A -> B, C then A -> B and A -> C
- union: if A -> B and A -> C then A -> B, C
- composition: if A -> B and C -> D then A, C -> B, D
proof of union rule using armstrong’s axioms
A-> B: augmentation with C: A, C -> B, C
A -> C: augmentation with A: A, A -> A, C ( = A -> A,C)
transitivity from above 2 dependencies: A -> B, C
pseudocode for computing the closure F+
F+ = F (initialisation)
repeat until F+ doesn’t change any more:
apply rules of reflexivity and augmentation to each functional dependency, f in set F+, and add these new func. dep. to F+.
for each f1, f2: if imply a func dep f3 using transitivity then add to F+
F = set of transitive dependencies
A = set of attributes in a relation
what is A+
set of all attributes that can be implied by the attributes of A using functional dependencies from F
how to compute A+
similarly to F+, start with functional dependencies of F, with attributes A on the LHS and repeatedly apply Armstrongs axioms
candidate key in terms of relational dependencies
minimal set of attributes such that A+ (under F+) includes/ which functionally determine all attributes in a relation
1NF
- no repeating groups i.e. an attribute that occurs with multiple values for a single occurrence of the primary key
- no identical rows
why not repeat columns horizontally to get into 1NF
- waste of space, not all column spaces used
- harder to query and reference specific columns
- need a fixed upper limit on number of repetitions
why not one long string with all attributes for 1NF
-harder to query
1NF: one table solution
-repeat data for each attribute so each belongs to one entry, but redundant data (i.e. fill in blanks/flatten the table)
1NF: two table solution
-place repeating data in a separate relation, with original primary key as a foreign key. iterate until no repeated groups remain
2NF
- 1NF
- no partial functional dependencies: every non-key attribute is dependent on the whole of the composite primary key
How to place in 2NF
-remove partially dependent attributes and place in separate relation with the copy of their determinant
3NF
- 2NF
- no transitive functional dependencies: no non-key attribute is transitively dependent on the primary key: all attributes are dependent on the key, the whole key and nothing but the key
How to place in 3NF
-remove transitively dependent attributes and place in new relation. take the attributes of their determinant as the primary key in the new table
how to find primary key
- intuitive observations from data
- computing functional dependency closure
how to depict functional dependencies
-dependency diagram
Limitations of file based approach
- Data duplication in different programs: different values/formats/waste of space
- Incompatibility: programs written in different languages- difficult to access others’ files
- Data dependencies: structure of the file is defined in the code of the program
- Separation and isolation of data: programs maintain their own data and users may be unaware of useful data in other programs
- Limited queries: each program has a defined purpose, for different queries, more application programs have to be designed
- Hard to recover when crashes occur
- Inefficient searching large data sets
- Many users: controlling simultaneous access = locking+ can’t hide sensitive data = inefficient
- Not synchronised: inconsistencies+lost updates
reason for limitations of file based approach
- definition of data is embedded in the application program (instead of stored separately and independently)
- no central control over the access/manipulation of data
database
a large collection of logically related data, designed to meet the needs of an organisation:
- single repository of data
- minimum duplication
- large databases have a data dictionary (metadata repository i.e. data describing the data)
DBMS
a software system that enables the user to define/create/maintain/control access to the DB
basic features of DBMS
- DDL (data definition language): defines the database- specify the structure, data types + constraints
- DML (data manipulation language): allows users to insert/delete/update/retrieve data from the database. Query language is the part of this that retrieves data. unlimited queries (adv.) Efficiency: good DBMS can answer SQL queries quickly
DBMS allows controlled access to the DB
- security system
- concurrency control
- recovery control
DB application program
computer program that interacts with the user + DBMS. Sends SQL statements to the DBMS + can be conventional (local) or online applications
Components of DBMS environment
- hardware: PC or computer network
- software: DBMS, application program, OS (network software if necessary)
- data: used by the organisation. schema = abstract description of the data
- procedures: documented instructions on how to use/run the system e.g. how to log on/use an application/make backups
- person: anyone involved in the system
people in a DBMS environment
- DB designers
- DB admins: maintenance of DBMS+OS, security+integrity control, satisfactory performance of user applications
- application developers
- end user
DB designers
- Logical designer: what to store- entities, attributes, relationships+constraints
- Physical designer: maps logical designs to tables+integrity constraints, selects access methods+storage structures
advantages of databases
- shared data
- concurrency control
- better data backup/recovery procedures
- consistency
- less redundancy
- economy of scale
- easier data access
- data security+integrity
- faster development of new applications
disadvantages of databases
- more hardware
- complexity/size of system
- high cost of implementation of DBMS
- slow processing of some applications
- high impact of failure
need to trust a DBMS so have mechanisms to ensure the database is
-reliable
-always in a consistent state
especially when hardware/software failures and multiple users access the database simultaneously
database recovery
process of restoring database to a correct state after a failure
concurrency control protocols
prevent database accesses interfering with each other
transaction
an action carried out by a single person/program which reads/updates the database
during a transaction vs after a transaction
database may not be in a consistent state vs database in a consistent state + valid integrity/referential constraints
transaction outcomes
committed- completes successfully
rollback- not completed successfully
(performed entirely or not at all)
concurrency control
process of managing simultaneous operations on the DB without having them interfere with each other
operations may be correct individually but cause inconsistency when executed simultaneously
how is DBMS different from multi user OS
-an OS allows two people to edit a document at the same time but if both try to write simultaneously -> ones changes get lost
issue with DDL
too low level to describe data organisation in a simple way for users: so have data model
data model
a collection of intuitive concepts describing entities, relationships + constraints
three characterisations of data
- structured
- semi structured
- unstructured
-structured data
data is represented in a strict format (relational data model)
DBMS checks the data follows: the structures and integrity/referential constraints specified in the schema
-semi structured data
self describing data, schema info is mixed in with the data values
ad hoc collected data, when not known how it will be stored/managed in advance
may have some structure but not all data has the same structure, each data object has different attributes not known in advance
-unstructured data
very little indication of the type/structure of data
e. g. text document with some info in it
e. g. web doc with some HTML in it
relational data model
relations = tables attributes = columns tuples = rows
issue with relational data model
too low level for big companies so express in a non-technical way: E-R model
E-R model
top down approach to DB design, graphical description
represents entities, attributes and relationships and constraints on these
uses UML notation + crows foot notation
3-Level ANSI-SPARC architecture
external level: user’s view of data
conceptual level: logical structure of data as seen by DB administrator
internal level: physical representation of data: algorithms, data structures
DB schema
abstract description of the collection of data in a DB:
defines schema, domain for each attribute and specifies integrity constraints
DB instance
data at a particular moment
objectives of DB schema
all users have access at every point to the same DB instance w/ customised views of parts of data
data independence
upper level of DB schema unaffected by changes to lower levels
e.g. physical data independence: if internal schema changed- conceptual level unchanged, and users only notice a change in performance
three main phases of database design
conceptual design
logical design
physical design
conceptual design
acts as fundamental understanding of DB
high level graphical representation via ER diagrams based off user requirements specification (completely independent of physical considerations)
logical design
create relational data model using conceptual design then normalise to eliminate redundancy/anomalies
physical design
describe database implementation of logical design:
- specific storage structures/access methods/security protection
- aim is optimum performance
classification of design phases into 3-level ANSI-SPARC
external schema + conceptual schema =logical/conceptual database design
internal schema+physical storage =physical DB design
relational database
collection of normalised relations
a relation is ‘normalised’ if
it is appropriately structured i.e. one value per cell and no repeating rows
domain
set of allowable values for an attribute
degree of a relation
number of attributes
cardinality of a relation
number of tuples
tuple
row of a relation with concrete values for each of the attributes
attribute
named column of a relation: each with a unique name- properties/common characteristics shared by all entity instances of an entity type
relation
table with rows and columns
how does this differ from mathematical relations
ordering of attributes doesn’t matter
foreign key
attribute in one table A whose values must match the primary key of another table B (or be null) then A references B
entity integrity
every attribute of a primary key can’t be null:
ensures every entity has a unique identifier and foreign key values can reference primary key values
referential integrity
ensures a foreign key matches the primary key from another relation or are null:
ensures references between tables are valid and prevents deleting a row in a table if there is a foreign key in another relation referencing it
types of relations
base relations: physically stored in the database
view (virtual relation): not physically stored in the database but derived from content of base relations. used to show customised info to every user and calculate dynamic quantities. computed upon request from user and changes when base relations change.
alternatives to relational data model
- network data model: tuples modelled as nodes and relationships (foreign keys) as edges
- hierarchal data model: type of NDM but graph is a tree graph and structure models ‘parent-child’ relationship
(limitations: deleting parent or adding record without a parent)
objective of ER model
ER model (graphical representation of DB):
understand nature and relationships among data
help derive tables in RDM
entity
data objects: real thing (or abstract notion) that we recognise as a separate concern within the database
entity type
group of all objects with the same properties, identified as having an independent existence
entity occurence
a uniquely identifiable instance of an entity type
relationship
named association between two entity types which has some meaning in the context of the database
cardinality of a relationship
the number of entity occurrences that are related to a single occurrence of an associated entity type through this relationship
entity participates ‘optionally’ (indicated by O on other side of relationship line)
it has partial participation otherwise has total participation (indicated by line on other side of relationship line)
simple attribute
consists of a single component- can’t be broken down further
composite attribute
multiple components: can be subdivided into smaller components
single valued attribute
multi valued attribute
derived attribute
- only holds one value
- it can hold many values
- it can be derived from the value of a related attribute
literals
{a}, [a], (…)
|
constants used in SQL statements
{a} = required element [a] = optional element
… = optional repetition
| = choice
SQL is ‘free format’
free format = parts of statements don’t need to be written on specific parts of the screen
types of query languages
SQL: formal definition of a new relation (which data is to be retrieved) from existing DB relations
relational algebra: how to build a new relation from existing DB relations
‘WHERE’
‘SELECT DISTINCT’
testing for null
filters rows according to some condition
eliminates duplicates
WHERE comment IS NULL
% vs _
% = wildcard, represents an arbitrary sequence of 0 or more characters _ = represents an arbitrary single character
aggregate function
operates on a single column and returns a single numerical value
GROUP BY
partitions data into groups, producing single summary row
types of subquery
- single-value/scalar subquery: single column and simple row
- multiple-value subquery: single column and multiple rows
- table subquery: multiple rows and columns
nested queries
- two scalar queries ‘WHERE branchNo = (select…)
- scalar + mulivalued subquery ‘WHERE branchNo IN (select…)
multivalued subquery
-ANY or SOME before subquery so WHERE is true if satisfied by at least one value returned by subquery and ALL if satisfied by all values returned by subquery
‘WHERE salary > SOME (select …)
alias for a column
in the FROM clause leave a space and write alias for column
inner join vs natural inner join
natural inner join demands equality for columns with the same name in the two tables e.g. a.salary = b.salary
but inner join without ‘natural’ can demand different relations between column values
inner join vs outer join
inner join: omits all rows that don’t satisfy join conditions
outer join: retains (some of) the rows that don’t satisfy the join conditions
(left outer join: retains rows of the left table that are unmatched with rows from right table.
full outer join: retains unmatched rows from both tables (fills necessary fields with NULL)
database updates
INSERT INTO TableName [(columnList)] VALUES (data ValueList) ---- UPDATE TableName SET columnName1 = dataValue1 [,columnName2= data Value2..] [WHERE searchCondition] ----- DELETE FROM TableName [WHERE searchCondition]
DDL basic commands
CREATE table: assign name to table and defines names and domains to each column in table
ALTER table: amends relation schema/table structure bc of design error or design change
DROP table: deletes a table
Specify integrity + referential constraints: -PRIMARY KEY -FOREIGN KEY
Create view- creates virtual table derived from base relation according to query to provide user with a virtual relation
BIT(n)
INTEGER
SMALLINT
NUMERIC(p,d)
BIT(n)- bit string of length n
INTEGER- large +ve or -ve integer
SMALLINT- small +ve or -ve integer
NUMERIC(p,d)- +ve/-ve decimal number with at most p precision (num digits) and d scale (num digits after .)
define custom domain types + with constraints
CREATE DOMAIN Name AS VARCHAR(10);
CREATE DOMAIN SEX AS CHAR(1)
CHECK (VALUE IN (‘M’, ‘F’))
or replace bracket (‘M’, ‘F’) with a SELECT statement
view
relation that depends on other relations and isn’t physically stored as a table
query language is relationally complete if
it can be used to produce any relation that can be created from relation calculus/algebra expressions
teleprocessing architecture
one computer with single CPU:
many end terminals cabled to central computer, all processing done in central computer: recieves requests, processes and responds with information
downsizing to more cost effective PC network with better/same performance: dec performance bc burden on central computer