transactions Flashcards
transactions
A DBMS uses the concept of a Transaction
* A transaction is a logical unit of work which completes in its entirety
or not at all
* Aim is to keep the database consistent
* A transaction can contain any number of database operations using
the SQL we’ve previously introduced in Part 1
* Can be read (SELECT) or write (INSERT INTO/UPDATE/DELETE FROM)
operations
Transactions can automatically begin with a piece of SQL
* They end with
* COMMIT – commit the changes and successfully end
* ROLLBACK – an error has occurred so undo everything from the transaction
* Programmatic SQL – normal termination of the program is a COMMIT
* Programmatic SQL – abnormal/error termination is a ROLLBACK
ACID properties of a transcation
Atomicity – completes in its entirety or not at all
* Consistency – the database must be in a consistent state
* DBMS and application developers need to ensure this
* Isolation – each transaction executes independently of others
* Durability – changes made by a transaction must persist
* Recovery system must ensure this
OLTP
- Online Transaction Processing
- Many DBs are examples of OLTP systems
- Quick, real-time access to data to read or modify it
- Alternative is OLAP (Online Analytical Processing) which involves
fewer, more intensive transactions
problems with transactions
The following video shows some problems that modern multi-user
databases can encounter
* We focus on three different issues
* The Lost Update Problem
* The Dirty Read or Uncommitted Dependency Problem
* The Inconsistent Analysis Problem
* In it, our two users (who we shall refer to as HP1 and HP2) are trying to
perform operations on their expenses system and on their product
database
Making all transactions run serially will fix this but there are
performance issues if only one transaction can ever run at once
* Read operations can be run in parallel
* If a transaction is writing to one part of the database, can still access
other parts not affected by the update
locking
DBMS needs to come up with a serializable schedule
* The solution to the issue is locking – transactions lock part of the DB
before updating
* Shared lock (read only) or exclusive lock (read or write)
* Can still be a problem if a transaction performs an unlock then locks
again later
Two Phase Locking (2PL) – unlock operations only performed after all locks
have been acquired
* Deadlock – it is possible that two transactions are each waiting for the other
to release a lock
* Can be resolved by the DBMS
* Timeouts (transaction rolls back after an amount of time)
* Deadlock detection (roll back the transaction which would cost the least to stop)
* Deadlock prevention (try to look for the problem in advance – not so common as this is
quite tricky)
distributed databases
Many databases are now distributed – this means the data is stored
in different physical locations but the DBMS must make this invisible
to the end users
* Data is allocated to different disks in different places
* More storage capability
* More possible performance issues
transactions and CAP theorem
C – Consistency. The database is in a consistent state. Transactions
ensure this by rolling back if an error occurs when the database is in
an inconsistent state
* A – Availability. Every query request gets a response
* P – Partition Tolerance. Distributed database can cope with network
failures, e.g. failures of individual nodes in a distributed DB and/or
network delays
CAP Theorem states that we can only expect to have two of these
three aspects at one time
* Partition tolerance is a necessity for a distributed DB so it means
designers have to trade-off between availability or consistency
* Do we sometimes not get a response due to network issues?
* Or do we accept the fact that data might not be 100% up to date?
* Also known as Brewer’s Theorem
logging and archiving
Transactions are the unit of work used in dealing with major problems
* The database keeps a log of all the transactions made in the order they
happened
* This includes the before and after values
* Checkpoints are made periodically noting which transactions are running,
which are committed, etc. Transactions are suspended as the checkpoint is
made. All shown as committed are fully complete and written to the disk
* The database is also regularly archived by its administrator onto offline
storage
recovery
Database systems can be hit by a number of issues for example:
* Sudden crash or loss of power to servers (as per the video)
* Hardware problems, e.g. corrupt/broken disks
* Flood/fire, etc. in the server room
* Accidental issues, e.g. bug in a program accessing the DB, user doing
something silly
* Malicious damage
The log file is very important here as it shows what transactions the
database has done
* Any transactions who were incomplete at the time of failure need to
be undone – these will have no COMMIT or ROLLBACK in the log
* Completed transactions will need to be redone in case the DBMS had
not finished writing the changes to disk
* Transactions marked as completed at the last checkpoint don’t need
to be redone
The log file has to be stored/archived on a different storage medium
to the database
* The archive has to be stored somewhere other than the server
location
* Depending on the problem, the database may need to be reloaded
from the previous archive before the log file entries are redone
indexing
An index is just like an index in a book
* Without an index, a DBMS such as MySQL would need to search a whole
table
* Can use the index to jump into the middle of the data without searching
through it all
* Indexes must be set up by the database administrator
* MySQL uses a data structure called a B-Tree to store indexes
Clustered indexes relate to how the data is stored on disk. Given that rows
can only be stored in one order, there can only be one clustered index per
database table
* Non-clustered indexes contain a pointer to the data row itself. The pointer
from the index row to the data row is called the row locator
* The primary key is usually the clustered index and some DBMS make this
happen automatically
* Some DBMS also do this to foreign keys as these are often used in queries to
join tables
Generally, it’s best not to index small tables as it’s faster to scan the table
* Indexes need to be updated when tables update so it’s best not to index
tables where the data changes very frequently
* Can index columns which often appear in GROUP BY, ORDER BY or
aggregate operators
* Therefore, it is a good idea to have an idea of what queries are particularly
common – recall the importance of requirements analysis
* DBeaver has an “Index” tab for each table which lets you see the current
indexes and allows you to create, modify and delete indexes