Database Concept Flashcards
Design a relational database
two sub-system: storage(file system) and management software
management software:
storage manager (block read in/write)
cache
SQL parser
log system (for undo/redo)
authorization management (roles)
Disaster recover system
indexing (fast read)
lock management (for concurrence)
why use index
Indexes are created to speed up the data retrieval and the query processing operations from a database table or view, by providing swift access to the database table rows, without the need to scan all the table’s data, in order to retrieve the requested data.
what column(s) can be index
The column or columns have a high degree of uniqueness
You frequently need to look for a certain value or range of values for the column.
index’s data structure
B+ tree
other candidates: binary search tree, B tree, Hash table
dense index and sparse index
For MySQL’s two storage engines, myisam use sparse index, innodb has one and only one dense index: if primary key is defined, it is the dense index; if not, the first non-empty index will be chosen to be the dense index; if neither, innodb will generate a hidden primary key for us.
In the dense index, there is an index record for every search key value in the database. This makes searching faster but requires more space to store index records itself. Index records contain search key value and a pointer to the actual record on the disk.
In the sparse index, index records are not created for every search key. An index record here contains a search key and an actual pointer to the data on the disk. To search a record, we first proceed by index record and reach the actual location of the data. If the data we are looking for is not where we directly reach by following the index, then the system starts sequential search until the desired data is found.

Why not use a binary search tree to store index
Since each node will only contain two children, the average lookup time will be O(logN). That means O(logN) times of I/O, which is too much – even slower than a full table scan.
B+ tree as index’s data structure
- non-leaf node’s number of child = number of keywords
- non-leaf node’s child pointer P[i] -> sub-tree with keywords[K[i], K[i+1])
- the non-leaf node only store index, data records are all in leaf nodes
- all leaf nodes have a pointer to next leaf nodes

Why use B+ tree as the DS for index?
- low I/O cost, internal nodes contains only indexes, a bulk read in contains more indexes which in turn reduce the total times of read
- search time cost is stable
- better for full-table scan (scan all leafs)
Why not use hash index for database?
- time complexity is better
- However, it only supports ‘=’, not range query
- does not keep the order, thus need constant sorting operation
- query by composition key is very complex for hash-based method
- hash collision will cause unnecessary table scan
- large hash collision will reduce the time efficiency
Bitmap for index?
yes, oracle is using it.
not good for high concurrency, thus, it is an option for OLAP, but not for OLTP
How to optimize slow query
identify slow query through slow query log
use EXPLAIN ANALYZE e.g. tool to analyze the query
revise query to let it exploit more with indexes
Is using primary key always faster than using other indexes?
No.
For example, when count(id), it probably will not use id(primary key) to aggregate, it will choose a sparse index, because a sparse index store only key and value in its nodes, it could read in more records in one batch each time/ eliminate more records this way, which in turn will speed up the query.
Script command related to tuning/optimalize query
- show variables like ‘%query%’ // find slow_query related var
- show status like ‘%slow_queries%’ // show the number of slow queries executed within this session
- explain ########query content########## //explain analyze // type=all or index: need to improve | extra=Using filesort or temporary : can’t use index
If the above 3 showed slow query, change the script to use other field to query or alter the table to add the current field as index
sometimes Mysql will choose a non-primary key to query to achieve better performance
If you want to force the query to use other index:
select count(id) from table1 force index(primary)
Composite Index
Like a single index, a composite index is also a data structure of records sorted on something. But unlike a single index, that something is not a field, but a concatenation of multiple fields.
Key related SQL command
PRIMARY KEY(‘id’),
UNIQUE KEY ‘account’ (‘account’),
KEY ‘index_area_title’ (‘area’,’title’),
KEY ‘idx_name’(‘name’)
Left-prefix rule of composite index
The left-prefix rule says a query can search an index efficiently if it provides search terms (such as values in a WHERE clause) that match the leading columns of the index, in left-to-right order.
mySQL will continue its matching from left to right until range operator(>, <, between, like)
key(a,b,c,d): where a=1.b=2.c>5,d=4 will not use index
key(a,b,d,c) where a=1.b=2.c>5,d=4 will use index
d=4 will be move upper stream by MySQL to fit the index
‘in’ and ‘=’ could be not in order
example:
table: KEY ‘mycomkey’ (A,B)
query 1: select * from table1 where A=1,b=2
query 2: select * from table where A=1
query 3: select * from table where B=2
query1 and 2 will use index mycomkey, query 3 will do whole table scan. because the left prefix rule, B=2 without A will not provoke the index
A and B’s order in the WHERE clause is not a matter, MySQL will change the order for you to fit the index
is index the more the better
no.
- The small table should not use an index, because the overhead of establishing and store the index system will exceed the efficiency gain compared to a whole table scan
- Update table will update the index, more index meaning more cost in maintaining the index system
- More index means more spatial cost
What is the difference between MyISAM’s and InnoDB’s internal Locking
MyISAM’s default is table-level locking, not support row-level locking;
InnoDB’s default is row-level locking, and support table-level locking
read lock
A READ lock has the following features:
- A READ lock for a table can be acquired by multiple sessions at the same time. In addition, other sessions can read data from the table without acquiring the lock.
- The session that holds the READ lock can only read data from the table, but cannot write. And other sessions cannot write data to the table until the READ lock is released. The write operations from another session will be put into the waiting states until the READ lock is released.
- If the session is terminated, either normally or abnormally, MySQL will release all the locks implicitly. This feature is also relevant for the WRITE lock.
read lock is generally shared lock but you can force mutex lock like:
select * from table1 where id between 1 and 300 _for update_
write lock
A WRITE lock has the following features:
- The only session that holds the lock of a table can read and write data from the table.
- Other sessions cannot read data from and write data to the table until the WRITE lock is released.
script to add a shared lock
lock in shared mode
MyISAM is suitable for what case
- frequent full table count(*)
- read-intensive, write-low-intensive
- no transaction
InnoDB is suitable for what case
- read and write-intensive
- high requirement for reliability, support transaction
different type of lock
level: row-level, table-level, page-level
shared lock and mutex lock
auto-lock and explicit lock
DML lock DDL lock
Optimistic lock pessimistic lock
transaction 4 features
ACID
- Atomic:all done or do nothing
- Consistency:integrity constraint
- Isolation:concurrent transaction should not interfere with each other
- Durability:after commit, the result of the transaction should be kept effective。 redo log file
Concurrent transaction problem 1
update loss - one update overlap another
all major database are already avoid it through lock
Concurrent transaction problem 2
dirty read
one transaction read other transaction’s uncommited data
read uncommited is the lowest isolation level
to avoid the dirty read, set session transaction isolation level read committed (oracle default)
Concurrent transaction problem 3
non-repeatable read
multiple read results not consistent
when one transaction is reading and another transaction is updating the value;
solution:
set session transaction isolation level repeatable read (innodb default)
Concurrent transaction problem 4
phantom reads
mySQL “repeatable read” avoid this(conceptually only serializable could avoid it )
By default, InnoDB operates in REPEATABLE READ transaction isolation level and with the innodb_locks_unsafe_for_binlog system variable disabled. In this case, InnoDB uses next-key locks for searches and index scans, which prevents phantom rows
A phantom read occurs when, in the course of a transaction, new rows are added or removed by another transaction to the records being read.
This can occur when range locks are not acquired on performing a SELECT … WHERE operation. The phantom reads anomaly is a special case of Non-repeatable reads when Transaction 1 repeats a ranged SELECT … WHERE query and, between both operations, Transaction 2 creates (i.e. INSERT) new rows (in the target table) which fulfil that WHERE clause.
In the SERIALIZABLE isolation mode, Query 1 would result in all records with age in the range 10 to 30 being locked, thus Query 2 would block until the first transaction was committed. In REPEATABLE READ mode, the range would not be locked, allowing the record to be inserted and the second execution of Query 1 to include the new row in its results.
snapshot read
non-lock, unblocking read, when is not serializable
简单的select操作
current read
read the newest version, and add a lock
select … lock in share mode //shared lock
select … for update //mutex lock
insert //mutex lock
update //mutex lock
delete //mutex lock
Under RC(read committed), RR(repeatable read) mode, how snapshot read be implemented
- 3 hidden column: transaction id, rollback pointer, row id(auto increase when new row inserted)
- undo log: the old version of data
insert undo log (discard after commit)
update undo log (update or delete)
- read view: find the closest transaction id prev to current active transactions (most stable for now / snapshot)
MVCC
Multiversion concurrency control
Is RR‘ snapshot read a real MVCC
not a real MVCC
what is the real mechanism for RR‘s avoiding phantom read
next key lock = record lock + gap lock
- Where conditions are all meet (query will use unique index), then use record lock. Then other concurrent insert commands will still be effective(no gap lock).
- Where conditions are partially meet or no-meet, then add Gap Lock
Example: id (6, 9, 11), when “delete from t1 where id = 9)
(6, 11] range will be GAP locked
- if delete from t1 where id =9, if id is not unique index, no index can be used, then it is a table gap lock, all gap is locked