Distributed database management systems Flashcards
Distributed architectures
Distributed DB
- diffrent DBMS server on different network nodes
- autonomous
- able to cooperate
- Guaranteeing ACID properties require complex techniques
Client/server
- simplest
Data replication
- a replica is a copy of the data stored on a different network node
- the replication server autonomously manages copy update
- simpler architecture than distributed databse
Parallel architecture
- Performance is the objective
- multiprocessor machines vs cpu clusters
Data warehouses
- decision support
Distributed systems relevant properties
Portability
- Capability of moving a program from system to system
- for dbs, this is guaranteed by the sql standard
Interoperability
- capability of different dbms servers to cooperate on a given task
- interaction protocols are needed
SQL execution types
Compile and go
- query sent to server
- query prepared (execution plan)
- query executed
- result shipped
Compile and store
- query sent to server
- query prepared (execution plan)
- plan is stored for later use
- query exectued
- result shipped
Advantages of distributed database systems
Functional:
- Appropriate localization of data and applications
Technological
- Increased data availability
- enhanced scalability
*
Data fragmentation types and properties (distribued database systems)
Given a relation R, a data fragment is a subset of R in terms of tuples, or schema or both.
Fragmentation criteria:
- horizontal
- subsets of tuples with same schema of R
- obtained by predicate selection (Employee dept=”production)
- fragments are not overlapped
- vertical
- subset of a schema of R
- selects a set of columns for each node
- primary keys are included
- all tuples are included
- mixed
Properties:
- completeness: each information is contained in at least a node
- correctness: information in r can be rebuilt from its fragments
Transparency levels (distributed dbms)
desribe the knowledge of a data distribution
an application should operate differently depending on the trasparency level supported by the dbms
Fragmentation trasparency
- application knows that fragmentation is going on but data distribution is not visible
allocation transparency
- app. knows fragments but not the allocation
- not aware of replicas
- must enumerate all fragments
language transparency
- programmer should select both fragment and allocation
- higher level transparency queries are converted to this format
Transaction types in distributed dbms
- remote request (read-only on single server)
- remote transaction (single server)
- distributed transaction (each sql statement is address to one single server)
- global atomicity is needed (2PC)
- Distributed request
- each sql command may refer to data on different nodes
- distributed optimization is needed (performed by dbms receiving request)
- fragmentation transparency is in this class only
ACID properties in relation to distributed dbms
Atomicity
- requires 2PC
- all nodes partecipating in distributed transaction must implement the same change (commit or rollback)
- failure: node, network, network partitioning
Consistency
- constraints are currently enforced only locally
Isolation
- requires strict 2PL and 2PC
Durability
- Requires the extension of a local procedures to manage atomicity in presence of failure
2PC
Goal: coordination of the conclusion of a distributed transaction
Parallel with a wedding:
- Priest coordinates agreement
- Couple partecipate in agreement
One coordinator (transaction manager), several DBMS servers that partecipate (resource managers), any partecipant can be a coordinator (even client)
TM new log records:
- Prepare (contains identity of all RMs partecipating)
- Global commit/abort
- Complete
RM new log record:
- Ready
- RM is willing to perform commit of the transaction
- decision cannot be changed afterwards
- RM loses its control
Protocol
- Phase 1
- TM writes prepare record and sends prepapre message to all RMs
- TM sets timeout
- if RMs receive the message
- state = reliable: write ready record, send the ready message
- state = not reliable: send not ready, terminate protocol, rollback
- state = crashed: no answer
- TM collects incoming messages, if it receives not ready or timeout expires launches global abort
- Phase 2
- TM sends global decision to RMs and sets timout
- RMs wait for glabal decision and when they receive it they commit/abort on the log and send an ACK back
- TM waits for all acks to be received, if the timeout expires, another is set (sending again message)
Failure of partecipant and coordinator in distributed systems
RM
- Warm restart procedure is modified with a new case
- if last record for transaction T is “ready”, then T does not know global decision
- Recovery
- Ready list: collecting IDs of all transactions in ready state
- for all transactions in the list, the global decision is asked to corresponding TM
TM
- Messages can be lost
- Prepare (out)
- Ready (in)
- Global decision (out)
- Recovery
- if last record in the TM log is prepare: global abort decision is written in the log and sent to all RMs (alternative is redo phase, but is not implemented)
- if last record in TM log is glabal decision: repeat phase 2
Network problems:
- in phase I -> abort
- in phase II -> repetition of phase II
X-OPEN-DTP
- protocol for coordination of distributed transactions
- guarantees interoperability between heterogeneous DBMSs
- one client, one TM, several RMs
- defines intefaces between client and TM (TM interface) and between TM and RMs (XA interface)
- RMs are passive and only answer to TM
- 2 optimization of 2PC:
- presumed abort: when no info is available in log, TM answers abort to a remote recovery request by RM
- reduces number of sync log writes
- read only: used be RM that did not modify db during transaction
- RM answers ready only to prepare request
- does not write log and locally terminated protocol
- TM will ignore the RM in phase 2
- presumed abort: when no info is available in log, TM answers abort to a remote recovery request by RM
- Heuristic decision: evolution in presence of TM failures
- during uncertainty window
- blocked transaction evoleves locally
- on TM recovery decisions are compared, if they differ, atomicity is lost and inconsitency is notified.
- resolving inconsitency is up to user applications.
Parallel DBMS
Through: multiprocessor systems or computer clusters
Queries can be efficiently parallelized: large table scans, group bys
Inter-query parallelism: different queries are scheduled on different processors, often used in OLTP systems
Intra-query: subparts of the same query are executed on different processors (OLAP, heavy queries)
DBMS benchamarks
TPC C (TPC E): emulates OLTP
TPC H: OLAP
TPCx-HS: Big data management (Hadoop clusters)
2PC: Uncertainty Window
It is between when the RM send the ready/not ready message until it receives the global decision from the TM.
Local resoruces are locked during this time. That is why the uncertainty windows should be as small as possible.