Distributed databases Flashcards

Question

What is No replication?

Answer 1

◦ Each fragment is stored at a unique site

Answer 2

◦ Very faster query answering, because you can connect to a random server and that server will have everything you need to answer the query - don't need to wait for a particular server to be available ◦ Very slow updates: because every time there is an update, you have to update every single server

Answer 3

◦ Updates are very fast as you’re only updating one place.

Answer 4

◦ Crashes are a big problem because then there’s no way to get the answer to a query. ◦ Queries are slower however because if you need information stored on another server you must wait to for this information

Answer 5

◦ Limit number of copies of each fragment, you may not be able to answer every query. ◦ Replicate only some fragments, so you may choose to replicate the most important fragments or the most frequently accessed fragments

Answer 6

Fragmentation transparency (Highest level) ◦ is transparent to users ◦ Users pose queries against the entire database ◦ The distributed DBMS translates this into a query plan that fetches the required information from appropriate nodes Replication transparency ◦ Ability to store copies of data items / fragments at different sites ◦ Replication is transparent to users Location transparency ◦ The location where data is stored is transparent to the user, this is similar to replication transparency but it’s more specialised than fragmentation transparency. Naming transparency ◦ You need to ensure that a given name (of a relation) has the same meaning everywhere in the system, so for each instance of a relation the name must be the same.

Answer 7

Concurrency control is the part for ensuring ACID properties of Isolation and consistency are retained. One way of doing this is using locks for full isolation/consistency.

Answer 8

To have one master computer doing all the locks so this one computer determines should I grant this request for lock or not.

Answer 9

1: if master computer fails then you’ll need to restart the entire system and all running transactions because you don’t know who had the locks when this computer failed. - Can have a backup system running. If this is in place we won’t need to restart everything. You must keep the primary and backup computers synchronised or you can have inaccurate backup info. However synchronisation is very expensive 2. If too many transactions require locks at the same time it will make it too difficult for one master computer to handle. - One way of dealing with too many transactions is to have more computers, each being the authority of a different item (locks) - However it may not be clear anymore who you need to ask to get a lock.

Answer 10

◦ Each site with a copy of an item has a local lock it can grant transactions for that item ◦ If a transaction gets over half the local locks for an item, it has a global lock on the item ◦ If so it must tell the sites with a copy that it has the lock ◦ If it takes too long to get/announce it has the lock, it must stop trying to get the lock and abort this part.

Answer 11

Much more distributed than the non-voting approach as it doesn't matter how many go down because if at least half are running you can still access each item

Answer 12

Requires much more communication between computers in networks and this takes a long time

Answer 13

atomicity and durability

Answer 14

- Start our local transaction at the central office T0 - Then we instruct the other sites to start local transactions 1,2 & 3 - These smaller transactions, 1,2 & 3 then figure out how much inventory they have on site and send this information back to T0 - T0 then determines how to move the product between the different sites, and tells them how to move the product (not physically but how the database is changed)

Answer 15

◦ Can assume Atomicity is enforced at each node locally, so can say that T1 is either fully executed or not executed at all ◦ Could be violated globally At individual site, atomicity could be satisfied but this can’t ensure overall because T1 and T3 could work but T2 could fail and have to be rolled back, but this would then mean that T1 and T3 even despite being run successfully also have to be rolled back because we want the global transactions to be fully executed or not executed at all

Answer 16

A protocol designed to ensure that either everybody commits or no-one commits

Answer 17

- This protocols job is to commit actions globally - We have a designated node that decides if/when local transactions can commit - We do logging at each node locally and you also lock the messages that get sent from other nodes and the ones we send to other nodes

Answer 18

Phase 1: Decide when to commit or abort - Coordinator sends “prepare T” to all involved nodes. - Each node decides if right now they are ready to commit or not. - If a node is ready it goes into the pre-committed state, sends “ready T” to coordinator - When in a pre-committed state, you’re not allowed to abort - If a node isn't ready to commit, sends “don’t commit T” and aborts the local transaction - If one node aborts then all the other nodes must abort too - Timeout is used in the case of a delay that states if a node doesn’t answer within this time then default to failed and abort.

Answer 19

Coordinator waits for responses of the nodes - assumes a timeout means a node wishes to abort - If node responds “don’t commit” or there’s a timeout, Coordinator sends “abort T” to all nodes. - If every node responds with “Ready T”, every node has decided to commit - Coordinator sends “commit T” to every node

Answer 20

- Before coordinator sends “prepare T” you write it in log file i buffer - If a node sends “ready T” then it must first enter the pre-committed state, you have to make sure all log entries are written to disk - If we see that the last entry in the log is ‘Don't commit T’ then we know we wanted to abort, so we just abort afterwards.

Answer 21

- Two cases, either all nodes are committing or aborting - In both cases we write to the log file what we’re doing, either COMMIT T or ABORT T - If COMMIT T is the last thing in the log file and there’s a failure then you should redo the transactions both locally and other places as well

Answer 22

- Fixes a small issue of Two Phase Commit - ensures DDBMS are consistent and reliable

Answer 23

- the prepare phase - the commit phase - the finish phase

Answer 24

- each node involved writes its changes to a log file in buffer and sends to the coordinator

Answer 25

- if all nodes involved send the coordinator sends to all nodes

Answer 26

- each node that receives writes changes to the disk and sends ACK to confirm success - if at this point a node does not send an ACK, coordinator sends transaction is rolled back

Answer 27

- If in phase 2, the coordinator and some transaction crash, while everybody else is in the pre-committed state, we have the problem that nothing can be done until the coordinator or this crashed transaction recover - We can either try to abort all transactions or try to commit all transactions but this doesn’t work. - If we try to get all transactions to commit, then the crashed transaction might be in the middle of or already have aborted and this can break durability - If the durability of a database is compromised, it may be difficult or impossible to recover lost data, which can lead to data loss, corruption - And if everyone tried to abort (apart from the crashed nodes) the issue could then be that the crashed transaction/s might have previously been told to commit and they might have done so but when they come back online they have to rollback and we’re not allowed to rollback committed transactions, so we’ll be breaking durability In either case we’re breaking durability so we have to: Leave everything in the pre-committed phase with all the items staying locked, this is extremely expensive:

Answer 28

-we have two relations each placed at different sites, R stored at A and B stores S - At B, we make the query of R natural join S - have to send over all of R from site A -the connection between the two sites might be very slow, so the more data we have to send over the longer this is going to take. So we want to minimise the data we’re transferring by only sending what we need.

Answer 29

R semijoin S is the set of all tuples in R that NATURAL JOIN at least one tuple in S

Answer 30

- So we have site A storing R and site B storing S and we want to compute the left semijoin at site A and then send this over to site B - So we send over the distinct common attributes of R and S from S to site A from B - then site A computes the left semijoin of R on S = R’ and we send this back to site B. These are the relevant tuples needed. - It then sends these tuples back over to site B where it can compute R’ natural join S Runtime cost: Costs time of first sending S’ times however big each tuple in S’ is + R’ times however big each tuple in R’ is

Answer 31

- So is this more efficient than the other method of just sending the entire table over. - It depends on if for instance the projection(distinct) is much smaller than the full relation, so many duplicates get eliminated during the process then this can be more efficient - It can also be that the size of the semi join is much smaller than the size of R and again this will typically be more efficient - In general, the size of the common attribute in S + the size of the semijoin R on S should be smaller than R and in that case, it’s more efficient than sending over all of S

Distributed databases Flashcards

(55 cards)