- splits files into chunks - reconstructs file from its chunks - chunking algorithm will detect the parts of the files that have been modified by the user and transfer only those parts to the Cloud Storage

- monitor the local workspace folders and notify the Indexer of any action performed by the users - also listens to any changes happening on other clients that are broadcasted by Synchronization service.

- process events received from Watcher and update internal metadata database w/ info about the chunks - Once chunks are successfully submitted to Cloud Storage, Indexer communicates w/ Sync Service to broadcast changes to other clients and update remote metadata database.

sdi 2 Flashcards by ace ang

dropbox design considerations

We should expect huge read and write volumes.
Read to write ratio is expected to be nearly the same.
Internally, files can be stored in small parts or chunks (say 4MB); this can provide a lot of
benefits i.e. if a user fails to
upload a file, then only the failing chunk will be retried.
We can reduce the amount of data exchange by transferring updated chunks only.
ACIDity of all file operations is required

How well did you know this?

Not at all

Perfectly

dropbox capacity estimation

ask total users and DAU
Let’s assume on average each user connects from 3 different devices.
Each user has 200 files/photos
average file size is 100KB
1M active connections per min

How well did you know this?

Not at all

Perfectly

dropbox high level design

The user will specify a folder as the workspace on their device. Anything placed in this
folder will be uploaded to the cloud, and whenever a file is modified or deleted, it will be reflected in the same way in the cloud storage. The user can specify similar workspaces on all their devices and any modification done on one device will be propagated to all other devices
3 types of “main” servers:
1. Block servers works w/ clients to upload/download files from cloud storage
2. Metadata servers
3. Synchronization servers will notify clients about changes

How well did you know this?

Not at all

Perfectly

insta high level design

we need to support two scenarios, one to upload photos and the other to view/search
photos.
- storage includes:
1. obj storage servers for photos
2. db servers for metadata

How well did you know this?

Not at all

Perfectly

pastebin high level design

application layer that will serve all the read and write requests. Application layer will talk to a storage layer to store and retrieve data.
- storage layer is divided into obj storage and metadata storage

How well did you know this?

Not at all

Perfectly

dropbox - what does client application do?

monitor workspace folder on user’s machine to detect changes
work with the storage servers to upload, download, and modify actual files to backend Cloud Storage
interacts with the remote Synchronization Service to handle any file metadata updates

How well did you know this?

Not at all

Perfectly

dropbox metadata

Keeping a local copy of metadata not only enables us to do offline updates but also saves a lot of round trips to update remote metadata.

How well did you know this?

Not at all

Perfectly

http long polling

A way for clients maintain an open connection with the server.

Client requests information from server w/ expectation that the server may not respond immediately.
If the server has no new data for the client when the poll is received, instead of sending an empty response, the server holds the request open and waits for response information to become available.
Once it does have new info, the server immediately sends an HTTP/S response to the client. Upon receipt of the server response, the client can immediately issue another request.

How well did you know this?

Not at all

Perfectly

dropbox client consists of:

Internal Metadata Database will keep track of all the files, chunks, their versions, and their
location in the file system.
chunker
watcher
indexer

How well did you know this?

Not at all

Perfectly

chunker

splits files into chunks
reconstructs file from its chunks
chunking algorithm will detect the parts of the files that have
been modified by the user and transfer only those parts to the Cloud Storage

How well did you know this?

Not at all

Perfectly

We can statically calculate what could be an optimal
chunk size based on

1) Storage devices we use in the cloud
2) Network bandwidth
3) Average file size in the storage

How well did you know this?

Not at all

Perfectly

watcher

monitor the local workspace folders and notify the Indexer of any action performed by the users
also listens to any changes happening on other clients that are broadcasted by
Synchronization service.

How well did you know this?

Not at all

Perfectly

indexer

process events received from Watcher and update internal metadata database w/ info about the chunks
Once chunks are successfully submitted to Cloud Storage, Indexer communicates w/ Sync Service to broadcast changes to other clients and update remote metadata database.

How well did you know this?

Not at all

Perfectly

Should mobile clients sync remote changes immediately?

Unlike desktop or web clients, mobile
clients usually sync on demand to save user’s bandwidth and space.

How well did you know this?

Not at all

Perfectly

dropbox metadata db

Sync Service should be able to provide a consistent view of the files using this db, esp if more than 1 user is working w/ the same file simultaneously.

if we choose noSQL such as DynamoDB:
ACID properties not supported in favor of scalability and performance, we need to incorporate the support for ACID properties programmatically in the logic of our Sync Service

if we use relational database such as MySQL, the Sync Service implementation will be simpler b/c rel DBs natively support ACID properties.

How well did you know this?

Not at all

Perfectly

The dropbox metadata Database should be storing information about the following:

Study These Flashcards

Chunks
Files
User
Devices
Workspace folders

dropbox file processing workflow

Study These Flashcards

Client A uploads chunks to cloud storage.
Client A updates metadata.
Client A gets confirmation and notifications are sent to Clients B and C about the changes.
Client B and C receive metadata changes and download updated chunks.

Data deduplication

Study These Flashcards

eliminating duplicate copies of data to improve storage utilization.

For each new incoming chunk, we get a hash of it and compare that hash with all the hashes of the existing chunks to see if we already have the same chunk present in our storage.

post-process, in-line

Post-process deduplication

Study These Flashcards

new chunks are first stored on the storage device, and later, some process analyzes the data, looking for duplication.

pros: clients won’t need to wait for hash calculation or hash lookup before storing the data
cons: 1) We will unnecessarily be
storing duplicate data, though for a short time, 2) Duplicate data will be transferred consuming
bandwidth.

In-line deduplication

Study These Flashcards

hash calculations (for chunks) can be done in real-time as the clients are entering data on their device. If our system identifies a chunk that it has already stored, only a reference to the existing chunk will be added in the metadata, rather than a full copy of the chunk, giving us optimal network and storage usage.

how can Sync Service achieve better response time?

Study These Flashcards

by transmitting less data btwn clients and Cloud Storage.

use differencing algo - Instead of transmitting entire files from clients to the server, we can just transmit the difference between 2 versions of a file.

Sync Service main functions

Study These Flashcards

Desktop clients communicate w/ Sync Service to either get updates from or send updates to Cloud Storage and other users.
it receives and carries out metadata update requests, then notifies all subscribed users or devices about the update.
we should use communication middleware between clients and Sync Service.

Message Queuing Service for message-based communication between clients and the Synchronization Service

Study These Flashcards

asynchronous
must handle large volume
highly available, reliable, scalable
2 types of queues

Request Q
- shared by all clients
- Clients’ requests to update the Metadata db will be sent to the Request Q first; from there the Sync Service will take it to update metadata.
Response Qs
- each client has its own Q
- responsible for delivering the
update messages to each client

Vertical Partitioning:

Study These Flashcards

store tables related to 1 feature on 1 server. ex: we can store all the user-related tables in one db and all files/chunks-related tables in another db.

this is straightforward to implement but
- easily run into scaling issues (we may need to store trillions of chunks, but our db can’t store that much)
- Joining 2 tables in two separate databases can cause performance and consistency issues.

dropbox caching

2 types - 1 for Block storage, 1 for metadata db We can use an off-the-shelf solution like Memcached that can store whole chunks w/ its respective IDs/Hashes. Block servers, before hitting Block storage, can quickly check if the cache has desired chunk. Use LRU eviction.

dropbox security

we will be storing the permissions of each file in our metadata DB to reflect what files are visible or modifiable by any user.

messaging app non functional requirements

1. Users should have real-time chat experience with minimum latency. 2. highly consistent - same chat history on all devices. 3. high availability is desirable consistency more important

sdi 2 Flashcards

(27 cards)