sdi 2 Flashcards

1
Q

dropbox design considerations

A
  • We should expect huge read and write volumes.
  • Read to write ratio is expected to be nearly the same.
  • Internally, files can be stored in small parts or chunks (say 4MB); this can provide a lot of
    benefits i.e. if a user fails to
    upload a file, then only the failing chunk will be retried.
  • We can reduce the amount of data exchange by transferring updated chunks only.
  • ACIDity of all file operations is required
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

dropbox capacity estimation

A
  • ask total users and DAU
  • Let’s assume on average each user connects from 3 different devices.
  • Each user has 200 files/photos
  • average file size is 100KB
  • 1M active connections per min
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

dropbox high level design

A
  • The user will specify a folder as the workspace on their device. Anything placed in this
    folder will be uploaded to the cloud, and whenever a file is modified or deleted, it will be reflected in the same way in the cloud storage. The user can specify similar workspaces on all their devices and any modification done on one device will be propagated to all other devices
  • 3 types of “main” servers:
    1. Block servers works w/ clients to upload/download files from cloud storage
    2. Metadata servers
    3. Synchronization servers will notify clients about changes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

insta high level design

A

we need to support two scenarios, one to upload photos and the other to view/search
photos.
- storage includes:
1. obj storage servers for photos
2. db servers for metadata

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

pastebin high level design

A

application layer that will serve all the read and write requests. Application layer will talk to a storage layer to store and retrieve data.
- storage layer is divided into obj storage and metadata storage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

dropbox - what does client application do?

A
  • monitor workspace folder on user’s machine to detect changes
  • work with the storage servers to upload, download, and modify actual files to backend Cloud Storage
  • interacts with the remote Synchronization Service to handle any file metadata updates
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

dropbox metadata

A

Keeping a local copy of metadata not only enables us to do offline updates but also saves a lot of round trips to update remote metadata.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

http long polling

A

A way for clients maintain an open connection with the server.

Client requests information from server w/ expectation that the server may not respond immediately.
If the server has no new data for the client when the poll is received, instead of sending an empty response, the server holds the request open and waits for response information to become available.
Once it does have new info, the server immediately sends an HTTP/S response to the client. Upon receipt of the server response, the client can immediately issue another request.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

dropbox client consists of:

A
  1. Internal Metadata Database will keep track of all the files, chunks, their versions, and their
    location in the file system.
  2. chunker
  3. watcher
  4. indexer
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

chunker

A
  • splits files into chunks
  • reconstructs file from its chunks
  • chunking algorithm will detect the parts of the files that have
    been modified by the user and transfer only those parts to the Cloud Storage
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

We can statically calculate what could be an optimal
chunk size based on

A

1) Storage devices we use in the cloud
2) Network bandwidth
3) Average file size in the storage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

watcher

A
  • monitor the local workspace folders and notify the Indexer of any action performed by the users
  • also listens to any changes happening on other clients that are broadcasted by
    Synchronization service.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

indexer

A
  • process events received from Watcher and update internal metadata database w/ info about the chunks
  • Once chunks are successfully submitted to Cloud Storage, Indexer communicates w/ Sync Service to broadcast changes to other clients and update remote metadata database.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Should mobile clients sync remote changes immediately?

A

Unlike desktop or web clients, mobile
clients usually sync on demand to save user’s bandwidth and space.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

dropbox metadata db

A

Sync Service should be able to provide a consistent view of the files using this db, esp if more than 1 user is working w/ the same file simultaneously.

if we choose noSQL such as DynamoDB:
ACID properties not supported in favor of scalability and performance, we need to incorporate the support for ACID properties programmatically in the logic of our Sync Service

if we use relational database such as MySQL, the Sync Service implementation will be simpler b/c rel DBs natively support ACID properties.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

The dropbox metadata Database should be storing information about the following:

A
  1. Chunks
  2. Files
  3. User
  4. Devices
  5. Workspace folders
17
Q

dropbox file processing workflow

A
  1. Client A uploads chunks to cloud storage.
  2. Client A updates metadata.
  3. Client A gets confirmation and notifications are sent to Clients B and C about the changes.
  4. Client B and C receive metadata changes and download updated chunks.
18
Q

Data deduplication

A

eliminating duplicate copies of data to improve storage utilization.

For each new incoming chunk, we get a hash of it and compare that hash with all the hashes of the existing chunks to see if we already have the same chunk present in our storage.

  • post-process, in-line
19
Q

Post-process deduplication

A

new chunks are first stored on the storage device, and later, some process analyzes the data, looking for duplication.

  • pros: clients won’t need to wait for hash calculation or hash lookup before storing the data
  • cons: 1) We will unnecessarily be
    storing duplicate data, though for a short time, 2) Duplicate data will be transferred consuming
    bandwidth.
20
Q

In-line deduplication

A

hash calculations (for chunks) can be done in real-time as the clients are entering data on their device. If our system identifies a chunk that it has already stored, only a reference to the existing chunk will be added in the metadata, rather than a full copy of the chunk, giving us optimal network and storage usage.

21
Q

how can Sync Service achieve better response time?

A

by transmitting less data btwn clients and Cloud Storage.

use differencing algo - Instead of transmitting entire files from clients to the server, we can just transmit the difference between 2 versions of a file.

22
Q

Sync Service main functions

A
  • Desktop clients communicate w/ Sync Service to either get updates from or send updates to Cloud Storage and other users.
  • it receives and carries out metadata update requests, then notifies all subscribed users or devices about the update.
  • we should use communication middleware between clients and Sync Service.
23
Q

Message Queuing Service for message-based communication between clients and the Synchronization Service

A
  • asynchronous
  • must handle large volume
  • highly available, reliable, scalable
  • 2 types of queues
  1. Request Q
    - shared by all clients
    - Clients’ requests to update the Metadata db will be sent to the Request Q first; from there the Sync Service will take it to update metadata.
  2. Response Qs
    - each client has its own Q
    - responsible for delivering the
    update messages to each client
24
Q

Vertical Partitioning:

A

store tables related to 1 feature on 1 server. ex: we can store all the user-related tables in one db and all files/chunks-related tables in another db.

this is straightforward to implement but
- easily run into scaling issues (we may need to store trillions of chunks, but our db can’t store that much)
- Joining 2 tables in two separate databases can cause performance and consistency issues.

25
Q

dropbox caching

A

2 types - 1 for Block storage, 1 for metadata db

We can use an off-the-shelf solution like Memcached that can store whole chunks w/ its respective IDs/Hashes. Block servers, before hitting Block storage, can quickly check if the
cache has desired chunk. Use LRU eviction.

26
Q

dropbox security

A

we will be storing the permissions of each
file in our metadata DB to reflect what files are visible or modifiable by any user.

27
Q

messaging app non functional requirements

A
  1. Users should have real-time chat experience with minimum latency.
  2. highly consistent - same chat history on all devices.
  3. high availability is desirable consistency more important