System Design - Tweet Search Flashcards
Functional Requirements
Users should be able to search for tweets by keyword.
Users should be able to sort by either Popularity (likes) or Chronologically (time).
Users should be able to access results in pages, up to 100.
Non-Functional Requirements
- The system must be fast, median queries should return in < 500ms.
- The system must support a high volume of requests.
- New tweets must be searchable in a short amount of time, < 30s.
- All tweets must be discoverable, especially old or unpopular tweets may take longer to return.
- The system should be highly available.
Scale Estimations
Twitter has 100M users.
Assume user produces 1 tweet/day and Likes 10 tweets/day. Assume 100k seconds in a day.
Number of writes:
Tweets created: 100M * 1 tweet/day / (100k seconds/day) = 1k tweets/second
Likes created: 100M * 10 likes/day / (100k seconds/day) = 10k likes/second
Noteworthy here is that the number of likes is significantly more than the number of tweet creations. Now let’s look at reads.
Searches: 100M * 1 search/day / (100k seconds/day) = 1k searches/second
While this is a lot (and may burst the 10x this value or more), our system is write-heavy vs read-heavy.
Next let’s evaluate the storage requirements of the system. First let’s assume Twitter is 10 years old and that the full tweet metadata is 1kb (probably an overestimate).
Tweets searchable: 100M tweets/day * 365 days/year * 10 years = 365B tweets
Raw size: 36B tweets * 1kb/tweet = 365 TB
Things we need to address. Strategy
1) Indexing our tweets so we can quickly search for them by keyword.
2) Handling the large volume of writes to our system.
3) How to deal with pagination.
API - System Interface
GET /search?query=keywords&sort=SORT&page=5
How to quickly find Tweets(Posts) for a certain keyword?
ElasticSearch or a full-text index from relational DB. Need to study how this works
Alternatively:
Reverse index.
Keyword => Post IDs
One index can be an array (for time sorted stuff)
Another index can be a sorted set, which is essentially a heap.
Need some service that quickly populates this
Large volume of writes for these reverse indexes. Can do an approximate Like count (only update once every 4 likes for a tweet)
How to handle large number of writes to Tweets datastore?
Use a fast key-value store for storing Tweets like Cassandra.
For Likes, we don’t update the reverse index (sorted sets) every time someone Likes a tweet. Instead we updated it every N times (perhaps every 4 Likes or 10 Likes). This gives an approximate sorted result.
Later, we can re-query for the actual number of likes from the top 1000, do a real sort then return the result.
How to deal with Pagination?
When we’re sorting chronologically and working backwards, the timestamp of the last tweet tells us everything we need to know about where we are within the results. With this approach we can re-use the search result cache for pagination - we simply extract N (where N is the page size) records with timestamps less than our input “cursor”.
Challenges
We’ll need something like the result cache (above) to handle results sorted by likes.