Facebook Newsfeed Flashcards
Requirements
Functional
Non-Functional
Functional:
A user can upload/download photos/videos.
A user can follow other users/groups/pages.
User’s Newsfeed is generated from the posts by the users/groups/pages that the user followed.
A user may follow many friends.
Feeds may contain images/videos/text.
Advanced: we should support adding new posts as they arrive, for active users.
Non-functional:
Maximum latency for user should be 2s, p50 200ms.
Availability > Consistency. But ideally, new posts should take <5s for active users to see.
Reliability - uploaded content never lost.
Estimation
Requests per second?
Storage estimates?
Assume 300M DAU that make 5 requests per day.
That is 1.5B requests per day i.e.
1.5 * 109 / 86400 = 1.5 * 10^9 / 0.86 * 10^5 = 17.5K QPS.
Storage: Assume 500 posts stored per newsfeed. Each post is 1KB. = 500kb per feed. * 300M DAU = 15 * 10^4 * 10^6 = 150B kb = 150TB
Assume a server can hold 100GB in-memory, that’s 1500 machines
API
How would client get the news feed?
Something like:
get_user_feed(user_id, start, end, count):
- start, end: default None, but we can use start if client wants to fetch newer items. use end for use case of scrolling deeper into history.
- count: diff clients can determine max items. Perhaps default is 200, more scrolling is smaller.
DB Design
What tables/relationships?
Starting with SQL:
User table
- id, name, email, dob, last login
Entity table (page, group) - name, type (tinyint), desc, create date, email etc
UserFollow (User follows user/page/group)
- user id, followingid, type (tinyint - is following another user or entity)
FeedItem
- id, created, userid, entityid, location, content (varchar 256), likes, media_id
Media
- type (img, video etc), desc, path, location, created
FeedItemMedia
- feed item id, media id
Steps to generate Newsfeed (naive and improved)? Risks?
- Fetch all users/entities that this user follows.
- Get latest feed items for these.
- Rank them (by popularity, relevance?)
High latency to fetch all following users and all their posts and order them by time/popularity.
#1 We can improve this by storing the news feed for each user in the DB. This can just be a list of user id -> all the feed item ids we want in the feed.
#2 Cache the feed. The cache can have a map + linked list. It points to current head and tail of the feed and tracks "most_recent_item" timestamp. As a linked list, it is fast to add/remove feed items and also fetch middle of the feed if user is scrolling past first page.
How are feed updates published to users? Pros/cons?
Would these new posts be cached?
- It depends. We could consider only updating cached feed when the feed is requested. The server would check “most_recent” timestamp on feed and update with newer items from the db layer (which could have its own FIFO cache). This would save space and prevent work updating cache feeds that won’t be pulled for a long time.
Pull or Push model…
Pull model: Clients can pull on a cadence or when user manually refreshes.
Risks:
- You create a copy of the post for each follower and it sits in cache memory (if you cache it). If you don’t cache, you increase latency for users who haven’t pulled in a while and have many feed updates.
- Many pulls may have empty responses (no updates), wastes resources and client battery/bandwidth.
Push model (fan-out): Immediately send to followers via long-polling or web-sockets.
Pros/Cons:
- (+) Fewer reads to server / cache / DB.
- Have to maintain long-polling (websockets or push notifications are better)
- Celebrity users can cause a ton of updates at once.
Hybrid: limit push updates to conserve resources. e.g. 1) don’t push celebrity posts, or 2) only push to online/active followers. Both require the posts service to have logic to know what notifications to queue.