sdi 6 Flashcards
yelp scale estimation
assume 500M places, 100K qps, 20% growth in the number of places and QPS each year
yelp db schema
4 tables: locations, reviews, location photos, review photos
how many bytes are the following?
- ID
- name of a location
- lat/long
- description/review
- category/rating
- 8 bytes
- 256 bytes
- 8 bytes
- 512 bytes
- 1 byte
1 byte = 8 bits; 32 bits means we can have 2^32 unique IDs
yelp api definition
search(api_dev_key, search_terms, user_location, radius_filter,
maximum_results_to_return,
category_filter, sort, page_token)
category_filter (string): Optional category to filter search results, e.g., Restaurants, Shopping Centers
sort (number): Optional sort mode: Best matched (0 - default), Minimum distance (1), Highest rated (2).
Returns: (JSON)
A JSON containing information about a list of businesses matching the search query. Each result entry
will have the business name, address, category, rating, and thumbnail.
How will we build a QuadTree?
We will start with one node that will represent the whole world in
one grid. Since it will have more than 500 locations, we will break it down into four nodes and
distribute locations among them. We will keep repeating this process with each child node until there
are no nodes left with more than 500 locations.
How will we find the grid for a given location?
We will start with the root node and search
downward to find our required node/grid. At each step, we will see if the current node we are visiting
has children. If it has, we will move to the child node that contains our desired location and repeat this
process. If the node does not have any children, then that is our desired node.
How will we find neighboring grids of a given grid?
We can keep a pointer in each node to access its parent. We
can keep expanding our search for neighboring grids by going up through the parent pointers.
another option is to connect all leaf nodes with a doubly linked list.
yelp - search workflow
We will first find the node that contains the user’s location. If that node has enough desired places, we can return them to the user. If not, we will keep expanding to the neighboring nodes until either we find the required number of places or exhaust our search based on the maximum radius.
yelp - data sharding
- based on region (problems w/ hot regions and dense regions)
- better option: hash function will map each LocationID to a server where we will store that place.
yelp - basic system design considerations
we need to store and index each dataset (places, reviews, etc.). For users to query this massive database, the indexing should be read-efficient, since when searching for nearby places, users expect to see the results in real time. Given that the location of a place doesn’t change that often, we don’t need to worry about frequent updates of the data.
yelp - How would we insert a new Place into our system?
we need to insert it into the databases as well as in the QuadTree. If our tree resides on 1 server, it is easy to add a new Place, but if the QuadTree is distributed among different servers, first we need to find the grid/server of the new Place and then add it there.
How can we return most popular places within a given radius?
Let’s assume we keep track of the
overall popularity of each place. An aggregated number can represent this popularity in our system. We will store this number in the database as well as in the QuadTree. While searching for the
top 100 places within a given radius, we can ask each partition of the QuadTree to return the top 100 places with maximum popularity. Then the aggregator server can determine the top 100 places among all the places returned by different partitions.
uber backend - how do we handle drivers’ frequently changing locations?
Since all active drivers report their location every 3s, there’ll be a lot more updates than querying on our QuadTree. We can keep the latest position reported by each driver in a HT and update our QT less frequently. Let’s assume we guarantee that a driver’s current location will be reflected in the
QT within 15s.
How would “Request Ride” use case work?
- The customer requests.
- An Aggregator server takes the request and asks QuadTree servers to return nearby drivers.
- Aggregator server collects all results and sorts them by ratings.
- Aggregator sends a notification to the top (say 3) drivers; whoever accepts the request first will be assigned. If none of the 3 respond, Aggregator asks the next 3 on the list.
- Once a driver accepts a request, the customer is notified.
uber backend - Do we need to repartition a grid as soon as it reaches the maximum limit?
We can have a cushion to let each grid grow a little bigger beyond the limit before we decide to partition it. Let’s say our grids can grow/shrink an extra 10% before we partition/merge them. This should decrease the load for a grid
partition or merge on high traffic grids.