sdi 5 Flashcards
tweet search api call
search(api_dev_key, search_terms, maximum_results_to_return, sort, page_token)
sort (number): Optional sort mode: Latest first (0 - default), Best matched (1), Most liked (2).
page_token (string): This token will specify a page in the result set that should be returned.
Returns: A JSON containing info about a list of tweets matching the search. Each result can have the user ID & name, tweet text, tweet ID, creation time, number of likes, etc.
tweet search high level design
we need to store all the statues in a database and also build an index that can keep
track of which word appears in which tweet. This index will help us quickly find tweets that users are
trying to search.
tweet search indexing
- the index should tell us which word comes in which tweet object. 300K English words + 200K nouns at 5 chars per word = 2.5MB of memory to store all words.
- perhaps we want to keep the index in memory for all tweets from only past 2 years
- our index would be like a big distributed hash table, where ‘key’ would be the word and ‘value’ will
be a list of TweetIDs of all those tweets which contain that word.
tweet search data sharding
- we can shard based on words (get hash of each word); to find all tweets containing a specific word we have to query only that server. Issues: hot words (lots of queries on that server) and words that appear in many tweets (non-uniform distribution of words)
- get a hash of TweetID and index all the words of the tweet on that server. While querying for a word, we query all the servers, and each server returns a set of TweetIDs. A centralized server will
aggregate these results to return to the user.
tweet search - What if both primary and secondary servers die at the same time?
We have to allocate a new server and rebuild the same index on it.
Our Index-Builder server can hold a HT where ‘key’ is index server number and ‘value’ is a HashSet containing all the TweetIDs being kept at that index server.
Whenever an index server has to rebuild itself, it can ask the Index-Builder server for all the tweets it needs to store and then fetch those tweets to build the index. This approach will surely be fast. We should also have a replica of the Index-Builder server for fault tolerance.
- avg size of html page?
- avg size of metadata for each page?
- 100 KB
- 500 bytes
how do search engines use web crawlers?
Web crawlers systematically browse webpages to learn what each page on the website is about, so this information can be indexed, updated and retrieved when a user makes a search query.
web crawler high level design
it takes a list of seed URLs as its input, then
1. Pick a URL from the unvisited URL list.
2. Establish a connection to host to download the corresponding document.
3. Parse the document contents to look for new URLs.
4. Add the new URLs to the list of unvisited URLs.
5. Process the downloaded document, e.g., store it or index its contents, etc.
6. repeat
web crawler BFS vs DFS
usually BFS. However, DFS is used sometimes; ex: if your crawler has already established a connection with the website, it might just DFS all the URLs within this website to save some handshaking overhead.
Path-ascending crawling:
Path-ascending crawling can help discover isolated resources or resources for which no inbound link would have been found w/ regular crawling of a particular site. In this scheme, a crawler ascends to every path in each URL that it intends to crawl. Ex: when given a seed URL of http://foo.com/a/b/page.html, it will attempt to crawl /a/b/, /a/, and /.
There are two important characteristics of the Web that makes Web crawling a very difficult task:
- Large volume of Web pages: the crawler can only download a fraction of the web pages at any time and hence it is critical that web crawler should be intelligent enough to prioritize download.
- web pages on the internet change very frequently. By the time the crawler is downloading the last page from a site, the page may change, or a new page may be added.
A bare minimum crawler needs at least these components:
- URL frontier: stores the list of URLs to download and prioritizes which URLs should be crawled first.
- HTTP Fetcher: To retrieve a web page from the server.
- Extractor: To extract links from HTML documents.
- Duplicate Eliminator: makes sure the same content is not extracted twice.
- Datastore: To store retrieved pages, URLs, and other metadata.
web crawler - URL frontier
contains all the URLs that remain to be downloaded.
Since we have a huge list of URLs to crawl, we can distribute our URL frontier onto multiple servers. Let’s assume:
- on each server we have multiple worker threads performing the crawling tasks.
- we have a hash function mapping each URL to the server responsible for crawling it.
keep in mind politeness requirements.
web crawler - politeness requirements
- Our crawler should not overload a server by downloading a lot of pages from it.
- We should not have multiple machines connecting a web server.
our crawler can have a collection of distinct FIFO sub-queues on each server. Each worker thread has its own sub-queue. When a new URL needs to be added, the sub-queue in which it is placed will be determined by the URL’s canonical hostname, using a hash function.
how do we handle the large size of our url frontier?
The size would be in the hundreds of millions of URLs. Hence,
we need to store our URLs on a disk.
- Enqueue buffer, once filled, will be dumped to the disk.
- Dequeue buffer will keep a cache of URLs that need to be visited; it can periodically read from disk to fill itself.