Web Search Flashcards
What 2 characteristics make retrieval of information from the web a hard task?
- The large and distributed volume of data available
- The fast pace of change
What 2 main types of challenges posed by the web?
- Data-centric problems related to the data itself
- Interaction-centric problems related to the user and their interactions
What are some examples of data-centric challenges?
- Distributed data
- Large volumes
- Unstructured and redundant data
- Quality of data
- Percentage of volatile data
What are some examples of interaction-centric challenges?
- Expressing a query
- Interpreting results
- User key challenge: to conceive a good query
- System key challenge: to do a fast search and return only relevant answers even to poor queries
What are some characteristics of HTML pages?
- Most HTML pages do not comply with HTML specifications so browsers fill in the gaps
- HTML pages are small and contain few images
- The average number of external pages pointing to a page is 0
- Most referenced sites are the search engines
How can we handle the challenges associated with web search?
- Scalability: use parallel indexing and searching with MapReduce
- Low quality info: spam detection and robust ranking
- Dynamic pages: real-time crawling
- Search accuracy: link analysis & multi-feature ranking
What are the key components of a search engine?
- Web crawling and indexing
- Query processing
- Ranking
- Snippet generation
- Displaying top results and snippets
What is web crawling and indexing?
Crawling the web to discover and download web pages and indexing the content of those pages to create a local database containing necessary info for fast retrieval
What is query processing?
Processing an entered query against the local indexed copies of web pages
What is ranking?
When the search engine applies its ranking algorithm to the indexed documents to determine relevance of each document
What is snippet generation?
Search engine generating short summaries of the content within the document
What is a crawler/spider/robot?
Software agents that traverse the web copying pages
What is the centralized crawler-indexer architecture
- Pages are crawled, stored in a central repository and indexed
- Index is used in a centralized fashion to answer queries
- Only a summarized or abstract representation of the content needs to be indexed
What are the main problems faced by centralized crawler-indexer architecture and what was the solution?
- Gathering the data
- Large volume
Solution was to distribute and parallelize computation