Web Search Flashcards

Question 1

Q

What 2 characteristics make retrieval of information from the web a hard task?

Answer

A

The large and distributed volume of data available
The fast pace of change

Question 2

Q

What 2 main types of challenges posed by the web?

Answer

A

Data-centric problems related to the data itself
Interaction-centric problems related to the user and their interactions

Question 3

Q

What are some examples of data-centric challenges?

Answer

A

Distributed data
Large volumes
Unstructured and redundant data
Quality of data
Percentage of volatile data

Question 4

Q

What are some examples of interaction-centric challenges?

Answer

A

Expressing a query
Interpreting results
User key challenge: to conceive a good query
System key challenge: to do a fast search and return only relevant answers even to poor queries

Question 5

Q

What are some characteristics of HTML pages?

Answer

A

Most HTML pages do not comply with HTML specifications so browsers fill in the gaps
HTML pages are small and contain few images
The average number of external pages pointing to a page is 0
Most referenced sites are the search engines

Question 6

Q

How can we handle the challenges associated with web search?

Answer

A

Scalability: use parallel indexing and searching with MapReduce
Low quality info: spam detection and robust ranking
Dynamic pages: real-time crawling
Search accuracy: link analysis & multi-feature ranking

Question 7

Q

What are the key components of a search engine?

Answer

A

Web crawling and indexing
Query processing
Ranking
Snippet generation
Displaying top results and snippets

Question 8

Q

What is web crawling and indexing?

Answer

A

Crawling the web to discover and download web pages and indexing the content of those pages to create a local database containing necessary info for fast retrieval

Question 9

Q

What is query processing?

Answer

A

Processing an entered query against the local indexed copies of web pages

Question 10

Q

What is ranking?

Answer

A

When the search engine applies its ranking algorithm to the indexed documents to determine relevance of each document

Question 11

Q

What is snippet generation?

Answer

A

Search engine generating short summaries of the content within the document

Question 12

Q

What is a crawler/spider/robot?

Answer

A

Software agents that traverse the web copying pages

Question 13

Q

What is the centralized crawler-indexer architecture

Answer

A

Pages are crawled, stored in a central repository and indexed
Index is used in a centralized fashion to answer queries
Only a summarized or abstract representation of the content needs to be indexed

Question 14

Q

What are the main problems faced by centralized crawler-indexer architecture and what was the solution?

Answer

A

Gathering the data
Large volume
Solution was to distribute and parallelize computation