Web Search Flashcards

1
Q

What 2 characteristics make retrieval of information from the web a hard task?

A
  1. The large and distributed volume of data available
  2. The fast pace of change
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What 2 main types of challenges posed by the web?

A
  1. Data-centric problems related to the data itself
  2. Interaction-centric problems related to the user and their interactions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some examples of data-centric challenges?

A
  1. Distributed data
  2. Large volumes
  3. Unstructured and redundant data
  4. Quality of data
  5. Percentage of volatile data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are some examples of interaction-centric challenges?

A
  1. Expressing a query
  2. Interpreting results
  3. User key challenge: to conceive a good query
  4. System key challenge: to do a fast search and return only relevant answers even to poor queries
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are some characteristics of HTML pages?

A
  1. Most HTML pages do not comply with HTML specifications so browsers fill in the gaps
  2. HTML pages are small and contain few images
  3. The average number of external pages pointing to a page is 0
  4. Most referenced sites are the search engines
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How can we handle the challenges associated with web search?

A
  1. Scalability: use parallel indexing and searching with MapReduce
  2. Low quality info: spam detection and robust ranking
  3. Dynamic pages: real-time crawling
  4. Search accuracy: link analysis & multi-feature ranking
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the key components of a search engine?

A
  1. Web crawling and indexing
  2. Query processing
  3. Ranking
  4. Snippet generation
  5. Displaying top results and snippets
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is web crawling and indexing?

A

Crawling the web to discover and download web pages and indexing the content of those pages to create a local database containing necessary info for fast retrieval

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is query processing?

A

Processing an entered query against the local indexed copies of web pages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is ranking?

A

When the search engine applies its ranking algorithm to the indexed documents to determine relevance of each document

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is snippet generation?

A

Search engine generating short summaries of the content within the document

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a crawler/spider/robot?

A

Software agents that traverse the web copying pages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the centralized crawler-indexer architecture

A
  1. Pages are crawled, stored in a central repository and indexed
  2. Index is used in a centralized fashion to answer queries
  3. Only a summarized or abstract representation of the content needs to be indexed
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the main problems faced by centralized crawler-indexer architecture and what was the solution?

A
  1. Gathering the data
  2. Large volume
    Solution was to distribute and parallelize computation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly