19 Web Search Flashcards

Question 1

Q

Web search: Challenges compared to ‘traditional’ IR

Answer

A

Scale (billions of web pages)
Heterogenous content
Trust
Web users different from “traditional” IR users
Business aspects (sponsored search)

Question 2

Q

Types of web queries

Answer

A

Information[~50%]: General info on topic
Ex: Italian cuisine, Britney Spears family life
Navigational[~20%]: Search specific entity
Ex: University of Oslo in Norway
Transactional[~30%]: Want to do something
Ex: Car rental from Gardemoen.

Question 3

Q

Web queries: Importantz

Answer

A

Precision often more important than recall

Especially precision on top results
Necessary to filter untrusted pages/spam
Need to consider other qualites than relevance (trustworthiness, recency of content)
Recall only matters if number of matches is very small

Query language must be lighweight (mostly phrase uqeries)

Question 4

Q

The web graph

Answer

A

We refer to the hyperlinks onto a page as in-links/indegree, and out- degree of a web page to be the number or links out of it.

Strongly connected: If there are pairs of pages such that one cannot proceed from one page to of the pair to the other by following hyper- links, these are not strongly connected.

Question 5

Q

Spamdexing

Answer

A

Manipulation of web content to appear artificially high on search results for particular keywords.

Common spamming techniques:

Keyword stuffing, invisible text
Cloaking: Server returns fake content to web crawlers
Doorways: dummy start page carefully crafted for keywords
Optimisation of metadata on the page

Counter measures:

Exploit “quality signals” (from web and from users) to determine wheter a webpage is trustworthy
Analysis of web graph to detect suspicious lingaes
Machine learning to classify spam
Editorial intervention (blacklists etc.)

Question 6

Q

Web crawling and indexing

Answer

A

Is the process by which we gather pages from web to index them and support a seach enginge. The objective of crawling is to quickly and efficiently gather as many useful web pages as possible.

Question 7

Q

A crawler must be:

Answer

A

Robust
The web contains servers that create spider traps, which are genrators of web pages that misled creawlers into getting stuch fetching an infinite number of pages in a particular domain.

Polite
Avoid flooding servers and only crawl allowed pages

Question 8

Q

A crawler should be

Answer

A

Distributed
Should have the ability to execute in a distributed fashion across multiple machines -

Scalable
The architecture should permit scaling up the crawl rate by adding extra machines and bandwith

Quality
Must be biased toward fetching useful pages. The fetched page is parsed, to extract both the text and the links from the page. The extracted text is fed to a text indexer. The extracted URLs/links are then added to a URL frontier, which at all times consists of URLs whose corresponding pages have yet to be fetched by the crawler. This process may be viewed as traversing the web graph.

Question 9

Q

Crawling architecture

Answer

A

The URL frontier, containing URLs yet to be teched in the current crawl
A DNS resolution module that determines the web server from which to fetch the page specified by a URL
A parings module that extracts the text and set of links from a fetched web page
A duplicate elimination module that determines wheter an extracted link is already in the URL frontier or has recently been fetched.

A crawler must crawl webpages thar are of high quality and/or are frqe- untly updated more often.

Question 10

Q

URL frontier

Answer

A

The URL frontier have a front queue and back queue to take into the consideration the polite and prioritizing effects. Its goals are to ensure that

Only one connection is open to any host
A waiting time for a few seconds occurs between succesive request to a host
High-priority pages are crawled preferantially

The two major modules are a set of F front queues and a set of B back queues. Both FIFO queues. The front queue implements the prioritization and back queue the politeness.

Question 11

Q

Web indexing

Answer

A

Two types of index partitioning:

Partitioning by terms:
Index terms divided in subsets, and each subset is allocated to a node Pro: Greater concurrency Cons: Must exchange and merge long PLs across nodes
Partitioning by documents:
Each node is responsible for a local index for subset of all documents (query sent to each node and the result are merged back)
Pro: Often easier to distribute, more efficient I/O on PLs
Cons: More disk seeks, need do calculate global statistics separately

Question 12

Q

Link analysis

Answer

A

The study on link analysis builds on two intuitions:

The anchor text pointing to page B is a good description of page B
The hyperlink from A to B represents an endorsements of page B, by the creator of page A. Therefore, good nodes will tend to point to good nodes, and bad nodes to bad nodes.

Question 13

Q

PageRank

Answer

A

The most well-known algorithm for ranking the quality of webpages accord- ing to their link structure (used by Google). It assigns a numberical score between 0 and 1 to each page, known as its PageRank. Given a web-seach query, the search enginge will give each web-page a score where amongst others, PageRank is used.

Given a web-surfer, he can swich to a new node (webpage) in two ways:

By clicking on a hyperlink on a node to the new node
“Teleport” from A to B, for example by typing URL in a browser. All possible webpages are likely. Therefore, if N is the total number of nodes in the web-graph, the teleport operation takes to surfer to each node with probability 1/N.

In assigning a PageRank score to each node, we use the teleport operation in two ways:

When at a node with no out-links, the surfer invokes the teleport operation
At any node that has hyperlinks (outgoing pages) the surfer invokes the teleport operation with probability 0 < a < 1 and a standard random walk (click on one of the hyperlinks at random) with probability 1 - a, where a is a fixed parameter chosen in advance, typically 0.1.

Question 14

Q

Markov Chain

Answer

A

A random walk in a web-graph can be represented with a Markov Chain. It is an N x N transition probability matrix P. Every index should be a number between 0 and 1, and the sum of all rows should be 1. The entry P_ij tells us the probability that the state at the next time-step is j, condidtioned on the current state being i. Each entry P_ij is known as a transition probability and depends only on the current state i (i = row, j = column). This is known as the Markov property.

Question 15

Q

How to compute the matrix P

Answer

A

One state for each web-page, and each transition probability represent- ing the probability of moving from one web page to another. If there is a hyperlink from page i to page j, then A_ij = 1. Otherwise A_ij = 0. We can derive P from 1. If a row of A has no 1s, then divide each element by 1/N. For all other rows, proceed as follows:

Divide each 1 in A by the number of 1s in its row.
Multiply the resulting matrix by 1 - δ. Add δ/N to every entry of resulting matrix, to obtain P.

Question 16

Q

The PageRank computation

Answer

A

The probability distribution of the surfers position at any time is the probability vector x. The probability distribution at t = 1 (started at t = 0) is given by the probability vector xP, at t = 2 by (xP)P = xP and so on.

The RageRank is query-independent (a static quality score). On the other hand, the relative ordering of pages should, intuitvely, depend on the query being serverd. For this reason, search engines use static quality measures such as PageRank just one of many factors in scoring a web page.