19 Web Search Flashcards

1
Q

Web search: Challenges compared to ‘traditional’ IR

A
  • Scale (billions of web pages)
  • Heterogenous content
  • Trust
  • Web users different from “traditional” IR users
  • Business aspects (sponsored search)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Types of web queries

A
  1. Information[~50%]: General info on topic
    Ex: Italian cuisine, Britney Spears family life
  2. Navigational[~20%]: Search specific entity
    Ex: University of Oslo in Norway
  3. Transactional[~30%]: Want to do something
    Ex: Car rental from Gardemoen.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Web queries: Importantz

A

Precision often more important than recall

  • Especially precision on top results
  • Necessary to filter untrusted pages/spam
  • Need to consider other qualites than relevance (trustworthiness, recency of content)
  • Recall only matters if number of matches is very small

Query language must be lighweight (mostly phrase uqeries)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

The web graph

A

We refer to the hyperlinks onto a page as in-links/indegree, and out- degree of a web page to be the number or links out of it.

Strongly connected: If there are pairs of pages such that one cannot proceed from one page to of the pair to the other by following hyper- links, these are not strongly connected.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Spamdexing

A

Manipulation of web content to appear artificially high on search results for particular keywords.

Common spamming techniques:

  • Keyword stuffing, invisible text
  • Cloaking: Server returns fake content to web crawlers
  • Doorways: dummy start page carefully crafted for keywords
  • Optimisation of metadata on the page

Counter measures:

  • Exploit “quality signals” (from web and from users) to determine wheter a webpage is trustworthy
  • Analysis of web graph to detect suspicious lingaes
  • Machine learning to classify spam
  • Editorial intervention (blacklists etc.)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Web crawling and indexing

A

Is the process by which we gather pages from web to index them and support a seach enginge. The objective of crawling is to quickly and efficiently gather as many useful web pages as possible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

A crawler must be:

A

Robust
The web contains servers that create spider traps, which are genrators of web pages that misled creawlers into getting stuch fetching an infinite number of pages in a particular domain.

Polite
Avoid flooding servers and only crawl allowed pages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

A crawler should be

A

Distributed
Should have the ability to execute in a distributed fashion across multiple machines -

Scalable
The architecture should permit scaling up the crawl rate by adding extra machines and bandwith

Quality
Must be biased toward fetching useful pages. The fetched page is parsed, to extract both the text and the links from the page. The extracted text is fed to a text indexer. The extracted URLs/links are then added to a URL frontier, which at all times consists of URLs whose corresponding pages have yet to be fetched by the crawler. This process may be viewed as traversing the web graph.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Crawling architecture

A
  1. The URL frontier, containing URLs yet to be teched in the current crawl
  2. A DNS resolution module that determines the web server from which to fetch the page specified by a URL
  3. A parings module that extracts the text and set of links from a fetched web page
  4. A duplicate elimination module that determines wheter an extracted link is already in the URL frontier or has recently been fetched.

A crawler must crawl webpages thar are of high quality and/or are frqe- untly updated more often.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

URL frontier

A

The URL frontier have a front queue and back queue to take into the consideration the polite and prioritizing effects. Its goals are to ensure that

  1. Only one connection is open to any host
  2. A waiting time for a few seconds occurs between succesive request to a host
  3. High-priority pages are crawled preferantially

The two major modules are a set of F front queues and a set of B back queues. Both FIFO queues. The front queue implements the prioritization and back queue the politeness.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Web indexing

A

Two types of index partitioning:

  1. Partitioning by terms:
    Index terms divided in subsets, and each subset is allocated to a node Pro: Greater concurrency Cons: Must exchange and merge long PLs across nodes
  2. Partitioning by documents:
    Each node is responsible for a local index for subset of all documents (query sent to each node and the result are merged back)
    Pro: Often easier to distribute, more efficient I/O on PLs
    Cons: More disk seeks, need do calculate global statistics separately
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Link analysis

A

The study on link analysis builds on two intuitions:

  1. The anchor text pointing to page B is a good description of page B
  2. The hyperlink from A to B represents an endorsements of page B, by the creator of page A. Therefore, good nodes will tend to point to good nodes, and bad nodes to bad nodes.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

PageRank

A

The most well-known algorithm for ranking the quality of webpages accord- ing to their link structure (used by Google). It assigns a numberical score between 0 and 1 to each page, known as its PageRank. Given a web-seach query, the search enginge will give each web-page a score where amongst others, PageRank is used.

Given a web-surfer, he can swich to a new node (webpage) in two ways:

  1. By clicking on a hyperlink on a node to the new node
  2. “Teleport” from A to B, for example by typing URL in a browser. All possible webpages are likely. Therefore, if N is the total number of nodes in the web-graph, the teleport operation takes to surfer to each node with probability 1/N.

In assigning a PageRank score to each node, we use the teleport operation in two ways:

  1. When at a node with no out-links, the surfer invokes the teleport operation
  2. At any node that has hyperlinks (outgoing pages) the surfer invokes the teleport operation with probability 0 < a < 1 and a standard random walk (click on one of the hyperlinks at random) with probability 1 - a, where a is a fixed parameter chosen in advance, typically 0.1.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Markov Chain

A

A random walk in a web-graph can be represented with a Markov Chain. It is an N x N transition probability matrix P. Every index should be a number between 0 and 1, and the sum of all rows should be 1. The entry Pij tells us the probability that the state at the next time-step is j, condidtioned on the current state being i. Each entry Pij is known as a transition probability and depends only on the current state i (i = row, j = column). This is known as the Markov property.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How to compute the matrix P

A

One state for each web-page, and each transition probability represent- ing the probability of moving from one web page to another. If there is a hyperlink from page i to page j, then Aij = 1. Otherwise Aij = 0. We can derive P from 1. If a row of A has no 1s, then divide each element by 1/N. For all other rows, proceed as follows:

  1. Divide each 1 in A by the number of 1s in its row.
  2. Multiply the resulting matrix by 1 - δ. Add δ/N to every entry of resulting matrix, to obtain P.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

The PageRank computation

A

The probability distribution of the surfers position at any time is the probability vector x. The probability distribution at t = 1 (started at t = 0) is given by the probability vector xP, at t = 2 by (xP)P = xP and so on.

The RageRank is query-independent (a static quality score). On the other hand, the relative ordering of pages should, intuitvely, depend on the query being serverd. For this reason, search engines use static quality measures such as PageRank just one of many factors in scoring a web page.