19 Web Search Flashcards
Web search: Challenges compared to ‘traditional’ IR
- Scale (billions of web pages)
- Heterogenous content
- Trust
- Web users different from “traditional” IR users
- Business aspects (sponsored search)
Types of web queries
- Information[~50%]: General info on topic
Ex: Italian cuisine, Britney Spears family life - Navigational[~20%]: Search specific entity
Ex: University of Oslo in Norway - Transactional[~30%]: Want to do something
Ex: Car rental from Gardemoen.
Web queries: Importantz
Precision often more important than recall
- Especially precision on top results
- Necessary to filter untrusted pages/spam
- Need to consider other qualites than relevance (trustworthiness, recency of content)
- Recall only matters if number of matches is very small
Query language must be lighweight (mostly phrase uqeries)
The web graph
We refer to the hyperlinks onto a page as in-links/indegree, and out- degree of a web page to be the number or links out of it.
Strongly connected: If there are pairs of pages such that one cannot proceed from one page to of the pair to the other by following hyper- links, these are not strongly connected.
Spamdexing
Manipulation of web content to appear artificially high on search results for particular keywords.
Common spamming techniques:
- Keyword stuffing, invisible text
- Cloaking: Server returns fake content to web crawlers
- Doorways: dummy start page carefully crafted for keywords
- Optimisation of metadata on the page
Counter measures:
- Exploit “quality signals” (from web and from users) to determine wheter a webpage is trustworthy
- Analysis of web graph to detect suspicious lingaes
- Machine learning to classify spam
- Editorial intervention (blacklists etc.)
Web crawling and indexing
Is the process by which we gather pages from web to index them and support a seach enginge. The objective of crawling is to quickly and efficiently gather as many useful web pages as possible.
A crawler must be:
Robust
The web contains servers that create spider traps, which are genrators of web pages that misled creawlers into getting stuch fetching an infinite number of pages in a particular domain.
Polite
Avoid flooding servers and only crawl allowed pages
A crawler should be
Distributed
Should have the ability to execute in a distributed fashion across multiple machines -
Scalable
The architecture should permit scaling up the crawl rate by adding extra machines and bandwith
Quality
Must be biased toward fetching useful pages. The fetched page is parsed, to extract both the text and the links from the page. The extracted text is fed to a text indexer. The extracted URLs/links are then added to a URL frontier, which at all times consists of URLs whose corresponding pages have yet to be fetched by the crawler. This process may be viewed as traversing the web graph.
Crawling architecture
- The URL frontier, containing URLs yet to be teched in the current crawl
- A DNS resolution module that determines the web server from which to fetch the page specified by a URL
- A parings module that extracts the text and set of links from a fetched web page
- A duplicate elimination module that determines wheter an extracted link is already in the URL frontier or has recently been fetched.
A crawler must crawl webpages thar are of high quality and/or are frqe- untly updated more often.
URL frontier
The URL frontier have a front queue and back queue to take into the consideration the polite and prioritizing effects. Its goals are to ensure that
- Only one connection is open to any host
- A waiting time for a few seconds occurs between succesive request to a host
- High-priority pages are crawled preferantially
The two major modules are a set of F front queues and a set of B back queues. Both FIFO queues. The front queue implements the prioritization and back queue the politeness.
Web indexing
Two types of index partitioning:
-
Partitioning by terms:
Index terms divided in subsets, and each subset is allocated to a node Pro: Greater concurrency Cons: Must exchange and merge long PLs across nodes -
Partitioning by documents:
Each node is responsible for a local index for subset of all documents (query sent to each node and the result are merged back)
Pro: Often easier to distribute, more efficient I/O on PLs
Cons: More disk seeks, need do calculate global statistics separately
Link analysis
The study on link analysis builds on two intuitions:
- The anchor text pointing to page B is a good description of page B
- The hyperlink from A to B represents an endorsements of page B, by the creator of page A. Therefore, good nodes will tend to point to good nodes, and bad nodes to bad nodes.
PageRank
The most well-known algorithm for ranking the quality of webpages accord- ing to their link structure (used by Google). It assigns a numberical score between 0 and 1 to each page, known as its PageRank. Given a web-seach query, the search enginge will give each web-page a score where amongst others, PageRank is used.
Given a web-surfer, he can swich to a new node (webpage) in two ways:
- By clicking on a hyperlink on a node to the new node
- “Teleport” from A to B, for example by typing URL in a browser. All possible webpages are likely. Therefore, if N is the total number of nodes in the web-graph, the teleport operation takes to surfer to each node with probability 1/N.
In assigning a PageRank score to each node, we use the teleport operation in two ways:
- When at a node with no out-links, the surfer invokes the teleport operation
- At any node that has hyperlinks (outgoing pages) the surfer invokes the teleport operation with probability 0 < a < 1 and a standard random walk (click on one of the hyperlinks at random) with probability 1 - a, where a is a fixed parameter chosen in advance, typically 0.1.
Markov Chain
A random walk in a web-graph can be represented with a Markov Chain. It is an N x N transition probability matrix P. Every index should be a number between 0 and 1, and the sum of all rows should be 1. The entry Pij tells us the probability that the state at the next time-step is j, condidtioned on the current state being i. Each entry Pij is known as a transition probability and depends only on the current state i (i = row, j = column). This is known as the Markov property.
How to compute the matrix P
One state for each web-page, and each transition probability represent- ing the probability of moving from one web page to another. If there is a hyperlink from page i to page j, then Aij = 1. Otherwise Aij = 0. We can derive P from 1. If a row of A has no 1s, then divide each element by 1/N. For all other rows, proceed as follows:
- Divide each 1 in A by the number of 1s in its row.
- Multiply the resulting matrix by 1 - δ. Add δ/N to every entry of resulting matrix, to obtain P.