06: Search Engines Flashcards
Three Main Components of a Search Engine
- input agents
- database engine
- the query server
* in practice, these three components are distributed but conceptually can be thought of as services on the same machine
Input Agents
(Three Main Components of a Search Engine)
web crawlers that surf the WWW requesting and downloading web pages
Database Engine
(Three Main Components of a Search Engine)
manages the URLs and the input agents in general
Components of a Search Engine
Web Crawlers
Refers to a class of software that:
- downloads pages
- identifies the hyperlinks
- adds links to a database for future crawling
Code for Simplified Crawler
//namespaces implied
public partial class _Default : System.Web.UI.Page {
protected void Page_Load(object sender, EventArgs e) {
//content in image
}
}
Code for Putting Web Crawler Results to a Table
Code for a Recursive Crawler
Robots Exclusion Standard
Implemented with plain text files named robots.txt stored at the root of the domain and has two syntactic elements:
- user-agent we want to make a rule for (the special character * means all agents)
- one disallow directive per line to identify patterns
Regular expressions are not supported
Scrapers
- programs that identify certain pieces of information from the web to be stored in databases
- sometimes combined with Crawlers
Scraper Classes
- URL scrapers
- Email scrapers
- Word scrapers
- Media scrapers
Word Scrapers
- may want to parse out all of the text within a web page
- words are the most difficult content to parse since the tags they appear in reflect how important they are to the page overall
- words in large font more important than small ones at the bottom of a page
- words that appear next to one another should be linked while words that are at opposite ends of a page or sentence are less related
To understand indexing, consider what a _____ and a _____ might identify from a web page and how they might _____ it.
To understand indexing, consider what a crawler and a scraper might identify from a web page and how they might store it.
Reverse Index
- indexes the words rather than the URLs
- mechanics of how this is done is not standardized
- generally, word tables are created (for every word found in pages) so that each word can be referenced by a unique integer, and indexes of these references can be built for faster searches
- demands on these indexes far exceed what a single database server can support
PageRank
method for computing a ranking for every web page based on the graph of the web
(graph of the web = hyperlinks between web pages)
* sites with thousands of backlinks are more important than sites with only a handful
PageRank Definition Equation
If a page has no links to other pages, it becomes a ____, and therefore ____ the ____ ____ ____. If the random surfer arrives at a sink page, it picks another ____ at random and continues surfing again.
If a page has no links to other pages, it becomes a sink and therefore terminates the random surfing process. If the random surfer arrives at a sink page, it picks another URL at random and continues surfing again.
The PageRank Theory holds that an imaginary surfer who is randomly clicking on links will eventually ____ ____.
The PageRank Theory holds that an imaginary surfer who is randomly clicking on links will eventually stop clicking.
PageRank Algorithm Factors
Modern ranking algorithms take much more into account than simple backlinks, including:
- Search History
- Geographic Location
- Authorship
- Freshness of the pages
Search Engine Optimization (SEO)
process a webmaster undertakes to make a website more appealing to search engines, and by doing so, increases its ranking in search results for terms the webmaster is interested in targeting