Search Engines and Crawlers Flashcards
Vertical search
More limited than web search (e.g. only certain file formats will be shown)
Enterprise search
Searching for company documentation
Desktop search
Searching for data inside files
Classification
Compares documents
Ad-hoc search
Searching through unstructured data
Relevance in information retrieval
o Topical relevance and user relevance
o Retrieval models – how the results will be presented (e.g. a list of links)
o Ranking algorithms
Evaluation in information retrieval
o Precision and recall – when multiple users search for the same thing, they should be shown the same result
o Test collections
o Clickthrough and log data
Information needs
o Query suggestions (e.g. autofill)
o Query expansion – providing other potentially relevant data
o Relevance feedback (e.g. showing how many times an academic paper has been cited)
What are the primary goals of a search engine?
- Effectiveness – retrieving the most relevant set of documents possible
- Efficiency – processing queries as quickly as possible
Issues in search engines
o Performance - efficient searching and indexing
o Incorporating new data
o Scalability – growing with data and users (e.g. handling large amounts of traffic)
o Adaptability – tuning for applications (e.g. adapting for use on a variety of devices)
Document statistics
Gathering and recording statistical information about words and documents
How are document statistics used?
The gathered information is stored in lookup tables and used by ranking algorithms
Lookup table
Data structure designed for quick retrieval
Weighting
Calculating weight using document statistics and storing it in a lookup table
tf.idf weighting
Giving high weights to terms that appear in very few documents
True or false: Weight is calculated during the query process
False! It can be calculated as part of the query process, but calculating during indexing makes querying more efficient
Inversion
Changing document-term info into term-document info
Methods of query transformation
o Spell checking
o Query suggestion
o Suggesting additional terms via query expansion
When search results are displayed, snippets are generated to…
o Summarise retrieved documents
o Identify related groups of documents
o Highlight important words and passages
Document data store
A database that manages large numbers of documents and the structured data (usually metadata or links) associated with them
True or false: A document data store can be stored in a relational database
True, but some applications use other storage systems for faster retrieval
Scoring
Calculating scores for documents using ranking
Performance optimisation
Designing ranking algorithms to decrease response time and increase query throughput
Methods of distributing ranked documents
o Query broker
o Caching results of common search queries
Parser
Processes sequences of text tokens (usually words)
Stopping
Removes common words from text tokens to reduce repetition and index size
Stemming
Grouping words that have a similar meaning (e.g. “fish” and “fishing”) and replacing them with a designated word
Why is stemming used?
Words in queries and documents are more likely to match
Information extraction
Using syntactic analysis to identify complex index terms
Classifier
o Identify class-related metadata
o Assign labels to documents representing topic categories
o Group documents without pre-defined categories
Crawler
Follows links to web pages to discover and download new pages
Web crawling
- Client program connects to domain name system (DNS) server
- DNS server translates host name into an internet protocol (IP) address
- Program attempts to connect to computer with that IP address
- Once connection is established, client program sends a HTTP request to the server
Politeness policies
Standards that aim to reduce a crawler’s impact on a web server’s performance. This could include a politeness window (time between requesting pages) being prevented from accessing certain pages
Focused crawling
Relies on web pages linking to other pages on the same topic
How does focused crawling work?
- Can use popular pages as seeds
- Use text classifiers to determine what page is about
- If page is on topic, keeps the page and uses its links to find related sites
How does a focused crawler decide which pages to visit next?
- Tracking topicality of downloaded pages to determine whether to download similar pages
- Anchor text data and topicality data can be combined to determine which pages to visit next
Deep web
Pages that are difficult for a crawler to find
Examples of deep web pages
- Sites that require an account
- Form results (e.g. flight timetables, product search)
- Scripted pages
How can web pages become easier to find?
Sitemaps
Issues in crawlers and how to solve them
- Text documents stored in incompatible formats - can be converted to tagged formats
- Crawling can be expensive in terms of CPU and network load - may be useful to store documents
BigTable
A distributed database system in which the table is split into small pieces (tablets), which are served by thousands of machines
How are changes recorded in BigTable?
Changes are recorded in a transaction log and stored in a shared file system
True or false: If a BigTable tablet server crashes, the whole table is inaccessible
False! Another server will immediately read the tablet data and transaction log and take over
True or false: BigTable is a relational database model
False! Unlike relational databases, not all rows have the same columns.
Feed
Real time stream of files; this is how search engines acquire new documents
Push feed
Alerts subscribers to new documents
Pull feed
Requires the subscriber to check periodically