Search Engines and Crawlers Flashcards
Vertical search
More limited than web search (e.g. only certain file formats will be shown)
Enterprise search
Searching for company documentation
Desktop search
Searching for data inside files
Classification
Compares documents
Ad-hoc search
Searching through unstructured data
Relevance in information retrieval
o Topical relevance and user relevance
o Retrieval models – how the results will be presented (e.g. a list of links)
o Ranking algorithms
Evaluation in information retrieval
o Precision and recall – when multiple users search for the same thing, they should be shown the same result
o Test collections
o Clickthrough and log data
Information needs
o Query suggestions (e.g. autofill)
o Query expansion – providing other potentially relevant data
o Relevance feedback (e.g. showing how many times an academic paper has been cited)
What are the primary goals of a search engine?
- Effectiveness – retrieving the most relevant set of documents possible
- Efficiency – processing queries as quickly as possible
Issues in search engines
o Performance - efficient searching and indexing
o Incorporating new data
o Scalability – growing with data and users (e.g. handling large amounts of traffic)
o Adaptability – tuning for applications (e.g. adapting for use on a variety of devices)
Document statistics
Gathering and recording statistical information about words and documents
How are document statistics used?
The gathered information is stored in lookup tables and used by ranking algorithms
Lookup table
Data structure designed for quick retrieval
Weighting
Calculating weight using document statistics and storing it in a lookup table
tf.idf weighting
Giving high weights to terms that appear in very few documents
True or false: Weight is calculated during the query process
False! It can be calculated as part of the query process, but calculating during indexing makes querying more efficient
Inversion
Changing document-term info into term-document info
Methods of query transformation
o Spell checking
o Query suggestion
o Suggesting additional terms via query expansion