Unit 5 - Web Searching Flashcards
Data structures used for storing indices:
- Suffix tree,
- Inverted index
- Citation index
- N-gram index
- Term document matrix
What are indices?
Indices are nothing but short descriptions of each webpage that may include title, creation, date and size, 1st line etc.
What is XML?
Stands for extensible Markup Language, used for exchanging data on the Web
Enables separation of content(XML) and presentation(XSL).
Who created XML?
W3C, to provide easy to use and standardised way to store self describing data.
INEX 2002 defined:
Component coverage and topical relevance
Four cases in Component coverage dimension
Exact coverage (E)
Too small (S)
Too large (L)
No coverage (N)
Cases in Topical relevance:
Highly relevant (3)
Fairly relevant (2)
Marginally relevant (1)
Non relevant (0)
What is a search engine?
Search engine is a program which helps users to find information stored on a computer somewhere in the World Wide Web.
Centralised crawler index architecture:
It is used by most of the search engines so it uses a crawler gather information to a single site where it is index by the index
Components of crawler indexer architecture
Crawlers, index query engine user interface
Problems using crawler indexer architecture
Dynamic nature of the web
High load on web servers
Large volume of data
Communication link problem
Harvest distributed crawler index and architecture
Problems:
Due to different crawler server load increase
Object by the cross are usually useless and discarded
No coordination among the crawlers
Components of harvest:
Gatherers
Brokers
Replicator
Object cache