Lecture 10 - Content-Based Applications Flashcards
What’s a B2B transaction?
Business to Business uses the WWW as a distributed document delivery service
What are the components of a search engine?
Database of references to webpages
A web crawler
An interface
Information retrieval system
What are the elements of the search engine database?
Where the users queries are matched
Contains only essential parts of the page
Only includes indexed pages
Search engines tend to be out of date
What does a web crawler do?
Records the data it finds such as words, metadata and alt attributes
What does the search engine interface do?
Gathers input from users
Presents results from the IR system
Often presents items in a ranked order
Requires user input
What are the two main methods of search term matching?
Keyword Searching and Concept-based searching
How does keyword searching work?
Matches single terms, computing cosine
How does concept-based searching work
Examining clusters of work
Attempts to determine the meaning of a query
What are the basic information retrieval features of a search engine?
Boolean Operators
Extended Operators
Stop word deletion
Stemming
Searching in fields (e.g. host)
What are the rules of ranked output for most search engines?
Early words more important
Title is important
frequency of occurrence matters for some
infrequent words matter more
modification date
How does Google handle searches differently than other SEs?
PageRanktm method is based on popularity, use of keywords and relevance.
Links as money
Google’s Anatomy: What does the URL server do?
Sends lists of URLS to be fetched
Fetched pages are sent to the store server
The store server compresses and stores pages into a repository
Each page has a docID
What does Google’s Indexer do?
Reads repository, uncompresses and parses documents
Converts pages into stats on word occurrences, hits
Includes intfo about the page, font size, capitalization
What does google’s sorter do?
Resorts barrels by wordID instead of docID
What does dumpLexicon do?
Takes the list and lexicon to produce a new lexicon
To be used by the searcher to answer questions
Using the inverted indx, lexicn and PageRanks
What is Googlebombing?
Specifically targeting a web page to rank in 1st position for a particular search query
What is the Deep Web?
Also called the invisible web, it contains documents not indexed by Search Engines.
How is the invisible web changing over time?
More search engines are parsing non-html content than before
Companies are making more content available by keeping urls stable and including sitemaps
What is dogpile?
A well known meta-crawler
As a searcher, what are steps to success in search engines?
Use multiple search engines
Search within results
Use boolean expressions
As a creator, what are the steps to success in search engines?
Always use ALT attributes
Avoid frames
Links between your pages
Use metadata, formal and informal
How do you increase your pages popularity?
Don’t use systematic reciprocal linking
Use a context map at the top of each page
Don’t use frames
Think through dynamic content implications
Why is metadata important?
- It’s useful for describing and locating info
- Judge relevance of information
- Promote good information management
- Search tools and information gateways can use metadata when location and describing resources
How can we reduce inconsistencies in our metadata?
Clearly label attributes
Stick to formats and rules
Catalogue Rules
What is Dublin Core (DC)?
It has 15 core elements
§ Title, Creator, Subject, Description, Publisher,
Contributors, Date, Type, Format, Identifier, Source,
Language, Relation, Coverage and Rights
What does RDF stand for?
Resource Description Framework
What does a resource description framework do?
It aims to provide the infrastructure to exchange metadata on the web.
Allows for mix of metadata schemes
Enables automated processing of web resources
Interoperability between applications that exchange machine-understandable information
What are the applications of RDF?
- Resource discovery - search engines
- Cataloguing - describe content and content relationships
- Describing intellectual property rights
- Intelligent software agents - info sharing
- Content rating
- Privacy preferences/policies
- Collections of pages as a single “document”
what are the disadvantages of metadata?
Stored in separate files
Difficult to convince information providers of its importance
Need for standardised usage and procedures
Not trusted by some search engines (because of keyword spamming)
What is the short term disadvantage of metadata?
Metadata imposes a load on the server
Metadata is becoming important, how should we handle it when creating a site?
Start collecting it immediately
Automate as much as possible
Ensure information providers use metadata