10 XML Retrieval Flashcards
Structured retrieval
Search over structured documents
XML
Extensible markup language. A standard for encoding structured docu- ments. Most widely used standard.
XML element
A node in a tree
XML attribute
An element can have one or more XML attributes
XML DOM
Document Oject Model. The standard way for accessing and processing XML docs. The DOM represents elements, attributes, and text within elements as nodes in a tree. With a DOM API, we can process a XML document by starting at the root element and then descending down the tree from partens to children.
XML Context/contexts
An XML path.
Schema
A schema puts constraints on the structure of allowable XML documents for a particular application. A shcema for ShakespearÕs plays may stipulate that scenes can only occur as children of acts and that only acts and scenes have the number attribute.
NEXI
A common format of a XML query.
Structured document retrieval principle
The principle is as follows: A system should always retrieve the most specific part of a document answering the query.
Indexing unit
Which parts of a document to index.
Schema heterogeneity/diversity
In many cases, several different XML schemas occur in a collection because the XML documents in an IR application often come from more than one source. This is called schema heterogeneity. It presents yet another challenge(s):
- Comparable elements may have different names like author vs creator.
- The strucutral orginization of the scemas may be different: Author names are direct descendats of the node autor, but in a different struc- trure there can be firstname and lastname as dicrect children from author.
Extended query
We can support the user by interpreting all partent-child relationships in queries as descendant relationships with any number of intervening nodes allowd. These are extended queries.
Structural term
To index all paths that end in a single vocabulary term, in other words, all XML-context/term pairs. We call such an XML-context/term pair a structural term.
Text-centric XML
Where we match the text of the query with the text of the XML docuemnts 31
Data-centric XML
Mainly encodes numerical and nontext attribute-value data. When quering data-centric XML, we want to impose exact match conditions in most cases. A query can be: ÒFind emplyees whose salary is the same this month as it was 12 months ago.Ó This query requires no ranking. It is purely strucutral and an exact matching of the salaries in the two time periods is probably sufficient to meet the users information need.