gembin

OSGi, Eclipse Equinox, ECF, Virgo, Gemini, Apache Felix, Karaf, Aires, Camel, Eclipse RCP

HBase, Hadoop, ZooKeeper, Cassandra

Flex4, AS3, Swiz framework, GraniteDS, BlazeDS etc.

There is nothing that software can't fix. Unfortunately, there is also nothing that software can't completely fuck up. That gap is called talent.

About Me

 

XQuery Search and Update

Abstract

Abstract

The XQuery 1.0 specification describes a powerful language for accessing and manipulating XML. In its initial form XQuery lacks two interesting features: fast token-based text search and node-level update. Specifications for these features are under development at the W3C with release dates after XQuery 1.0. This paper looks at the mechanisms by which XQuery can support search and update based on our experience at Mark Logic implementing these features.

Are Search and Update Needed?

To begin, let us ask the question, are search and update capabilities needed in an XQuery environment? The answer depends on the use case. When using XQuery to federate queries against data-oriented RDBMS stores, text-based search isn't required and updates can be handled through direct SQL calls. The XQuery in this situation acts as a read-only view, a convenience rather than a primary access and manipulation language.

In other use cases the need can be acute. Take for example O'Reilly Media, a Mark Logic customer with a custom publishing site called SafariU that's targeted at helping university professors create custom course readers. Professors access book and article content (held as Docbook within Mark Logic) by performing advanced searches to locate items of interest -- chapters, sections, or code examples. Professors gather together these items into a new book, reviewing HTML and later PDF copies, and eventually receiving a stack of their custom books at the university bookstore.

SafariU requires fast text search so professors can find content to select and update to track the new hierarchical book structure and content usage for royalty reports. In fact, before the first professor visits SafariU, O'Reilly uses XQuery update calls to "decorate" the book and article content: adding <indexterm> elements on articles, adding estimated page counts to content pieces, etc.

Mark Logic has opted to provide search and update using function calls -- rather than language extensions -- so as to conform to the XQuery standard as much as possible. We have been able to make search and update calls efficient, executing quickly even against gigabyte documents in terabyte collections. In the rest of this paper we'll explain the Mark Logic approaches to search and update and compare and contrast that approach with the preliminary W3C proposals.

Section Title

If XQuery search were to provide only what a typical search engine provides, it would be fairly easy. You specify the search word or phrase, possibly add a basic document selection constraint, and retrieve a relevance ordered list of documents containing the search term(s). Using the Mark Logic search function calls and the input() function to return a sequence of all document nodes, the expression would look like this:

 

 
cts:search(input(), "word or phrase")
 

 

Yet XQuery enables so much more. First, XQuery has direct access to and can utilize the XML document structure. This lets you determine what subset of content should be searched. As shown in Example 1, you can look for the most relevant <sect1> blocks discussing "cougars" where the <sect1> blocks must be within chapters about "veterinary medicine" in books published after 2003.

 

Example 1: A Structure-aware Search
 
cts:search(
input()/book[meta/pubdate >= 2003]
       
//chapter[cts:contains(title, "veterinary medicine")]//sect1,
"cougars
)
 

Second, XQuery can access the items returned from each search and programmatically manipulate each item for rendering or to feed into another subsequent search. In the cougar example, the search expression might return a sequence of <sect1> elements, then the query might continue on to extract all the figure blocks within those sections for display, along with any paragraph linking to the figure. Example 2 shows what that query would look like, and how a theoretical print-references() might be implemented:

 

Example 2: Manipulating Search Results
 
for $s in
cts:search(...)[1 to 10]
for $f in $s//figure
return (
print-figure($f),
print-references($f)
)
 
define function print-figure($f as element(figure)) { ... }
 
define function print-references($f as element(figure))
{
let $id := $f/@id
for $link in $f/ancestor::book//link[@linkend=$id]
return
<p>
 
{ string-join($link/ancestor::para[1]//text(), " ") }
</p>
}
 

 

The power to control what's searched and what's returned proves especially valuable when documents have useful semantic structure. Figure 1 shows a screen shot of a query for "tennis elbow" executed against a collection of medical textbooks:

The results were rendered with custom XQuery logic. Start at the most relevant <para> elements, walk up the document hierarchy to the nearest containing section, then render that block as HTML. Clickable breadcrumbs were added based on the full ancestral history of the matching <para>. The search returns paragraphs, but for the end result we've opted to include full sections.

By comparison, a search engine presented with a large number of Docbook files would return simply URL pointers to which files were most relevant. Any post processing would be entirely manual and external to the search engine.

Executing a Search

Mark Logic provides two functions to execute searches: cts:contains() and cts:search(). The "cts" namespace stands for "core text search" and is a standard extension in Mark Logic.

The cts:contains() call is the simpler of the pair. It returns a boolean true() if the node passed as the first argument contains the word or phrase specified by the second argument. Example 3 returns all the programlistings within any example element whose title contains "black and white" within any book element whose title contains "java servlet". The query executes quickly based on indexes of both content and structure.

Example 3: Searching content and structure
input()/book[cts:contains(title, "java servlet")]
     
//example[cts:contains(title, "black and white")]
    
//programlisting
 

Those familiar with XQuery may recall a standard fn:contains() function. The standard function operates on a character level rather than token (word), so fn:contains(title, "foo") somewhat unexpectedly matches the word "food". The cts:contains(title, "foo") call would not, because it matches tokens.

The cts:contains() function works well for black and white searches. By this I don't mean searches for "black and white", but rather those cases where your item is either in or it's out. There's no relevance rank ordering. For that you use cts:search().

The cts:search() call returns a result sequence in decreasing relevance order. In common usage it accepts two arguments, an expression to be searched and a query specifying what search to perform. Below is the Example 1 search query shown earlier:

 
cts:search(
input()/book[meta/pubdate >= 2003]
      
//chapter[cts:contains(title, "veterinary medicine")]//sect1,
"cougars"
)[1 to 10]
 

 

The first argument limits the search domain to <sect1> elements within specific chapters within specific books. The second argument indicates the search criteria. It's the search criteria that influences the relevance ranking. The following two queries are very different:

 
//title[cts:contains(., "XQuery search")]
 
 
cts:search(//title, "XQuery search")
 

The first returns titles that contain "XQuery search", in document order. The second returns titles that contain "XQuery search", in decreasing relevance order.

The second argument to cts:search() can accept not just simple words and phrases but cts:query instances. Through a cts:query instance you can apply an arbitrarily complex and nested search criteria. Each criteria can provide search options (case insensitivity, stemming enablement, wildcard enablement, language) and a term weight (an indicator of this term's relative importance in the search, between 0.0 and 1.0).

You build a cts:query instance through of a series of "cts" function calls. Example 4 shows several examples:

 

Example 4: Query Examples
 
(: Simple strings are treated as
cts:word-query() :)
cts:word-query("XQuery search")
 
(: How to use a
cts:word-query() :)
cts:search(input(),
cts:word-query("rhinos"))
 
(: Indicates to use stemming so "searching" would be a match also :)
cts:word-query("XQuery search", "stemmed")
 
(: Either wine color is acceptable, red 2x preferred :)
cts:or-query(
cts:word-query("white wine", "stemmed", 0.5),
cts:word-query("red wine", "stemmed", 1.0)
)
 
(: Both explorers should appear :)
cts:and-query(
cts:word-query("Lewis", "case-sensitive"),
cts:word-query("Clark", "case-sensitive")
)
 
(: Same as previous. Capitalized words trigger case sensitivity. :)
cts:and-query(
cts:word-query("Lewis"),
cts:word-query("Clark")
)
 
(: Look for "spring" with relevance weighting by location :)
cts:or-query(
cts:element-word-query(xs:QName("title"), "spring", (), 1.0),
cts:element-word-query(xs:QName("subtitle"), "spring", (), 0.3),
cts:element-word-query(xs:QName("para"), "spring", 0.1)
)
 
(: Use
cts:near-query() to specify a term distance :)
cts:near-query(
(cts:word-query("white wine", "stemmed", 0.5),
cts:word-query("red wine", "stemmed", 1.0)),
2)
 

Scoring a Search

The cts:search() call returns results in (decreasing) relevance order. You can quantify each item's relevance using the cts:score() function, as shown in Example 5:

Example 5: Scoring Results
 
for $x in
cts:search(//para, "spring break")
return <p score="{cts:score($x)}"> { $x/* } </p>
 

A score is a non-negative integer, unbounded and increasing logarithmetically. The result of calling xdmp:score() for a given nodes depends on the structure of the most recent function call which returned the node as a value. This implementation of score violates XQuery side-effect free-ness. On the other hand, it appears to be the only feasible approach to scoring which is compatible with the current XQuery grammar. Recently the Free Text Task Force of the W3C XQuery Working Group has published a draft recommendation for XML free text search which includes a grammatical recommendation for managing relevance scoring. We will describe this is the next section and compare it to the Mark Logic infrastructure.

Relevance scores are computed using a "tf-idf" formula (tf-idf stands for term frequency, inverse document frequency) with byte-length normalization. Element nodes with a higher occurrence of a given word or tag will score higher than nodes of the same size with a lower occurrence of the given word or tag. Between two element nodes with the same number of occurrences of a given word or tag, the smaller element will normalize to larger score, and therefore a higher relevance. Between two words or tags occurring with the same frequency in a given element, the word or tag which is less common in the collection will make a bigger contribution to the relevance score.

Relation of Mark Logic Search to the W3C FTTF Activity

The Free Text Task Force of the W3C XQuery Working Group has published a draft recommendation for XQuery text search. The recommendation describes a surface syntax for XML search which integrates with the existing XQuery grammar. The basic component of this syntax is the "ftcontains" predicate relation. This predicate maps very closely to the Mark Logic cts:contains() predicate. For example, the FTTF proposed search expression:

 
/books/book[title
ftcontains ("dog" with stemming) && "cat"]/author
 

maps directly to the Mark Logic syntax

 
/books/book[cts:contains(title,
 
cts:and-query((cts:word-query("dog","stemmed"),"cat")))]/author
 

The Mark Logic search function API acts effectively as an implementation API for the W3C draft recommendation. It would be premature to start implementing the FTTF grammar given the lack of maturity and large number of unresolved issues in that recommendation.

The FTTF grammar also uses the "ftcontains" predicate relation as an integral part of the FLWOR expression. For example, the set of author children of book elements, where the book element has a title child containing both "dog" and "cat":

 
for $b in //book
where $b/title
ftcontains "dog" && "cat"
return $b/author
 

maps to the Mark Logic function API:

 
for $b in
cts:search(//book,
 
cts:element-query(QName("title"),
cts:and-query(("dog","cat"))))
return $b/author
 

A key difference, however, is that the Mark Logic cts:search() function returns results in (decreasing) relevance order by default, whereas the FLWOR expression returns results in document order unless an explicit order by clause is included in the expression. The fully equivalent FTTF expression would be:

 
for $b in //book
score $s as $b/title
ftcontains "dog" && "cat"
where $s > 0
order by $s descending
return $b/author
 

which is surprisingly more verbose than the lower-level Mark Logic implementation. De-coupling the search and the relevance expression is potentially very useful, but in practice the two expressions are almost always equal. The EBNF grammar allows that the right-hand-side of the "ftcontains" relation may be any expression, but the specification restricts the expression to be a Boolean combination of ft expressions.

Considering FT Expressions

FT search expressions appear on the right hand side of the ftcontains predicate. They are composed from the following primitives, which are almost all supported directly by the Mark Logic search function API:

  • Primitives
    • Boolean And, Or, Not -- supported;MildNot -- not supported

    • Proximity – supported

    • Case and diacritic sensitivity -- supported

    • Stemming and thesaurus options -- supported

    • Stop word options -- not supported

    • Language options -- supported

    • Wildcard matching -- supported

    • Match incidence counting -- not supported

    • Ordered matching -- supported

    • "at start" and "at end" anchors -- supported with regexp matching

    • "words", "sentences", and "paragraph" scope -- supported with appropriate markup

 

From the FTTF Mildnot specification:

If I want to find articles that mention Mexico, I might search for '"Mexico" mild not "New Mexico"'. '"Mexico" mild not "New Mexico"' matches any Expr that contains Mexico on its own. An Expr that contains "New Mexico" is not "excluded" from the result - it may mention "Mexico" as well.

(The specification probably means to say "node" where it says "Expr" as one does not match expressions, one evaluates expressions to obtain a set of nodes which may match some search condition.)

 

This seems like a nice idea, and it will be considered for subsequent releases of the Mark Logic server.

Conversely, the Mark Logic search system includes important concepts which are not present in the FTTF draft recommendation. The Mark Logic server may be configured to "phrase through", "phrase around", or "phrase into" a given set of markup. Phrasing through has the effect of making markup "transparent". For example, inline emphasis tags should not interrupt the phrase structure of a text search.

 
You <b>phrase through</b> a bold tag.
 

Phrasing around has the effect of "ignoring" markup which interrupt the phrase structure. For example, inline footnotes should not contribute to the phrase structure of the footnoted text.

 
A GPS <footnote>Global Positioning System</footnote> device should match a search for "GPS device".
 

Lastly, there are cases when the best behavior is to allow the phrase structure to descend recursively into the string value of the subtree below a given element node.

Mark Logic Update

Mark Logic provides a handful of functions to enable document-level and node-level update. A single query can execute any number of these functions.

Three basic functions perform document-level updates. These functions all use the Mark Logic standard "xdmp" namespace prefix, as shown in Example 6.

Example 6: Document-level Updates
 
(: Loads a document from the
filesystem to the specified URI.
The URI is what would be used as a parameter to
fn:doc().
 
Replaces any existing document at that URI. :)
xdmp:load($filepath, $uri)
 
(: Loads a dynamically constructed document or element node :)
xdmp:document-insert($uri, $root as node())
 
(: Deletes the specified document :)
xdmp:document-delete($uri)
 

Three other functions provide node-level insert abilities. Through the three basic calls shown in Example 7 a node can be added at any location:

Example 7: Node-level Inserts
 
(: Inserts the new node as the last child of the parent :)
xdmp:node-insert-child($parent as node(), $new as node())
 
(: Inserts the node as a following sibling to the reference node :)
xdmp:node-insert-after($sibling as node(), $new as node())
 
(: Inserts the node as a preceding sibling to the reference node :)
xdmp:node-insert-before($sibling as node(), $new as node())
 

A final pair of functions in Example 8 provide the rest of the node-level updates:

Example 8: Node-level Replacements
 
(: Replace the old with the new :)
xdmp:node-replace($old as node(), $new as node())
 
(: Delete the node (and its descendents) from the document :)
xdmp:node-delete($old as node())

There are additional methods to get and set document properties (used for WebDAV), to add and remove documents from collections, and to alter document permissions. Example 9 utilizing several of these calls appears below. It's an implementation of a named queue:

Example 9: Named Queue
 
define function
enqueue($name as xs:string, $i as item()) as empty() {
(: wrap $i in an element() so it can be added to a doc,
: whether it's already an element or not: works for attributes, etc.
:)
let $n := <entry>{$i}</entry>
let $queue := doc($name)/queue
return
if (exists($queue))
  
then
xdmp:node-insert-child($queue, $n)
  
else
xdmp:document-insert($name, <queue>{$n}</queue>)
}
 
define function
dequeue($name as xs:string) as item()? {
(: get the first child element under the queue root, and unwrap it :)
let $node := doc($name)/queue/entry[1]
return
 
if (exists($node))
    
then (xdmp:node-delete($node), $node/@*, $node/node())
    
(: if there's nothing on the queue, return the empty sequence :)
    
else ()
}
 

Everything is straightforward in this example except perhaps the return value from dequeue(). The "then" clause returns a sequence of three items: the result of the xdmp:node-delete() call (always empty), the $node/@* (any possible attribute on the <entry> in case the item stored was an attribute), and $node/node() (any nodes under the <entry> in case the item stored was an element, PI, text node, etc).

To conform to the spirit of the XQuery specification in disallowing side effects, updates are not made visible to the query environment executing the update. This requires some getting used to, but doesn't hinder productivity.

Updates occur after the query completes without error. Should a query produce an uncaught error, the set of updates from that query do not occur. Mark Logic does not yet provide explicit commit/rollback support. However, Mark Logic provides a try/catch expression so that errors can be caught and processed while allowing execution to continue. Its syntax:

 
try {
$try-expression
}
catch ($exception as element()) {
$catch-expression
}
 

 

Example 10 loads files from a list of filenames, reporting errors but not letting any errors stop the other files from being loaded:

Example 10: Handling Load Errors
 
for $file in $files
return
try {
  
<li>
 
{ xdmp:load($file, $file) }
 
{ $file } loaded
  
</li>
}
catch ($exception) {
  
<span>
    
Problem while loading { $file }:<br/>
    
<i xmlns:e="http://marklogic.com/xdmp/error">
      
{$exception/e:message/text()}
    
</i>
  
</span>
}
 

ACID Semantics

Mark Logic updates satisfy a subset of ACID semantics. Below are some definitions conveniently extracted from the Wikipedia that define ACID semantics.

  • ACID Semantics
  • Atomicity refers to the ability of the DBMS to guarantee that either all of the tasks of a transaction are performed or none of them are.
  • Consistency refers to the database being in a legal state when the transaction begins and when it ends. This means that a transaction can't break the rules, or integrity constraints, of the database.
  • Isolation refers to the ability of the application to make operations in a transaction appear isolated from all other operations. This means that no operation outside the transaction can ever see the data in an intermediate state.
  • Durability refers to the guarantee that once the user has been notified of success, the transaction will persist, and not be undone. This means it will survive system failure, and that the database system has checked the integrity constraints and won't need to abort the transaction. Typically, all transactions are written into a log that can be played back to recreate the system to its state right before the failure. A transaction can only be deemed committed after it is safely in the log.

The Mark Logic Server evaluates an XQuery module by spawning an evaluation thread which operates on a snapshot of the database. All the updates appearing in the module are collected and a change vector computed and journaled. The XQuery module has an implied commit which guarantees that either all the changes or none of the changes become persistent. Update evaluation guarantees atomicity, isolation, and durability. Consistency is satisfied, but only in the weak sense, since integrity constraints are not explicitly part of the Mark Logic architecture. Integrity is expressed as de-normalized documents: all the elements collected in one document are maintained as a consistent collection. Isolation is maintained in the strongest sense: each query evaluation thread operates on a static snapshot, and all the updates refer to the original values, even if an update sub-expression modifies one of the values appearing elsewhere in the XQuery module.

W3C XQuery Update Activity

The W3C XQuery Working Group has published a draft requirements document for XML XQuery update. This draft is available at http://www.w3.org/TR/2005/WD-xquery-update-requirements-20050211/. The pertinent parts for this paper are the basic functional requirements and the ACID semantics requirements.

W3C XQuery Update Functional Requirements

From the specification:

Locus of modifications -- supported

The XQuery Update Facility MUST be able to change the properties of existing nodes while preserving their identity. The XQuery Update Facility MUST be able to create a new copy of a node with a specific set of changes.

Delete -- supported

The XQuery Update Facility MUST be able to delete nodes.

Insert -- supported

The XQuery Update Facility MUST be able to insert new nodes in specified positions.

Replace -- supported

The XQuery Update Facility MUST be able to replace a node.

Changing values -- supported

The XQuery Update Facility MUST be able to change the value returned by the typed-value accessor for a node.

Modifying properties -- not supported

The XQuery Update Facility SHOULD be able to modify some of the properties of a node such as the name, type, content, nilability, base-URI, etc.

Moving nodes -- not supported

The XQuery Update Facility MAY be able to move a node from one location to another.

Conditional updates -- supported

The XQuery Update Facility MUST be able to do conditional updates.

Iterative updates -- supported

The XQuery Update Facility MUST be able to iterate over nodes to do updates.

Validation -- not supported

The XQuery Update Facility MAY support an explicit XML Schema validation operation that preserves node identity.

Compositionality -- supported

The XQuery Update Facility MUST be able to compose update operators with other update operators. The XQuery Update Facility MAY be compositional with respect to XQuery expressions; that is, it may be possible to use an update wherever an XQuery expression is used.

Parameterization -- supported

The XQuery Update Facility SHOULD provide a means to parameterize update operations.

W3C XQuery Update Semantics Requirements

From the specification:

Atomicity -- supported

The XQuery Update Facility MUST provide a set of atomic operations, and MUST define a means to group atomic operations into an atomic execution unit.

Consistency -- supported

At the end of an outermost update operation (that is, an update operation invoked from the external environment), the data model MUST be consistent with respect to the constraints specified in the Data Model. In particular, all type annotations MUST be consistent with the content of the items they govern. The XQuery Update Facility MAY define additional levels of granularity at which Data Model constraints are enforced.

Isolation -- supported

The XQuery Update Facility SHOULD define the means by which operations can be isolated from concurrent operations.

Durability -- supported

The XQuery Update Facility MUST define a means to control the durability of atomic operations and atomic execution units.

Conclusion

Interested parties can download evaluation and community licensed copies of Mark Logic Server from http://xqzone.marklogic.com. Full function documentation for Mark Logic search and update functions can be found at http://xqzone.marklogic.com/pubs/.

 

Biography

Jason Hunter

Mark Logic

Jason Hunter works as a Lead Applications Engineer at Mark Logic. He's author of Java Servlet Programming (O'Reilly). He's a member of the expert groups responsible for Servlet, JSP, JAXP, and XQJ API development, and has assisted the W3C XQuery Working Group. He is also an Apache Member and as Apache's representative to the Java Community Process Executive Committee he established a landmark agreement allowing open source Java. He co-created the open source JDOM library to enable optimized Java and XML integration.

posted on 2008-07-29 17:15 gembin 阅读(726) 评论(0)  编辑  收藏 所属分类: DatabaseXMLXQuery


只有注册用户登录后才能发表评论。


网站导航:
 

导航

统计

常用链接

留言簿(6)

随笔分类(440)

随笔档案(378)

文章档案(6)

新闻档案(1)

相册

收藏夹(9)

Adobe

Android

AS3

Blog-Links

Build

Design Pattern

Eclipse

Favorite Links

Flickr

Game Dev

HBase

Identity Management

IT resources

JEE

Language

OpenID

OSGi

SOA

Version Control

最新随笔

搜索

积分与排名

最新评论

阅读排行榜

评论排行榜

free counters