Rendition Protocol: XQuery

Showing posts with label XQuery. Show all posts

Thursday, October 20, 2011

Testing for an empty sequence in XQuery

In an XQuery code base I was maintaining recently, all external variables were strongly typed as strings (declare variable $input as xs:string external;). This saved some time and lines of code since we always knew the type of those variables and the code would fail fast if they were not strings.

Unfortunately, the writers of the calling code decided it was too hard to make sure an empty string was passed in for certain cases and they wanted to pass in null, which was translated to the empty sequence. Rather than make edits to many different functions, we edited the main modules to be more defensive.

For some reason I have a hard time remembering how to test for an empty sequence and I also wasn't sure if I could redefine a declared variable in a FLOWR so I wrote the little test snippet below.

xquery version "1.0-ml";
declare variable $input := ();

let $original-input := $input
let $input := if(fn:empty($input)) then "" else $input

return (
  element results {
    element original-input-empty { fn:empty($original-input) },
    element revised-input-empty { fn:empty($input) }
  }
)

Thursday, May 12, 2011

XQuery Katas

After having several MarkLogic projects back-to-back, I haven't had a new one in far too long. So, to practice my XQuery in the hopes of having a new one soon, I've started a collection of XQuery katas.

The intent of these is less TDD and more brushing up on XQuery, but I'm also interested in testing my functions. To that end you'll see I have my tests built using:

Visual Studio
NUnit
Saxon HE

Meager, but it suits my current needs.

My first kata is up now on BitBucket and I've gotten some constructive feedback from the xquery-talk mailing list. I hope to add more over the next few weeks.

Tuesday, April 13, 2010

XQuery Link-Love

Hi there.

I recently received some link-love from Pete Aven (Twitter, Blog) for some XQuery and MarkLogic Server posts I have here. The posts are kind of out-dated at this point, but I leave them up in case they help someone out. When I was just starting out, this kind of information was invaluable to me so I'm just trying to pay it forward.

I'm currently on a big SQL Server based project and my last MarkLogic project had some hellish deadlines that didn't leave much time for posting newer bits. With a little luck the next project I have lined up will bring me back into the XQuery/MarkLogic fold!

In the meantime, read Pete's blog and the other bloggers he has listed here.

Friday, July 31, 2009

Unique Attribute Values Across Multiple Documents using XQuery

It's a little slow, but here's one way to get a list of all the unique attribute values across multiple XML documents using XQuery.


let $raw-values := 
  for $book in collection("abc")/(gbook|set)[@type='oeb']
  return 
    element { "book" } 
    { 
      for $value in distinct-values($book//node()/@class)
      return element { "class" } { $value }
    }
for $item in distinct-values($raw-values//class)
order by $item
return element { "uniques" } { $item }

Friday, July 10, 2009

MarkLogic XCC Layer File Open Errors

If you have library modules you're importing, the query may work fine in cq, but if you try to use the same query via the XCC layer you may get "File Open Error" messages.

One cause of this for me was the pathing in the import statement. cq seems to handle a relative path while XCC cannot, at least in MarkLogic 4.1.

I needed to change from...

import module namespace my = "http://blah.com" at "search-parser-xml.xqy",
"search-snippet.xqy";

...to...

import module namespace my = "http://blah.com" at "/search-parser-xml.xqy",
"/search-snippet.xqy";

Monday, July 6, 2009

MarkLogic, cq and Namespaces

If you import an XQuery library in cq and declare the namespace, cq gets fussy if you then try to declare your own functions. I know there are clear reasons for this, but here's what I do so I can use my own functions during testing.


xquery version "1.0-ml";

import module namespace search = "http://marklogic.com/appservices/search" 
  at "/MarkLogic/appservices/search/search.xqy";

declare namespace my="http://www.my-web-site.com/xquery";

declare variable $options-title := 
<options xmlns="http://marklogic.com/appservices/search">  
  <searchable-expression>
    collection("abc123")//(div)
  </searchable-expression>
  <transform-results apply="snippet">
    <per-match-tokens>30</per-match-tokens>
    <max-matches>1</max-matches>
    <max-snippet-chars>200</max-snippet-chars>
    <preferred-elements/>
  </transform-results>
</options>;

declare function my:do-search()
{
  search:search("food", $options-title, (), 25)
};

my:do-search()

Monday, November 24, 2008

Loading XML into eXist Using XQuery and the Sandbox

This past weekend I was tinkering with the eXist XML database. The installation went fine and some of their sample queries ran fine. My next step was to load some of my content into it.

Rather than use their web interface or desktop client, I wanted to load the documents using XQuery through their sandbox application. I thought this would be quick and easy and would allow me to compare some features of eXist to MarkLogic Server.

There is quite a bit of documentation for eXist, but the XQuery API is light on specific usage examples. I also ran into some non-obvious gotchas. Here is the XQuery code that I used to load a document into a specific collection, along with some notes below.

declare namespace xmldb="http://exist-db.org/xquery/xmldb";
declare variable $file as xs:string { 
  "file:///C:/Program%20Files/eXist/samples/mattio/sample.xml" };
declare variable $name as xs:string { "sample.xml" };
declare variable $collection as xs:string { "/db/test/" };

<results>
{
let $collection-status := 
  if(not(xmldb:collection-exists($collection))) then 
    xmldb:create-collection("", $collection)
  else ("Collection already exists.")
return <collection-status> { $collection-status } </collection-status>
,
let $load-status := xmldb:store($collection, $name, xs:anyURI($file))
return <load-status> { $load-status } </load-status>
}
</results>

When I was trying to use C:\ to start my path or when I was leaving out xs:anyURI(), I was getting a misleading error that implied there was something wrong with my document. The error was:

XMLDB reported an exception while storing documentorg.xmldb.api.base.XMLDBException: fatal error at (1,1) : Content is not allowed in prolog. [at line 120, column 21] In call to function: sandbox:exec-query(xs:string) [134:10]

Here are some other notes.

Note that the xmldb namespace needs to be declared.

Note the syntax of $file. This is how you reference a document on your file system, including encoding the path to use %20 instead of a space.

Note that $file must be wrapped in xs:anyURI() when used in xmldb:store() in order to force it to be considered a URI and not a simple string.

Thanks to Dannes and Wolfgang for their help with this. They were on the exist-open list on a Saturday.

Next up I'll load about 50 large documents to build some basic queries to review index tuning.

Wednesday, October 8, 2008

Add XQuery Support to UltraEdit

Leave it to the team that works on UltraEdit to make adding XQuery support easy.

Here's a tutorial on adding a language.

Here's the XQuery wordfile they provide.

Done in about 15 seconds. Nice.

Sunday, September 28, 2008

Stop a Long-Running Query in MarkLogic Server

If you're like me, every once in awhile you'll be working on a query in cq, you'll do something stupid in XQuery, run the query, and it will run forever. MarkLogic Server has built-in timeouts, but you can stop a long-running query rather than waiting. Here's how you can do it using the default admin console in 3.2.

In the left-hand column, click on Groups.
Click on Default
Click on App Servers
Click on the app server cq is connected to.
Click the Status tab.
Click the Show More button.
Scroll to the bottom and you should see a request with the the /cq/ path referenced.
Click the cancel link.
Confirm that you want to cancel the query.

Thursday, September 11, 2008

New XQuery Component for SyntaxHighlighter

I created an XQuery component for Alex Gorbatchev's SyntaxHighlighter, which you see in use with my XQuery snippets on this blog.

I've asked Alex to add it to the download package, but you can also download it from my site.

Wednesday, June 11, 2008

Removing Noise Words from a String with XQuery

MarkLogic doesn't offer a way to do stop words (a/k/a suppression lists a/k/a noise words) by default for various reasons -- and I didn't want to block them from being used in searches -- but I was asked to remove them from consideration when using hit highlighting. Here's the code I used to remove a fixed set of noise words from a user's search string.


define variable $NOISE_WORDS as xs:string*
{
 (: \b is a word boundary. This catches beginning,
    end, and middle of string matches on whole words. :)
 ('\bthe\b', '\bof\b', '\ban\b', '\bor\b',
  '\bis\b', '\bon\b', '\bbut\b', '\ba\b')
}

define function remove-noise-words($string, $noise)
{
 (: This is a recursive function. :)
 if(not(empty($noise))) then
   remove-noise-words(
     replace($string, $noise[1], '', 'i'),
     (: This passes along the noise words after
        the one just evaluated. :)
     $noise[position() > 1]
   )
 else normalize-space($string)
}

let $source-string1 := "The Tragedy of King Lear"
let $source-string2 := "The Tragedy OF King Lear These an"
let $source-string3 :=
 "The Tragedy of the an of King Lear These of"
let $source-string4 := "The of an of"
(: Need to handle empty result if all noise words,
  as in #4 above. :)
let $final :=
 remove-noise-words($source-string1, $NOISE_WORDS)
return $final

Tuesday, April 15, 2008

Creating a Summary from a MarkLogic Search Result

It's called a summary, or a snippet, or context. It's the string beneath each search result that shows you some words around your search term(s) in the document that was returned.

There's a good one in lib-search if you're using it. I'm not ... yet. At first I tried to use just the relevant functions, but it wasn't doing quite what I wanted and it seemed pretty heavy, especially returning 25 documents per page. The additional things I wanted it to do were to allow me to ignore certain elements and to cross element boundaries. So, even though I'm an XQuery and MarkLogic rookie I decide to try and roll my own!


module "http://greenwood.com"
default function namespace="http://www.w3.org/2003/05/xpath-functions"
declare namespace gpg="http://greenwood.com"

(: Take a search result and create a snippet of text based on the first hit
   in the file.  Exclude selected elements when generating the snippet. If the
   hit is in an element that is removed, it will use the next available hit
   or default to the first string of words available. Element boundaries are
   ignored, which is a perceived benefit. :)

define variable $gpg:START-TEXT as xs:string { "ML-HIT-START" }
define variable $gpg:END-TEXT as xs:string { "ML-HIT-END" }

define function gpg:get-summary($node as node(), $cts-query as cts:query, $word-buffer as xs:integer) as node()
{
  let $myHighlight as node() := cts:highlight( $node, $cts-query, ($gpg:START-TEXT, $cts:text, $gpg:END-TEXT) ) 
  let $mySummary as node() := <summary> { gpg:remove-elements($myHighlight) } </summary>
  let $mySnippet as xs:string := gpg:create-snippet($mySummary, $word-buffer)
  (: Yes, we are running cts:highlight twice.  The advantage is that it greatly simplifies
     the logic for getting the snippet text and has minimal impact on performance when 
     compared to that alternative. It's the lesser of two evils. :)  
  return
    cts:highlight(<summary> { $mySnippet } </summary>, $cts-query, <span class="hit"> { $cts:text } </span> )  
}

define function gpg:create-snippet($node as node(), $word-buffer as xs:integer) as xs:string
{
  let $myString := normalize-space(string($node))
  let $myTokenizedString := tokenize($myString, "\s")
  (: If the sequence contains the start of the search indicator use it, else use 1. :)
  (: index-of() can return a sequence of hits, so just grab the first. :)
  let $myStartHit := if(index-of($myTokenizedString, $gpg:START-TEXT)[1] castable as xs:integer) then
                       index-of($myTokenizedString, $gpg:START-TEXT)[1]
                     else 1
  (: If starting the buffer's number of words before the hit is a negative number,
     start at 1, otherwise start at the first hit minus the buffer. :)
  let $myStart := if( ($myStartHit - $word-buffer) < 0 ) then 1 else ($myStartHit - $word-buffer)
  let $myEnd := $word-buffer*2
  (: Subsequence does not really care if you feed it negative numbers or numbers that
     extend beyond the source sequence's actual size, which is very useful here. 
     Negative numbers can have odd results, though. :)
  let $myTokenizedStringSmall := subsequence($myTokenizedString, $myStart, $myEnd)
  (: Join the sequence back together as a string with spaces between each item. :)
  let $myUneditedString := string-join($myTokenizedStringSmall, " ")
  (: Delete the placeholder text completely. :)
  let $myEditedStringStart := replace($myUneditedString, $gpg:START-TEXT, '')
  let $myEditedStringEnd := replace($myEditedStringStart, $gpg:END-TEXT, '')
  (: When this is returned, run cts:highlight on it to get highlighting in the snippet. 
     Or don't if it's not needed. :)
  return $myEditedStringEnd 
}
   
(: This group of elements is used to remove selected nodes recursively. This means we can
   remove hits on head or metadata elements, which might look odd in a snippet. :)
define function gpg:remove-elements($node as node()) as node()
{
  for $i in $node/node() return gpg:removal($i)
}
(: This function removes nodes or pass them to the correct handler for processing. :)
define function gpg:removal($node as node()) as node()
{
  typeswitch($node)
  case text() return gpg:text-handler($node)
  case element(content-metadata) return () (: This is one that is removed. :)
  case element(head) return () 
  case element(entry-head) return () 
  case element(taxonomy) return () 
  case processing-instruction() return ()
  default return gpg:default-handler($node) (: The default is to return the node and recurse. :)
}
define function gpg:text-handler($node as node()?) as node()*
{
  if(empty($node)) then ()
  else (text {$node})
}
define function gpg:default-handler($node as node()?) as element()*
{
  element { local-name($node) }
          { $node/@*, gpg:remove-elements($node) }
}

If you're reading this on a narrower screen resolution, you may be loosing the right-hand side of the code. Copy and paste it out to see it better.

UPDATE: This is painfully slow, primarily I think because my documents are too large to process like this. I'm working to make this leaner.

Friday, March 28, 2008

Setting and Getting Documet Quality in MarkLogic Server

I had a request to influence the score of documents returned by searches based on the year of publication. Since a year isn't used as part of most searches, it seemed like the best approach was to set the document quality to the pub year. Then I could use that value in the scoring calculations. Take a look at the Developer's Guide for how do do that.

Here's how I looped through all the documents in a collection and set document quality using a value stored in the XML already.


(: Set document quality :) 
for $i in collection('myCollection')/book
return
  let $myYear := $i/metadata/publication-date
  let $myBaseUri := base-uri($i)
  (: I don't know about you, but I don't trust my XML vendors.
     This tests casting the data to an int first. :)
  let $myDocumentQuality := 
    if($myYear castable as xs:integer) then
      $myYear cast as xs:integer
    else 1990 (: This is a default setting in case of bad data. :)
  return
 
  xdmp:document-set-quality($myBaseUri, $myDocumentQuality)

Depending on how many documents you have stored, you may need to modify this to set the document quality in smaller batches because it is quite intensive.

Once that's run, you can go back and review all your document quality settings. I originally had these two queries run together, but I think it takes a minute or two (depending on your system) for the settings to actually be indexed.


(: Get document quality :)
<results>
{
for $i in collection('myCollection')/book
return
  <result
   base-uri="{base-uri($i)}"
   year="{$i/metadata/publication-date}"
   set-document-quality="{xdmp:document-get-quality(base-uri($i))}"/>
}
</results>

Sunday, March 16, 2008

Basic Information on MarkLogic Collections

I wanted to get some quick information about all of my collections, starting with a list of names and how many documents were in each. This isn't rocket science, but I'll add to this post as the query expands.


(: Set "collection lexicon" to true in the index definition. :)
let $collections := cts:collections() return
<root count="{ fn:count($collections) }">
{
for $collection in $collections
return
  <collection>
    <name> { $collection } </name>
    <size> { fn:count(fn:collection($collection)) } </size>
  </collection>
}
</root>

Thursday, January 24, 2008

Modify an XML Fragment Retrieved by MarkLogic Server

Last week I was experimenting with a scenario where I wanted to retrieve a section of a larger XML document from MarkLogic Server, but modify it before passing it up to the web application layer (we're using MarkLogic -> XCC -> ASP.NET). This XQuery sample recurses through all of the element and text nodes, modifies what's needed, and passes the rest back unchanged. In the end we decided on a different approach, so this query is not tuned but the premise is interesting. I think there's probably a better way to structure the default handler, but for now ...


define variable $myId as xs:string external

define function recurse($node as node()) as node()*
{
  for $i in $node/node() return modify-fragment($i)
}

(: Here we define which elements we want to modify and declare
   a handler for them. :) 
define function modify-fragment($node as node()) as node()*
{
  typeswitch ($node)
  case text() return text-handler($node)
  case element(ref) return ref($node)
  case element(see) return see($node)
  case element(see-also) return see-also($node)
  default return default-handler($node)
}
 
define function text-handler($node as node()?) as node()*
{
  if(empty($node)) then ()
  else (text {$node})
}
 
define function default-handler($node as node()?) as element()*
{
  element { local-name($node) }
               { $node/@*, recurse($node) }
}
 
define function see($element as element(see)) as element()
{
  let $destination-node := $element/ancestor::book/
                 descendant::node()[@local-id=$element/@seeref] return
  if(not(empty($destination-node/ancestor-or-self::node()
                 [@fragment='true'][1]))) then
  < see>
  { $element/@* }
  {
    attribute {"destination-node"}
                  {local-name($destination-node)},
    attribute {"fragment-local-id"}
                  {$destination-node/ancestor-or-self::node()
                     [@fragment='true'][1]/@local-id}
  }
  { recurse($element) }
  < /see>
  else recurse($element)
}
 
define function see-also($element as element(see-also)) as element()
{
  let $destination-node := $element/ancestor::book/
                 descendant::node()[@local-id=$element/@seeref] return
  if(not(empty($destination-node/ancestor-or-self::node()
                 [@fragment='true'][1]))) then
  < see-also>
  { $element/@* }
  {
    attribute {"destination-node"}
                  {local-name($destination-node)},
    attribute {"fragment-local-id"}
                  {$destination-node/ancestor-or-self::node()
                     [@fragment='true'][1]/@local-id}
  }
  { recurse($element) }
  < /see-also>
  else recurse($element)
}
 
define function ref($element as element(ref)) as element()
{
  if($element/@type='local-id') then 
    let $destination-node := $element/ancestor::book/
                 descendant::node()[@local-id=$element/@value] return
    if(not(empty($destination-node/ancestor-or-self::node()
                 [@fragment='true'][1]))) then
    < ref>
      { $element/@* }
      {
        attribute {"destination-node"}
                      {local-name($destination-node)},
        attribute {"fragment-local-id"}
                      {$destination-node/ancestor-or-self::node()
                       [@fragment='true'][1]/@local-id}
      }
      { recurse($element) }
    < /ref>
    else recurse($element)
  else recurse($element)
}
 
(:let $myId := 'ID1234':)
 
for $i in (//chapter[@local-id = $myId] | //entry[@local-id = $myId])[1]
return
 
< fragment imagepath="{$i/property::imagepath/text()}">
{
    recurse($i)
}
< /fragment>

Tuesday, January 15, 2008

XQuery Support in MarkLogic Server

It's a buried a little deep, but here's how to find out which version of the XQuery spec your version of MarkLogic supports.

Got to http://developer.marklogic.com/ and find the documentation. Find the release notes. Find the section called "Compatibility with XQuery Drafts."

Here's the information for 3.2.

This release implements the XQuery language, functions and operators specified in the May 02, 2003 W3C XQuery Working Group Draft Recommendations:
http://www.w3.org/TR/2003/WD-xquery-20030502
http://www.w3.org/TR/2003/WD-xpath-functions-20030502
Additionally, much of the added functionality in the January 2007 W3C XQuery Recommendation is implemented in MarkLogic Server 3.2.

Tuesday, August 28, 2007

Querying a MarkLogic Server Collection for a Title List, Plus Performance

The real theme of this post is, "Keep it simple, stupid."

I had a query I was using to generate a list of documents stored in a MarkLogic Server collection. The results were being passed to a web application for use as a title list. I was getting the results I wanted, but the query was taking about 20 seconds to complete and there were fewer than 200 book-length XML files to search. The key line in that query was:

for $i in cts:search(//book, cts:collection-query($myCollection))

I knew there had to be a better way to express this, but I couldn't find anything in the API that helped. Then someone suggested using the built-in XQuery function, collection(). I changed the query to...

for $i in collection($myCollection)/book

...and now the query runs in less than .05 seconds, including some trips down into the structure to grab metadata. That's what I was expecting.

Thanks to the people on the Mark Logic developer email list for helping with this.

Sunday, May 6, 2007

Get a Document's Properties by Attribute Value from MarkLogic Server

Here's a query to get a document's properties from MarkLogic Server using an attribute value. The attribute name is "id" and it is on the node named "document." I used the cts:element-attribute-value-query() function because I can set the case sensitivity and other options. The entire <prop:properties> node is returned.


define variable $myId as xs:string external
(:let $myId := 'ID1234':)

for $i in cts:search(//document,
 cts:element-attribute-value-query(xs:QName("document"),
   xs:QName("id"),
   $myId,
   "case-insensitive"
 )
)
return $i/property::node()/..

You can pass in a variable from an external app or you can define it in the query using let. You can also tack on some additional metadata, like the URI of the document and any collections it belongs to. Change the return block for this to something like:


return
 <result>
   { $i/property::node()/.. }
   <uri> { base-uri($i) } </uri>
   <collections> {  xdmp:document-get-collections(base-uri($i)) } </collections>
 </result>

Thursday, May 3, 2007

Using cts:search to Search More Than One Node Level in MarkLogic Server

I had a relatively simple requirement for building a search application running against MarkLogic Server: search multiple levels of the XML hierarchy for all files in the repository and return each level as a document; the search must be case and diacritic insensitive. Here's beginnings of the query that did the trick:


define variable $ORIGINAL_QUERY as xs:string external

for $i in cts:search( //(chapter
                        div
                        entry
                        section),
                     cts:word-query(
                                     $ORIGINAL_QUERY,
                                     ("case-insensitive", "diacritic-insensitive")
                                   )
                   )
return

<result id="{ $i/@local-id }">
 {
   $i/( content-metadata/title ),
   $i/( content-metadata/subtitle ),
   $i/( content-metadata/label ),
   $i/( head ),
   $i/( entry-head ),
   $i/( content-metadata/contributor/display-name ),
   $i/( content-metadata/copyright/display-date )
 }
</result>

There's a lot more work to do on this -- tokenizing words, tokeninzing quoted strings as phrases, accepting Boolean terms, returning different values based on the node, etc., etc., etc.

Thanks to the people on the Mark Logic developer email list for helping with this.