Processing large XML files with Shell Scripts

I recently did some work around analysing xml files for data imports. This kind of task is usually well suited for taco bell programming. Now xml is not easily manipulated with standard unix utilities, so I looked for a way to run xqueries against my files.

The first thing I found was the eXist xml database. It has a reasonable http interface and hence can be scripted using curl. The trouble with eXist is, that you have to first import your data into the database and also it is doesn’t deal with big files (~500MB) gracefully – at least not in an ad-hoc naïve fashion.

After some research I found xqilla a command line utility, which operates on files and does easily filter hundreds of megabytes.

Also I found xquery really nice. It doesn’t use angle brackets. So here I have got a little example to make my point. Given I want to extract some information from this xml input (bigfile.xml(:

<?xml version="1.0" encoding="UTF-8"?>
<library>
    <book author="J.D. Salinger" title="The Catcher in the Rye" lang="en">
        <isbn>0-316-76953-3</isbn>
    </book>
    <book author="Joseph Heller" title="Catch-22" lang="en">
        <isbn>0-684-83339-5</isbn>
    </book>
    <book author="Ödön von Horváth" title="Jugend ohne Gott" lang="de">
        <isbn>3-518-18807-0</isbn>
    </book>
</library>

Now I create the following xquery and write it to a file called test.xquery.

for $x in ./library/book
where $x/@lang = "en"
return 
  concat(
    data($x/@title),
    " by ",
    data($x/@author),
    ": ",
    data($x/isbn)
  )

Now running

xqilla test.xquery -i bigfile.xml

yields this:

The Catcher in the Rye by J.D. Salinger: 0-316-76953-3
Catch-22 by Joseph Heller: 0-684-83339-5

If you are doing multiple queries against a single file eXist will probably be faster if you create the right indices, but for just filtering xml in a single pass, xqilla is a tool to consider.

For adhoc usage it is often impractical to create a file with the quey expression. In such cases a process substitution is your friend:

xqilla <(echo 'for $x in ./library/book return data($x/@title)') -i bigfile.xml
This entry was posted in Software Development. Bookmark the permalink.