Processing large XML files with Shell Scripts

I recently did some work around analysing xml files for data imports. This kind of task is usually well suited for taco bell programming. Now xml is not easily manipulated with standard unix utilities, so I looked for a way to run xqueries against my files.

The first thing I found was the eXist xml database. It has a reasonable http interface and hence can be scripted using curl. The trouble with eXist is, that you have to first import your data into the database and also it is doesn’t deal with big files (~500MB) gracefully – at least not in an ad-hoc naïve fashion.

After some research I found xqilla a command line utility, which operates on files and does easily filter hundreds of megabytes.

Also I found xquery really nice. It doesn’t use angle brackets. So here I have got a little example to make my point. Given I want to extract some information from this xml input (bigfile.xml(:

<?xml version="1.0" encoding="UTF-8"?>
<library>
    <book author="J.D. Salinger" title="The Catcher in the Rye" lang="en">
        <isbn>0-316-76953-3</isbn>
    </book>
    <book author="Joseph Heller" title="Catch-22" lang="en">
        <isbn>0-684-83339-5</isbn>
    </book>
    <book author="Ödön von Horváth" title="Jugend ohne Gott" lang="de">
        <isbn>3-518-18807-0</isbn>
    </book>
</library>

Now I create the following xquery and write it to a file called test.xquery.

for $x in ./library/book
where $x/@lang = "en"
return 
  concat(
    data($x/@title),
    " by ",
    data($x/@author),
    ": ",
    data($x/isbn)
  )

Now running

xqilla test.xquery -i bigfile.xml

yields this:

The Catcher in the Rye by J.D. Salinger: 0-316-76953-3
Catch-22 by Joseph Heller: 0-684-83339-5

If you are doing multiple queries against a single file eXist will probably be faster if you create the right indices, but for just filtering xml in a single pass, xqilla is a tool to consider.

For adhoc usage it is often impractical to create a file with the quey expression. In such cases a process substitution is your friend:

xqilla <(echo 'for $x in ./library/book return data($x/@title)') -i bigfile.xml

Comments

One response to “Processing large XML files with Shell Scripts”

Stefan Meier

7. October 2011

How big are the files you have processed successfully in this fashion? I’m trying to query against a 19.5GB XML file and xqilla hogs memory without end and then just swaps endlessly.

– Stefan

Processing large XML files with Shell Scripts

Comments

One response to “Processing large XML files with Shell Scripts”

Leave a Reply Cancel reply