XSLT 2.0 Processing in Postgres

Introduction

Through the xml2 module, Postgres can apply XSL stylesheets to arbitrary texts. There are two big caveats with this approach:

The xslt functionality is through libxslt, which conforms to XSLT 1.0
The xml2 module is deprecated, slated for removal, and Postgres doesn’t implement the functionality in the core server, either.

All is not lost for those who want to use XSLT 2.0+ in Postgres. We can embed the gold standard XML Library, saxon, in Postgres using PL/Java. In the end, we will have a regular Postgres function that passes in a stylesheet and data file path, and returns the text result. Since this is a non-trivial, I’ve decided to document the process of setting up, installing, and running an example stylesheet.

Setting up Postgres

We’re going to be using Postgres 9.4 on Ubuntu, and setting up Postgres 9.4 is as simple as following the tutorial outlined on the Postgres site, but for convenience, I’ll copy it down below. One thing to note is that we installing a development dependencies as PL/Java tie into those.

echo deb http://apt.postgresql.org/pub/repos/apt/ trusty-pgdg main >\
  /etc/apt/sources.list.d/pgdg.list

wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | \
sudo apt-key add -
sudo apt-get update
sudo apt-get install postgresql-9.4 postgresql-server-dev-9.4 libecpg-dev libpgtypes3

Now you can sudo -u postgres psql to log into the database!

Setting up PL/Java

PL/Java allows Postgres to call out to the JVM for additional functionality. The default JDK on Ubuntu is OpenJDK, and I happen to prefer the Oracle JDK, so we’ll install that. Additionally there isn’t a maven package for PL/Java so we’ll have to build it on the machine with Postgres on it because of the use of JNI in PL/Java. For those interested in the rationale, there is an interesting article on the PL/Java wiki that explains the choice of JNI.

sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer maven
git clone https://github.com/tada/pljava.git
cd pljava
mvn package

Now that we have PL/Java built, we are going to add it as an extension to Postgres.

After the extension had been added, I needed to add the install extension jar file to the class path in the module configuration.

# Add settings for extensions here
pljava.classpath = '/usr/share/postgresql/9.4/extension/pjlava-1.4.4.jar'

Initial Test

Before we commit to writing our XSLT 2.0+ processor, let’s make sure that our extension we installed works. Log into the database and execute.

CREATE EXTENSION pljava;
CREATE FUNCTION getsysprop(VARCHAR) RETURNS VARCHAR
  AS 'java.lang.System.getProperty'
  LANGUAGE java;

SELECT getsysprop('user.home');

If you see an output and no errors, we know the installation was a success!

Only have one pljava extension for the entire server! I made the mistake of creating the extension on the “postgres” database and on “mydb” the result was very uninformative “ERROR: Failed to create Java VM”

Implementation

This part assumes a Maven project.

Add the Saxon library to the pom file.

<dependency>
    <groupId>net.sf.saxon</groupId>
    <artifactId>Saxon-HE</artifactId>
    <version>${saxon.version}</version>
</dependency>

Since PL/Java works with static functions, we create a class of Helpers in the com.example.pljava.test package with a static xslt function.

The following implementation takes in a file path to the XSLT and a file path to the XML.

private static final TransformerFactory factory = new TransformerFactoryImpl();
private static final Logger logger = Logger.getAnonymousLogger();

public static String xslt(String xsltPath, String dataPath) {
    try {
        final Templates late = factory.newTemplates(
            new StreamSource(new File(xsltPath)));
        StringWriter writer = new StringWriter();
        Transformer transformer = late.newTransformer();
        transformer.transform(new StreamSource(new File(dataPath)),
            new StreamResult(writer));
        return writer.toString();
    } catch (TransformerException e) {
        logger.warning(e.getMessage());
        return null;
    }
}

Build the project (I prefer to combine everything into an uber jar using Maven Shade). To load it into Postgres, log into the database (no need to CREATE EXTENSION again), and execute.

SELECT sqlj.install_jar('file:///home/nick/pljava-test.jar', 'test', false);
SELECT sqlj.set_classpath('public', 'test');

Documentation on install jar and set_classpath. The classpath is set on public namespace as public is the default namespace in Postgres.

Next step. Create a Postgres function to call out to our function.

CREATE FUNCTION xslt(varchar, varchar) RETURNS varchar
  AS 'com.example.pljava.test.Helpers.xslt'
  LANGUAGE javaU;

The language is javaU because we are accessing the file system, which is an inheritantly unsafe operation. Only a super user can create this function.

Just to illustrate that our solution works, we’ll use the generic identity transform. Since our function works with file paths, we’ll store the following in /home/nick/identity.xsl.

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>
</xsl:stylesheet>

In /home/nick/data.xml we have

<a>Hello</a>

And now the clincher.

SELECT xslt('/home/nick/identity.xsl', '/home/nick/data.xml')

Will output <a>Hello</a>. Hurray! We just sucessfully embeded an XSLT 2.0 processor in Postgres.

Improvements

Looking at the code, we see that each time the xslt function is called we are parsing the xslt file each time, but intepreting the same xslt transformation is redundant for each data file. We can implement a cache such that xslt files only need to be parsed and interpreted once. The code below solves the issue with Guava’s Cache, which evicts the templates after an hour of no accesses.

private static final TransformerFactory factory = new TransformerFactoryImpl();
private static final Logger logger = Logger.getAnonymousLogger();

private static final LoadingCache<String, Templates> XSLT_CACHE =
        CacheBuilder.newBuilder()
                .expireAfterAccess(1, TimeUnit.HOURS)
                .build(new CacheLoader<String, Templates>() {
                    @Override
                    public Templates load(String s) throws Exception {
                        return factory.newTemplates(new StreamSource(new File(s)));
                    }
                });

public static String xslt(String xsltPath, String dataPath) {
    try {
        final Templates late = XSLT_CACHE.get(xsltPath);
        StringWriter writer = new StringWriter();
        Transformer transformer = late.newTransformer();
        transformer.transform(new StreamSource(new File(dataPath)),
            new StreamResult(writer));
        return writer.toString();
    } catch (TransformerException | ExecutionException e) {
        logger.warning(e.getMessage());
        return null;
    }
}

We can build our project and replace the exist jar

SELECT sqlj.replace_jar('file:///home/nick/pljava-test.jar', 'test', false);

Performance

With \timing and 15 values running through the function yielded about 1.25 milliseconds per row, which is on a single core, 2GB RAM, hard drive, and a VM. Obviously your mileage will vary, but I’m expecting this method to scale out beautifully because our xslt function is logically immutable given the files don’t change. If the contents of the file are subject to change, create the function with STABLE volatility, else use IMMUTABLE.

Additional Information

For more information, see a great article Introduction to PostgreSQL PL/java, which will go over additional features, such as deployment descriptors (making deployments easier) in more details.