XSLT 2.0 Processing in Postgres
Published on:Table of Contents
Introduction
Through the xml2 module, Postgres can apply XSL stylesheets to arbitrary texts. There are two big caveats with this approach:
- The xslt functionality is through libxslt, which conforms to XSLT 1.0
- The xml2 module is deprecated, slated for removal, and Postgres doesn’t implement the functionality in the core server, either.
All is not lost for those who want to use XSLT 2.0+ in Postgres. We can embed the gold standard XML Library, saxon, in Postgres using PL/Java. In the end, we will have a regular Postgres function that passes in a stylesheet and data file path, and returns the text result. Since this is a non-trivial, I’ve decided to document the process of setting up, installing, and running an example stylesheet.
Setting up Postgres
We’re going to be using Postgres 9.4 on Ubuntu, and setting up Postgres 9.4 is as simple as following the tutorial outlined on the Postgres site, but for convenience, I’ll copy it down below. One thing to note is that we installing a development dependencies as PL/Java tie into those.
echo deb http://apt.postgresql.org/pub/repos/apt/ trusty-pgdg main >\
/etc/apt/sources.list.d/pgdg.list
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | \
sudo apt-key add -
sudo apt-get update
sudo apt-get install postgresql-9.4 postgresql-server-dev-9.4 libecpg-dev libpgtypes3
Now you can sudo -u postgres psql
to log into the database!
Setting up PL/Java
PL/Java allows Postgres to call out to the JVM for additional functionality. The default JDK on Ubuntu is OpenJDK, and I happen to prefer the Oracle JDK, so we’ll install that. Additionally there isn’t a maven package for PL/Java so we’ll have to build it on the machine with Postgres on it because of the use of JNI in PL/Java. For those interested in the rationale, there is an interesting article on the PL/Java wiki that explains the choice of JNI.
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer maven
git clone https://github.com/tada/pljava.git
cd pljava
mvn package
Now that we have PL/Java built, we are going to add it as an extension to Postgres.
After the extension had been added, I needed to add the install extension jar file to the class path in the module configuration.
# Add settings for extensions here
pljava.classpath = '/usr/share/postgresql/9.4/extension/pjlava-1.4.4.jar'
Initial Test
Before we commit to writing our XSLT 2.0+ processor, let’s make sure that our extension we installed works. Log into the database and execute.
CREATE EXTENSION pljava;
CREATE FUNCTION getsysprop(VARCHAR) RETURNS VARCHAR
AS 'java.lang.System.getProperty'
LANGUAGE java;
SELECT getsysprop('user.home');
If you see an output and no errors, we know the installation was a success!
Only have one pljava extension for the entire server! I made the mistake of creating the extension on the “postgres” database and on “mydb” the result was very uninformative “ERROR: Failed to create Java VM”
Implementation
This part assumes a Maven project.
Add the Saxon library to the pom file.
<dependency>
<groupId>net.sf.saxon</groupId>
<artifactId>Saxon-HE</artifactId>
<version>${saxon.version}</version>
</dependency>
Since PL/Java works with static functions, we create a class of Helpers
in
the com.example.pljava.test
package with a static xslt
function.
The following implementation takes in a file path to the XSLT and a file path to the XML.
private static final TransformerFactory factory = new TransformerFactoryImpl();
private static final Logger logger = Logger.getAnonymousLogger();
public static String xslt(String xsltPath, String dataPath) {
try {
final Templates late = factory.newTemplates(
new StreamSource(new File(xsltPath)));
StringWriter writer = new StringWriter();
Transformer transformer = late.newTransformer();
transformer.transform(new StreamSource(new File(dataPath)),
new StreamResult(writer));
return writer.toString();
} catch (TransformerException e) {
logger.warning(e.getMessage());
return null;
}
}
Build the project (I prefer to combine everything into an uber jar using Maven
Shade). To load it into
Postgres, log into the database (no need to CREATE EXTENSION
again), and
execute.
SELECT sqlj.install_jar('file:///home/nick/pljava-test.jar', 'test', false);
SELECT sqlj.set_classpath('public', 'test');
Documentation on install jar
and
set_classpath
.
The classpath is set on public
namespace as public
is the default namespace
in Postgres.
Next step. Create a Postgres function to call out to our function.
CREATE FUNCTION xslt(varchar, varchar) RETURNS varchar
AS 'com.example.pljava.test.Helpers.xslt'
LANGUAGE javaU;
The language is javaU
because we are accessing the file system, which is an
inheritantly unsafe operation. Only a super user can create this function.
Just to illustrate that our solution works, we’ll use the generic identity
transform. Since our function works with file paths, we’ll store the following
in /home/nick/identity.xsl
.
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
In /home/nick/data.xml
we have
<a>Hello</a>
And now the clincher.
SELECT xslt('/home/nick/identity.xsl', '/home/nick/data.xml')
Will output <a>Hello</a>
. Hurray! We just sucessfully embeded an XSLT 2.0
processor in Postgres.
Improvements
Looking at the code, we see that each time the xslt
function is called we are
parsing the xslt file each time, but intepreting the same xslt transformation
is redundant for each data file. We can implement a cache such that xslt files
only need to be parsed and interpreted once. The code below solves the issue
with Guava’s Cache, which evicts the templates after an hour of no
accesses.
private static final TransformerFactory factory = new TransformerFactoryImpl();
private static final Logger logger = Logger.getAnonymousLogger();
private static final LoadingCache<String, Templates> XSLT_CACHE =
CacheBuilder.newBuilder()
.expireAfterAccess(1, TimeUnit.HOURS)
.build(new CacheLoader<String, Templates>() {
@Override
public Templates load(String s) throws Exception {
return factory.newTemplates(new StreamSource(new File(s)));
}
});
public static String xslt(String xsltPath, String dataPath) {
try {
final Templates late = XSLT_CACHE.get(xsltPath);
StringWriter writer = new StringWriter();
Transformer transformer = late.newTransformer();
transformer.transform(new StreamSource(new File(dataPath)),
new StreamResult(writer));
return writer.toString();
} catch (TransformerException | ExecutionException e) {
logger.warning(e.getMessage());
return null;
}
}
We can build our project and replace the exist jar
SELECT sqlj.replace_jar('file:///home/nick/pljava-test.jar', 'test', false);
Performance
With \timing
and 15 values running through the function yielded about 1.25
milliseconds per row, which is on a single core, 2GB RAM, hard drive, and a VM.
Obviously your mileage will vary, but I’m expecting this method to scale out
beautifully because our xslt
function is logically immutable given the files
don’t change. If the contents of the file are subject to change, create the
function with STABLE
volatility, else use IMMUTABLE
.
Additional Information
For more information, see a great article Introduction to PostgreSQL PL/java, which will go over additional features, such as deployment descriptors (making deployments easier) in more details.
Comments
If you'd like to leave a comment, please email [email protected]