12 February 2010

Processing large XML documents with XSLT

I've resurrected an old java program called xsltstream which might be useful for biohackathon2010. This program applies a XSLT stylesheet only for the given node from a large xml document. The DOM is read from a SAX stream, built in memory for each target element , processed with XSLT and then disposed. Now, say you want to transform a XML file from dbSNP with XSLT to make a RDF document. You cannot do that with xsltproc because the XML file is just too big ( e.g. ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/XML/ds_ch1.xml.gz is 1,099,375 KB ).
But an xslt stylesheet can be applied with xsltstream to all the <Rs> elements of 'ds_ch1.xml.gz':

java -jar xsltstream.jar -x 'http://lindenb.googlecode.com/svn/trunk/src/xsl/dbsnp2rdf.xsl' -q Rs \
'ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/XML/ds_ch1.xml.gz' |\
grep -v "rdf:RDF" | grep -v "<?xml version"


(...)
<o:SNP rdf:about="http://www.ncbi.nlm.nih.gov/snp/830">
<dc:title>rs830</dc:title>
<o:taxon rdf:resource="http://www.ncbi.nlm.nih.gov/taxonomy/9606"/>
<o:het rdf:datatype="http://www.w3.org/2001/XMLSchema#float">0.02</o:het>
<o:hasHandle rdf:resource="urn:void:ncbi:snp:handle:WIAF"/>
<o:hasHandle rdf:resource="urn:void:ncbi:snp:handle:SNP500CANCER"/>
<o:hasHandle rdf:resource="urn:void:ncbi:snp:handle:SEQUENOM"/>
<o:hasMapping>
<o:Mapping>
<o:build rdf:resource="urn:void:ncbi:build:Celera/36_3"/>
<o:chrom rdf:resource="urn:void:ncbi:chromosome:9606/chr1"/>
<o:start rdf:datatype="http://www.w3.org/2001/XMLSchema#int">66444409</o:start>
<o:end rdf:datatype="http://www.w3.org/2001/XMLSchema#int">66444410</o:end>
<o:orient>+</o:orient>
</o:Mapping>
</o:hasMapping>
<o:hasMapping>
<o:Mapping>
<o:build rdf:resource="urn:void:ncbi:build:HuRef/36_3"/>
<o:chrom rdf:resource="urn:void:ncbi:chromosome:9606/chr1"/>
<o:start rdf:datatype="http://www.w3.org/2001/XMLSchema#int">66263806</o:start>
<o:end rdf:datatype="http://www.w3.org/2001/XMLSchema#int">66263807</o:end>
<o:orient>-</o:orient>
</o:Mapping>
</o:hasMapping>
<o:hasMapping>
<o:Mapping>
<o:build rdf:resource="urn:void:ncbi:build:reference/36_3"/>
<o:chrom rdf:resource="urn:void:ncbi:chromosome:9606/chr1"/>
<o:start rdf:datatype="http://www.w3.org/2001/XMLSchema#int">67926134</o:start>
<o:end rdf:datatype="http://www.w3.org/2001/XMLSchema#int">67926135</o:end>
<o:orient>+</o:orient>
</o:Mapping>
</o:hasMapping>
</o:SNP>


<o:SNP rdf:about="http://www.ncbi.nlm.nih.gov/snp/844">
<dc:title>rs844</dc:title>
<o:taxon rdf:resource="http://www.ncbi.nlm.nih.gov/taxonomy/9606"/>
<o:het rdf:datatype="http://www.w3.org/2001/XMLSchema#float">0.42</o:het>
<o:hasHandle rdf:resource="urn:void:ncbi:snp:handle:WIAF"/>
<o:hasHandle rdf:resource="urn:void:ncbi:snp:handle:LEE"/>
<o:hasHandle rdf:resource="urn:void:ncbi:snp:handle:HGBASE"/>
<o:hasHandle rdf:resource="urn:void:ncbi:snp:handle:SC_JCM"/>
<o:hasHandle rdf:resource="urn:void:ncbi:snp:handle:TSC-CSHL"/>
<o:hasHandle rdf:resource="urn:void:ncbi:snp:handle:LEE"/>
<o:hasHandle rdf:resource="urn:void:ncbi:snp:handle:YUSUKE"/>
<o:hasHandle rdf:resource="urn:void:ncbi:snp:handle:CGAP-GAI"/>
<o:hasHandle rdf:resource="urn:void:ncbi:snp:handle:CSHL-HAPMAP"/>
<o:hasHandle rdf:resource="urn:void:ncbi:snp:handle:PERLEGEN"/>
<o:hasHandle rdf:resource="urn:void:ncbi:snp:handle:ABI"/>
<o:hasHandle rdf:resource="urn:void:ncbi:snp:handle:SI_EXO"/>
<o:hasHandle rdf:resource="urn:void:ncbi:snp:handle:BCMHGSC_JDW"/>
<o:hasHandle rdf:resource="urn:void:ncbi:snp:handle:HUMANGENOME_JCVI"/>
<o:hasHandle rdf:resource="urn:void:ncbi:snp:handle:SNP500CANCER"/>
<o:hasHandle rdf:resource="urn:void:ncbi:snp:handle:1000GENOMES"/>
<o:hasHandle rdf:resource="urn:void:ncbi:snp:handle:ILLUMINA-UK"/>
<o:hasMapping>
<o:Mapping>
<o:build rdf:resource="urn:void:ncbi:build:Celera/36_3"/>
<o:chrom rdf:resource="urn:void:ncbi:chromosome:9606/chr1"/>
<o:start rdf:datatype="http://www.w3.org/2001/XMLSchema#int">134750981</o:start>
<o:end rdf:datatype="http://www.w3.org/2001/XMLSchema#int">134750982</o:end>
<o:orient>+</o:orient>
</o:Mapping>
</o:hasMapping>
<o:hasMapping>
<o:Mapping>
<o:build rdf:resource="urn:void:ncbi:build:HuRef/36_3"/>
<o:chrom rdf:resource="urn:void:ncbi:chromosome:9606/chr1"/>
<o:start rdf:datatype="http://www.w3.org/2001/XMLSchema#int">132892081</o:start>
<o:end rdf:datatype="http://www.w3.org/2001/XMLSchema#int">132892082</o:end>
<o:orient>-</o:orient>
(...)


The java archive for xsltstream is available at http://lindenb.googlecode.com/files/xsltstream.jar

Usage:
  -x <xslt-stylesheet file/url> required
-p <param-name> <param-value> (add parameter to the xslt engine)
-d depth (0 based) default:-1
-q qName target default:null
<file>|stdin



That's it !
Pierre

3 comments:

Michael Kay said...

There's a lot of capability for streamed processing in Saxon-EE these days, which goes well beyond this - though I agree there are many cases which can be solved using this kind of approach.

Pierre Lindenbaum said...

Thanks Michael, it is nice to know

Reg Whitton said...

Hi Pierre,

Digging through your code I notice that it interprets the SAX events to build the DOM. I would guess that you have done this because there is no obvious event mechanism available as a DOM is built normally. However there is one - see Document Object Model Load and Save

Below is my rough code to transform nodes in a stream.

Reg


public class Transform {

public static void main(String[] args) throws Exception {

final String targetNodeName = args[0];
String xsltFilename = args[1];
InputStream inputStream = System.in;
OutputStream outputStream = System.out;

/*
* Get hold of factory for LoadSave objects See
* http://www.w3.org/TR/DOM-Level-3-LS/load-save.html
*/
DOMImplementationRegistry registry = DOMImplementationRegistry
.newInstance();
DOMImplementationLS domImpl = (DOMImplementationLS) registry
.getDOMImplementation("XML 1.0 LS 3.0");

/*
* Read up XSLT transformation. Each thread will need to create its own
* copy of the Transformer.
*/
TransformerFactory transformerFactory = TransformerFactory
.newInstance();
Templates templates = transformerFactory.newTemplates(new StreamSource(
new BufferedReader(new FileReader(xsltFilename))));
final Transformer transformer = templates.newTransformer();
transformer.setOutputProperty("omit-xml-declaration", "yes");

/*
* Create DOM filter that rejects all nodes outside of target node, and
* even rejects the target node once transformed.
*/
final Result result = new StreamResult(outputStream);
LSParserFilter filter = new LSParserFilter() {

boolean withinTargetNode = false;
int indexInStream = 0;

public short acceptNode(Node node) {

if (Node.ELEMENT_NODE == node.getNodeType()
&& targetNodeName.equals(node.getNodeName())) {
/*
* We have a completed target node, we transform it to the
* output stream.
*/
indexInStream++;
transformer.setParameter("nodeIndex", indexInStream);
DOMSource domSource = new DOMSource(node);
try {
transformer.transform(domSource, result);
} catch (Exception ex) {
System.err.println(ex);
return FILTER_INTERRUPT;
}
withinTargetNode = false;
return FILTER_REJECT;
}
return withinTargetNode ? FILTER_ACCEPT : FILTER_REJECT;
}

public int getWhatToShow() {
return NodeFilter.SHOW_ALL;
}

public short startElement(Element element) {
if (targetNodeName.equals(element.getNodeName())) {
withinTargetNode = true;
}
return FILTER_ACCEPT;
}
};

/*
* Create a parser which uses the filter, and parse the input stream. We
* ignore the resulting document which contains only the root node.
*/
LSParser parser = domImpl.createLSParser(
DOMImplementationLS.MODE_SYNCHRONOUS, null);
parser.setFilter(filter);
LSInput input = domImpl.createLSInput();
input.setByteStream(inputStream);
parser.parse(input);
}
}