07 March 2009

A lightweight java parser for RDF

About one year ago, I wrote a lightweight java parser for RDF based on the Stream API for XML (Stax). It is far from being perfect as , for example, it does not handle the reified statements, xml:base, ... but it is small (24K) and works fine with most RDF files. Inspired by the XML SAX parsers, this RDF parser doesn't keep the statements in memory but calls a method "found" each time a triple is found. This method can be overridden to implement your own code.

Source code


The code is available at

RDFEvent


First we need a small internal class to record the content of each triple
private static class RDFEvent
{
URI subject=null;
URI predicate=null;
Object value=null;
URI valueType=null;
String lang=null;
int listIndex=-1;
(...)
}

Searching for rdf:RDF


First we scan the elements of the document until the <rdf:RDF> element is found. Then, the method parseRDF is called.
this.parser = this.factory.createXMLEventReader(in);

while(getReader().hasNext())
{
XMLEvent event = getReader().nextEvent();
if(event.isStartElement())
{
StartElement start=(StartElement)event;
if(name2string(start).equals(RDF.NS+"RDF"))
{
parseRDF();
}
}
}

parseRDF: Searching the statements


All the nodes are then scanned .The method parseDescription is called for each element.
while(getReader().hasNext())
{
XMLEvent event = getReader().nextEvent();
if(event.isEndElement())
{
return;
}
else if(event.isStartElement())
{
parseDescription(event.asStartElement());
}
else if(event.isProcessingInstruction())
{
throw new XMLStreamException("Found Processing Instruction in RDF ???");
}
else if(event.isCharacters() &&
event.asCharacters().getData().trim().length()>0)
{
throw new XMLStreamException("Found text in RDF ???");
}
}

parseDescription: Parsing the subject of a triple


The current element will be the subject of the triple.
The URI of this subject need to extracted.
First we check if this URI can be extracted from an attribute rdf:about
Attribute att= description.getAttributeByName(new QName(RDF.NS,"about"));
if(att!=null) descriptionURI= createURI( att.getValue());

If it was not found, the attribute rdf:nodeID is searched:
att= description.getAttributeByName(new QName(RDF.NS,"nodeID"));
if(att!=null) descriptionURI= createURI( att.getValue());

If it was not found, the attribute rdf:ID is searched.
att= description.getAttributeByName(new QName(RDF.NS,"ID"));
if(att!=null) descriptionURI= resolveBase(att.getValue());

If it was not found, this is an anonymous node. We create a random URI.
descriptionURI= createAnonymousURI();


rdf:type


The qualified name of the element contains the rdf:type of this statement. We can emit a new triple about this type:
QName qn=description.getName();
if(!(qn.getNamespaceURI().equals(RDF.NS) &&
qn.getLocalPart().equals("Description")))
{
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= createURI(RDF.NS+"type");
evt.value=name2uri(qn);
found(evt);
}


Other attributes


The other attributes of the current element may contains some new triples.
for(Iterator<?> i=description.getAttributes();
i.hasNext();)
{
att=(Attribute)i.next();
qn= att.getName();
String local= qn.getLocalPart();
if(qn.getNamespaceURI().equals(RDF.NS) &&
( local.equals("about") ||
local.equals("ID") ||
local.equals("nodeID")))
{
continue;
}
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= name2uri(qn);
evt.value= att.getValue();
found(evt);
}

Searching the predicates


We then loop over the children of the current element. Those nodes are the predicates of the current subject. The method parsePredicate is called, each time a new element is found.
while(getReader().hasNext())
{
XMLEvent event = getReader().nextEvent();
if(event.isEndElement())
{
return descriptionURI;
}
else if(event.isStartElement())
{
parsePredicate(descriptionURI,event.asStartElement());
}
else if(event.isProcessingInstruction())
{
throw new XMLStreamException("Found Processing Instruction in RDF ???");
}
else if(event.isCharacters() &&
event.asCharacters().getData().trim().length()>0)
{
throw new XMLStreamException("Found text in RDF ??? \""+
event.asCharacters().getData()+"\""
);
}
}

parsePredicate: Parsing the predicate of the current triple


First the property attributes of the current element are scanned, and some new triples may be created. e.g:
<rdf:Description ex:fullName="Dave Beckett">
<ex:homePage rdf:resource="http://purl.org/net/dajobe/"/>
</rdf:Description>

During this process, the value of the attribute rdf:parseType is noted if it was found.
Furthermore, if there was an attribute rdf:resource, then this element is a new triple linking another resource.
<ex:homePage rdf:resource="http://purl.org/net/dajobe/"/>

If rdf:parseType="Literal" then we transform the children of the current node into a string, and a new triple is created.
if(parseType.equals("Literal"))
{
StringBuilder b= parseLiteral();
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= predicateURI;
evt.value= b.toString();
evt.lang=lang;
evt.valueType=datatype;
found(evt);
}

If rdf:parseType="Resource", then the current node is a blank node: The rdf:Description will be omitted. A blank node is created and we call recursively parsePredicate using this blank node has the new subject.
URI blanck = createAnonymousURI();
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= predicateURI;
evt.value=blanck;
evt.lang=lang;
evt.valueType=datatype;
found(evt);

while(getReader().hasNext())
{
XMLEvent event = getReader().nextEvent();
if(event.isStartElement())
{
parsePredicate(blanck, event.asStartElement());
}
else if(event.isEndElement())
{
return;
}
}

If rdf:parseType="Collection", The children elements give the set of subject nodes of the collection. We call recursively parseDescription for each of these nodes.
int index=0;
while(getReader().hasNext())
{
XMLEvent event = getReader().nextEvent();
if(event.isStartElement())
{
URI value= parseDescription(event.asStartElement());
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= predicateURI;
evt.value=value;
evt.lang=lang;
evt.valueType=datatype;
evt.listIndex=(++index);

found(evt);
}
else if(event.isEndElement())
{
return;
}
}

Else this is the default rdf:parseType.
If a new element is found, then , this is the subject of a new resource (We call recursively parseDescription), else the current statement has a Literal as the object of this statement and we concatenate all the text.
StringBuilder b= new StringBuilder();
while(getReader().hasNext())
{
XMLEvent event = getReader().nextEvent();
if(event.isStartElement())
{
URI childURI=parseDescription(event.asStartElement());
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= predicateURI;
evt.value= childURI;
found(evt);
b.setLength(0);
foundResourceAsChild=true;
}
else if(event.isCharacters())
{
b.append(event.asCharacters().getData());
}
else if(event.isEndElement())
{
if(!foundResourceAsChild)
{
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= predicateURI;
evt.value= b.toString();
evt.lang=lang;
evt.valueType=datatype;
found(evt);
}
else
{
if(b.toString().trim().length()!=0) throw new XMLStreamException("Found bad text "+b);
}
return;
}
}

Testing


The following code parses go.rdf.gz (1744 Ko) and returns the number of statements.
long now= System.currentTimeMillis();
URL url= new URL("ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/go.rdf.gz");
InputStream r= new GZIPInputStream(url.openStream());
RDFHandler h= new RDFHandler()
{
@Override
public void found(URI subject, URI predicate, Object value,
URI dataType, String lang, int index)
throws IOException {
++count;
}
};
h.parse(r);
r.close();
System.out.println("time:"+((System.currentTimeMillis()-now)/1000)+" secs count:"+count+" triples");

Result:
time:17 secs count:188391 triples



That's it.

Pierre

No comments: