03 February 2010

Using a FASTA file as a source of RDF statements for SPARQL.


In this post, I'll show how a Fasta file can be used as a source of RDF statements for the Jena API.
The DNA sequences in the Fasta file will be used by Jena without any prior transformation: the file will be used as a Graph by Jena by implementing com.hp.hpl.jena.graph.Graph.

Here, my example uses a Fasta file but it could have been any kind of input: a SQL database, a XML file, a GFF file, etc...

How it works


com.hp.hpl.jena.graph.Graph is the interface to be satisfied by implementations maintaining collections of RDF triples. The core interface is small (add, delete, find, contains) and is augmented by additional classes to handle more complicated matters such as reification, query handling, bulk update, event management, and transaction handling. My implementation for this interface extends com.hp.hpl.jena.graph.impl.GraphBase, will read a Fasta file and create set of RDF triple for each sequence. All we need is (love and ) implementing the abstract function ExtendedIterator<Triple> graphBaseFind(TripleMatch matcher)
public class FastaModel
extends GraphBase
{
protected ExtendedIterator<Triple> graphBaseFind(TripleMatch matcher)
{
//the function to be implemented....
}
}

The FastaSequence


A simple container for a name and a sequence.
/** a simple fasta sequence */
private static class FastaSequence
{
StringBuilder name=new StringBuilder();
StringBuilder sequence=new StringBuilder();
}

Reading the next Fasta Sequence

... Not Rocket Science...
/** the file reader */
private PushbackReader reader;
(...)
reader=new PushbackReader(new FileReader(fastaFile));
(...)
private FastaSequence readNext() throws IOException
{
if(this.reader==null) return null;
int c;
FastaSequence seq=null;
while((c=this.reader.read())!=-1)
{
if(c=='>')
{
if(seq!=null)
{
this.reader.unread(c);
return seq;
}
seq=new FastaSequence();
while((c=this.reader.read())!=-1)
{
if(c=='\n') break;
seq.name.append((char)c);
}
}
else if(seq!=null && Character.isLetter(c))
{
seq.sequence.append((char)c);
}
}
this.close();//close the FileReader
return seq;
}

Implementing the Iterator of Triples


My FastaIterator extends jena.util.iterator.NiceIterator<Triple>, a class extening the ExtendedIterator returned by the Graph function ExtendedIterator<Triple> graphBaseFind(TripleMatch matcher). The class contains three fields:
  • a FileReader
  • com.hp.hpl.jena.graph.Triple that is used as a filter
  • a stack/queue of RDF Triples
. The constructor for 'FastaIterator' opens the stream:
FastaIterator(TripleMatch matcher) throws IOException
{
this.filter=matcher.asTriple();
try
{
this.reader=new PushbackReader(new FileReader(FastaModel.this.fastaFile));
}
catch (IOException e)
{
throw new JenaException(e);
}
}
The method 'close' just close the input stream
@Override
public void close()
{
try
{
if(this.reader!=null) reader.close();
}
catch (IOException e)
{
throw new JenaException(e);
}
finally
{
this.reader=null;
super.close();
}
The method 'next()' check if there is something in the RDF queue, if true a RDF triple is removed and returned:
@Override
public Triple next()
{
if(this.triples_queue.isEmpty()) hasNext();
if(this.triples_queue.isEmpty()) throw new IllegalStateException();
return this.triples_queue.pop();
}
The method 'hasNext()' returns true if the queue of RDF triple is not empty. Otherwise, it gets the next Fasta Sequence
from the input stream and transforms it into a set of RDF triple that are added to the RDF queue if they match this.filter.That is to say, the following fasta sequence...:


>gi|227935373|gb|FJ425127.1| Rotavirus G8 isolate 6862/2000/ARN NSP3 gene, partial cds
GGCCACTTCAACATTAGAATTAATGGGTATTCAATATGATTACAATGAAGTATTTACCAGAGTTAAAAGT
AAATTTGATTATGTGATGGATGACTCTGGTGTTAAAAACAATCTTTTGGGTAAAGCTATAACTATTGATC
AGGCATTAAATGGAAAGTTTGGCTCAGCTATTAGAAATAGAAATTGGATGACTGATTCTAAAACGGTTGC
TAAATTAGATGAAGACGTGAATAAACTTAGAATGACATTATCTTCTAAAGGAATCGACCAAAAGATGAGA
GTACTTAATGCTTGTTTAGTGTA

... will generate those four RDF statements:
<http://www.ncbi.nlm.nih.gov/nuccore/227935373> <urn:lindenb:ontology:length> "303"^^<http://www.w3.org/2001/XMLSchema#int>
<http://www.ncbi.nlm.nih.gov/nuccore/227935373> <urn:lindenb:ontology:sequence> "GGCCACTTCAACATTAGAATTAATGGGTATTCAATATGATTACAATGAAGTATTTACCAGAGTTAAAAGTAAATTTGATTATGTGATGGATGACTCTGGTGTTAAAAACAATCTTTTGGGTAAAGCTATAACTATTGATCAGGCATTAAATGGAAAGTTTGGCTCAGCTATTAGAAATAGAAATTGGATGACTGATTCTAAAACGGTTGCTAAATTAGATGAAGACGTGAATAAACTTAGAATGACATTATCTTCTAAAGGAATCGACCAAAAGATGAGAGTACTTAATGCTTGTTTAGTGTA"
<http://www.ncbi.nlm.nih.gov/nuccore/227935373> <http://purl.org/dc/elements/1.1/title> "gi|227935373|gb|FJ425127.1| Rotavirus G8 isolate 6862/2000/ARN NSP3 gene, partial cds"
<http://www.ncbi.nlm.nih.gov/nuccore/227935373> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <urn:lindenb:ontology:Sequence>
Here is the code for the 'hasNext' function:
@Override
public boolean hasNext()
{
if(!triples_queue.isEmpty()) return true;
if(this.reader==null) return false;
try
{
/* loop until the queue is not empty or the stream is closed */
while(this.triples_queue.isEmpty())
{
//try to get a new fasta sequence
FastaSequence seq=readNext();
if(seq==null) return false;

String name=seq.name.toString();
//check it is a genbank file with a gi
if(!name.startsWith("gi|"))
{
continue;
}
int i=name.indexOf('|',3);
if(i==-1) continue;
//create the subject
Node subject =Node.createURI("http://www.ncbi.nlm.nih.gov/nuccore/"+name.substring(3,i));

//make a triple for the rdf:type
Triple triple=new Triple(
subject,
RDF.type.asNode(),
Node.createURI("urn:lindenb:ontology:Sequence")
);
//append this triple to the queue if it is accepted by this.filter
if(this.filter.asTriple().matches(triple))
{
this.triples_queue.add(triple);
}

//make a triple for the dc:title
triple=new Triple(
subject,
DC.title.asNode(),
Node.createLiteral(name)
);

//append this triple to the queue if it is accepted by this.filter
if(this.filter.asTriple().matches(triple))
{
this.triples_queue.add(triple);
}

//make a triple for the DNA sequence
triple=new Triple(
subject,
Node.createURI("urn:lindenb:ontology:sequence"),
Node.createLiteral(seq.sequence.toString())
);

//append this triple to the queue if it is accepted by this.filter
if(this.filter.asTriple().matches(triple))
{
this.triples_queue.add(triple);
}

//make a triple for the size of this sequence
triple=new Triple(
subject,
Node.createURI("urn:lindenb:ontology:length"),
Node.createLiteral(String.valueOf(seq.sequence.length()),null,XSDDatatype.XSDint)
);

//append this triple to the queue if it is accepted by this.filter
if(this.filter.asTriple().matches(triple))
{
this.triples_queue.add(triple);
}

}
}
catch (IOException e)
{
close();
throw new JenaException(e);
}
return !triples_queue.isEmpty();
}

Using the graph

.
Creating a new Jena RDF Model
Model m=ModelFactory.createModelForGraph(
new FastaModel(
new File("rotavirus.fa")
));

Looping over the RDF statements
After creating this new Model, it can be used as a regular Jena RDF Model. e.g:
StmtIterator i=m.listStatements();
while(i.hasNext())
{
System.err.println(i.next());
}
Result
[http://www.ncbi.nlm.nih.gov/nuccore/227935373, urn:lindenb:ontology:length, "303"^^http://www.w3.org/2001/XMLSchema#int]
[http://www.ncbi.nlm.nih.gov/nuccore/227935373, urn:lindenb:ontology:sequence, "GGCCACTTCAACATTAGAATTAATGGGTATTCAATATGATTACAATGAAGTATTTACCAGAGTTAAAAGTAAATTTGATTATGTGATGGATGACTCTGGTGTTAAAAACAATCTTTTGGGTAAAGCTATAACTATTGATCAGGCATTAAATGGAAAGTTTGGCTCAGCTATTAGAAATAGAAATTGGATGACTGATTCTAAAACGGTTGCTAAATTAGATGAAGACGTGAATAAACTTAGAATGACATTATCTTCTAAAGGAATCGACCAAAAGATGAGAGTACTTAATGCTTGTTTAGTGTA"]
[http://www.ncbi.nlm.nih.gov/nuccore/227935373, http://purl.org/dc/elements/1.1/title, "gi|227935373|gb|FJ425127.1| Rotavirus G8 isolate 6862/2000/ARN NSP3 gene, partial cds"]
[http://www.ncbi.nlm.nih.gov/nuccore/227935373, http://www.w3.org/1999/02/22-rdf-syntax-ns#type, urn:lindenb:ontology:Sequence]
[http://www.ncbi.nlm.nih.gov/nuccore/227935371, urn:lindenb:ontology:length, "303"^^http://www.w3.org/2001/XMLSchema#int]
[http://www.ncbi.nlm.nih.gov/nuccore/227935371, urn:lindenb:ontology:sequence, "GGCCACTTCAACATTAGAATTAATGGGTATTCAATATGATTACAATGAAGTATTTACCAGAGTTAAAAGTAAATTTGATTATGTGATGGATGACTCTGGTGTTAAAAACAATCTTTTGGGTAAAGCTATAACTATTGATCAGGCATTAAATGGAAAGTTTGGCTCAGCTATTAGAAATAGAAATTGGATGACTGATTCTAAAACGGTTGCTAAATTAGATGAAGACGTGAATAAACTTAGAATGACATTATCTTCTAAAGGAATCGACCAAAAGATGAGAGTACTTAATGCTTGTTTAGTGTA"]
[http://www.ncbi.nlm.nih.gov/nuccore/227935371, http://purl.org/dc/elements/1.1/title, "gi|227935371|gb|FJ425126.1| Rotavirus G8 isolate 6854/2002/ARN NSP3 gene, partial cds"]
[http://www.ncbi.nlm.nih.gov/nuccore/227935371, http://www.w3.org/1999/02/22-rdf-syntax-ns#type, urn:lindenb:ontology:Sequence]
[http://www.ncbi.nlm.nih.gov/nuccore/227935369, urn:lindenb:ontology:length, "303"^^http://www.w3.org/2001/XMLSchema#int]
[http://www.ncbi.nlm.nih.gov/nuccore/227935369, urn:lindenb:ontology:sequence, "GGCCACTTCAACATTAGAATTAATGGGTATTCAATATGATTACAATGAAGTATTTACCAGAGTTAAAAGTAAATTTGATTATGTGATGGATGACTCTGGTGTTAAAAACAATCTTCTGGGTAAAGCTATAACTATTGATCAGGCATTAAATGGAAAGTTTGGCTCAGCTATTAGAAATAGAAATTGGATGACTGATTCTAAAACGGTTGCTAAATTAGATGAAGACGTGAATAAACTTAGAATGACATTATCTTCTAAAGGAATCGACCAAAAGATGAGAGTACTTAATGCTTGTTTAGTGTA"]
[http://www.ncbi.nlm.nih.gov/nuccore/227935369, http://purl.org/dc/elements/1.1/title, "gi|227935369|gb|FJ425125.1| Rotavirus G8 isolate 6810/2004/ARN NSP3 gene, partial cds"]
(...)

This model can also be used as a source of RDF by ARQ , the SPARQL engine for Jena (!). Here we create a new SPARQL engine and list the sequences having a length lower than the others
Query query=QueryFactory.create(
"SELECT ?Seq1 ?Len1 ?Seq2 ?Len2" +
"{" +
"?Seq1 a <urn:lindenb:ontology:Sequence> . " +
"?Seq1 <urn:lindenb:ontology:length> ?Len1 . " +
"?Seq2 a <urn:lindenb:ontology:Sequence> . " +
"?Seq2 <urn:lindenb:ontology:length> ?Len2 . " +
"FILTER (?Seq1!=?Seq2 && ?Len1 < ?Len2) "+

"}"
);
QueryExecution execution = QueryExecutionFactory.create(query, m);
ResultSet row=execution.execSelect();
while(row.hasNext())
{
QuerySolution solution=row.next();

for(Iterator<String> si=solution.varNames();si.hasNext();)
{
String name=si.next();
System.out.println(name+" : "+solution.get(name));
}
System.out.println();
}
Result:
Seq1 : http://www.ncbi.nlm.nih.gov/nuccore/227935373
Len1 : 303^^http://www.w3.org/2001/XMLSchema#int
Seq2 : http://www.ncbi.nlm.nih.gov/nuccore/227935361
Len2 : 304^^http://www.w3.org/2001/XMLSchema#int

Seq1 : http://www.ncbi.nlm.nih.gov/nuccore/227935373
Len1 : 303^^http://www.w3.org/2001/XMLSchema#int
Seq2 : http://www.ncbi.nlm.nih.gov/nuccore/227935359
Len2 : 304^^http://www.w3.org/2001/XMLSchema#int

Seq1 : http://www.ncbi.nlm.nih.gov/nuccore/227935373
Len1 : 303^^http://www.w3.org/2001/XMLSchema#int
Seq2 : http://www.ncbi.nlm.nih.gov/nuccore/215489730
Len2 : 305^^http://www.w3.org/2001/XMLSchema#int

Seq1 : http://www.ncbi.nlm.nih.gov/nuccore/227935371
Len1 : 303^^http://www.w3.org/2001/XMLSchema#int
Seq2 : http://www.ncbi.nlm.nih.gov/nuccore/227935361
Len2 : 304^^http://www.w3.org/2001/XMLSchema#int

Seq1 : http://www.ncbi.nlm.nih.gov/nuccore/227935371
Len1 : 303^^http://www.w3.org/2001/XMLSchema#int
Seq2 : http://www.ncbi.nlm.nih.gov/nuccore/227935359
Len2 : 304^^http://www.w3.org/2001/XMLSchema#int

Seq1 : http://www.ncbi.nlm.nih.gov/nuccore/227935371
Len1 : 303^^http://www.w3.org/2001/XMLSchema#int
Seq2 : http://www.ncbi.nlm.nih.gov/nuccore/215489730
Len2 : 305^^http://www.w3.org/2001/XMLSchema#int

Seq1 : http://www.ncbi.nlm.nih.gov/nuccore/227935369
Len1 : 303^^http://www.w3.org/2001/XMLSchema#int
Seq2 : http://www.ncbi.nlm.nih.gov/nuccore/227935361
Len2 : 304^^http://www.w3.org/2001/XMLSchema#int
(...)
Hey, I thinks it's coool ! :-)
BTW I wonder how, knowing the FILTER of the SPARQL query, searching the Graph can be optimized, for example if we know that the sequences have been sorted in the fasta file according to their lengths.... Any idea ?

Compiling



export JENAPATH=${JENALIB}/icu4j-3.4.4.jar:${JENALIB}/iri-0.7.jar:${JENALIB}/jena-2.6.2.jar:${JENALIB}/jena-2.6.2-tests.jar:${JENALIB}/junit-4.5.jar:${JENALIB}/log4j-1.2.13.jar:${JENALIB}/lucene-core-2.3.1.jar:${JENALIB}/slf4j-api-1.5.6.jar:${JENALIB}/slf4j-log4j12-1.5.6.jar:${JENALIB}/stax-api-1.0.1.jar:${JENALIB}/wstx-asl-3.2.9.jar:${JENALIB}/xercesImpl-2.7.1.jar:${JENALIB}/icu4j-3.4.4.jar:${JENALIB}/iri-0.7.jar:${JENALIB}/jena-2.6.2.jar:${JENALIB}/jena-2.6.2-tests.jar:${JENALIB}/junit-4.5.jar:${JENALIB}/log4j-1.2.13.jar:${JENALIB}/lucene-core-2.3.1.jar:${JENALIB}/slf4j-api-1.5.6.jar:${JENALIB}/slf4j-log4j12-1.5.6.jar:${JENALIB}/stax-api-1.0.1.jar:${JENALIB}/wstx-asl-3.2.9.jar:${JENALIB}/xercesImpl-2.7.1.jar:${JENALIB}/arq-2.8.1.jar
javac -cp ${JENAPATH}:. -d bin -sourcepath src src/test/FastaModel.java

Running


java -cp ${JENAPATH}:bin test.FastaModel

All, in one, here is the code



That's it !
Pierre

2 comments:

Unknown said...

Thanks alot.
I have a problem,coud u help me?
I follow the user manual D2R Plattform but I have a problem about using D2RQ within Jena!
when I make sparql queries I don't have any result!
for example in below code,"rs.hasNext()" is always false!Please help me.


ModelD2RQ mm = new ModelD2RQ("file:/C:/last%20dl/d2r-server-0.6/d2r-server-0.6/
mapping-iswc.n3");

String sparql =
"PREFIX dc: " +
"PREFIX foaf: " +
"SELECT ?paperTitle ?authorName WHERE {" +
" ?paper dc:title ?paperTitle . " +
" ?paper dc:creator ?author ." +
" ?author foaf:name ?authorName ." +
"}";
Query q = QueryFactory.create(sparql);
ResultSet rs = QueryExecutionFactory.create(q, m).execSelect();
while (rs.hasNext()) {
QuerySolution row = rs.nextSolution();
System.out.println("Title: " + row.getLiteral("paperTitle").getString());
System.out.println("Author: " + row.getLiteral("authorName").getString());
}

Pierre Lindenbaum said...

you should ask this on http://stackoverflow.com/ or http://www.semanticoverflow.com/