12 March 2009

A few nightmares before biohackathon 2009.

OK, after Scifoo 2007 (http://plindenbaum.blogspot.com/2007/07/scifoo-07-anxiety-from-homebody.html). Here are my apprehensions for BioHackathon 2009, you know, I'm lost and anxious when I cannot see the Peripherique ;-)
















My notebook for a Stupid RDF Server

In this post, I'm writing my notes about the java package com.sun.net.httpserver.*. This package contains a lightweight HTTP server and I've tested it to create a simple-and-stupid RDF server.
The source code is available at:



OK. First we need a 'Statement' class holding a RDF triple:
private static class Statement
{
/** subject of this statement */
private URI subject;
/** predicate of this statement */
private URI predicate;
/** value of this statement a String or a URI*/
private Object value;


boolean isLiteral()
{
return value.getClass()==String.class;
}

@Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + predicate.hashCode();
result = prime * result + subject.hashCode();
result = prime * result + value.hashCode();
return result;
}

@Override
public boolean equals(Object obj) {
if (this == obj) return true;
if (obj == null) return false;
Statement other = (Statement) obj;
return subject.equals(other.subject) &&
predicate.equals(other.predicate) &&
value.equals(other.value)
;
}


@Override
public String toString() {
return "<"+subject+"> <"+predicate+"> "+(isLiteral()
?"\""+C.escape(String.class.cast(value))+"\""
:"<"+value+"> ."
);
}
}

The statements are just stored in a synchronized set. That is to say, we the server stops, all the statements are lost :-).
private Set<Statement> statements= Collections.synchronizedSet(new HashSet<Statement>());


The 'servlet' StupidRDFServer is a class that implementing com.sun.net.httpserver.HttpHandler, that is to say it must implement the method public void handle(HttpExchange http) throws IOException

The following code starts the server.
public static void main(String[] args)
{
try
{
HttpServer server = HttpServer.create(new InetSocketAddress(PORT), 0);
server.createContext(CONTEXT, new StupidRDFServer());
server.start();
}
catch(IOException err)
{
err.printStackTrace();
}
}
If the query is empty, the following form is returned to the client.
Add Statement 
 
 
  


The 'Add Statements' button adds a statement in the list:
respHeader.add("Content-Type", "text/html");
http.sendResponseHeaders(200, 0);
boolean added=this.statements.add(stmt);
printForm(out,(added?"Statement Added":"Statement already in model"));

The 'Query N3' button prints the triples, using the parameters of the form as a filter:
respHeader.add("Content-Type", "text/plain");
http.sendResponseHeaders(200, 0);
synchronized(statements)
{
Iterator<Statement> iter= statements.iterator();
while(iter.hasNext())
{
Statement triple=iter.next();
if(
(stmt.subject==null || stmt.subject.equals(triple.subject)) &&
(stmt.predicate==null || stmt.predicate.equals(triple.predicate)) &&
(stmt.value==null || stmt.value.equals(triple.value))
)
{
out.println(triple);
}
}
}


The 'Query RDF' button prints the triples as RDF, using the parameters of the form as a filter. Example of output:
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Statement>
<rdf:subject rdf:resource="foaf%3Ame"/>
<rdf:predicate rdf:resource="foaf%3Aname"/>
<rdf:object>Pierre</rdf:object>
</rdf:Statement>
<rdf:Statement>
<rdf:subject rdf:resource="urn%3Aother"/>
<rdf:predicate rdf:resource="foaf%3Aknows"/>
<rdf:object rdf:resource="foaf%3Aknows"/>
</rdf:Statement>
</rdf:RDF>


That's it

07 March 2009

A lightweight java parser for RDF

About one year ago, I wrote a lightweight java parser for RDF based on the Stream API for XML (Stax). It is far from being perfect as , for example, it does not handle the reified statements, xml:base, ... but it is small (24K) and works fine with most RDF files. Inspired by the XML SAX parsers, this RDF parser doesn't keep the statements in memory but calls a method "found" each time a triple is found. This method can be overridden to implement your own code.

Source code


The code is available at

RDFEvent


First we need a small internal class to record the content of each triple
private static class RDFEvent
{
URI subject=null;
URI predicate=null;
Object value=null;
URI valueType=null;
String lang=null;
int listIndex=-1;
(...)
}

Searching for rdf:RDF


First we scan the elements of the document until the <rdf:RDF> element is found. Then, the method parseRDF is called.
this.parser = this.factory.createXMLEventReader(in);

while(getReader().hasNext())
{
XMLEvent event = getReader().nextEvent();
if(event.isStartElement())
{
StartElement start=(StartElement)event;
if(name2string(start).equals(RDF.NS+"RDF"))
{
parseRDF();
}
}
}

parseRDF: Searching the statements


All the nodes are then scanned .The method parseDescription is called for each element.
while(getReader().hasNext())
{
XMLEvent event = getReader().nextEvent();
if(event.isEndElement())
{
return;
}
else if(event.isStartElement())
{
parseDescription(event.asStartElement());
}
else if(event.isProcessingInstruction())
{
throw new XMLStreamException("Found Processing Instruction in RDF ???");
}
else if(event.isCharacters() &&
event.asCharacters().getData().trim().length()>0)
{
throw new XMLStreamException("Found text in RDF ???");
}
}

parseDescription: Parsing the subject of a triple


The current element will be the subject of the triple.
The URI of this subject need to extracted.
First we check if this URI can be extracted from an attribute rdf:about
Attribute att= description.getAttributeByName(new QName(RDF.NS,"about"));
if(att!=null) descriptionURI= createURI( att.getValue());

If it was not found, the attribute rdf:nodeID is searched:
att= description.getAttributeByName(new QName(RDF.NS,"nodeID"));
if(att!=null) descriptionURI= createURI( att.getValue());

If it was not found, the attribute rdf:ID is searched.
att= description.getAttributeByName(new QName(RDF.NS,"ID"));
if(att!=null) descriptionURI= resolveBase(att.getValue());

If it was not found, this is an anonymous node. We create a random URI.
descriptionURI= createAnonymousURI();


rdf:type


The qualified name of the element contains the rdf:type of this statement. We can emit a new triple about this type:
QName qn=description.getName();
if(!(qn.getNamespaceURI().equals(RDF.NS) &&
qn.getLocalPart().equals("Description")))
{
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= createURI(RDF.NS+"type");
evt.value=name2uri(qn);
found(evt);
}


Other attributes


The other attributes of the current element may contains some new triples.
for(Iterator<?> i=description.getAttributes();
i.hasNext();)
{
att=(Attribute)i.next();
qn= att.getName();
String local= qn.getLocalPart();
if(qn.getNamespaceURI().equals(RDF.NS) &&
( local.equals("about") ||
local.equals("ID") ||
local.equals("nodeID")))
{
continue;
}
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= name2uri(qn);
evt.value= att.getValue();
found(evt);
}

Searching the predicates


We then loop over the children of the current element. Those nodes are the predicates of the current subject. The method parsePredicate is called, each time a new element is found.
while(getReader().hasNext())
{
XMLEvent event = getReader().nextEvent();
if(event.isEndElement())
{
return descriptionURI;
}
else if(event.isStartElement())
{
parsePredicate(descriptionURI,event.asStartElement());
}
else if(event.isProcessingInstruction())
{
throw new XMLStreamException("Found Processing Instruction in RDF ???");
}
else if(event.isCharacters() &&
event.asCharacters().getData().trim().length()>0)
{
throw new XMLStreamException("Found text in RDF ??? \""+
event.asCharacters().getData()+"\""
);
}
}

parsePredicate: Parsing the predicate of the current triple


First the property attributes of the current element are scanned, and some new triples may be created. e.g:
<rdf:Description ex:fullName="Dave Beckett">
<ex:homePage rdf:resource="http://purl.org/net/dajobe/"/>
</rdf:Description>

During this process, the value of the attribute rdf:parseType is noted if it was found.
Furthermore, if there was an attribute rdf:resource, then this element is a new triple linking another resource.
<ex:homePage rdf:resource="http://purl.org/net/dajobe/"/>

If rdf:parseType="Literal" then we transform the children of the current node into a string, and a new triple is created.
if(parseType.equals("Literal"))
{
StringBuilder b= parseLiteral();
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= predicateURI;
evt.value= b.toString();
evt.lang=lang;
evt.valueType=datatype;
found(evt);
}

If rdf:parseType="Resource", then the current node is a blank node: The rdf:Description will be omitted. A blank node is created and we call recursively parsePredicate using this blank node has the new subject.
URI blanck = createAnonymousURI();
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= predicateURI;
evt.value=blanck;
evt.lang=lang;
evt.valueType=datatype;
found(evt);

while(getReader().hasNext())
{
XMLEvent event = getReader().nextEvent();
if(event.isStartElement())
{
parsePredicate(blanck, event.asStartElement());
}
else if(event.isEndElement())
{
return;
}
}

If rdf:parseType="Collection", The children elements give the set of subject nodes of the collection. We call recursively parseDescription for each of these nodes.
int index=0;
while(getReader().hasNext())
{
XMLEvent event = getReader().nextEvent();
if(event.isStartElement())
{
URI value= parseDescription(event.asStartElement());
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= predicateURI;
evt.value=value;
evt.lang=lang;
evt.valueType=datatype;
evt.listIndex=(++index);

found(evt);
}
else if(event.isEndElement())
{
return;
}
}

Else this is the default rdf:parseType.
If a new element is found, then , this is the subject of a new resource (We call recursively parseDescription), else the current statement has a Literal as the object of this statement and we concatenate all the text.
StringBuilder b= new StringBuilder();
while(getReader().hasNext())
{
XMLEvent event = getReader().nextEvent();
if(event.isStartElement())
{
URI childURI=parseDescription(event.asStartElement());
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= predicateURI;
evt.value= childURI;
found(evt);
b.setLength(0);
foundResourceAsChild=true;
}
else if(event.isCharacters())
{
b.append(event.asCharacters().getData());
}
else if(event.isEndElement())
{
if(!foundResourceAsChild)
{
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= predicateURI;
evt.value= b.toString();
evt.lang=lang;
evt.valueType=datatype;
found(evt);
}
else
{
if(b.toString().trim().length()!=0) throw new XMLStreamException("Found bad text "+b);
}
return;
}
}

Testing


The following code parses go.rdf.gz (1744 Ko) and returns the number of statements.
long now= System.currentTimeMillis();
URL url= new URL("ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/go.rdf.gz");
InputStream r= new GZIPInputStream(url.openStream());
RDFHandler h= new RDFHandler()
{
@Override
public void found(URI subject, URI predicate, Object value,
URI dataType, String lang, int index)
throws IOException {
++count;
}
};
h.parse(r);
r.close();
System.out.println("time:"+((System.currentTimeMillis()-now)/1000)+" secs count:"+count+" triples");

Result:
time:17 secs count:188391 triples



That's it.

Pierre

03 March 2009

String Challenge: My (brute-force) solution.

In this post, I present my (brute-force/quick n'dirty) solution to the recent 'String Challenge' submited by Thomas Mailund on his blog: http://www.mailund.dk/index.php/2009/03/02/a-string-algorithms-challenge/. Briefly, here is the problem:

Given an input string X, an integer k and a frequency f, report all k-mers that occur with frequency higher than f. Expect the length of X to be from a few hundred thousands to a few millions and k to be between 5 and 20.



I wrote a simple java code to solve this problem. This is not big science as I used a brute-force algorithm, but you might be interested about how the k-mers were mapped to their occurrences. Here, my code uses the java implementation of the BerkeleyDB API (http://www.oracle.com/technology/documentation/berkeley-db/je/index.html) to map the k-mers to their occurrences.

The source code is available here:http://anybody.cephb.fr/perso/lindenb/tmp/StringChallenge.java (Opps, please remove this extra (c=fin.read();), I cannot change this at this time)


The code



First the BerkeleyDB environment is initialized and we create a temporary database that will map the k-mers to the counts.

File envHome=new File(System.getProperty("java.io.tmpdir"));
EnvironmentConfig envCfg= new EnvironmentConfig();
envCfg.setAllowCreate(true);
envCfg.setReadOnly(false);
Environment env= new Environment(envHome,envCfg);

//create a first database mapping k-mers to count
DatabaseConfig cfg= new DatabaseConfig();
cfg.setAllowCreate(true);
cfg.setReadOnly(false);
cfg.setTemporary(true);
Database db= env.openDatabase(null, "kmers", cfg);
DatabaseEntry key = new DatabaseEntry();
DatabaseEntry value = new DatabaseEntry();


The sequence is then scanned and an array of bytes of length k is filled with the characters.

FileReader fin= new FileReader(file);
byte array[]=new byte[kmer];
int c=-1;
int array_size=0;
//c=fin.read(); oopps, I should have removed this....


while((c=fin.read())!=-1)
{
if(Character.isWhitespace(c)) continue;
c=Character.toUpperCase(c);
array[array_size++]=(byte)c;

}



This array is filled, searched in the database and the count is incremented back in the BerkeleyDB. The the content of the array is shifted to the left.

key.setData(array);

int count=0;
//is this data already exists ?
if(db.get(null,key, value, null)==OperationStatus.SUCCESS)
{
count =IntegerBinding.entryToInt(value);
}

IntegerBinding.intToEntry(count+1, value);
db.put(null,key, value);

//switch to left
for(int i=1;i< kmer;++i)
{
array[i-1]=array[i];
}
array_size--;


At the end, in order to have a set of ordered results, a reverse database is created . It maps the counts to the k-mers.



//create a second database mapping count to k-mers
cfg= new DatabaseConfig();
cfg.setAllowCreate(true);
cfg.setReadOnly(false);
cfg.setTemporary(true);
cfg.setSortedDuplicates(true);
Database db2= env.openDatabase(null, "occ",cfg);
key = new DatabaseEntry();
value = new DatabaseEntry();

Cursor cursor= db.openCursor(null, null);
while(cursor.getNext(key, value,null)==OperationStatus.SUCCESS)
{
int count=IntegerBinding.entryToInt(value);
if((count/(float)total)< freq) continue;
db2.put(null, value, key);
}
cursor.close();


This second database is then scanned to print the ordered results.

while(cursor.getNext(key, value,null)==OperationStatus.SUCCESS)
{
int count=IntegerBinding.entryToInt(key);
String seq= new String(value.getData(),value.getOffset(),value.getSize());
System.out.println(seq+"\t"+count+"/"+total);
}


Compilation



javac -cp ${BERKELEY-JE-PATH}/lib/je-3.3.75.jar:. StringChallenge.java

Excecution


java -cp ${BERKELEY-JE-PATH}/lib/je-3.3.75.jar:. StringChallenge

Test with the Human chr22 (~49E6 bp)


time java -cp ${BERKELEY-JE-PATH}/lib/je-3.3.75.jar:. StringChallenge -k 8 -f 0.001 chr22.fa
AAAAAAAA 71811/49691430
TTTTTTTT 72474/49691430
NNNNNNNN 14840030/49691430

real 5m44.240s
user 5m56.186s
sys 0m3.178s

time java -cp ${BERKELEY-JE-PATH}/lib/je-3.3.75.jar:. StringChallenge -k 10 -f 0.001 chr22.fa
AAAAAAAAAA 50128/49691428
TTTTTTTTTT 50841/49691428
NNNNNNNNNN 14840010/49691428

real 8m15.152s
user 8m41.429s
sys 0m3.748s



That's it. Now I'd like to know how the 'elegant' solutions will be implemented :-)