19 December 2006

CiteXplore – integrating biomedical literature and data

Via Prosper:


CiteXplore combines literature search with text mining tools for biology. Search results are cross referenced to EBI applications based on publication identifiers. Links to full text versions are provided where available.

CiteXplore uses powerful text-mining
tools developed by EMBL-EBI researchers to link literature
and databases automatically, so that at the touch of
a button the biological terms are identified in the text
and you can call up the record of the molecule that you
are looking for.


http://www.ebi.ac.uk/citexplore/


15 December 2006

JAVA 1.6 Mustang, Derby/JavaDB and Bioinformatics.

Java1.6 nows contains an embedded SQL database engine called derby. In this post I will show how I have tested derby to store some fasta sequences. What is cool is that you can call any public java static methods directly from the SQL queries.

The source code is available here

When the SQL driver is called, it creates a new directory (here derby4fasta) where derby will store its data


File database=new File("derby4fasta");
String DRIVER="org.apache.derby.jdbc.EmbeddedDriver";
Class<?> driver=Class.forName(DRIVER);
boolean create=!database.exists();
driver.newInstance();
String url="jdbc:derby:";
Properties props= new Properties();
if(create)
{
props.setProperty("create", "true");
}
props.setProperty("user", "dba");
props.setProperty("password", "");
props.setProperty("databaseName","derby4fasta");
this.connection = DriverManager.getConnection(url,props);


I declare a static method returning the GC% of a sequence.

public static double gcPercent(String sequence)
{
int n=0;
for(int i=0;i< sequence.length();++i)
{
char base= Character.toUpperCase(sequence.charAt(i));
n+=(base=='G' || base=='C'?1:0);
}
return (double)(n/(double)sequence.length());
}


I declare a new function that will call this method.

/* the function FASTA.GC call org.lindenb.sandbox.Derby4Fasta.gcPercent() */
statement.executeUpdate(
"create function FASTA.GC( seq VARCHAR(2000) ) returns DOUBLE "+
" LANGUAGE JAVA "+
" NO SQL "+
" PARAMETER STYLE JAVA "+
" EXTERNAL NAME \'org.lindenb.sandbox.Derby4Fasta.gcPercent\'"
);




The fasta sequences are read and inserted in the database. We can now select the sequences having a GC% greater than 55%.

//find sequences having a GC% > 55%
Statement stmt= app.connection.createStatement();
ResultSet row=stmt.executeQuery("select FASTA.GC(seq)*100.0,name,length(seq) from FASTA.SEQUENCE " +
"where FASTA.GC(seq)>0.55");
while(row.next())
{
System.out.println(row.getInt(1)+"%\t"+row.getString(2)+"("+row.getInt(3)+")");
}


compiling and executing....

55% >gi|11448650|gb|BF436335.1|BF436335 7p06d03.x1 NCI_CGA
55% >gi|23257376|gb|BU583411.1|BU583411 mai04h08.y1 McCarr
57% >gi|10811235|gb|BF057339.1|BF057339 7k19e02.x1 NCI_CGA
57% >gi|11271233|gb|BF321955.1|BF321955 uz66f08.y1 NCI_CGA
57% >gi|11271233|gb|BF321955.1|BF321955 uz66f08.y1 NCI_CGA
58% >gi|13675777|gb|BG625264.1|BG625264 pgn1c.pk002.c2 Nor
58% >gi|13675786|gb|BG625273.1|BG625273 pgn1c.pk002.d1 Nor
58% >gi|13675786|gb|BG625273.1|BG625273 pgn1c.pk002.d1 Nor
63% >gi|84131965|gb|CV878005.1|CV878005 PDUts1172A01 Porci


That's it.

Pierre

JAVA 1.6 Mustang, StAXand Bioinformatics

about StAX via xml.com ; Most current XML APIs fall into one of two broad classes: event-based APIs like SAX and XNI or tree-based APIs like DOM and JDOM. Most programmers find the tree-based APIs to be easier to use; but such APIs are less efficient, especially with respect to memory usage. (...) However, the common streaming APIs like SAX are all push APIs. They feed the content of the document to the application as soon as they see it, whether the application is ready to receive that data or not. SAX and XNI are fast and efficient, but the patterns they require programmers to adopt are unfamiliar and uncomfortable to many developers. (...)

StAX shares with SAX the ability to read arbitrarily large documents. However, in StAX the application is in control rather than the parser. The application tells the parser when it wants to receive the next data chunk rather than the parser telling the client when the next chunk of data is ready. Furthermore, StAX exceeds SAX by allowing programs to both read existing XML documents and create new ones. Unlike SAX, StAX is a bidirectional API.

I 've tested StaX to see how it could be used to read the NCBI/TinySeqXML format.
For each xml all TSeq sequence was parsed using the StaX API (XMLEventReader). Once in memory all sequences were printed to stdout using a XMLStreamWriter.

[the source code is here]


(...)
XMLInputFactory factory = XMLInputFactory.newInstance();
factory.setProperty("javax.xml.stream.isNamespaceAware", Boolean.FALSE);
factory.setProperty("javax.xml.stream.isCoalescing", Boolean.TRUE);
/** create a XML Event parser */
XMLEventReader parser = factory.createXMLEventReader(in);
TSeq seq= null;


/** loop over the events */
while(parser.hasNext()) {
XMLEvent event = parser.nextEvent();

if(event.isStartElement())
{
StartElement start=((StartElement)event);
String localName= start.getName().getLocalPart();
if(localName.equals("TSeq"))
{
seq= new TSeq();
this.TSeqSet.addElement(seq);
}
else if(localName.equals("TSeq_seqtype"))
{
seq.type= start.getAttributeByName(new QName("value")).getValue();
}
else if(localName.equals("TSeq_gi"))
{
seq.gi= Integer.parseInt(parser.getElementText());
}
else if(localName.equals("TSeq_accver"))
{
seq.accver= parser.getElement
(...)

... and to write the sequences...
 (...)
XMLOutputFactory factory= XMLOutputFactory.newInstance();
XMLStreamWriter w= factory.createXMLStreamWriter(out);
w.writeStartDocument();
w.writeStartElement("TSeqSet");

for(TSeq seq: TSeqSet)
{
w.writeStartElement("TSeqSet");
w.writeEmptyElement("TSeq_seqtype");
w.writeAttribute("value", seq.type);
w.writeStartElement("TSeq_gi");
w.writeCharacters(String.valueOf(seq.gi));
w.writeEndElement();
w.writeStartElement("TSeq_accver");
w.writeCharacters(seq.accver);
w.writeEndElement();
w.writeStartElement("TSeq_sid");
w.writeCharacters(seq.sid);
w.writeEndElement();
w.writeStartElement("TSeq_taxid");
w.writeCharacters(String.valueOf(seq.taxid));
w.writeEndElement();
w.writeStartElement("TSeq_orgname");
w.writeCharacters(seq.orgname);
w.writeEndElement();
w.writeStartElement("TSeq_defline");
w.writeCharacters(seq.defline);
w.writeEndElement();
w.writeStartElement("TSeq_length");
w.writeCharacters(String.valueOf(seq.length));
w.writeEndElement();
w.writeStartElement("TSeq_sequence");
w.writeCharacters(seq.sequence);
w.writeEndElement();
w.writeEndElement();
}

w.writeEndElement();
w.writeEndDocument();
w.flush();
(....)

compiling and running...

pierre@linux:> javac org/lindenb/sandbox/STAXTinySeq.java

pierre@linux:> java org/lindenb/sandbox/STAXTinySeq tinyseq.xml


<?xml version="1.0" ?><TSeqSet><TSeqSet><TSeq_se
qtype value="nucleotide"/><TSeq_gi>27592135</TSeq_gi>&
lt;TSeq_accver>CB017399.1</TSeq_accver><TSeq_sid>gnl|d
bEST|16653996</TSeq_sid><TSeq_taxid>9031</TSeq_taxid&g
t;<TSeq_orgname>Gallus gallus</TSeq_orgname><TSeq_defl
ine>pgn1c.pk016.a18 Chicken lymphoid cDNA library (pgn1c) Gallus g
allus cDNA clone pgn1c.pk016.a18 5' similar to ref|XP_176823.1 simila
r to Rotavirus X associated non-structural protein (RoXaN) [Mus muscu
lus] ref|XP_193795.1| similar to Rotavirus X as></TSeq_defline&
gt;<TSeq_length>671</TSeq_length><TSeq_sequence>GGA
AGGGCTGCCCCACCATTCATCCTTTTCTCGTAGTTTGTGCACGGTGCGGGAGGTTGTCTGAGTGACTTC
ACGGGTCGCCTTTGTGCAGTACTAGATATGCAGCAGACCTATGACATGTGGCTAAAGAAACACAATCCT
GGGAAGCCTGGAGAGGGAACACCACTCACTTCGCGAGAAGGGGAGAAACAGATCCAGATGCCCACTGAC
TATGCTGACATCATGATGGGCTACCACTGCTGGCTCTGCGGGAAGAACAGCAACAGCAAGAAGCAATGG
CAGCAGCACATCCAGTCAGAGAAGCACAAGGAGAAGGTCTTCACCTCAGACAGTGACTCCAGCTGCTGG
AGCTATCGCTTCCCTATGGGCGAGTTCCAGCTCTGTGAAAGGTACCATGCACATGGCTCTGTTTGATCC
CAGAAGTGATGACTACTTAGTGGTAAAAACACATTTCCAGACACACAACTTCAGAAAATGAGTGCAAGC
TTCAAGTCTGCCCTTTGTAGCCATAATGTGCTCAGCTCTCGGTCTGCTGAACAGAGTCTACTTGGCTCA
ATTCTTGGGGGAATCCCAGATGCTTTATTAGATTGTTTGAATGTCTCACGCCCTCTGAATCAGTGCCTT



That's it.
Pierre

JAVA 1.6 Mustang, JAXB and Bioinformatics

JAXB provides a convenient way to bind an XML schema to a representation in Java code. It makes it easy for you to incorporate XML data and processing functions in applications based on Java technology without having to know much about XML itself. The Architecture for XML Binding is now included in the new Java1.6. I wanted to test JAXB to see how it could be used to parse the NCBI/TinySeqXML format. First I created a XSD description of a TSeq:

Source tinyseq.xsd
(...)
<xs:element name="TSeqSet">
<xs:annotation>
<xs:documentation>Set of sequences</xs:documentation>
</xs:annotation>
<xs:complextype>
<xs:sequence>
<xs:element ref="TSeq" maxoccurs="unbounded">
</xs:element>
</xs:sequence>
</xs:complextype>
(...)
I then invoked the binding compiler XJC.
XJC generates Java classes acorresponding to the elements. It
parsed my tinyseq xsd schema and created three files:
pierre@linux:~> xjc org/lindenb/sandbox/tinyseq.xsd -d ./ -p org.lindenb.sandbox.tinyseq
parsing a schema...
compiling a schema...
org/lindenb/sandbox/tinyseq/ObjectFactory.java
org/lindenb/sandbox/tinyseq/TSeq.java
org/lindenb/sandbox/tinyseq/TSeqSet.java
I then wrote a java class using the JAXB API to read an then write a TinySeq file .

[source is here]
/** find the JAXB context in the defined path */
JAXBContext jc = JAXBContext.newInstance("org.lindenb.sandbox.tinyseq");
Unmarshaller u = jc.createUnmarshaller();
/** read the sequence */
TSeqSet seqSet = (TSeqSet)u.unmarshal(new FileInputStream("org/lindenb/sandbox/tinyseq.xml"));

Marshaller m= jc.createMarshaller();
/** echo the sequence to stdout */
m.marshal(seqSet, System.out);
compiling and running...
jpierre@linux> javac org/lindenb/sandbox/JAXBTinySeq.java org/lindenb/sandbox/tinyseq/ObjectFactory.java
jpierre@linux> java -cp . org.lindenb.sandbox.JAXBTinySeq tinyseq.xml

<?xml version="1.0" ?><TSeqSet><TSeqSet><TSeq_se
qtype value="nucleotide"/><TSeq_gi>27592135</TSeq_gi><
TSeq_accver>CB017399.1</TSeq_accver><TSeq_sid>gnl|d
bEST|16653996</TSeq_sid><TSeq_taxid>9031</TSeq_taxid
><TSeq_orgname>Gallus gallus</TSeq_orgname><TSeq_defl
ine>pgn1c.pk016.a18 Chicken lymphoid cDNA library (pgn1c) Gallus g
allus cDNA clone pgn1c.pk016.a18 5' similar to ref|XP_176823.1 simila
r to Rotavirus X associated non-structural protein (RoXaN) [Mus muscu
lus] ref|XP_193795.1| similar to Rotavirus X as></TSeq_defline>
<TSeq_length>671</TSeq_length><TSeq_sequence>GGA
AGGGCTGCCCCACCATTCATCCTTTTCTCGTAGTTTGTGCACGGTGCGGGAGGTTGTCTGAGTGACTTC
ACGGGTCGCCTTTGTGCAGTACTAGATATGCAGCAGACCTATGACATGTGGCTAAAGAAACACAATCCT
GGGAAGCCTGGAGAGGGAACACCACTCACTTCGCGAGAAGGGGAGAAACAGATCCAGATGCCCACTGAC
(...)


That's it, all the classes and the methods to store and parse the XML were generated using xjc and everything was ready for direct use.

Pierre

13 December 2006

JAVA 1.6 Mustang , Scripting and Bioinformatics.

Java 1.6 has been released and it is now open source. Among the new features of java, there is an embedded sql engine and a scripting engine. I've to tested this later one to see if I could create a simple filter for fasta sequences just like awk do with regular files. What is interesting is that you can call some (complex) methods already coded in java (see below with the trivial static functions 'reverseComplement' and 'gcPercent'.

The source code using the new Scripting API is at the bottom of this post.

It was compiled like this:

javac org/lindenb/sandox/FastaAWK.java

and here is a test: the program takes as input somes fasta sequences and it only prints the one where (the length is greater than 900 pb or the GC percent is lower than 0.45 or (the reverse complement contains ATGCTTCTTG and the name contains Xenopus)).

cat ~/roxan.fasta | java org/lindenb/sandox/FastaAWK "FastaAWK.gcPercent(sequence)< 0.45 || sequence.length>90 ||  (FastaAWK .reverseComplement(sequence).toUpperCase().indexOf('ATGCTTCTTG')!=-1 && name.indexOf('Xenopus')!=-1 )"


>gi|27592135|gb|CB017399.1|CB017399 pgn1c.pk016.a18 Chicken lymphoid cDNA librar
y (pgn1c) Gallus gallus cDNA clone pgn1c.pk016.a18 5' similar to ref|XP_176823.1
similar to Rotavirus X associated non-structural protein (RoXaN) [Mus musculus]
ref|XP_193795.1| similar to Rotavirus X associated non-structural protein (RoXa
N) [Mus musculus], mRNA sequence
GGAAGGGCTGCCCCACCATTCATCCTTTTCTCGTAGTTTGTGCACGGTGC
GGGAGGTTGTCTGAGTGACTTCACGGGTCGCCTTTGTGCAGTACTAGATA
TGCAGCAGACCTATGACATGTGGCTAAAGAAACACAATCCTGGGAAGCCT
GGAGAGGGAACACCACTCACTTCGCGAGAAGGGGAGAAACAGATCCAGAT
GCCCACTGACTATGCTGACATCATGATGGGCTACCACTGCTGGCTCTGCG
GGAAGAACAGCAACAGCAAGAAGCAATGGCAGCAGCACATCCAGTCAGAG
AAGCACAAGGAGAAGGTCTTCACCTCAGACAGTGACTCCAGCTGCTGGAG
CTATCGCTTCCCTATGGGCGAGTTCCAGCTCTGTGAAAGGTACCATGCAC
ATGGCTCTGTTTGATCCCAGAAGTGATGACTACTTAGTGGTAAAAACACA
TTTCCAGACACACAACTTCAGAAAATGAGTGCAAGCTTCAAGTCTGCCCT
TTGTAGCCATAATGTGCTCAGCTCTCGGTCTGCTGAACAGAGTCTACTTG
GCTCAATTCTTGGGGGAATCCCAGATGCTTTATTAGATTGTTTGAATGTC
TCACGCCCTCTGAATCAGTGCCTTGAGGTGCCTTCAGAAGGCTTGTGATG
GTTAGNNNTNGCATTTTGGTT
>gi|13675786|gb|BG625273.1|BG625273 pgn1c.pk002.d1 Normalized chicken lymphoid c
DNA library Gallus gallus cDNA clone pgn1c.pk002.d1 5' similar to gb|AAF05541.1|
AF188530_1 (AF188530) ubiquitous tetratricopeptide containing protein RoXaN [Hom
o sapiens]G, mRNA sequence
CAATTGATGACATCGAAACAGACTGCTCTATGGACCTGCAGTGCCTGCCA
GCTCCTGTGGCCACCTCCATCTCTGTGAGCGAGGGGCTGTCCCCTTTGCA
(...)




The source:

package org.lindenb.sandox;

import java.io.BufferedReader;
import java.io.InputStreamReader;
import javax.script.Compilable;
import javax.script.CompiledScript;
import javax.script.ScriptEngine;
import javax.script.ScriptEngineFactory;
import javax.script.ScriptEngineManager;
import javax.script.ScriptException;
import javax.script.SimpleBindings;

/**
* my first test for java 1.6 mustang/ ScriptEngine
* @author Pierre Lindenbaum PhD
*
* example: cat *.fasta | java org.lindenb.sandox.FastaAWK "FastaAWK.gcPercent(sequence)<0.45 || (sequence.length>900) || (FastaAWK .reverseComplement(sequence).toUpperCase().indexOf('ATGCTTCTTG')!=-1 && name.indexOf('Xenopus')!=-1 );"
*
*/

public class FastaAWK {
/** header of fasta sequence */
private String name=null;
/** dna sequence */
private StringBuilder sequence= new StringBuilder();
/** compiled user script */
private CompiledScript compiledScript=null;


/** constructor
* initialize the statements
* @param statements
*/

public FastaAWK(String args[]) throws ScriptException
{
//copy statements to add the 'importClass'
String statements[]= new String[args.length+1];
//import this class to get a handle on gcPercent and reverseComplement
statements[0]="importClass(Packages."+this.getClass().getName()+");";
System.arraycopy(args, 0, statements, 1, args.length);

//get a javascript engine
ScriptEngineManager sem = new ScriptEngineManager();
ScriptEngine scriptEngine = sem.getEngineByName("js");
ScriptEngineFactory scriptEngineFactory= scriptEngine.getFactory();
String program = scriptEngineFactory.getProgram(statements);
//compile this program
this.compiledScript=((Compilable) scriptEngine).compile(program);
}

/**
*
* @param sequence the dna sequence
* @return the GC%
*/

public static double gcPercent(String sequence)
{
int n=0;
for(int i=0;i< sequence.length();++i)
{
char base= Character.toUpperCase(sequence.charAt(i));
n+=(base=='G' || base=='C'?1:0);
}
return (double)(n/(double)sequence.length());
}

/**
* return the reverse complement of a sequence
* @param sequence the dna sequence
* @return the reverse complement of a sequence
*/

public static String reverseComplement(String sequence)
{
StringBuilder b= new StringBuilder();
for(int i=sequence.length()-1;i>=0;--i)
{
switch(sequence.charAt(i))
{
case 'A': b.append('T');break;
case 'T': b.append('A');break;
case 'G': b.append('C');break;
case 'C': b.append('G');break;
case 'a': b.append('t');break;
case 't': b.append('a');break;
case 'g': b.append('c');break;
case 'c': b.append('g');break;
default: b.append('N');break;
}
}

return b.toString();
}

/**
* print the current fasta sequence if the compiled script return true
* @throws ScriptException
*/

public void eval() throws ScriptException
{
/* bind name and sequence to javascript variable 'name' and 'seq
uence'*/

SimpleBindings bindings= new SimpleBindings();
bindings.put("name", this.name);
bindings.put("sequence", this.sequence.toString());
//invoke the script with the current binding and get the result
Object o=this.compiledScript.eval(bindings);
if(o==null || !(o instanceof Boolean ) ) return;
//if the result is true: print the fasta sequence
Boolean b=(Boolean)o;
if(b.equals(Boolean.FALSE))
{
return;
}
System.out.print(name);
for(int i=0;i< sequence.length();++i)
{
if(i%50==0) System.out.println();
System.out.print(sequence.charAt(i));
}
System.out.println();

}

/**
* @param args
*/

public static void main(String[] args)
{
try {
FastaAWK awk= new FastaAWK(args);
//loop over fasta sequences
BufferedReader in= new BufferedReader(new InputStreamReader(System.in));
String line=null;
while((line=in.readLine())!=null)
{
if(line.startsWith(">"))
{
if(awk.sequence.length()>0)
{
awk.eval();
}
awk.sequence.setLength(0);
awk.name=line.trim();
}
else
{
awk.sequence.append(line.trim());
}
}
if(awk.sequence.length()>0)
{
awk.eval();
}

in.close();
}
catch (Throwable e)
{
e.printStackTrace();
}

}

}




Pierre Lindenbaum

PS: hey I put my latest presentation (in french) for the first time on slideshare.net !

06 December 2006

Visual Pipeline Editor (Continued)

Following my post about amatea, Egon has suggested me to have a look at two softwares:

KNIME:
KNIME is a modular data exploration platform based on Eclipse that enables the user to visually create pipelines, selectively execute some or all analysis steps, and later investigate the results through interactive views on data and models.


NIME offers the possibility to extend its functionallity by creating your own nodes.

Taverna
The Taverna project aims to provide a language and software tools to facilitate easy use of workflow and distributed compute technology within the eScience community. At first glance, it works with web services, but I need to investigate a little more about it.


04 December 2006

Visual Unix Pipeline

A few days ago I was presented a really impressive demo of Amadea. In a simplistic view, it looks like a "visual unix pipeline": just drag your 'grep','sort',etc... on your desktop, draw the links, choose your data sources (mysql, excel spreadsheets...) and run your analysis.

Amadea Screenshot
Via ISoft: The central window enables the user to draw and control the execution of the transformation process. Selecting the output of one of the operators automatically updates the data grid to reflect the transformations processed by this operator on the input table. The frames on the left of the screen give access to transformation operators. The parameters of each operator can be spelt out in the right of the screen.


I wondered if there is any other tool which acts like such a visual editor and which could help people from my lab to perform some simple operation on their data ?

On the other hand and without pretention, it might be easy to create a naive version of those unix filters in java just by extending java.io.InputStream. For example, for a Grep, I would write:

public class GrepInputStream
extends InputStream
{
/** parent stream */
private InputStream parent;

/** regular expression */
private Pattern pattern;

/** byte buffer */
private byte byteBuffer[];

/** position in buffer */
private int curr;


/** constructor */
public GrepInputStream(
InputStream parent,
Pattern pattern)
{
this.parent= parent;
this.pattern=pattern;
this.byteBuffer=null;
}

@Override
public int read() throws IOException
{
/** byte buffer already exists */
if(byteBuffer!=null && curr< byteBuffer.length)
{
byte b= byteBuffer[curr++];
if(curr== byteBuffer.length)
{
byteBuffer=null;
curr=0;
}
return b;
}

while(true)
{
/** read next line from parent */
int i;
ByteArrayOutputStream byteOutputStream=new ByteArrayOutputStream();
while((i=parent.read())!=-1)
{
byteOutputStream.write(i);
if(i=='\n') break;//eol reached
}
this.byteBuffer= byteOutputStream.toByteArray();
if(byteBuffer.length==0) return -1;

/** creates the line. remove the CR if needed */
String line= new String(
this.byteBuffer,
0,
byteBuffer[byteBuffer.length-1]=='\n'?byteBuffer.length-1:byteBuffer.length
);

Matcher match= this.pattern.matcher(line);
/* this line matcth our pattern */
if(match.find()) break;
}
this.curr=1;
return this.byteBuffer[0];
}

}



Pierre

30 November 2006

Social Network: Vidaeo

Vidaeo is the new name of the largest French social network previously known as Viaduc. The new site speaks English and is now open to everybody in the world. There are 623 members working in the sector of "Biotechnology".

Pierre

Proteins@Home

Proteins@Home, a large-scale protein structure prediction project , is now open for beta-testing. Proteins@Home is based on BOINC, a program that lets you donate your idle computer time to science projects like SETI@home, Climateprediction.net, Rosetta@home, World Community Grid, and many others.

23 November 2006

www.oboedit.org

Via : GO friends
OBO-Edit, the open source ontology editor written in Java. OBO-Edit is optimized for the OBO biological ontology file format. It features an easy to use editing interface, a simple but fast reasoner, and powerful search capabilities. It now has its own website! Visit http://www.oboedit.org for the central online resource for all things OBO-Edit.

21 November 2006

JAVA, multithread, k-means, clustering, OTMI.

K-means: A tutorial about the Java concurrency API posted on javaworld contains a java code about the k-means clustering algorithm which is an algorithm to cluster objects based on attributes into k partitions used in commonly used in bioinformatics.

OTMI: The Open Text Mining Interface (OTMI) was introduced on nascent a few monthes ago. OTMI aims to enable scientific, technical and medical (STM) publishers, among others, to disclose their full text for indexing and text-mining purposes but without giving it away in a form that is readily human-readable. There is now a wiki dedicated to OTMI at http://www.opentextmining.org/.

14 November 2006

UCSC Genome Browser wiki site launched

Via: Genome-announce.

The UCSC Genome Bioinformatics group has launched a wiki site for sharing
information about the UCSC Genome Browser and its data. The wiki -- at
http://genomewiki.ucsc.edu -- provides an informal forum for the browser users,
mirror sites, and staff to discuss topics of interest in the genome biology
field and exchange usage tips, scripts, programs, and notes about mirroring the
Genome Browser and working with the Genome Browser source.

09 November 2006

Photo tourism

Photo tourism is a system for browsing large collections of photographs in 3D. Amazing.

Noah Snavely, Steven M. Seitz, Richard Szeliski, "Photo tourism:Exploring photo collections in 3D," ACM Transactions on Graphics (SIGGRAPH Proceedings), 25(3), 2006, 835-846.

06 November 2006

State of the Blogosphere

Via Technorati.

On July 31, 2006, Technorati tracked its 50 millionth blog. The blogosphere that Technorati tracks continues to show significant growth. The blogosphere has been doubling in size every 6 months or so. It is over 100 times bigger than it was just 3 years ago....

03 November 2006

Soylent Green

Via : Science Mag:
Impacts of Biodiversity Loss on Ocean Ecosystem Services: Boris Worm & al.
(...).We analyzed local experiments, long-term regional time series, and global fisheries data to test how biodiversity loss affects marine ecosystem services across temporal and spatial scales. Overall, rates of resource collapse increased and recovery potential, stability, and water quality decreased exponentially with declining diversity. Restoration of biodiversity, in contrast, increased productivity fourfold and decreased variability by 21%, on average. We conclude that marine biodiversity loss is increasingly impairing the ocean's capacity to provide food, maintain water quality, and recover from perturbations. (...)

Conclusions. Positive relationships between diversity and ecosystem functions and services were found using experimental and correlative approaches along trajectories of diversity loss and recovery. Our data highlight the societal consequences of an ongoing erosion of diversity that appears to be accelerating on a global scale. This trend is of serious concern because it projects the global collapse of all taxa currently fished by the mid–21st century (based on the extrapolation of regression to 100% in the year 2048)(...).

02 November 2006

Bio2RDF

Via public-semweb-lifesci.
The Bio2RDF project is a tool to convert bioinformatics data and knowledge bases to RDF format. It is a kind of generalized rdfizer for bioinformatic application, and it is a place for the semantic web life science community to develope and grow. It is said to be optimized with a Firefox Search plugin and Simile Piggy Bank.

Pierre

"XML, SQL, and C" by Jim Kent

Jim Kent the author of the BLAT algorithm has published an article in "Dr Dobb's Journal" (I guess you cannot find this via pubmed :-) ! ): XML, SQL, and C Tools for mapping between C and XML data structures, among other tasks. XML is an increasingly popular format for exchanging data. It handles optional fields and hierarchical data structures well, is readable by humans as well as computers, is portable across a wide variety of platforms, and it can even handle recursive data structures. In my own field of bioinformatics, XML has become almost as common as the venerable tab-separated file for exchanging information between databases. Still, integrating data from XML sources into our relational databases and our largely C code base sometimes seemed to involve more work than it should. Consequently, I was motivated to write the four tools—autoXml, AutoDtd, sqlToXml, and xmlToSql—presented in this article.

Pierre

01 November 2006

Honey Bee Genome



Genomic Research has published a special issue about the honey bee genome.
There's also a Honey Bee Genome Focus in Nature.

Pierre


PubMed as a Search Engine

Via PubMedNews.
PubMed is now available as a search engine add-on on the search bar in the upper-right corner of Firefox 2.0 and Internet Explorer 7.0. From a PubMed Web page, click the search box drop down arrow next to the default search engine Google, and then select Add PubMed search.

Pierre

31 October 2006

JAVA custom Annotation & Concurrent Versions System

I manage my sources using CVS (Concurrent Versions System). When a user reported a bug in one of my java program, I faced the problem to get the version of the java class where the Exception was throwed..... I don't know if this was described but here is my solution using a custom java annotation. CVS can use a mechanism known as keyword substitution (or keyword expansion) to help identifying the files. Embedded strings of the form $keyword$ and $keyword:…$ in a file are replaced with strings of the form $keyword:value$ whenever you obtain a new revision of the file. Here I used Date,Source and Revision.

First I defined a new custom Annotation called @RCS returning the author, a date and a revision.

import java.lang.annotation.*;
/**
* describe a Revision Control System
* @author pierre
*
*/
@Retention(RetentionPolicy.RUNTIME) /* The annotation should be available for reflection at runtime.*/
@Target(ElementType.TYPE) /* place in : class, interface, enum */
public @interface RCS {
/** author of this source */
String author() default '[undefined]';
/** revision number of this source */
String revision() default '[undefined]';
/** date of revision of this source */
String date() default '[undefined]';
/** file name for this source */
String source() default '[undefined]';
}


and here are two files used for test. Both classes contains a @RCS annotation which can be find at runtime. Test2 jsut throws an Exception. Test1 catch this exception and display the stack trace. for each StackTraceElement it tries to find a @RCS annotation.


import java.lang.annotation.*;
@RCS( author='$Author: $',
date='$Date: $',
source='$Source: $',
revision='$Revision: $'
)

public class Test
{

Test()
{
Test2 t=new Test2();
t.doSomething();
}



public static void main(String args[])
{
try
{
Test t= new Test();
}
catch(Exception err)
{

System.err.println(err.getLocalizedMessage());
for(StackTraceElement e: err.getStackTrace())
{
System.err.println('Class :\t\t\t'+e. getClassName());
System.err.println('File :\t\t\t'+e. getFileName());
System.err.println('Line :\t\t\t'+e. getLineNumber());

if(e.getClassName()!=null)
{
try {
Class c= Class.forName( e.getClassName());
RCS rcsInfo=(RCS)c.getAnnotation(RCS.class);
System.err.println('Revision:\t\t'+(rcsInfo==null?'N/A':rcsInfo.revision()));
System.err.println('Date:\t\t\t'+(rcsInfo==null?'N/A':rcsInfo.date()));
System.err.println('Author:\t\t\t'+(rcsInfo==null?'N/A':rcsInfo.author()));
}
catch (Exception err2)
{
}
}
System.err.println();
}
}

}
}



@RCS(   author='$Author:  $',
date='$Date: $',
source='$Source: $',
revision='$Revision: $'
)

public class Test2
{
Test2()
{
}
void doSomething()
{
throw new RuntimeException('Test');
}

}


when the files where committed the CVS fields were substituted by their correct values. Here is the stack trace as it was printed:


pierre@linux:~/tmp/src> cvs commit
(...)
pierre@linux:~/tmp/src> javac Test.java
pierre@linux:~/tmp/src> java Test
Test
Class : Test2
File : Test2.java
Line : 13
Revision: $Revision: 1.2 $
Date: $Date: 2006/10/31 18:10:44 $
Author: $Author: pierre $


Class : Test
File : Test.java
Line : 15
Revision: $Revision: 1.5 $
Date: $Date: 2006/10/31 18:54:08 $
Author: $Author: pierre $


Class : Test
File : Test.java
Line : 24
Revision: $Revision: 1.5 $
Date: $Date: 2006/10/31 18:54:08 $
Author: $Author: pierre $



That's it.
Pierre

Pubmed "Abstract Plus" and Pubmed2connotea

I didn't mention it before, but I've upgraded the greasemonkey script pubmed2connotea for the new "Abstract Plus" format in pubmed. Pubmed2connotea/Pubmed2CiteULike is a Greasemonkey user script which alters the content web
page when browsing your bibliography on NCBI pubmed by inserting
a hyperlink. This new link adds a new bookmark into connotea with the current selected paper.

Update the latest version at http://www.urbigene.com/pubmed2connotea/.



Pierre

Zotero

zotero is a free Firefox extension, easy-to-use research tool that helps you gather and organize resources (whether bibliography or the full text of articles), and then lets you to annotate, organize, and share the results of your research. It includes the best parts of older reference manager software (like EndNote)—the ability to store full reference information in author, title, and publication fields and to export that as formatted references—and the best parts of modern software such as del.icio.us , like the ability to sort, tag, and search in advanced ways.



In zotero, data are stored using the sqlite, so using an external sqlite engine, you can query or update the database without firefox/mozilla:

pierre@linux:~/> echo ".tables" |./sqlite3-3.3.8.bin -separator '   ' -header ~/.mozilla/firefox/xxxx.default/zotero/zotero.sqlite

charsets itemAttachments tags
collectionItems itemCreators transactionLog
collections itemData transactionSets
creatorTypes itemNotes transactions
creators itemSeeAlso translators
csl itemTags userFieldMask
fieldFormats itemTypeCreatorTypes userFields
fields itemTypeFields userItemTypeFields
fileTypeMimeTypes itemTypes userItemTypeMask
fileTypes items userItemTypes
fulltextItems savedSearchConditions version
fulltextWords savedSearches


pierre@linux:~/> echo "select * from items order by dateAdded;" |./sqlite3-3.3.8.bin-separator ' ' -header ~/.mozilla/firefox/xxxx.default/zotero/zotero.sqlite

itemID itemTypeID title dateAdded dateModified
8423 4 Haplotypes in the gene encoding protein kinase c-beta (PRKCB1) on chromosome 16 are associated with autism. 2006-10-30 18:47:16 2006-10-30 18:47:56
15625 1 2006-10-30 18:48:11 2006-10-31 10:50:53


About sqlite: There is a video about this engine on google-video.


23 October 2006

Things about Bioinformatics #1

Hum, too much work , I've not enough time to write some original post on this blog. So, as Deepak Singh did on business|bytes|genes|molecules, I'm also starting a series of post about thing I noticed:

dbSNP: From dbSNP: Users may now query genotypes for 1 or more snps by rs#, chromosome location, or gene Id. SNP properties and populations are specified prior to retrieval. Output includes HTML, XML, Text and HaploView format by population.
http://www.ncbi.nlm.nih.gov/projects/SNP/snp_gf.cgi

Calendar: I've started to use google's calendar and I created a shared one about bioinformatics (mostly for test but I'll try to keep it up to date). Here is its address in XML and in iCal format. (See also: upcoming.org)

Connotea: Ben Lund has created a new (beta) tool for connotea called Notea: Notea is a Firefox Extension for storing and organizing local copies of online works. It optionally integrates with Connotea to allow easy sharing of bookmarks for your locally stored works.

Del.icio.us: from time to time, I synchronize my tagged bookmarks from connotea to del.icio.us (using xslt and curl) . Del.icio.us has a few functionalities that are missing to connotea: discovering sites using the network, suggesting links, tag bundles....

Comics: want to know more about North Korea ? Do read this comics : "Pyongyang: A Journey to North Korea."



Comics: 300, a film based on Frank Miller's graphic novel 300 about the the battle of thermopylae will be in theaters in 2007. The comparison between the book and the movie is astonishing !

The book:


The movie:


Pierre

05 October 2006

A Google Gadget for Bioinformatics.

As Pedro said on his blog: Google announced that their widgets (or gadgets) can now be used on third party webpages.. I've written ([here])a google gadget searching for pubmed or genbank using the NCBI EUtilies. Enjoy !





Add to Google



Pierre

Updated 2010-08-12: source code

<?xml version="1.0" encoding="UTF-8"?>
<Module>
<ModulePrefs
title="Search NCBI/__UP_db__"
directory_title="Search NCBI/__UP_db__"
description="Search Pubmed/Genbank at the National Center for Biotechnology Information (NCBI)"
author="Pierre Lindenbaum PhD"
author_email="plindenbaum+gadget4ncbi@yahoo.fr"
title_url="http://www.urbigene.com/googlegadget/index.html#gadget4ncbi"
author_affiliation="Integragen"
author_location="Evry,France"
height="200"
width="320"
scrolling="true"
singleton="true"
screenshot="http://www.urbigene.com/googlegadget/ncbi_screenshot.png"
thumbnail="http://www.urbigene.com/googlegadget/ncbi_thumbnail.png"
author_photo="http://www.urbigene.com/googlegadget/plindenbaum.png"
author_link="http://plindenbaum.blogspot.com"
author_aboutme="Bioinformatician at Integragen"
author_quote="A child of five would understand this. Send someone to fetch a child of five. "
>
<Locale lang="en"/>
<Locale lang="ja"/>
<Locale lang="fr"/>
<Locale lang="de"/>
<Locale lang="zn-cn"/>
<Locale lang="zh-tw"/>
</ModulePrefs>
<UserPref name="term"
display_name="Your Query"
datatype="string"
required="true"
default_value="Rotavirus"
/>
<UserPref name="db"
display_name="Database"
datatype="enum"
required="true"
default_value="pubmed">
<EnumValue value="pubmed" display_value="Pubmed"/>
<EnumValue value="nucleotide" display_value="Nucleotide"/>
<EnumValue value="protein" display_value="Protein"/>
</UserPref>
<UserPref name="retmax"
display_name="Number of items retrieved"
datatype="enum"
required="true"
default_value="10">
<EnumValue value="1" display_value="1"/>
<EnumValue value="2" display_value="2"/>
<EnumValue value="10" display_value="10"/>
<EnumValue value="15" display_value="15"/>
<EnumValue value="20" display_value="20"/>
<EnumValue value="25" display_value="25"/>
<EnumValue value="50" display_value="50"/>
<EnumValue value="100" display_value="100"/>
</UserPref>
<Content type="html"><![CDATA[
<style>
#content__MODULE_ID__ {
font-size: 9pt;
margin: 5px;
background-color: #FFFFBF;
}
dt {
font-size: 9pt;
}
dd {
font-size: 7pt;
}

</style>
<div id="content__MODULE_ID__"></div>
<script type="text/javascript">

var prefs__MODULE_ID__ = new _IG_Prefs(__MODULE_ID__);

function dom1__MODULE_ID__(node,tag)
{
if(node==null || tag==null) return null;
var n=node.firstChild;
while(n!=null)
{
if(n.nodeName==tag) return n;
n=n.nextSibling;
}
return null;
}

function dom2__MODULE_ID__(node)
{
if(node==null) return "undefined";
var c= node.firstChild;
if(c==null) return "undefined";
return c.nodeValue;
}

function pubmed__MODULE_ID__(response)
{
var nodes = response.getElementsByTagName("PubmedArticle");
var html="<div><dl>";
for (var i = 0; i < nodes.length ; i++)
{
var MedlineCitation= dom1__MODULE_ID__(nodes.item(i),"MedlineCitation");
var PMID= dom1__MODULE_ID__(MedlineCitation,"PMID");
var Article = dom1__MODULE_ID__(MedlineCitation,"Article");
var ArticleTitle = dom1__MODULE_ID__(Article,"ArticleTitle");
var Title = dom1__MODULE_ID__(dom1__MODULE_ID__(MedlineCitation,"MedlineJournalInfo"),"MedlineTA");
var AuthorList = dom1__MODULE_ID__(Article,"AuthorList");
var Author = dom1__MODULE_ID__(AuthorList,"Author");
var LastName = dom1__MODULE_ID__(Author,"LastName");

html+='<dt><a title=\"show this Article\" href=\"http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=pubmed&amp;cmd=Retrieve&amp;dopt=AbstractPlus&amp;list_uids='+dom2__MODULE_ID__(PMID)+
'\" target=\"pubmed'+dom2__MODULE_ID__(PMID)+
'\">'+dom2__MODULE_ID__(ArticleTitle)+'</a></dt>';
html+='<dd><i>'+dom2__MODULE_ID__(Title)+'</i>. <u>'+dom2__MODULE_ID__(LastName)+'</u> &amp; al.</dd>';
}

html+="</dl></div>";
_gel("content__MODULE_ID__").innerHTML = html;

}

function sequence__MODULE_ID__(response)
{
var nodes = response.getElementsByTagName("TSeq");
var html="<div><dl>";
for (var i = 0; i < nodes.length ; i++)
{
var c=nodes.item(i);
var TSeq_seqtype = dom1__MODULE_ID__(c,"TSeq_seqtype");
var TSeq_gi = dom1__MODULE_ID__(c,"TSeq_gi");
var TSeq_accver = dom1__MODULE_ID__(c,"TSeq_accver");
var TSeq_orgname = dom1__MODULE_ID__(c,"TSeq_orgname");
var TSeq_defline = dom1__MODULE_ID__(c,"TSeq_defline");
var TSeq_length = dom1__MODULE_ID__(c,"TSeq_length");

html+='<dt><a title=\"Show this Sequence\" href=\"http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db='+
TSeq_seqtype.getAttribute("value")+
'&amp;val='+dom2__MODULE_ID__(TSeq_gi)+
'\" target=\"seq'+dom2__MODULE_ID__(TSeq_gi)+
'\">'+dom2__MODULE_ID__(TSeq_defline)+'</a></dt>'+
'<dd><i>'+
dom2__MODULE_ID__(TSeq_accver)+
'</i>. <i>'+dom2__MODULE_ID__(TSeq_orgname)+'</i>. (length:'+dom2__MODULE_ID__(TSeq_length)+')</dd>';
}

html+="</dl></div>";


_gel("content__MODULE_ID__").innerHTML = html;
}


function dofetch__MODULE_ID__(response)
{
if (response == null ||
typeof(response) != "object" ||
response.firstChild == null)
{
_gel("content__MODULE_ID__").innerHTML = "<i>Invalid data for efetch</i>";
return;
}
if(prefs__MODULE_ID__.getString("db")=="pubmed")
{
pubmed__MODULE_ID__(response);
}
else
{
sequence__MODULE_ID__(response);
}
}

function doesearch__MODULE_ID__(response)
{

if (response == null ||
typeof(response) != "object" ||
response.firstChild == null)
{
_gel("content__MODULE_ID__").innerHTML = "<i>Invalid data for search</i>";
return;
}
var nodes = response.getElementsByTagName("QueryKey");
if(nodes==null || nodes.length!=1)
{
_gel("content__MODULE_ID__").innerHTML = "<i>Error with QueryKey</i>";
return;
}
var QueryKey=nodes.item(0).firstChild.nodeValue;
nodes = response.getElementsByTagName("WebEnv");
if(nodes==null || nodes.length!=1)
{
_gel("content__MODULE_ID__").innerHTML = "<i>Error with WebEnv</i>";
return;
}
var WebEnv=nodes.item(0).firstChild.nodeValue;
var url= "http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db="+
prefs__MODULE_ID__.getString("db")+
"&WebEnv="+escape(WebEnv)+
"&query_key="+escape(QueryKey)+
"&tool=gadget4ncbi"+
"&retmode=xml"+
"&usehistory=y"+
"&retmax="+prefs__MODULE_ID__.getString("retmax");

if(prefs__MODULE_ID__.getString("db")!="pubmed")
{
url+="&rettype=fasta";
}

_IG_FetchXmlContent(url,dofetch__MODULE_ID__);
}

function dogadget__MODULE_ID__()
{
var url= "http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db="+
prefs__MODULE_ID__.getString("db")+
"&term="+escape(prefs__MODULE_ID__.getString("term"))+
"&tool=gadget4ncbi"+
"&retmode=xml"+
"&usehistory=y"+
"&retmax="+prefs__MODULE_ID__.getString("retmax");
_IG_FetchXmlContent(url,doesearch__MODULE_ID__);
}

_IG_RegisterOnloadHandler(dogadget__MODULE_ID__);

</script>
]]></Content>
</Module>

26 September 2006

MYSQL UDF, trees of data, hierarchy

In a previous post I described how to write a mysql user defined function (UDF) in C to create a translate-dna-to-protein function for mysql. Now, I've been playing with hierachies and I wrote a few UDFs to explore a tree of data; I guess this could have been done with a mysql stored procedure but I'm currently using an old server (and I need a deeper knowledge of mysql5 :-) ). The source code is available [here].
The tree itself is written as a sorted static const array of data so you cannot modify it via a mysql command but you'll have to recompile the code.

typedef struct Taxonomy
{
/** the ncbi-id */
taxon_type_t tax_id;
/** id of the parent */
taxon_type_t parent_id;
/** scientific name */
char name[TAXON_NAME_SIZE];
}Taxon,*const TaxonPtr;


It may also not be suitable for large trees. In my example, I've been using a subset of the NCBI taxonomy:

static const Taxon all_taxons[]={
{1, -1, "root"},
{2759, 131567, "Eukaryota"},
{6072, 33208, "Eumetazoa"},
{7711, 33511, "Chordata"},
{7742, 89593, "Vertebrata"},
{7776, 7742, "Gnathostomata"},
{8287, 117571, "Sarcopterygii"},
{9347, 32525, "Eutheria"},
{9443, 314146, "Primates"},
...)
};


but one could imagine something like this...

{1,-1,"root"},
{2,1,"world"},
{3,2,"europe"},
{4,3,"france"},
...


or...

{1,-1,"rdf:Property"},
{2,1,"foaf:knows"},
{3,2,"rel:friendOf"},
{4,2,"rel:childOf"},
...


the source taxonudf.c was successfuly compiled on my computer using the following command-line.

gcc -fPIC -shared -DDBUG_OFF -O3 -I/usr/include/mysql -lmysqlclient -o /usr/lib/taxonudf.o taxonudf.c


taxon_name returns the scientific name of an organism from a ncbi-tax-id.

mysql> create function taxon_name
returns string soname "taxonudf.o";
Query OK, 0 rows affected (0,00 sec)

mysql> select taxon_name(9606);
+------------------+
| taxon_name(9606) |
+------------------+
| Homo sapiens |
+------------------+
1 row in set (0,43 sec)


tax_id returns the ncbi-tax-id from from its name.

mysql> create function taxon_id
returns integer soname "taxonudf.o";
Query OK, 0 rows affected (0,00 sec)

mysql> select taxon_id("Homo sapiens");
+--------------------------+
| taxon_id("Homo sapiens") |
+--------------------------+
| 9606 |
+--------------------------+
1 row in set (0,00 sec)


taxon_childof returns wether a node in the hierarchy is a descendant of another node.

mysql> create function taxon_childof
returns integer soname "taxonudf.o";
Query OK, 0 rows affected (0,00 sec)

mysql> select taxon_childof(taxon_id("Homo"),taxon_id("Homo Sapiens"))
as "is Homo child of Homo.Sapiens ?";
+---------------------------------+
| is Homo child of Homo.Sapiens ? |
+---------------------------------+
| 0 |
+---------------------------------+
1 row in set (0,00 sec)


mysql> select taxon_childof(taxon_id("Homo Sapiens"),taxon_id("Homo"))
as "Is Homo.Sapiens descendant of Homo ?";
+--------------------------------------+
| Is Homo.Sapiens descendant of Homo ? |
+--------------------------------------+
| 1 |
+--------------------------------------+
1 row in set (0,00 sec)



taxon_com is an aggregate function which finds the common ancestral node in a set of node.

mysql> create aggregate function taxon_com
returns integer soname "taxonudf.o";
Query OK, 0 rows affected (0,00 sec)

mysql> create temporary table t1(cluster varchar(20),taxon int);
Query OK, 0 rows affected (0,07 sec)

mysql> insert into t1(cluster,taxon) values("A",251093),
("A",9781), ("A",37348),("B",9605),("B",9606),("B",63221),
("C",32523),("C",33154),("C",7776),("C",9443);
Query OK, 10 rows affected (0,03 sec)
Records: 10 Duplicates: 0 Warnings: 0

mysql> select cluster,taxon as "ncbi-id",taxon_name(taxon) as "Name",taxon_childof(taxon,taxon_id("Primates")) as "Is_Primate" from t1;
+---------+---------+-------------------------------+------------+
| cluster | ncbi-id | Name | Is_Primate |
+---------+---------+-------------------------------+------------+
| A | 251093 | Elephas antiquus | 0 |
| A | 9781 | Elephantidae gen. sp. | 0 |
| A | 37348 | Mammuthus | 0 |
| B | 9605 | Homo | 1 |
| B | 9606 | Homo sapiens | 1 |
| B | 63221 | Homo sapiens neanderthalensis | 1 |
| C | 32523 | Tetrapoda | 0 |
| C | 33154 | Fungi/Metazoa group | 0 |
| C | 7776 | Gnathostomata | 0 |
| C | 9443 | Primates | 1 |
+---------+---------+-------------------------------+------------+
10 rows in set (0,00 sec)

mysql> select cluster,taxon_com(taxon) as ncbi_id
from t1 group by cluster;
+---------+---------+
| cluster | ncbi_id |
+---------+---------+
| A | 9780 |
| B | 9605 |
| C | 33154 |
+---------+---------+
3 rows in set (0.00 sec)


That's all folks

03 September 2006

Scott McCloud at SciFoo 2006

I've just discovered on Flickr that Scott McCloud was present at SciFoo 2006.

scifoo


McLoud is the author of Understanding comics, one of the best book about comics I've ever read.

scifoo


In this book, McLoud introduced a map of visual iconography that took the shape of a triangle.

The lower left corner was visual resemblance (e.g., photography and realistic painting). The lower right included the products of what he called iconic abstraction (e.g., cartooning). And at the top were the denizens of the picture plane ("pure" abstraction) which ceased to make reference to any visual phenomena other than themselves. The move from realism to cartoons along the bottom edge was a move away from resemblance that still retained "meaning," so words, the next logical step in the progression, were included at far right, thereby enclosing anything in comics' visual vocabulary between the three points.

See also: http://www.scottmccloud.com/inventions/triangle/triangle.html


29 August 2006

My own little scifoo camp 2006.

Back from holidays at Montresor where, as the leader of Nature Network BrieComte Robert, I organized my own little private rainy scifoo camp.

My own private scifoo camp


The reason you didn't see me at SciFoo 2006

The reason you didn't see me at SciFoo 2006
The reason you didn't see me at SciFoo 2006


11 August 2006

The Life Sciences Semantic Web is Full of Creeps!

An article published in " Briefings in Bioinformatics Advance Access".

The Life Sciences Semantic Web is Full of Creeps!


Benjamin M. Good and Mark D. Wilkinson
Abstract:The Semantic Web for the Life Sciences (SWLS), when realized, will dramatically improve our ability to conduct bioinformatics analyses using the vast and growing stores of web-accessible resources. This ability will be achieved through the widespread acceptance and application of standards for naming, representing, describing and accessing biological information. The W3C-led Semantic Web initiative has established most, if not all, of the standards and technologies needed to achieve a unified, global SWLS. Unfortunately, the bioinformatics community has, thus far, appeared reluctant to fully adopt them. Rather, we are seeing what could be described as ‘semantic creep’--timid, piecemeal and ad hoc adoption of parts of standards by groups that should be stridently taking a leadership role for the community. We suggest that, at this point, the primary hindrances to the creation of the SWLS may be social rather than technological in nature, and that, like the original Web, the establishment of the SWLS will depend primarily on the will and participation of its consumers.

Mark Wilkinson is one of the creators of BioMoby. Bio Moby is a system for interoperability between biological data hosts and analytical services. Benjamin Good is a PhD student in the British Columbia Strategic Training Program in Bioinformatics. Both of them have a profile on connotea (users bgood, mwilkinson), group:Wilkinson Laboratory).

Although I'm convinced that the semantic web/RDF/XML model is the format of choice for any application (please ! use it for your output format !), I admit I never had the time and the technical knowledge about web services to really understand how BioMoby works , why I should use it and why I should use a LSID instead of an good old URI... :-)

hey, I'm going to ask them an offprint :-)

09 August 2006

A Bookmarklet for Offprint Requests

Hi, I'm pleased to share the javascript Bookmarklet I wrote today. A bookmarklet is a small JavaScript program that can be stored as a URL within a bookmark in most popular web browsers, or within hyperlinks on a web page. This bookmarklet invokes a new mail, from thunderbird, filled with a message requesting an offprint request fo an article. The first <a href="mailto:xxx@xxx.xxx"> tag found in the current page is used as the recipient of the mail and the subject is the title of the current page.


Here is the bookmarklet (you have to modify it by editing its properties in order to include your own message...):

Drag this Link: Offprint Request up to your Bookmarks Toolbar.


The bookmarklet was successfully tested on firefox/thunderbird with Bioinformatics: Building chromosome-wide LD maps Bioinformatics 2006 22(16):1933-1934 and NAR SYBR Green real-time telomeric repeat amplification protocol for the rapid quantification of telomerase activity Nucleic Acids Research, 2002, Vol. 31, No. 2 e3.

Example of mail generated from the previous paper from NAR:

From: me
To: xxxx@ucdavis.xxx
Subject: [offprint request] SYBR Green real-time telomeric repeat amplification protocol for the rapid quantification of telomerase activity -- Wege et al. 31 (2): e3 -- Nucleic Acids Research

Hi,
my name is Bruce Banner, I'm a nuclear physicist working on gamma radiations at Los-Alamos. Your recent paper titled

"SYBR Green real-time telomeric repeat amplification protocol for the rapid quantification of telomerase activity -- Wege et al. 31 (2): e3 -- Nucleic Acids Research"

caught my attention.
Would it be possible for you to forward me a PDF copy ? I thank you in advance.

Best Regards.

B. Banner

--
Bruce Banner PhD.
Gamma Radiation Laboratory
Los Alamos
http://www.marvel.com/universe/Hulk