YOKOFAKUN: Taxonomy and Semantic Web: writing an extension for ARQ/SPARQL

In this post I'll show how I've implemented a custom function in ARQ, the SPARQL/Jena engine for querying a RDF graph. The new function implemented tests if a node in the NCBI-taxonomy hierarchy as a given ancestor.

Requirements

Jena/ARQ: http://jena.sourceforge.net/ARQ/
A java 1.6 compiler
nodes.dmp , the ncbi taxonomy downloaded from ftp://ftp.ncbi.nih.gov/pub/taxonomy/

Here are a sample of the very first lines of nodes.dmp: the first column is the node-id of the taxon, the second column is its parent-id.

cat nodes.dmp | cut -c 1-20 | head
  |    1    |    no rank    |        |
  |    131567    |    superki
  |    335928    |    genus    |
  |    6    |    species    |    AC
  |    32199    |    species   
  |    135621    |    genus   
  |    10    |    species    |   
  |    203488    |    genus   
  |    13    |    species    |   
  |    32011    |    genus    |

The input

our input is a RDF file:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:tax="http://species.lindenb.org"
>

<tax:Individual rdf:about="http://fr.wikipedia.org/wiki/Tintin">
<dc:title xml:lang="fr">Tintin</dc:title>
<dc:title xml:lang="en">Tintin</dc:title>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:9606"/>
</tax:Individual>

<tax:Individual rdf:about="http://fr.wikipedia.org/wiki/Babar">
<dc:title xml:lang="fr">Babar</dc:title>
<dc:title xml:lang="en">Babar</dc:title>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:9785"/>
</tax:Individual>

<tax:Individual rdf:about="http://fr.wikipedia.org/wiki/Milou">
<dc:title xml:lang="fr">Milou</dc:title>
<dc:title xml:lang="en">Snowy</dc:title>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:9615"/>
</tax:Individual>

<tax:Individual rdf:about="http://fr.wikipedia.org/wiki/Donald_Duck">
<dc:title xml:lang="fr">Donald</dc:title>
<dc:title xml:lang="en">Donald Duck</dc:title>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:8839"/>
</tax:Individual>

<tax:Individual rdf:about="http://fr.wikipedia.org/wiki/Le_L%C3%A9zard">
<dc:title xml:lang="fr">Lezard</dc:title>
<dc:title xml:lang="en">Lizard</dc:title>
<dc:title xml:lang="fr">Curt Connors</dc:title>
<dc:title xml:lang="en">Curt Connors</dc:title>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:9606"/>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:8504"/>
</tax:Individual>

</rdf:RDF>

Images via wikipedia
Tintin & Snowy	Babar	Donald	The Lizard

Basically this file describes

4 individuals: Tintin (human), Snowy (dog), Donal (duck) , Babar (Elephant) and Dr Connors/The Lizard (spiderman's foe)
Each individual unambigously identified by his URI in wikipedia
Each individual is named in english and in french
For each individual, is ID in the NCBI hierarchy is specified using a simple URI (here I've tried to use a LSID, but it could have been something else (a URL... ))

A basic query

The following SPARQL query retrieve the URI, the taxonomy and the english name for each individuals.

The query

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dc:  <http://purl.org/dc/elements/1.1/>
PREFIX tax:  <http://species.lindenb.org>

SELECT ?individual ?taxon ?title
  {
  ?individual a tax:Individual .
  ?individual dc:title ?title .
  ?individual tax:taxon ?taxon .
  FILTER langMatches( lang(?title), "en" )
  }

Invoking ARQ

arq --query query01.rq --data taxonomy.rdf

Result

-------------------------------------------------------------------------------------------------------------
| individual                                    | taxon                                 | title             |
=============================================================================================================
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:8504> | "Curt Connors"@en |
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Curt Connors"@en |
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:8504> | "Lizard"@en       |
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Lizard"@en       |
| <http://fr.wikipedia.org/wiki/Donald_Duck>    | <lsid:ncbi.nlm.nih.gov:taxonomy:8839> | "Donald Duck"@en  |
| <http://fr.wikipedia.org/wiki/Milou>          | <lsid:ncbi.nlm.nih.gov:taxonomy:9615> | "Snowy"@en        |
| <http://fr.wikipedia.org/wiki/Babar>          | <lsid:ncbi.nlm.nih.gov:taxonomy:9785> | "Babar"@en        |
| <http://fr.wikipedia.org/wiki/Tintin>         | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Tintin"@en       |
-------------------------------------------------------------------------------------------------------------

Adding a custom function

Now, I want to add a new function in sparql. This function 'isA' will take as input to parameters: the taxon/LSID of the child and the taxon/LSID of the parent and it will return a boolean 'true' if the 'child' has the 'parent' in his phylogeny. This new function is implemented by extending the class com.hp.hpl.jena.sparql.function.FunctionBase2. This new class contains an associative array child2parent mapping each taxon-id to its parent. This map is loaded as described bellow:

                Pattern pat= Pattern.compile("[ \t]*\\|[ \t]*");
                String line;
                BufferedReader r= new BufferedReader(new FileReader(TAXONOMY_NODES_PATH));
                while((line=r.readLine())!=null)
                    {
                    String tokens[]=pat.split(line, 3);
                    this.child2parent.put(
                            Integer.parseInt(tokens[0]),
                            Integer.parseInt(tokens[1])
                            );
                    }
                r.close();
(...)

The function 'exec' will check if the two arguments are an URI and will invoke the method isChildOf

  public NodeValue exec(NodeValue childNode, NodeValue parentNode)
    {
    (...check the nodes are URI)
    return NodeValue.makeBoolean(isChildOf(childId,parentId));
    }

The function 'isChildOf' loops in the map child2parent to check if the parent is an ancestor of the child:

       while(true)
            {
            Integer id= child2parent.get(childid);
            if(id==null || id==childid) return false;
            if(id==parentid) return true;
            childid=id;
            }

Here is the complete source code of this class:

package org.lindenb.arq4taxonomy;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.regex.Pattern;

import com.hp.hpl.jena.sparql.expr.ExprEvalException;
import com.hp.hpl.jena.sparql.expr.NodeValue;
import com.hp.hpl.jena.sparql.function.FunctionBase2;

public class isA
    extends FunctionBase2
    {
    public static final String LSID="lsid:ncbi.nlm.nih.gov:taxonomy:";
    public static final String TAXONOMY_NODES_PATH="/home/lindenb/tmp/TAXONOMY_NCBI/nodes.dmp";
    private Map<Integer, Integer> child2parent=null;
   
    public isA()
        {
       
        }
    /**
     * return a associative map child.id -> parent.id
     * @return
     */
    private Map<Integer, Integer> getTaxonomy()
        {
        if(this.child2parent==null)
            {
            this.child2parent= new HashMap<Integer, Integer>();
            try
                {
                Pattern pat= Pattern.compile("[ \t]*\\|[ \t]*");
                String line;
                BufferedReader r= new BufferedReader(new FileReader(TAXONOMY_NODES_PATH));
                while((line=r.readLine())!=null)
                    {
                    String tokens[]=pat.split(line, 3);
                    this.child2parent.put(
                            Integer.parseInt(tokens[0]),
                            Integer.parseInt(tokens[1])
                            );
                    }
                r.close();
                System.err.println(this.child2parent.size());
                }
            catch(IOException err)
                {
                err.printStackTrace();
                throw new ExprEvalException(err);
                }
            }
        return this.child2parent;
        }
   
    private boolean isChildOf(int childid,int parentid)
        {
        if(childid==parentid) return true;
        Map<Integer,Integer> map= getTaxonomy();
        while(true)
            {
            Integer id= map.get(childid);
            if(id==null || id==childid) return false;
            if(id==parentid) return true;
            childid=id;
            }
        }
   
    @Override
    public NodeValue exec(NodeValue childNode, NodeValue parentNode)
        {

        if( childNode.isLiteral() ||
            parentNode.isLiteral() ||
            childNode.asNode().isBlank() ||
            parentNode.asNode().isBlank())
            {
            return NodeValue.makeBoolean(false);
            }

        String childURI = childNode.asNode().getURI();
        if(!childURI.startsWith(LSID))
            {
            return NodeValue.makeBoolean(false);
            }
       

        String parentURI = parentNode.asNode().getURI();
        if(!parentURI.startsWith(LSID))
            {
            return NodeValue.makeBoolean(false);
            }

        int childId=0;
        try {
            childId= Integer.parseInt(childURI.substring(LSID.length()));
            }
        catch (NumberFormatException e)
            {
            return NodeValue.makeBoolean(false);
            }
       
        int parentId=0;
        try {
            parentId= Integer.parseInt(parentURI.substring(LSID.length()));
            }
        catch (NumberFormatException e)
            {
            return NodeValue.makeBoolean(false);
            }
   
        return NodeValue.makeBoolean(isChildOf(childId,parentId));
        }
   
    }

This class is then compiled and packaged into the file tax.jar:

    javac -cp $(ARQ_CLASSPATH):. -sourcepath src src/org/lindenb/arq4taxonomy/isA.java
    jar cvf tax.jar -C src org

and we add this jar in the classpath:

export CP=$PWD/tax.jar

To tell ARQ about this new functio,n we just add its classpath as a new PREFIX in the SPARQL query:

PREFIX fn: <java:org.lindenb.arq4taxonomy.>

First test

the following SPARQL query retrieves all the Mammals (http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Undef&id=40674) in the data set.

The query

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dc:  <http://purl.org/dc/elements/1.1/>
PREFIX tax:  <http://species.lindenb.org>
PREFIX fn: <java:org.lindenb.arq4taxonomy.>

SELECT ?individual ?taxon ?title
{
?individual a tax:Individual .
?individual dc:title ?title .
?individual tax:taxon ?taxon .
 FILTER fn:isA(?taxon,<lsid:ncbi.nlm.nih.gov:taxonomy:40674> )
 FILTER langMatches( lang(?title), "en" )
}

The command line

arq --query query02.rq --data taxonomy.rdf

The result

-------------------------------------------------------------------------------------------------------------
| individual                                    | taxon                                 | title             |
=============================================================================================================
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Curt Connors"@en |
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Lizard"@en       |
| <http://fr.wikipedia.org/wiki/Milou>          | <lsid:ncbi.nlm.nih.gov:taxonomy:9615> | "Snowy"@en        |
| <http://fr.wikipedia.org/wiki/Babar>          | <lsid:ncbi.nlm.nih.gov:taxonomy:9785> | "Babar"@en        |
| <http://fr.wikipedia.org/wiki/Tintin>         | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Tintin"@en       |
-------------------------------------------------------------------------------------------------------------

Second query

the following SPARQL query retrieves all the 'Sauropdias' (http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Undef&id=8457) in the RDF file.

The SPARQL file

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dc:  <http://purl.org/dc/elements/1.1/>
PREFIX tax:  <http://species.lindenb.org>
PREFIX fn: <java:org.lindenb.arq4taxonomy.>

SELECT ?individual ?taxon ?title
{
?individual a tax:Individual .
?individual dc:title ?title .
?individual tax:taxon ?taxon .
 FILTER fn:isA(?taxon,<lsid:ncbi.nlm.nih.gov:taxonomy:8457> )
 FILTER langMatches( lang(?title), "en" )
}

Command line

arq --query query03.rq --datataxonomy.rdf

The result

-------------------------------------------------------------------------------------------------------------
| individual                                    | taxon                                 | title             |
=============================================================================================================
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:8504> | "Curt Connors"@en |
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:8504> | "Lizard"@en       |
| <http://fr.wikipedia.org/wiki/Donald_Duck>    | <lsid:ncbi.nlm.nih.gov:taxonomy:8839> | "Donald Duck"@en  |
-------------------------------------------------------------------------------------------------------------

Et hop ! voila ! That's it !

2 comments:

Massy Biagio said...: I love you!!! That blog is a source of life for a webmaster as I'd like to becomes!!!!! Well, now i'll comes a your reader ( I read very few blogs..) Goodbye from Perugia, Italy.
PS: I hope you can help my bad knowledgments in programming. I love blogging, but is only 3 years I have a Pc; at the moment i learned only Seo......; Friday, 28 November, 2008
Anonymous said...: Dear Mr,

Thanks for your helpful blogs, ehm...
I has been searching about MAPRF 6.0 by Ritter (1990)/Ritter & Salamini (1996)...because I will use this program for linkage analysis...but i didn't find it in internet...especially for protocol.
Do you know about it?

Ray Tiran
Indonesia

raytiran@gmail.com; Wednesday, 03 December, 2008

YOKOFAKUN

25 November 2008

Taxonomy and Semantic Web: writing an extension for ARQ/SPARQL

Requirements

The input

A basic query

The query

Invoking ARQ

Result

Adding a custom function

First test

The query

The command line

The result

Second query

The SPARQL file

Command line

The result

2 comments:

About Me

Feeds

Blog Archive

Web2.0

Labels