26 September 2006

MYSQL UDF, trees of data, hierarchy

In a previous post I described how to write a mysql user defined function (UDF) in C to create a translate-dna-to-protein function for mysql. Now, I've been playing with hierachies and I wrote a few UDFs to explore a tree of data; I guess this could have been done with a mysql stored procedure but I'm currently using an old server (and I need a deeper knowledge of mysql5 :-) ). The source code is available [here].
The tree itself is written as a sorted static const array of data so you cannot modify it via a mysql command but you'll have to recompile the code.

typedef struct Taxonomy
{
/** the ncbi-id */
taxon_type_t tax_id;
/** id of the parent */
taxon_type_t parent_id;
/** scientific name */
char name[TAXON_NAME_SIZE];
}Taxon,*const TaxonPtr;


It may also not be suitable for large trees. In my example, I've been using a subset of the NCBI taxonomy:

static const Taxon all_taxons[]={
{1, -1, "root"},
{2759, 131567, "Eukaryota"},
{6072, 33208, "Eumetazoa"},
{7711, 33511, "Chordata"},
{7742, 89593, "Vertebrata"},
{7776, 7742, "Gnathostomata"},
{8287, 117571, "Sarcopterygii"},
{9347, 32525, "Eutheria"},
{9443, 314146, "Primates"},
...)
};


but one could imagine something like this...

{1,-1,"root"},
{2,1,"world"},
{3,2,"europe"},
{4,3,"france"},
...


or...

{1,-1,"rdf:Property"},
{2,1,"foaf:knows"},
{3,2,"rel:friendOf"},
{4,2,"rel:childOf"},
...


the source taxonudf.c was successfuly compiled on my computer using the following command-line.

gcc -fPIC -shared -DDBUG_OFF -O3 -I/usr/include/mysql -lmysqlclient -o /usr/lib/taxonudf.o taxonudf.c


taxon_name returns the scientific name of an organism from a ncbi-tax-id.

mysql> create function taxon_name
returns string soname "taxonudf.o";
Query OK, 0 rows affected (0,00 sec)

mysql> select taxon_name(9606);
+------------------+
| taxon_name(9606) |
+------------------+
| Homo sapiens |
+------------------+
1 row in set (0,43 sec)


tax_id returns the ncbi-tax-id from from its name.

mysql> create function taxon_id
returns integer soname "taxonudf.o";
Query OK, 0 rows affected (0,00 sec)

mysql> select taxon_id("Homo sapiens");
+--------------------------+
| taxon_id("Homo sapiens") |
+--------------------------+
| 9606 |
+--------------------------+
1 row in set (0,00 sec)


taxon_childof returns wether a node in the hierarchy is a descendant of another node.

mysql> create function taxon_childof
returns integer soname "taxonudf.o";
Query OK, 0 rows affected (0,00 sec)

mysql> select taxon_childof(taxon_id("Homo"),taxon_id("Homo Sapiens"))
as "is Homo child of Homo.Sapiens ?";
+---------------------------------+
| is Homo child of Homo.Sapiens ? |
+---------------------------------+
| 0 |
+---------------------------------+
1 row in set (0,00 sec)


mysql> select taxon_childof(taxon_id("Homo Sapiens"),taxon_id("Homo"))
as "Is Homo.Sapiens descendant of Homo ?";
+--------------------------------------+
| Is Homo.Sapiens descendant of Homo ? |
+--------------------------------------+
| 1 |
+--------------------------------------+
1 row in set (0,00 sec)



taxon_com is an aggregate function which finds the common ancestral node in a set of node.

mysql> create aggregate function taxon_com
returns integer soname "taxonudf.o";
Query OK, 0 rows affected (0,00 sec)

mysql> create temporary table t1(cluster varchar(20),taxon int);
Query OK, 0 rows affected (0,07 sec)

mysql> insert into t1(cluster,taxon) values("A",251093),
("A",9781), ("A",37348),("B",9605),("B",9606),("B",63221),
("C",32523),("C",33154),("C",7776),("C",9443);
Query OK, 10 rows affected (0,03 sec)
Records: 10 Duplicates: 0 Warnings: 0

mysql> select cluster,taxon as "ncbi-id",taxon_name(taxon) as "Name",taxon_childof(taxon,taxon_id("Primates")) as "Is_Primate" from t1;
+---------+---------+-------------------------------+------------+
| cluster | ncbi-id | Name | Is_Primate |
+---------+---------+-------------------------------+------------+
| A | 251093 | Elephas antiquus | 0 |
| A | 9781 | Elephantidae gen. sp. | 0 |
| A | 37348 | Mammuthus | 0 |
| B | 9605 | Homo | 1 |
| B | 9606 | Homo sapiens | 1 |
| B | 63221 | Homo sapiens neanderthalensis | 1 |
| C | 32523 | Tetrapoda | 0 |
| C | 33154 | Fungi/Metazoa group | 0 |
| C | 7776 | Gnathostomata | 0 |
| C | 9443 | Primates | 1 |
+---------+---------+-------------------------------+------------+
10 rows in set (0,00 sec)

mysql> select cluster,taxon_com(taxon) as ncbi_id
from t1 group by cluster;
+---------+---------+
| cluster | ncbi_id |
+---------+---------+
| A | 9780 |
| B | 9605 |
| C | 33154 |
+---------+---------+
3 rows in set (0.00 sec)


That's all folks

2 comments:

Jo Yo said...

I wrote an interesting UDF tool that may be interesting to you.

JsMap. Basically, I can map a javascript to a MySQL database. This lets me migrate computation from the application to the database for greatly locality.

Pierre Lindenbaum said...

@Jo fun, a JSON/UDF was on my todo list :-)