26 July 2011

A mysql full-text parser searching for some SNPs

A mysql full-text parser server plugin can be used to replace or modify the built-in full-text parser.
This post a full-text parser plugin named bioparser that extracts the rs#### id from a text.

The source

The source of this plugin is available on github at:
https://github.com/lindenb/bioudf/blob/master/src/bioparser.c.

The work horse of the plugin is a simple function bioparser_parse scanning the SQL query or the TEXT. Each time a word starting with "rs" and followed by one or more number is found, it is added to the list of words to be find:
static int bioparser_parse(MYSQL_FTPARSER_PARAM *param)
{
char *curr=param->doc;
const char *begin=param->doc;
const char *end= begin + param->length;

param->flags = MYSQL_FTFLAGS_NEED_COPY;
while(curr+2<end)
{
if(tolower(*curr)=='r' &&
tolower(*(curr+1))=='s' &&
isdigit(*(curr+2)) &&
(curr==begin || IS_DELIM(*(curr-1) ) )
)
{
char* p=curr+2;
while(p!=end && isdigit(*p))
{
++p;
}
if(p==end || IS_DELIM(*p))
{
my_add_word(param,curr,p-curr);
}
curr=p;
}
else
{
curr++;
}
}

return 0;
}

Install the Plugin


mysql> INSTALL PLUGIN bioparser SONAME 'bioparser.so';
Query OK, 0 rows affected (0.00 sec)

mysql> show plugins;
+-----------------------+----------+--------------------+--------------+---------+
| Name | Status | Type | Library | License |
+-----------------------+----------+--------------------+--------------+---------+
(...)
| partition | ACTIVE | STORAGE ENGINE | NULL | GPL |
| bioparser | ACTIVE | FTPARSER | bioparser.so | GPL |
+-----------------------+----------+--------------------+--------------+---------+
21 rows in set (0.00 sec)

Invoke Plugin


create a table that will use the plugin:
mysql> create table pubmed(
abstract TEXT,
FULLTEXT (abstract) WITH PARSER bioparser
) ENGINE=MyISAM;

Insert some abstracts.
mysql> insert into pubmed(abstract) values("A predictive role in radiation pneumonitis (RP) development was observed for the LIG4 SNP rs1805388 (adjusted hazard ratio, 2.08; 95% confidence interval, 1.04-4.12; P = .037 for the CT/TT genotype vs the CC genotype). In addition, men with the TT genotype of the XRCC4 rs6869366 SNP and women with AG + AA genotypes of the XRCC5 rs3835 SNP also were at increased risk of developing severe RP.");
Query OK, 1 row affected (0.00 sec)

(...)

mysql> select abstract from pubmed;
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| abstract |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| A predictive role in radiation pneumonitis (RP) development was observed for the LIG4 SNP rs1805388 (adjusted hazard ratio, 2.08; 95% confidence interval, 1.04-4.12; P = .037 for the CT/TT genotype vs the CC genotype). In addition, men with the TT genotype of the XRCC4 rs6869366 SNP and women with AG + AA genotypes of the XRCC5 rs3835 SNP also were at increased risk of developing severe RP. |
| Nonhomologous end joining (NHEJ) is a pathway that repairs DNA double-strand breaks (DSBs) to maintain genomic stability in response to irradiation. The authors hypothesized that single nucleotide polymorphisms (SNPs) in NHEJ repair genes may affect clinical outcomes in patients with nonsmall cell lung cancer (NSCLC) who receive definitive radio(chemo)therapy. |
| The authors genotyped 5 potentially functional SNPs-x-ray repair complementing defective repair in Chinese hamster cells 4 (XRCC4) reference SNP (rs) number rs6869366 (-1394 guanine to thymine [-1394G?T] change) and rs28360071 (intron 3, deletion/insertion), XRCC5 rs3835 (guanine to adenine [G?A] change at nucleotide 2408), XRCC6 rs2267437 (-1310 cytosine to guanine [C?G) change], and DNA ligase IV (LIG4) rs1805388 (threonine-to-isoleucine change at codon 9 [T9I])-and estimated their associations with severe radiation pneumonitis (RP) (grade ?3) in 195 patients with NSCLC. |
| The current results indicated that NHEJ genetic polymorphisms, particularly LIG4 rs1805388, may modulate the risk of RP in patients with NSCLC who receive definitive radio(chemo)therapy. Large studies will be needed to confirm these findings. |
| The repair of DNA double-strand breaks (DSBs) is the major mechanism to maintain genomic stability in response to irradiation. We hypothesized that genetic polymorphisms in DSB repair genes may affect clinical outcomes among non-small cell lung cancer (NSCLC) patients treated with definitive radio(chemo)therapy. |
| We also found that RAD51 -135G>C and XRCC2 R188H SNPs were independent prognostic factors for overall survival (adjusted HR?=?1.70, 95% CI, 1.14-2.62, P?=?0.009 for CG/CC vs. GG; and adjusted HR?=?1.70; 95% CI, 1.02-2.85, P?=?0.043 for AG vs. GG, respectively) and that the SNP-survival association was most pronounced in the presence of RP. |
| A total of 291 patients (145 male/146 female, mean age (± S.D.) 52.2 (± 13.1) years) with PsA were examined clinically, by standard laboratory tests and their DNA was genotyped for the SNP rs2476601 (PTPN22 +1858 C/T). Allelic frequencies were determined and compared with 725 controls. |
| this is a test rs2476601, rs1805388, rs3835 and rs25 |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
8 rows in set (0.00 sec)

Now, test the plugin:

mysql> select
concat(left(abstract,40),"...") as ABSTRACT,
match(abstract) against("DNA double-strand) as SCORE
from pubmed group by 2 HAVING SCORE!=0;
Empty set (0.00 sec)

mysql> select
concat(left(abstract,40),"...") as ABSTRACT,
match(abstract) against("rs25") as SCORE
from pubmed group by 2 HAVING SCORE!=0;
+---------------------------------------------+--------------------+
| ABSTRACT | SCORE |
+---------------------------------------------+--------------------+
| this is a test rs2476601, rs1805388, rs3... | 1.8603347539901733 |
+---------------------------------------------+--------------------+
1 row in set (0.01 sec)

mysql> select
concat(left(abstract,40),"...") as ABSTRACT,
match(abstract) against("rs2476601 rs1805388 rs6869366") as SCORE
from pubmed group by 2 HAVING SCORE!=0 order by 2 desc;
+---------------------------------------------+--------------------+
| ABSTRACT | SCORE |
+---------------------------------------------+--------------------+
| A total of 291 patients (145 male/146 fe... | 1.086121916770935 |
| A predictive role in radiation pneumonit... | 1.0619741678237915 |
| this is a test rs2476601, rs1805388, rs3... | 1.0502985715866089 |
| The authors genotyped 5 potentially func... | 1.0388768911361694 |
+---------------------------------------------+--------------------+
4 rows in set (0.00 sec)

mysql> select
concat(left(abstract,40),"...") as ABSTRACT,
match(abstract) against("rs25,rs2476601,rs1805388,rs6869366") as SCORE
from pubmed group by 2 HAVING SCORE!=0 order by 2 desc;
+---------------------------------------------+--------------------+
| ABSTRACT | SCORE |
+---------------------------------------------+--------------------+
| this is a test rs2476601, rs1805388, rs3... | 2.9106333255767822 |
| A total of 291 patients (145 male/146 fe... | 1.086121916770935 |
| A predictive role in radiation pneumonit... | 1.0619741678237915 |
| The authors genotyped 5 potentially func... | 1.0388768911361694 |
+---------------------------------------------+--------------------+
4 rows in set (0.00 sec)

uninstall the plugin


mysql> UNINSTALL PLUGIN bioparser;
Query OK, 0 rows affected (0.00 sec)


That's it,

Pierre

No comments: