27 February 2013

4 Tools I wrote today to annotate VCF files.

Here are four java-bases tools I wrote today for @SolenaLS to annotate VCF files.

VCFTabix

VCFTabix : we want to use the VCF of the 1Kgenomes project (ALL.wgs.phase1_release_v3.20101123.snps_indels_sv.sites.vcf.gz) as a source of annotation for our VCF.

Options

-f (vcf indexed with tabix) REQUIRED.
 -T (tag String) VCF-INFO-ID optional can be used several times.
 -R doesn't use REF allele
 -A doesn't use ALT allele
 -I don't replace ID if it exists.
 -F don't replace INFO field if it exists.
 -C (TAG) use this tag in case of conflict with the ALT allele.

Example:

java -jar dist/vcftabix.jar -f ALL.wgs.phase1_release_v3.20101123.snps_indels_sv.sites.vcf.gz \
   -T AC -T SNPSOURCE  input.vcf 

##fileformat=VCFv4.1
(....)
##Annotated with fr.inserm.umr1087.jvarkit.tools.vcftabix.VCFTabix:ALL.wgs.phase1_release_v3.20101123.snps_indels_sv.sites.vcf.gz
##INFO=<ID=AC,Number=.,Type=Integer,Description="Alternate Allele Count">
##INFO=<ID=SNPSOURCE,Number=.,Type=String,Description="indicates if a snp was called when analysing the low coverage or exome alignment data">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  CD10977
5       2999819 rs21234 A       G       1620    .       AC=2156;AF=1.00;AN=2;DB;DP=41;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=59.44;MQ0=0;QD=39.51;SB=-6.519e-03;SNPSOURCE=LOWCOV;       GT:AD:DP:GQ:PL  1/1:0,41:41:99:1653,123,0
(...)


EVS2BED

EVS2BED: bulk download of the XML data of EVS into a BED file CHROM/START/END/XML (see my related post: http://plindenbaum.blogspot.fr/2012/05/downloading-xml-data-from-exome-variant.html )

Example

Download 10 records and index the data with tabix:
java -jar dist/evs2bed.jar -L 10 | LC_ALL=C sort -k1,1 -k2,2n -k3,3n -k4,4 -u |\
  /path/totabix-0.2.6/bgzip -c > test.bed.gz
/path/totabix-0.2.6/tabix -p bed -f test.bed.gz


$ gunzip -c  test.bed.gz  | cut -c 1-500 | fold -w 80
1 69427 69428 <snpList><positionString>1:69428</positionString><chrPos
ition>69428</chrPosition><alleles>G/T</alleles><uaAlleleCounts>G=313/T=6535</uaA
lleleCounts><aaAlleleCounts>G=14/T=3808</aaAlleleCounts><totalAlleleCounts>G=327
/T=10343</totalAlleleCounts><uaMAF>4.5707</uaMAF><aaMAF>0.3663</aaMAF><totalMAF>
3.0647</totalMAF><avgSampleReadDepth>110</avgSampleReadDepth><geneList>OR4F5</ge
neList><snpFunction><chromosome>1</chromosome><position>69428</position><conserv
ationScore>1.0</conservationSc
1 69475 69476 <snpList><positionString>1:69476</positionString><chrPos
ition>69476</chrPosition><alleles>C/T</alleles><uaAlleleCounts>C=2/T=7020</uaAll
eleCounts><aaAlleleCounts>C=0/T=3908</aaAlleleCounts><totalAlleleCounts>C=2/T=10
928</totalAlleleCounts><uaMAF>0.0285</uaMAF><aaMAF>0.0</aaMAF><totalMAF>0.0183</
totalMAF><avgSampleReadDepth>123</avgSampleReadDepth><geneList>OR4F5</geneList><
snpFunction><chromosome>1</chromosome><position>69476</position><conservationSco
re>0.6</conservationScore><con
1 69495 69496 <snpList><positionString>1:69496</positionString><chrPos
ition>69496</chrPosition><alleles>A/G</alleles><uaAlleleCounts>A=2/G=6764</uaAll
eleCounts><aaAlleleCounts>A=23/G=3785</aaAlleleCounts><totalAlleleCounts>A=25/G=
10549</totalAlleleCounts><uaMAF>0.0296</uaMAF><aaMAF>0.604</aaMAF><totalMAF>0.23
64</totalMAF><avgSampleReadDepth>91</avgSampleReadDepth><geneList>OR4F5</geneLis
t><snpFunction><chromosome>1</chromosome><position>69496</position><conservation
Score>0.5</conservationScore><
1 69510 69511 <snpList><positionString>1:69511</positionString><chrPos
ition>69511</chrPosition><alleles>G/A</alleles><uaAlleleCounts>G=5337/A=677</uaA
lleleCounts><aaAlleleCounts>G=1937/A=1623</aaAlleleCounts><totalAlleleCounts>G=7
274/A=2300</totalAlleleCounts><uaMAF>11.2571</uaMAF><aaMAF>45.5899</aaMAF><total
MAF>24.0234</totalMAF><avgSampleReadDepth>69</avgSampleReadDepth><geneList>OR4F5
</geneList><snpFunction><chromosome>1</chromosome><position>69511</position><con
servationScore>1.0</conservati
1 69589 69590 <snpList><positionString>1:69590</positionString><chrPos
ition>69590</chrPosition><alleles>A/T</alleles><uaAlleleCounts>A=0/T=6214</uaAll
eleCounts><aaAlleleCounts>A=1/T=3555</aaAlleleCounts><totalAlleleCounts>A=1/T=97
69</totalAlleleCounts><uaMAF>0.0</uaMAF><aaMAF>0.0281</aaMAF><totalMAF>0.0102</t
otalMAF><avgSampleReadDepth>119</avgSampleReadDepth><geneList>OR4F5</geneList><s
npFunction><chromosome>1</chromosome><position>69590</position><conservationScor
e>0.8</conservationScore><cons
1 69593 69594 <snpList><positionString>1:69594</positionString><chrPos
ition>69594</chrPosition><alleles>C/T</alleles><uaAlleleCounts>C=2/T=6190</uaAll
eleCounts><aaAlleleCounts>C=0/T=3548</aaAlleleCounts><totalAlleleCounts>C=2/T=97
38</totalAlleleCounts><uaMAF>0.0323</uaMAF><aaMAF>0.0</aaMAF><totalMAF>0.0205</t
otalMAF><avgSampleReadDepth>109</avgSampleReadDepth><geneList>OR4F5</geneList><s
npFunction><chromosome>1</chromosome><position>69594</position><conservationScor
e>0.8</conservationScore><cons
1 69619 69620 <snpList><positionString>1:69620</positionString><chrPos
ition>69620</chrPosition><alleles>T/TA</alleles><uaAlleleCounts>A1=2/R=4694</uaA
lleleCounts><aaAlleleCounts>A1=0/R=2954</aaAlleleCounts><totalAlleleCounts>A1=2/
R=7648</totalAlleleCounts><uaMAF>0.0426</uaMAF><aaMAF>0.0</aaMAF><totalMAF>0.026
1</totalMAF><avgSampleReadDepth>56</avgSampleReadDepth><geneList>OR4F5</geneList
><snpFunction><chromosome>1</chromosome><position>69620</position><conservationS
core>0.7</conservationScore><c
1 69744 69745 <snpList><positionString>1:69745</positionString><chrPos
ition>69745</chrPosition><alleles>CA/C</alleles><uaAlleleCounts>A1=4/R=4254</uaA
lleleCounts><aaAlleleCounts>A1=0/R=2716</aaAlleleCounts><totalAlleleCounts>A1=4/
R=6970</totalAlleleCounts><uaMAF>0.0939</uaMAF><aaMAF>0.0</aaMAF><totalMAF>0.057
4</totalMAF><avgSampleReadDepth>10</avgSampleReadDepth><geneList>OR4F5</geneList
><snpFunction><chromosome>1</chromosome><position>69745</position><conservationS
core>0.9</conservationScore><c
1 69760 69761 <snpList><positionString>1:69761</positionString><chrPos
ition>69761</chrPosition><alleles>T/A</alleles><uaAlleleCounts>T=645/A=4093</uaA
lleleCounts><aaAlleleCounts>T=62/A=2840</aaAlleleCounts><totalAlleleCounts>T=707
/A=6933</totalAlleleCounts><uaMAF>13.6133</uaMAF><aaMAF>2.1365</aaMAF><totalMAF>
9.2539</totalMAF><avgSampleReadDepth>8</avgSampleReadDepth><geneList>OR4F5</gene
List><snpFunction><chromosome>1</chromosome><position>69761</position><conservat
ionScore>0.1</conservationScor

VCFTabixml

VCFTabixml retrieves the annotations from a BED (CHROM/START/END/XML) and, using XLST, annotates a VCF.

Example

In the following example, a few variations are downloaded from EVS using EVS2BED. Then a small VCF is created and annotated using the bed and the following XSLT stylesheet: evs2vcf.xsl.
java -jar dist/evs2bed.jar -L 10  |\
       LC_ALL=C sort -k1,1 -k2,2n -k3,3n -k4,4 -u |\
       tabix-0.2.6/bgzip -c > test.bed.gz

/tabix-0.2.6/tabix -p bed -f test.bed.gz
java -jar  dist/vcftabixml.jar \
   -f test.bed.gz -x  dist/evs2vcf.xsl test.vcf



echo "#CHROM POS ID REF ALT QUAL FILTER INFO" > test.vcf
echo "1 1 . C T . . T=test1" >> test.vcf
echo "1 69476 . T C . . T=test2" >> test.vcf
echo "1 69511 . A C . . T=test3" >> test.vcf 
java -jar  dist/vcftabixml.jar \
    -f test.bed.gz -x  dist/evs2vcf.xsl test.vcf

Result:
##Annotated with fr.inserm.umr1087.jvarkit.tools.vcftabixml.VCFTabixml:test.bed.gz
#CHROM POS ID REF ALT QUAL FILTER INFO
1 1 . C T . . T=test1;
1 69476 . T C . . T=test2;EVS_UAMAF=0.0285;EVS_AAMAF=0.0;EVS_TOTALMAF=0.0183;EVS_AVGSAMPLEREADDEPTH=123;EVS_GENELIST=OR4F5;EVS_CONSERVATIONSCORE=0.6;EVS_CONSERVATIONSCOREGERP=2.3;EVS_RSIDS=rs148502021;EVS_CLINICALLINK=unknown;EVS_ONEXOMECHIP=false;EVS_GWASPUBMEDIDS=unknown;
1 69511 . A C . . T=test3;EVS_UAMAF=11.2571;EVS_AAMAF=45.5899;EVS_TOTALMAF=24.0234;EVS_AVGSAMPLEREADDEPTH=69;EVS_GENELIST=OR4F5;EVS_CONSERVATIONSCORE=1.0;EVS_CONSERVATIONSCOREGERP=1.1;EVS_RSIDS=rs75062661;EVS_CLINICALLINK=unknown;EVS_ONEXOMECHIP=false;EVS_GWASPUBMEDIDS=unknown;EVS_CONFLICTALT=G;

VCFBigWig

VCFBigWig use a bigwig file to annotate a VCF file.

Example:

add the GERP score to a VCF:
$ java -jar dist/vcfbigwig.jar -f /commun/data/pubdb/ucsc/hg19/bbi/All_hg19_RS_noprefix.bw  test.vep.vcf
##fileformat=VCFv4.1
##Annotated with class fr.inserm.umr1087.jvarkit.tools.vcfbigwig.VCFBigWig:/commun/data/pubdb/ucsc/hg19/bbi/All_hg19_RS_noprefix.bw
##INFO=<ID=All_hg19_RS_noprefix,Number=1,Type=Float,Description="Annotations from /commun/data/pubdb/ucsc/hg19/bbi/All_hg19_RS_noprefix.bw">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
22      12345   .       G       T       100.0   .       All_hg19_RS_noprefix=4.829999923706055

11 February 2013

Counting the reads in a BAM file using SGE and the Open-MPI library: my notebook.

In the current post, I'll describe my first Open MPI program. Open MPI is "a Message Passing Interface (MPI) library, a standardized and portable message-passing system to function on a wide variety of parallel computers". My C program takes a list of BAMs, distributes some jobs on the SGE (SUN/Oracle Grid Engine) to count the number of reads and returns the results to a master process.

Downloading and installing

The library was downloaded from http://www.open-mpi.org/software/ompi/v1.6/, compiled with the option --with-sge and installed in '/commun/data/packages/openmpi'.
./configure --prefix=/commun/data/packages/openmpi --with-sge
make 
make install

Configuring SGE to run MPI-based program

As described in http://docs.oracle.com/cd/E19080-01/n1.grid.eng6/817-5677/6ml49n2c0/index.html .
$ su
$ source /commun/sge_root/bird/common/settings.csh
$ setenv ARCH lx24-amd64
$ qconf -ap orte
And added:
pe_name            orte
slots              448
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $round_robin
control_slaves     TRUE
job_is_first_task  TRUE
urgency_slots      min
accounting_summary FALSE
Add orte to the parallele env list:
qconf -mattr queue pe_list "orte" all.q
Using qconf -mconf I've also set shell_start_mode to 'unix_behavior'.

Algorithm

In the 'main' method, test if this is the master(==0) or a slave process(!=0):
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0)
 {
 master(argc-1,&argv[1]);
 }
else
 {
 slave(argc-1,&argv[1]);
 }

MASTER

The master gets the number of available proc:
MPI_Comm_size(MPI_COMM_WORLD, &proc_count)
Loop over the number of available processors and the BAMS, for each BAM, create a fixed size structure containing the path to the BAM as well as the count of reads initialized to '0':
typedef struct Params_t
 {
 char filename[FILENAME_MAX];
 int is_error;
 long count_total;
 long count_mapped;
 long count_properly_paired;
 long count_dup;
 }Params;
This structure is sent to the slaves:
 MPI_Send(
 &param, /* message buffer */
 sizeof(Params), /* one data item */
 MPI_CHAR, /* data item is an array of bytes */
 rank, /* destination process rank */
 TAG_COUNT_BAM, /* user chosen message tag */
 MPI_COMM_WORLD /* default communicator */
 ); 
and we wait for the slaves to return those 'Params' with the filled values.:
MPI_Recv(&param, /* message buffer */
 sizeof(Params), /* one data item */
 MPI_CHAR, /* data item is an array of bytes */
 MPI_ANY_SOURCE, /* receive from any sender */
 MPI_ANY_TAG, /* any type of message */
 MPI_COMM_WORLD, /* default communicator */
 &status);
And the result is printed:
printf("%s\t%ld\t%ld\t%ld\t%ld\n",
 param.filename,
 param.count_total,
 param.count_mapped,
 param.count_properly_paired,
 param.count_dup
 );
At the end of the master method, when no more BAM is available, the slaves are released (tag is 'TAG_RELEASE_SLAVE') with :
MPI_Send(
 &empty, /* message buffer */
 sizeof(Params), /* one data item */
 MPI_CHAR, /* data item is an array of bytes */
 rank, /* destination process rank */
 TAG_RELEASE_SLAVE, /* SHutdown */
 MPI_COMM_WORLD
 ); /* default communicator */

SLAVE

The slave method receives some messages from the master:
MPI_Recv(
 &param,
 sizeof(Params),
 MPI_CHAR,
 0,
 MPI_ANY_TAG,
 MPI_COMM_WORLD,
 &status);
If the message was 'TAG_RELEASE_SLAVE' we exit the method. Else the BAM file is scanned using the SAmtools API for C.
count_reads(&param);
and the result is sent back to the master:
MPI_Send(
 &param,
 sizeof(Params),
 MPI_CHAR,
 0,
 0,
 MPI_COMM_WORLD
 );
.

Running the Job on SGE

I still don't understand why I need to write the following line:
# qsh -pe orte 4
Your job 84581 ("INTERACTIVE") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 84581 has been successfully scheduled.
Run the analysis:
qrsh -cwd -pe orte 100 /commun/data/packages/openmpi/bin/mpirun \
 /path/to/ompibam \
 `find /commun/data/projects/folder1234/align -name "*.bam"`
qstat -f shows the jobs running:
arch          states
   ---------------------------------------------------------------------------------
   all.q@node01                   BIP   0/15/64        2.76 lx24-amd64
      84598 0.55500 mpirun     lindenb      r     02/11/2013 12:03:36    15
   ---------------------------------------------------------------------------------
   all.q@node02                   BIP   0/14/64        3.89 lx24-amd64
      84598 0.55500 mpirun     lindenb      r     02/11/2013 12:03:36    14
   ---------------------------------------------------------------------------------
   all.q@node03                   BIP   0/14/64        3.23 lx24-amd64
      84598 0.55500 mpirun     lindenb      r     02/11/2013 12:03:36    14
   ---------------------------------------------------------------------------------
   all.q@node04                   BIP   0/14/64        3.68 lx24-amd64
      84598 0.55500 mpirun     lindenb      r     02/11/2013 12:03:36    14
   ---------------------------------------------------------------------------------
   all.q@node05                   BIP   0/15/64        2.91 lx24-amd64
      84598 0.55500 mpirun     lindenb      r     02/11/2013 12:03:36    15
   ---------------------------------------------------------------------------------
   all.q@node06                   BIP   0/14/64        3.91 lx24-amd64
      84598 0.55500 mpirun     lindenb      r     02/11/2013 12:03:36    14
   ---------------------------------------------------------------------------------
   all.q@node07                   BIP   0/14/64        3.79 lx24-amd64
      84598 0.55500 mpirun     lindenb      r     02/11/2013 12:03:36    14 
output:
#filename total mapped properly_paired dup
/commun/data/projects/folder1234/11X0589_markdup.bam 1179766 973943 684794 0
/commun/data/projects/folder1234/11X0589_pair1_sorted.bam 1179766 973943 684794 0
/commun/data/projects/folder1234/11X0589_realigned.bam 1179766 973943 684794 0
/commun/data/projects/folder1234/11X0589_recal.bam 1179766 973943 684794 0
/commun/data/projects/folder1234/11X0602_pair1_sorted.bam 982006 815636 566136 0
/commun/data/projects/folder1234/11X0602_realigned.bam 982006 815636 566136 0
/commun/data/projects/folder1234/11X0602_markdup.bam 982006 815636 566136 0
/commun/data/projects/folder1234/11X0602_pair1_unsorted.bam 982006 815636 566136 0
/commun/data/projects/folder1234/11X0375_pair1_sorted.bam 1255772 1031155 735480 0
/commun/data/projects/folder1234/11X0375_realigned.bam 1255772 1031155 735480 0
/commun/data/projects/folder1234/11X0375_markdup.bam 1255772 1031155 735480 0
/commun/data/projects/folder1234/11X0367_pair1_sorted.bam 1422144 1158315 796466 0
/commun/data/projects/folder1234/11X0367_markdup.bam 1422144 1158315 796466 0
/commun/data/projects/folder1234/11X0375_pair1_unsorted.bam 1255772 1031155 735480 0
/commun/data/projects/folder1234/11X0367_realigned.bam 1422144 1158315 796466 0
/commun/data/projects/folder1234/11X0291_pair1_sorted.bam 1296294 1067233 743924 0
/commun/data/projects/folder1234/11X0291_markdup.bam 1296294 1067233 743924 0
/commun/data/projects/folder1234/11X0291_realigned.bam 1296294 1067233 743924 0
/commun/data/projects/folder1234/11X0311_markdup.bam 1353534 1117463 788474 0
/commun/data/projects/folder1234/11X0311_realigned.bam 1353534 1117463 788474 0
/commun/data/projects/folder1234/11X0375_recal.bam 1255772 1031155 735480 0
/commun/data/projects/folder1234/11X0589_pair1_unsorted.bam 1179766 973943 684794 0
/commun/data/projects/folder1234/11X0367_pair1_unsorted.bam 1422144 1158315 796466 0
/commun/data/projects/folder1234/11X0290_pair1_sorted.bam 1436970 1191044 837726 0
/commun/data/projects/folder1234/11X0290_realigned.bam 1436970 1191044 837726 0
/commun/data/projects/folder1234/11X0291_pair1_unsorted.bam 1296294 1067233 743924 0
/commun/data/projects/folder1234/11X0311_pair1_sorted.bam 1353534 1117463 788474 0
/commun/data/projects/folder1234/11X0240_pair1_sorted.bam 1489672 1229654 830042 0
/commun/data/projects/folder1234/11X0240_realigned.bam 1489672 1229654 830042 0
/commun/data/projects/folder1234/11X0240_markdup.bam 1489672 1229654 830042 0
/commun/data/projects/folder1234/XX10272_pair1_sorted.bam 1579608 1296449 876040 0
/commun/data/projects/folder1234/XX10272_markdup.bam 1579608 1296449 876040 0
/commun/data/projects/folder1234/XX10272_realigned.bam 1579608 1296449 876040 0
/commun/data/projects/folder1234/11X0602_recal.bam 982006 815636 566136 0
/commun/data/projects/folder1234/11X0290_markdup.bam 1436970 1191044 837726 0
/commun/data/projects/folder1234/11X0290_pair1_unsorted.bam 1436970 1191044 837726 0
/commun/data/projects/folder1234/11X0247_pair1_sorted.bam 1489832 1233208 862842 0
/commun/data/projects/folder1234/11X0456_pair1_sorted.bam 1486490 1220969 848738 0
/commun/data/projects/folder1234/11X0456_realigned.bam 1486490 1220969 848738 0
/commun/data/projects/folder1234/11X0247_markdup.bam 1489832 1233208 862842 0
/commun/data/projects/folder1234/11X0247_realigned.bam 1489832 1233208 862842 0
/commun/data/projects/folder1234/11X0456_markdup.bam 1486490 1220969 848738 0
/commun/data/projects/folder1234/11X0367_recal.bam 1422144 1158315 796466 0
/commun/data/projects/folder1234/11X0305_pair1_sorted.bam 1818236 1489267 1027458 0
/commun/data/projects/folder1234/11X0305_markdup.bam 1818236 1489267 1027458 0
/commun/data/projects/folder1234/11X0305_realigned.bam 1818236 1489267 1027458 0
/commun/data/projects/folder1234/11X0542_pair1_sorted.bam 1784860 1469183 1016314 0
/commun/data/projects/folder1234/XX07551_recal.bam 1778358 1456518 1018694 0
/commun/data/projects/folder1234/11X0542_realigned.bam 1784860 1469183 1016314 0
/commun/data/projects/folder1234/11X0542_markdup.bam 1784860 1469183 1016314 0
/commun/data/projects/folder1234/11X0311_pair1_unsorted.bam 1353534 1117463 788474 0
/commun/data/projects/folder1234/11X0291_recal.bam 1296294 1067233 743924 0
/commun/data/projects/folder1234/XX07551_pair1_sorted.bam 1778358 1456518 1018694 0
/commun/data/projects/folder1234/11X0240_pair1_unsorted.bam 1489672 1229654 830042 0
/commun/data/projects/folder1234/XX10272_pair1_unsorted.bam 1579608 1296449 876040 0
/commun/data/projects/folder1234/XX07551_markdup.bam 1778358 1456518 1018694 0
/commun/data/projects/folder1234/XX07551_realigned.bam 1778358 1456518 1018694 0
/commun/data/projects/folder1234/11X0305_pair1_unsorted.bam 1818236 1489267 1027458 0
/commun/data/projects/folder1234/11X0311_recal.bam 1353534 1117463 788474 0
/commun/data/projects/folder1234/11X0456_pair1_unsorted.bam 1486490 1220969 848738 0
/commun/data/projects/folder1234/11X0247_pair1_unsorted.bam 1489832 1233208 862842 0
/commun/data/projects/folder1234/11X0542_pair1_unsorted.bam 1784860 1469183 1016314 0
/commun/data/projects/folder1234/XX07551_pair1_unsorted.bam 1778358 1456518 1018694 0
/commun/data/projects/folder1234/11X0433_pair1_sorted.bam 2178798 1800943 1249738 0
/commun/data/projects/folder1234/11X0433_markdup.bam 2178798 1800943 1249738 0
/commun/data/projects/folder1234/XX10272_recal.bam 1579608 1296449 876040 0
/commun/data/projects/folder1234/11X0433_realigned.bam 2178798 1800943 1249738 0
/commun/data/projects/folder1234/11X0240_recal.bam 1489672 1229654 830042 0
/commun/data/projects/folder1234/11X0290_recal.bam 1436970 1191044 837726 0
/commun/data/projects/folder1234/11X0456_recal.bam 1486490 1220969 848738 0
/commun/data/projects/folder1234/11X0247_recal.bam 1489832 1233208 862842 0
/commun/data/projects/folder1234/11X0433_pair1_unsorted.bam 2178798 1800943 1249738 0
/commun/data/projects/folder1234/11X0305_recal.bam 1818236 1489267 1027458 0
/commun/data/projects/folder1234/11X0542_recal.bam 1784860 1469183 1016314 0
/commun/data/projects/folder1234/11X0433_recal.bam 2178798 1800943 1249738 0
However I noticed seems that running this code on the cluster is slower that running it on the master node...

The C code

Note: the C code is compiled with ortecc, not gcc.


I'm still very new to this API and to those concepts. Feel free to suggest anything :-)
That's it,
Pierre

07 February 2013

A tool to compare the BAMs

Following that thread on Biostar, I've created a tool that compares two or more BAMs. This java program uses the Picard and berkeleydb-JE libraries and is available at: http://code.google.com/p/jvarkit/wiki/CompareBams.

Download & install

see http://code.google.com/p/jvarkit/wiki/CompareBams.

Example

The following Makefile align the same pair of FASTQs with 5 different parameters for bwa aln -O (gap_open_penalty)

FASTQ1=001.fastq.gz
FASTQ2=002.fastq.gz
REF=/path/to/human_g1k_v37.fasta
BWA=bwa
SAMTOOLS=samtools
ALL_BAMS= 

define SAI
$(1)_1.sai : ${FASTQ1}  $(REF)
        $(BWA) aln  $(2) -t 2  -f $$@ $(REF) $$<
$(1)_2.sai :${FASTQ2}  $(REF)
        $(BWA) aln  $(2) -t 2  -f $$@ $(REF) $$<

endef

define ALN

ALL_BAMS+= $(1).bam 

$(eval $(call SAI,$(1),$(2)))

$(1).bam: $(1)_1.sai $(1)_2.sai
        $(BWA) sampe $(3)  ${REF} \
                $(1)_1.sai $(1)_2.sai  \
                $(FASTQ1) $(FASTQ2) |\
        $(SAMTOOLS) view -S -b -o $$@ -T $(REF) - 

endef

.PHONY:all

all: diff.gz

$(eval $(foreach GAP, 8 9 10 11 12 , $(call ALN,GAP$(GAP), -O $(GAP) , )))


diff.gz: $(ALL_BAMS)
        mkdir -p tmp.bdb
        java -jar  /commun/data/packages/jvarkit/comparebams.jar -d tmp.bdb $^ | gzip --best > $@
execute:
$ make

(...)
java -jar  /path/to/jvarkit/comparebams.jar -d tmp.bdb GAP8.bam GAP9.bam GAP10.bam GAP11.bam GAP12.bam | gzip --best > diff.gz
Feb 07, 2013 2:09:57 PM fr.inserm.umr1087.jvarkit.tools.cmpbams.CompareBams run
#INFO: in GAP8.bam count:1
Feb 07, 2013 2:10:30 PM fr.inserm.umr1087.jvarkit.tools.cmpbams.CompareBams run
INFO: in GAP9.bam count:1
(...)

The output is a tab delimited file containing

  • the name of the read
  • the comparison of each bam vs the others EQ=equals, NE=not-equals
  • the positions of the reads in each bam
$ gunzip -c diff.gz | egrep -E  '(#|NE)' | head


#READ-Name      GAP8.bam GAP9.bam|GAP8.bam GAP10.bam|GAP8.bam GAP11.bam|GAP8.bam GAP12.bam|GAP9.bam GAP10.bam|GAP9.bam GAP11.bam|GAP9.bam GAP12.bam|GAP10.bam GAP11.bam|GAP10.bam GAP12.bam|GAP11.bam GAP12.bam GAP8.bam        GAP9.bam        GAP10.bam       GAP11.bam       GAP12.bam
M00491:10:000000000-A27BP:1:1101:10029:10672    EQ|NE|NE|NE|NE|NE|NE|EQ|EQ|EQ   6:123892013,6:123892006 6:123892013,6:123892006 6:123892005,6:123892013 6:123892005,6:123892013 6:123892005,6:123892013
M00491:10:000000000-A27BP:1:1101:10265:10054    EQ|EQ|NE|NE|EQ|NE|NE|NE|NE|NE   19:49671437,19:49671412 19:49671437,19:49671412 19:49671437,19:49671412 19:49671435     19:49671412,19:49671435
M00491:10:000000000-A27BP:1:1101:10904:12333    EQ|NE|NE|NE|NE|NE|NE|EQ|EQ|EQ   10:88681151,10:88681156 10:88681151,10:88681156 10:88681156,10:88681150 10:88681156,10:88681150 10:88681156,10:88681150
M00491:10:000000000-A27BP:1:1101:11211:13492    EQ|NE|NE|NE|NE|NE|NE|EQ|EQ|EQ   8:52321469      8:52321469      8:52321470      8:52321470      8:52321470
M00491:10:000000000-A27BP:1:1101:11298:18283    NE|NE|NE|NE|EQ|EQ|EQ|EQ|EQ|EQ   6:126071103,6:126071110 6:126071104,6:126071110 6:126071104,6:126071110 6:126071104,6:126071110 6:126071104,6:126071110
M00491:10:000000000-A27BP:1:1101:11381:15675    EQ|NE|NE|NE|NE|NE|NE|EQ|EQ|EQ   1:156106900,1:156106905 1:156106900,1:156106905 1:156106905,1:156106899 1:156106905,1:156106899 1:156106905,1:156106899
M00491:10:000000000-A27BP:1:1101:12189:14088    EQ|NE|NE|EQ|NE|NE|EQ|EQ|NE|NE   15:22015803     15:22015803     15:21009140     15:21009140     15:22015803
M00491:10:000000000-A27BP:1:1101:12382:11193    EQ|NE|NE|NE|NE|NE|NE|EQ|EQ|EQ   4:111542263,4:111542256 4:111542263,4:111542256 4:111542263,4:111542254 4:111542263,4:111542254 4:111542263,4:111542254
M00491:10:000000000-A27BP:1:1101:12998:24492    EQ|NE|NE|NE|NE|NE|NE|EQ|EQ|EQ   2:179433503,2:179433496 2:179433503,2:179433496 2:179433503,2:179433497 2:179433503,2:179433497 2:179433503,2:179433497
(....)

That's it
Pierre

05 February 2013

Making use of Picard Metrics files using XML and XSLT. #ngs

Many tools in the Picard package produce some "Metrics File" (described at http://picard.sourceforge.net/picard-metric-definitions.shtml). The picard API contains a java parser "MetricsFile" parsing those metrics-file:

MetricsFile<MetricBase, Comparable<?>> metricsFile=new MetricsFile<MetricBase, Comparable<?>>();
metricsFile.read(new FileReader("metrics.txt"));
In order produce some custom reports from those files, I've created a tool that dump the content of the MetricsFile as a XML file. The source code is available at: http://code.google.com/p/jvarkit/source/browse/trunk/src/main/java/fr/inserm/umr1087/jvarkit/tools/picard/metrics2xml/PicardMetricsToXML.java.

Compilation

$ mkdir tmp
$ javac -d tmp -cp  /path/to/picard.jar:/path/to/sam.jar \
     -sourcepath  src/main/java \
     src/main/java/fr/inserm/umr1087/jvarkit/tools/picard/metrics2xml/PicardMetricsToXML.java
$ jar vcf picardmetrics2xml.jar -C tmp .

Usage

Say you have used the tool 'CollectInsertSizeMetrics.jar' from picard:
$ java -jar/path/to/CollectInsertSizeMetrics.jar \
 O=out.metrics \
 I=/path/to/samtools/examples/sorted.bam \
 AS=true \
 R=/path/to/samtools/ex1.fa \
 H=chart.pdf
The file out.metrics looks like this:
## net.sf.picard.metrics.StringHeader
# net.sf.picard.analysis.CollectInsertSizeMetrics HISTOGRAM_FILE=(...)
## net.sf.picard.metrics.StringHeader
# Started on: Tue Feb 05 12:51:30 CET 2013

## METRICS CLASS net.sf.picard.analysis.InsertSizeMetrics
MEDIAN_INSERT_SIZE MEDIAN_ABSOLUTE_DEVIATION MIN_INSERT_SIZE MAX_INSERT_SIZE MEAN_INSERT_SIZE STANDARD_DEVIATION READ_PAIRS
209 10 54 243 208.857506 13.614603 4716 FR 5 9 13 17 21 25 29 35 43 

## HISTOGRAM java.lang.Integer
insert_size All_Reads.fr_count
54 3
170 3
173 9
174 3
175 3
177 6
(...)
This file can be converted to XML using the following command:
$ java -cp /path/to/picard.jar:/path/to/sam.jar:picardmetrics2xml.jar file.metrics


<?xml version="1.0" encoding="UTF-8"?><picard-metrics xmlns="http://picard.sourc
eforge.net/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><metrics-file
 file="file.metrics"><headers><header class="net.sf.picard.metrics.StringHeader"
>net.sf.picard.analysis.CollectInsertSizeMetrics HISTOGRAM_FILE=jeter2 INPUT=/ho
me/lindenb/package/samtools-0.1.18/examples/sorted.bam OUTPUT=jeter REFERENCE_SE
QUENCE=/home/lindenb/package/samtools-0.1.18/examples/ex1.fa ASSUME_SORTED=true 
   DEVIATIONS=10.0 MINIMUM_PCT=0.05 METRIC_ACCUMULATION_LEVEL=[ALL_READS] STOP_A
FTER=0 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL
=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false</header><h
eader class="net.sf.picard.metrics.StringHeader">Started on: Tue Feb 05 12:51:30
 CET 2013</header></headers><metrics><thead class="net.sf.picard.analysis.Insert
SizeMetrics"><th class="double">MEDIAN_INSERT_SIZE</th><th class="double">MEDIAN
_ABSOLUTE_DEVIATION</th><th class="int">MIN_INSERT_SIZE</th><th class="int">MAX_
INSERT_SIZE</th><th class="double">MEAN_INSERT_SIZE</th><th class="double">STAND
ARD_DEVIATION</th><th class="long">READ_PAIRS</th><th class="net.sf.picard.sam.S
amPairUtil$PairOrientation">PAIR_ORIENTATION</th><th class="int">WIDTH_OF_10_PER
CENT</th><th class="int">WIDTH_OF_20_PERCENT</th><th class="int">WIDTH_OF_30_PER
CENT</th><th class="int">WIDTH_OF_40_PERCENT</th><th class="int">WIDTH_OF_50_PER
CENT</th><th class="int">WIDTH_OF_60_PERCENT</th><th class="int">WIDTH_OF_70_PER
CENT</th><th class="int">WIDTH_OF_80_PERCENT</th><th class="int">WIDTH_OF_90_PER
CENT</th><th class="int">WIDTH_OF_99_PERCENT</th><th class="java.lang.String">SA
MPLE</th><th class="java.lang.String">LIBRARY</th><th class="java.lang.String">R
EAD_GROUP</th></thead><tbody><tr><td>209.0</td><td>10.0</td><td>54</td><td>243</
td><td>208.857506</td><td>13.614603</td><td>4716</td><td>FR</td><td>5</td><td>9<
/td><td>13</td><td>17</td><td>21</td><td>25</td><td>29</td><td>35</td><td>43</td
><td>65</td><td xsi:nil="true"/><td xsi:nil="true"/><td xsi:nil="true"/></tr></t
body></metrics><histogram class="java.lang.Integer"><thead><th>insert_size</th><
th>All_Reads.fr_count</th></thead><tbody><tr><td>54</td><td>3.0</td></tr><tr><td
>170</td><td>3.0</td></tr><tr><td>173</td><td>9.0</td></tr><tr><td>174</td><td>3
.0</td></tr><tr><td>175</td><td>3.0</td></tr><tr><td>177</td><td>6.0</td></tr><t
r><td>178</td><td>6.0</td></tr><tr><td>179</td><td>9.0</td></tr><tr><td>180</td>
<td>6.0</td></tr><tr><td>181</td><td>6.0</td></tr><tr><td>182</td><td>21.0</td><
/tr><tr><td>183</td><td>9.0</td></tr><tr><td>184</td><td>15.0</td></tr><tr><td>1
85</td><td>33.0</td></tr><tr><td>186</td><td>15.0</td></tr><tr><td>187</td><td>3
(...)

Converting to JSON

Now, we can convert the XML to whatever we want using XSLT. I wrote a stylesheet picardmetrics2json.xsl converting the XML to JSON (though, I should escape the quotes in the strings ).
$ xsltproc picardmetrics2json.xsl metrics.xml


{
    "metrics.xml": {
        "headers": [
            {
                "class": "net.sf.picard.metrics.StringHeader",
                "value": "net.sf.picard.analysis.CollectInsertSizeMetrics HISTOGRAM_FILE=metrics.pdf INPUT=samtools-0.1.18/examples/sorted.bam OUTPUT=metrics.txt REFERENCE_SEQUENCE=/home/lindenb/package/samtools-0.1.18/examples/ex1.fa ASSUME_SORTED=true    DEVIATIONS=10.0 MINIMUM_PCT=0.05 METRIC_ACCUMULATION_LEVEL=[ALL_READS] STOP_AFTER=0 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false"
            },
            {
                "class": "net.sf.picard.metrics.StringHeader",
                "value": "Started on: Tue Feb 05 12:51:30 CET 2013"
            }
        ],
        "metrics": [
            {
                "MEDIAN_INSERT_SIZE": 209,
                "MEDIAN_ABSOLUTE_DEVIATION": 10,
                "MIN_INSERT_SIZE": 54,
                "MAX_INSERT_SIZE": 243,
                "MEAN_INSERT_SIZE": 208.857506,
                "STANDARD_DEVIATION": 13.614603,
                "READ_PAIRS": 4716,
                "PAIR_ORIENTATION": "FR",
                "WIDTH_OF_10_PERCENT": 5,
                "WIDTH_OF_20_PERCENT": 9,
                "WIDTH_OF_30_PERCENT": 13,
                "WIDTH_OF_40_PERCENT": 17,
                "WIDTH_OF_50_PERCENT": 21,
                "WIDTH_OF_60_PERCENT": 25,
                "WIDTH_OF_70_PERCENT": 29,
                "WIDTH_OF_80_PERCENT": 35,
                "WIDTH_OF_90_PERCENT": 43,
                "WIDTH_OF_99_PERCENT": 65,
                "SAMPLE": null,
                "LIBRARY": null,
                "READ_GROUP": null
            }
        ],
        "histogram": [
            {
                "insert_size": 54,
                "All_Reads.fr_count": 3
            },
            {
                "insert_size": 170,
                "All_Reads.fr_count": 3
            },(...)

Converting to HTML

Another stylesheet convert the XML to HTML. It also produces the javascript code to display the histograms using Google chart:
$ xsltproc picardmetrics2html.xsl metrics.xml > output.html


That's it,
Pierre