21 October 2011

A reference genome with or without the 'chr' prefix

The name of the chromosomes in the fasta files for the human genome are prefixed with 'chr' :

$  grep ">" hg19.fa 
>chr1
>chr2
>chr3
>chr4
>chr5
>chr6
(...)
The FAIDX index for this fasta file looks like this:
chr1 249250621 6 50 51
chr2 243199373 254235646 50 51
chr3 198022430 502299013 50 51
chr4 191154276 704281898 50 51
chr5 180915260 899259266 50 51
chr6 171115067 1083792838 50 51
(...)
.Today, I've been asked to call the variations for a set of BAM files mapped on a reference genome without this 'chr' prefix. One way to get around this problem is to change the header for those BAM. Another way is to create a copy of the faidx file where the 'chr' prefixes have been removed (the faidx is still valid as the positions in the chromosomes didn't change):
sed 's/^chr//' hg19.fa.fai > hg19_NOPREFIX.fa.fai
and to create a symbolic link named hg19_NOPREFIX.fa pointing to the original reference:
 ln -s hg19.fa hg19_NOPREFIX.fa
. The result:
ls -lah

-rw-r--r-- 1 root root 3.0G Jan  4  2011 hg19.fa
-rw-r--r-- 1 root    root   788 Jan 27  2011 hg19.fa.fai
lrwxrwxrwx 1 root    root     7 Oct 20 16:12 hg19_NOPREFIX.fa -> hg19.fa
-rw-r--r-- 1 root    root   713 Oct 20 16:12 hg19_NOPREFIX.fa.fai
This solution worked so far with samtools mpileup.

That's it,

Pierre

07 October 2011

Knime4Bio: a set of custom nodes for the interpretation of NGS data with KNIME

Our paper has just been published in Bioinformatics  :-)

http://bioinformatics.oxfordjournals.org/content/early/2011/10/07/bioinformatics.btr554.abstract



Knime4Bio: a set of custom nodes for the interpretation of Next Generation Sequencing data with KNIME.
   Pierre Lindenbaum, Solena Le Scouarnec, Vincent Portero and Richard Redon


Summary: Here, we describe Knime4Bio, a set of custom nodes for the KNIME (The Konstanz Information Miner) interactive graphical workbench, for the interpretation of large biological datasets. We demonstrate that this tool can be utilised to quickly retrieve previously published scientific findings.
Availability: http://code.google.com/p/knime4bio/.





That's it,
Pierre

05 October 2011

Grouping mutations/Gene=f(sample)

GroupByGene is a small C++ tool grouping the data:

  • CHROM
  • POS
  • REF
  • GENE
  • SAMPLE
by gene=f(sample). This tool is available on google code : http://code.google.com/p/variationtoolkit/source/browse/trunk/src/groupbygene.cpp
Example:
$ cat input.tsv

#CHROM	POS	REF	ALT	GENE	SAMPLE	
chr1	10	A	T	gene1	indi1
chr1	10	A	T	gene1	indi2
chr1	11	C	G	gene1	indi2
chr2	110	C	G	gene2	indi3
chr3	210	A	T	gene3	indi1
chr3	211	C	T	gene3	indi2
chr3	211	C	T	gene3	indi3
chr3	215	C	G	gene3	indi3
chr3	216	C	T	gene3	indi3
chr4	390	C	T	gene4	indi1
chr4	390	C	A	gene4	indi3

Calling "groupbygene:


$ groupbygene  --chrom 1 --pos 2 --ref 3 --alt 4 --sample 6 --gene 5 < input.tsv

GENECHROMSTARTENDcount
SAMPLES
distinct
MUTATIONS
count(indi1)count(indi2)count(indi3)
gene1chr1101122120
gene2chr211011011001
gene3chr321021634113
gene4chr439039022101


$ groupbygene  --chrom 1 --pos 2 --ref 3 --alt 4 --sample 6 --gene 5 --norefalt < input.tsv

GENECHROMSTARTENDcount
SAMPLES
distinct
MUTATIONS
count(indi1)count(indi2)count(indi3)
gene1chr1101122120
gene2chr211011011001
gene3chr321021634113
gene4chr439039021101


That's it,

Pierre

Verticalize: printing the input stream vertically.

A useful tool: verticalize is a small C++ tool printing the input stream vertically. The source is available on github : https://github.com/lindenb/ccsandbox/blob/master/src/verticalize.cpp.
An Example with 1000genomes.org :

$ curl -s "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.sites.vcf.gz"|\
gunzip  -c | grep -v "##" |\
verticalize  | head -n 30

>>>	2
$1	#CHROM	1
$2	POS   	10327
$3	ID    	rs112750067
$4	REF   	T
$5	ALT   	C
$6	QUAL  	.
$7	FILTER	PASS
$8	INFO  	DP=65;AF=0.208;CB=BC,NCBI
<<<	2

>>>	3
$1	#CHROM	1
$2	POS   	10469
$3	ID    	rs117577454
$4	REF   	C
$5	ALT   	G
$6	QUAL  	.
$7	FILTER	PASS
$8	INFO  	DP=2055;AF=0.020;CB=UM,BC,NCBI
<<<	3

(...)
That's it, Pierre