30 September 2014

Using the Ensembl Regulatory Build to annotate some VCF files

via UCSC Genome Browser project announcements: "Data from the Ensembl Regulatory Build are now available in the UCSC Genome Browser as a public track hub for both hg19 and hg38. This track hub contains promoters and their flanking regions, enhancers, and many other regulatory features predicted across a number of cell lines using annotated segmentation states".
For example looking at chr21:33037019-33037021 returns the following screen:

Those new annotations are deployed by the Sanger Institute as a UCSC track hub. By the way, those file can be directly handled using the UCSC standalone tools:
$ bigWigSummary -type=mean -udcDir=.  \
  "http://ngs.sanger.ac.uk/production/ensembl/regulation//hg19/segmentation_summaries/Segway_17/1.bw" \
  chr1 1  110301 1

1.23587
I wrote a java tool for the annotation of VCFs with those files. This tool uses the BigWig library for java ( https://code.google.com/p/bigwig/ ) and is available at: https://github.com/lindenb/jvarkit/wiki/VcfEnsemblReg.
Here is an example with the following VCF:
##fileformat=VCFv4.1
(...)
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample
chr21 33037029 . C T 6.20 . . GT:PL:DP:GQ 1/1:35,3,0:1:4
VcfEnsemblReg is invoked:
$  java -jar dist/vcfensemblreg.jar in.vcf > out.vcf
Here is the content of out.vcf:
##fileformat=VCFv4.1
##INFO=<ID=AP2ALPHA,Number=1,Type=Float,Description="Overlap summary of AP2ALPHA ChipSeq binding peaks across available datasets http://ngs.sanger.ac.uk/production/ensembl/regulation//hg19/tfbs/AP2ALPHA.bw">
##INFO=<ID=AP2GAMMA,Number=1,Type=Float,Description="Overlap summary of AP2GAMMA ChipSeq binding peaks across available datasets http://ngs.sanger.ac.uk/production/ensembl/regulation//hg19/tfbs/AP2GAMMA.bw">
##INFO=<ID=ATF3,Number=1,Type=Float,Description="Overlap summary of ATF3 ChipSeq binding peaks across available datasets http://ngs.sanger.ac.uk/production/ensembl/regulation//hg19/tfbs/ATF3.bw">
##INFO=<ID=BAF155,Number=1,Type=Float,Description="Overlap summary of BAF155 ChipSeq binding peaks across available datasets http://ngs.sanger.ac.uk/production/ensembl/regulation//hg19/tfbs/BAF155.bw">
##INFO=<ID=BAF170,Number=1,Type=Float,Description="Overlap summary of BAF170 ChipSeq binding peaks across available datasets http://ngs.sanger.ac.uk/production/ensembl/regulation//hg19/tfbs/BAF170.bw">
(...)
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample
chr21 33037029 . C T 6.20 . BuildOverview=ctcf_45704|CTCFBindingSite;Segway_17_1=3.0;Segway_17_14=7.0;Segway_17_24=3.0;Segway_17_6=1.0;Segway_17_7=2.0;Segway_17_8=1.0;Segway_17_A549_projected=ctcf_45704|InactiveRegions;Segway_17_A549_segments=14_gene_79558|TranscriptionAssociated;Segway_17_DND41_projected=ctcf_45704|InactiveRegions;Segway_17_DND41_segments=1_distal_17115|DistalEnhancer;Segway_17_GM12878_projected=ctcf_45704|InactiveRegions;Segway_17_GM12878_segments=1_distal_29075|DistalEnhancer;Segway_17_H1HESC_projected=ctcf_45704|ActiveCTCFBindingSite;Segway_17_H1HESC_segments=8_ctcf_27831|DistalCTF;Segway_17_HELAS3_projected=ctcf_45704|InactiveRegions;Segway_17_HELAS3_segments=6_distal_76536|DistalEnhancer;Segway_17_HEPG2_projected=ctcf_45704|InactiveRegions;Segway_17_HEPG2_segments=1_distal_21535|DistalEnhancer;Segway_17_HMEC_projected=ctcf_45704|InactiveRegions;Segway_17_HMEC_segments=14_gene_44998|TranscriptionAssociated;Segway_17_HSMMT_projected=ctcf_45704|InactiveRegions;Segway_17_HSMMT_segments=24_gene_70780|TranscriptionAssociated;Segway_17_HSMM_projected=ctcf_45704|InactiveRegions;Segway_17_HSMM_segments=24_gene_80902|TranscriptionAssociated;Segway_17_HUVEC_projected=ctcf_45704|InactiveRegions;Segway_17_K562_projected=ctcf_45704|InactiveRegions;Segway_17_K562_segments=14_gene_68692|TranscriptionAssociated;Segway_17_MONO_projected=ctcf_45704|InactiveRegions;Segway_17_MONO_segments=14_gene_35200|TranscriptionAssociated;Segway_17_NHA_projected=ctcf_45704|InactiveRegions;Segway_17_NHDFAD_projected=ctcf_45704|InactiveRegions;Segway_17_NHDFAD_segments=14_gene_57366|TranscriptionAssociated;Segway_17_NHEK_projected=ctcf_45704|InactiveRegions;Segway_17_NHEK_segments=24_gene_95458|TranscriptionAssociated;Segway_17_NHLF_projected=ctcf_45704|InactiveRegions;Segway_17_NHLF_segments=14_gene_59524|TranscriptionAssociated;Segway_17_OSTEO_projected=ctcf_45704|InactiveRegions;Segway_17_OSTEO_segments=14_gene_61575|TranscriptionAssociated GT:PL:DP:GQ 1/1:35,3,0:1:4
Here are the new fields in the INFO column:
Segway_17_1 3.0
Segway_17_14 7.0
Segway_17_24 3.0
Segway_17_6 1.0
Segway_17_7 2.0
Segway_17_8 1.0
Segway_17_A549_projected ctcf_45704|InactiveRegions
Segway_17_A549_segments 14_gene_79558|TranscriptionAssociated
Segway_17_DND41_projected ctcf_45704|InactiveRegions
Segway_17_DND41_segments 1_distal_17115|DistalEnhancer
Segway_17_GM12878_projected ctcf_45704|InactiveRegions
Segway_17_GM12878_segments 1_distal_29075|DistalEnhancer
Segway_17_H1HESC_projected ctcf_45704|ActiveCTCFBindingSite
Segway_17_H1HESC_segments 8_ctcf_27831|DistalCTF
Segway_17_HELAS3_projected ctcf_45704|InactiveRegions
Segway_17_HELAS3_segments 6_distal_76536|DistalEnhancer
Segway_17_HEPG2_projected ctcf_45704|InactiveRegions
Segway_17_HEPG2_segments 1_distal_21535|DistalEnhancer
Segway_17_HMEC_projected ctcf_45704|InactiveRegions
Segway_17_HMEC_segments 14_gene_44998|TranscriptionAssociated
Segway_17_HSMMT_projected ctcf_45704|InactiveRegions
Segway_17_HSMMT_segments 24_gene_70780|TranscriptionAssociated
Segway_17_HSMM_projected ctcf_45704|InactiveRegions
Segway_17_HSMM_segments 24_gene_80902|TranscriptionAssociated
Segway_17_HUVEC_projected ctcf_45704|InactiveRegions
Segway_17_K562_projected ctcf_45704|InactiveRegions
Segway_17_K562_segments 14_gene_68692|TranscriptionAssociated
Segway_17_MONO_projected ctcf_45704|InactiveRegions
Segway_17_MONO_segments 14_gene_35200|TranscriptionAssociated
Segway_17_NHA_projected ctcf_45704|InactiveRegions
Segway_17_NHDFAD_projected ctcf_45704|InactiveRegions
Segway_17_NHDFAD_segments 14_gene_57366|TranscriptionAssociated
Segway_17_NHEK_projected ctcf_45704|InactiveRegions
Segway_17_NHEK_segments 24_gene_95458|TranscriptionAssociated
Segway_17_NHLF_projected ctcf_45704|InactiveRegions
Segway_17_NHLF_segments 14_gene_59524|TranscriptionAssociated
Segway_17_OSTEO_projected ctcf_45704|InactiveRegions
Segway_17_OSTEO_segments 14_gene_61575|TranscriptionAssociated

OK, now I've got a VCF containing those 'Ensembl Regulatory' annotations. What can I do with this ? I've currently no idea :-)

That's it,
Pierre

1 comment:

jgoldmann said...

Hi Pierre,

Why don't you see, which of your variants disturb binding motifs of relevant transcription factors -- These variations should be among the ones with the largest impact on gene regulation.

Best, Jakob