31 August 2012

The 500th post: Generating a pipeline of analysis (Makefile) for NGS with xslt.

This is my 500th post ! Happy birthday my blog ! (I know, I know, I have been unfaithful since the beginning of the year).

In my previous post, I've shown how a the .INTERMEDIATE keyword can be used in a Makefile to reduce the number of temporary files generated in a classical NGS pipeline. 'Make' also has an option '-j' to specify the number of threads: some independent tasks (such as aligning two distinct samples) can be processed in parallel .

At this point, I should cite this tweet from Sean Davis:



... and this even-more-recent tweet from Justin H. Johnson:




Here, I will show how XSLT can be used to generate a whole Makefile to process the output of Illumina/Casava after the FASTQs files have been generated using configureBclToFastq.pl using --fastq-cluster-count.

A Xml descriptor generated by Illumina Casava looks like this:
<DemultiplexConfig>
  <Software/>
  <FlowcellInfo ID="C0KTCACXX" Operator="JohnDoe" Recipe="" Desc="">
    <Lane Number="1">
      <Sample ProjectId="WGS" Control="N" Index="CTTGTA" SampleId="Father" Desc="1.5pM" Ref="" />
      <Sample ProjectId="WGS" Control="N" Index="GCCAAT" SampleId="Mother" Desc="1.5pM" Ref="" />
      <Sample ProjectId="WGS" Control="N" Index="TGACCA" SampleId="Daughter" Desc="1.5pM" Ref="" />
      <Sample ProjectId="Undetermined_indices" Control="N" Index="Undetermined" SampleId="lane1" Desc="Clusters with unmatched barcodes for lane 1" Ref="unknown" />
    </Lane>
    <Lane Number="2">
      <Sample ProjectId="WGS" Control="N" Index="CTTGTA" SampleId="Father" Desc="1.5pM" Ref="" />
      <Sample ProjectId="WGS" Control="N" Index="GCCAAT" SampleId="Mother" Desc="1.5pM" Ref="" />
      <Sample ProjectId="WGS" Control="N" Index="TGACCA" SampleId="Daughter" Desc="1.5pM" Ref="" />
      <Sample ProjectId="Undetermined_indices" Control="N" Index="Undetermined" SampleId="lane2" Desc="Clusters with unmatched barcodes for lane 2" Ref="unknown" />
    </Lane>
  </FlowcellInfo>
</DemultiplexConfig>
This document contains the required informations to find the fastqs and to generate a Makefile. I wrote the following XSLT stylesheet:


This stylesheet transforms the DemultiplexConfig.xml file to the following Makefile:

#
#files must have been generated using CASAVA/configureBclToFastq.pl with --fastq-cluster-count 0
#path to bwa
override BWA=/path/to/bwa-0.6.1/bwa
BWASAMPEFLAG= -a 600
BWAALNFLAG= -t 4
#path to samtools
SAMTOOLS=/path/to/samtools
#path to BEDTOOLS
BEDTOOLS=/path/to/bedtools/bin
#output directory
OUTDIR=2012831.OUTPUT.idp0
INPUTDIR=.
REF=/path/to/human_g1k_v37.fasta
JAVA=java

PICARDLIB=/path/to/picard-tools
TABIX=/path/to/tabix
GATK=/path/to/GenomeAnalysisTK.jar
GATKFLAGS=
VCFDBSNP=/path/to/ncbi/snp/00-All.vcf.gz

#rule how to create a bai from a bam:
%.bam.bai: %.bam
 #indexing BAM $<
 $(SAMTOOLS) index $<

#creates the default output directory
$(OUTDIR):
 mkdir -p $@

#indexes the reference for bwa
$(REF).bwt $(REF).ann $(REF).amb $(REF).pac $(REF).sa : $(REF)
 $(BWA) index -a bwtsw $<

#indexes the reference for samtools
$(REF).fai: $(REF)
 $(SAMTOOLS) faidx $<

#creates sequence dict for picard
$(REF).dict: $(REF)
 $(JAVA) -jar $(PICARDLIB)/CreateSequenceDictionary.jar REFERENCE=$(REF) OUTPUT=$<



#intermediate see http://plindenbaum.blogspot.fr/2012/08/next-generation-sequencing-gnu-make-and.html
.INTERMEDIATE :  \
 $(OUTDIR)/Project_WGS/Sample_Father/Father.bam.bai \
 $(OUTDIR)/Project_WGS/Sample_Father/Father.bam \
 $(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001.bam \
 $(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002.bam \
 $(OUTDIR)/Project_WGS/Sample_Mother/Mother.bam.bai \
 $(OUTDIR)/Project_WGS/Sample_Mother/Mother.bam \
 $(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001.bam \
 $(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002.bam \
 $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.bam.bai \
 $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.bam \
 $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001.bam \
 $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002.bam


#main target : generate the VCFs
all : vcfs


vcfs:  \
 $(OUTDIR)/Project_WGS/Sample_Father/Father.vcf.gz \
 $(OUTDIR)/Project_WGS/Sample_Mother/Mother.vcf.gz \
 $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.vcf.gz


$(OUTDIR)/Project_WGS/Sample_Father/Father.realigned.bam : $(OUTDIR)/Project_WGS/Sample_Father/Father.bam $(OUTDIR)/Project_WGS/Sample_Father/Father.bam.bai
 $(JAVA) -jar $(GATK) $(GATKFLAGS) \
  -T RealignerTargetCreator \
  -R $(REF) \
  -I $< \
  -o $<.intervals \
  --known $(VCFDBSNP)
 $(JAVA) -jar $(GATK) $(GATKFLAGS) \
  -T IndelRealigner \
  -R $(REF) \
  -I $< \
  -o $@ \
  -targetIntervals $<.intervals \
  --knownAlleles $(VCFDBSNP)
 rm -f $<.intervals


#creates a BAM for 'Father' by merging the other lanes
$(OUTDIR)/Project_WGS/Sample_Father/Father.bam: $(OUTDIR)  \
  $(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001.bam \
  $(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002.bam
 #merge BAMS
 $(JAVA) -jar $(PICARDLIB)/MergeSamFiles.jar SORT_ORDER=coordinate OUTPUT=$@ \
  INPUT=$(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001.bam \
  INPUT=$(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002.bam
 #fix mate information
 $(JAVA) -jar $(PICARDLIB)/FixMateInformation.jar \
  INPUT=$@
 #validate
 $(JAVA) -jar $(PICARDLIB)/ValidateSamFile.jar \
  VALIDATE_INDEX=true \
  I=$@


#Creates a BAM for 'Project_WGS/Sample_Father/Father_CTTGTA_L001'

$(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001.bam: $(REF).bwt \
 $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001_R1_001.fastq.gz \
 $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001_R2_001.fastq.gz
 #create directory
 mkdir -p $(OUTDIR)/Project_WGS/Sample_Father
 #align fastq1 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp22528_1.sai $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001_R1_001.fastq.gz
 #align fastq2 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp22528_2.sai $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001_R2_001.fastq.gz
 #sampe those files
 $(BWA) sampe $(BWASAMPEFLAG)  -r '@RG ID:idp22528 CN:MyLaboratory LB:Father PG:bwa SM:Father PL:ILLUMINA' $(REF) \
  -f $(OUTDIR)/idp22528.sam \
  $(OUTDIR)/idp22528_1.sai \
  $(OUTDIR)/idp22528_2.sai \
  $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001_R1_001.fastq.gz \
  $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001_R2_001.fastq.gz
 #remove sai
 rm -f $(OUTDIR)/idp22528_1.sai $(OUTDIR)/idp22528_2.sai
 #convert sam to bam
 $(SAMTOOLS) view -b -T $(REF) -o $(OUTDIR)/idp22528_unsorted.bam $(OUTDIR)/idp22528.sam
 #remove the sam
 rm -f $(OUTDIR)/idp22528.sam
 #sort this bam
 $(SAMTOOLS) sort  $(OUTDIR)/idp22528_unsorted.bam $(OUTDIR)/idp22528_sorted
 #remove the unsorted bam
 rm -f $(OUTDIR)/idp22528_unsorted.bam
 #mark duplicates
 $(JAVA) -jar $(PICARDLIB)/MarkDuplicates.jar VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true TMP_DIR=$(OUTDIR) \
  INPUT=$(OUTDIR)/idp22528_sorted.bam \
  OUTPUT=$(OUTDIR)/idp22528_dup.bam \
  METRICS_FILE=$(OUTDIR)/Project_WGS/Sample_Father/Father.metrics
 #remove before duplicate
 rm -f $(OUTDIR)/idp22528_sorted.bam
 #move to final bam
 mv $(OUTDIR)/idp22528_dup.bam $@



$(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002.bam: $(REF).bwt \
 $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002_R1_001.fastq.gz \
 $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002_R2_001.fastq.gz
 #create directory
 mkdir -p $(OUTDIR)/Project_WGS/Sample_Father
 #align fastq1 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp31344_1.sai $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002_R1_001.fastq.gz
 #align fastq2 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp31344_2.sai $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002_R2_001.fastq.gz
 #sampe those files
 $(BWA) sampe $(BWASAMPEFLAG)  -r '@RG ID:idp31344 CN:MyLaboratory LB:Father PG:bwa SM:Father PL:ILLUMINA' $(REF) \
  -f $(OUTDIR)/idp31344.sam \
  $(OUTDIR)/idp31344_1.sai \
  $(OUTDIR)/idp31344_2.sai \
  $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002_R1_001.fastq.gz \
  $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002_R2_001.fastq.gz
 #remove sai
 rm -f $(OUTDIR)/idp31344_1.sai $(OUTDIR)/idp31344_2.sai
 #convert sam to bam
 $(SAMTOOLS) view -b -T $(REF) -o $(OUTDIR)/idp31344_unsorted.bam $(OUTDIR)/idp31344.sam
 #remove the sam
 rm -f $(OUTDIR)/idp31344.sam
 #sort this bam
 $(SAMTOOLS) sort  $(OUTDIR)/idp31344_unsorted.bam $(OUTDIR)/idp31344_sorted
 #remove the unsorted bam
 rm -f $(OUTDIR)/idp31344_unsorted.bam
 #mark duplicates
 $(JAVA) -jar $(PICARDLIB)/MarkDuplicates.jar VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true TMP_DIR=$(OUTDIR) \
  INPUT=$(OUTDIR)/idp31344_sorted.bam \
  OUTPUT=$(OUTDIR)/idp31344_dup.bam \
  METRICS_FILE=$(OUTDIR)/Project_WGS/Sample_Father/Father.metrics
 #remove before duplicate
 rm -f $(OUTDIR)/idp31344_sorted.bam
 #move to final bam
 mv $(OUTDIR)/idp31344_dup.bam $@




#creates a VCF for 'Father'
$(OUTDIR)/Project_WGS/Sample_Father/Father.vcf.gz : $(OUTDIR)/Project_WGS/Sample_Father/Father.realigned.bam
 $(JAVA) -jar $(GATK) \
  -T UnifiedGenotyper \
  -glm BOTH \
  -baq OFF \
  -R $(REF) \
  -I $< \
  --dbsnp $(VCFDBSNP) \
  -o $(OUTDIR)/Project_WGS/Sample_Father/Father.vcf
 $(TABIX)/bgzip $(OUTDIR)/Project_WGS/Sample_Father/Father.vcf
 $(TABIX)/tabix -f -p vcf $(OUTDIR)/Project_WGS/Sample_Father/Father.vcf.gz
$(OUTDIR)/Project_WGS/Sample_Mother/Mother.realigned.bam : $(OUTDIR)/Project_WGS/Sample_Mother/Mother.bam $(OUTDIR)/Project_WGS/Sample_Mother/Mother.bam.bai
 $(JAVA) -jar $(GATK) $(GATKFLAGS) \
  -T RealignerTargetCreator \
  -R $(REF) \
  -I $< \
  -o $<.intervals \
  --known $(VCFDBSNP)
 $(JAVA) -jar $(GATK) $(GATKFLAGS) \
  -T IndelRealigner \
  -R $(REF) \
  -I $< \
  -o $@ \
  -targetIntervals $<.intervals \
  --knownAlleles $(VCFDBSNP)
 rm -f $<.intervals


#creates a BAM for 'Mother' by merging the other lanes
$(OUTDIR)/Project_WGS/Sample_Mother/Mother.bam: $(OUTDIR)  \
  $(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001.bam \
  $(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002.bam
 #merge BAMS
 $(JAVA) -jar $(PICARDLIB)/MergeSamFiles.jar SORT_ORDER=coordinate OUTPUT=$@ \
  INPUT=$(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001.bam \
  INPUT=$(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002.bam
 #fix mate information
 $(JAVA) -jar $(PICARDLIB)/FixMateInformation.jar \
  INPUT=$@
 #validate
 $(JAVA) -jar $(PICARDLIB)/ValidateSamFile.jar \
  VALIDATE_INDEX=true \
  I=$@


#Creates a BAM for 'Project_WGS/Sample_Mother/Mother_GCCAAT_L001'

$(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001.bam: $(REF).bwt \
 $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001_R1_001.fastq.gz \
 $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001_R2_001.fastq.gz
 #create directory
 mkdir -p $(OUTDIR)/Project_WGS/Sample_Mother
 #align fastq1 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp15352_1.sai $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001_R1_001.fastq.gz
 #align fastq2 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp15352_2.sai $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001_R2_001.fastq.gz
 #sampe those files
 $(BWA) sampe $(BWASAMPEFLAG)  -r '@RG ID:idp15352 CN:MyLaboratory LB:Mother PG:bwa SM:Mother PL:ILLUMINA' $(REF) \
  -f $(OUTDIR)/idp15352.sam \
  $(OUTDIR)/idp15352_1.sai \
  $(OUTDIR)/idp15352_2.sai \
  $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001_R1_001.fastq.gz \
  $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001_R2_001.fastq.gz
 #remove sai
 rm -f $(OUTDIR)/idp15352_1.sai $(OUTDIR)/idp15352_2.sai
 #convert sam to bam
 $(SAMTOOLS) view -b -T $(REF) -o $(OUTDIR)/idp15352_unsorted.bam $(OUTDIR)/idp15352.sam
 #remove the sam
 rm -f $(OUTDIR)/idp15352.sam
 #sort this bam
 $(SAMTOOLS) sort  $(OUTDIR)/idp15352_unsorted.bam $(OUTDIR)/idp15352_sorted
 #remove the unsorted bam
 rm -f $(OUTDIR)/idp15352_unsorted.bam
 #mark duplicates
 $(JAVA) -jar $(PICARDLIB)/MarkDuplicates.jar VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true TMP_DIR=$(OUTDIR) \
  INPUT=$(OUTDIR)/idp15352_sorted.bam \
  OUTPUT=$(OUTDIR)/idp15352_dup.bam \
  METRICS_FILE=$(OUTDIR)/Project_WGS/Sample_Mother/Mother.metrics
 #remove before duplicate
 rm -f $(OUTDIR)/idp15352_sorted.bam
 #move to final bam
 mv $(OUTDIR)/idp15352_dup.bam $@



$(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002.bam: $(REF).bwt \
 $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002_R1_001.fastq.gz \
 $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002_R2_001.fastq.gz
 #create directory
 mkdir -p $(OUTDIR)/Project_WGS/Sample_Mother
 #align fastq1 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idm22432_1.sai $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002_R1_001.fastq.gz
 #align fastq2 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idm22432_2.sai $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002_R2_001.fastq.gz
 #sampe those files
 $(BWA) sampe $(BWASAMPEFLAG)  -r '@RG ID:idm22432 CN:MyLaboratory LB:Mother PG:bwa SM:Mother PL:ILLUMINA' $(REF) \
  -f $(OUTDIR)/idm22432.sam \
  $(OUTDIR)/idm22432_1.sai \
  $(OUTDIR)/idm22432_2.sai \
  $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002_R1_001.fastq.gz \
  $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002_R2_001.fastq.gz
 #remove sai
 rm -f $(OUTDIR)/idm22432_1.sai $(OUTDIR)/idm22432_2.sai
 #convert sam to bam
 $(SAMTOOLS) view -b -T $(REF) -o $(OUTDIR)/idm22432_unsorted.bam $(OUTDIR)/idm22432.sam
 #remove the sam
 rm -f $(OUTDIR)/idm22432.sam
 #sort this bam
 $(SAMTOOLS) sort  $(OUTDIR)/idm22432_unsorted.bam $(OUTDIR)/idm22432_sorted
 #remove the unsorted bam
 rm -f $(OUTDIR)/idm22432_unsorted.bam
 #mark duplicates
 $(JAVA) -jar $(PICARDLIB)/MarkDuplicates.jar VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true TMP_DIR=$(OUTDIR) \
  INPUT=$(OUTDIR)/idm22432_sorted.bam \
  OUTPUT=$(OUTDIR)/idm22432_dup.bam \
  METRICS_FILE=$(OUTDIR)/Project_WGS/Sample_Mother/Mother.metrics
 #remove before duplicate
 rm -f $(OUTDIR)/idm22432_sorted.bam
 #move to final bam
 mv $(OUTDIR)/idm22432_dup.bam $@




#creates a VCF for 'Mother'
$(OUTDIR)/Project_WGS/Sample_Mother/Mother.vcf.gz : $(OUTDIR)/Project_WGS/Sample_Mother/Mother.realigned.bam
 $(JAVA) -jar $(GATK) \
  -T UnifiedGenotyper \
  -glm BOTH \
  -baq OFF \
  -R $(REF) \
  -I $< \
  --dbsnp $(VCFDBSNP) \
  -o $(OUTDIR)/Project_WGS/Sample_Mother/Mother.vcf
 $(TABIX)/bgzip $(OUTDIR)/Project_WGS/Sample_Mother/Mother.vcf
 $(TABIX)/tabix -f -p vcf $(OUTDIR)/Project_WGS/Sample_Mother/Mother.vcf.gz
$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.realigned.bam : $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.bam $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.bam.bai
 $(JAVA) -jar $(GATK) $(GATKFLAGS) \
  -T RealignerTargetCreator \
  -R $(REF) \
  -I $< \
  -o $<.intervals \
  --known $(VCFDBSNP)
 $(JAVA) -jar $(GATK) $(GATKFLAGS) \
  -T IndelRealigner \
  -R $(REF) \
  -I $< \
  -o $@ \
  -targetIntervals $<.intervals \
  --knownAlleles $(VCFDBSNP)
 rm -f $<.intervals


#creates a BAM for 'Daughter' by merging the other lanes
$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.bam: $(OUTDIR)  \
  $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001.bam \
  $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002.bam
 #merge BAMS
 $(JAVA) -jar $(PICARDLIB)/MergeSamFiles.jar SORT_ORDER=coordinate OUTPUT=$@ \
  INPUT=$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001.bam \
  INPUT=$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002.bam
 #fix mate information
 $(JAVA) -jar $(PICARDLIB)/FixMateInformation.jar \
  INPUT=$@
 #validate
 $(JAVA) -jar $(PICARDLIB)/ValidateSamFile.jar \
  VALIDATE_INDEX=true \
  I=$@


#Creates a BAM for 'Project_WGS/Sample_Daughter/Daughter_TGACCA_L001'

$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001.bam: $(REF).bwt \
 $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001_R1_001.fastq.gz \
 $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001_R2_001.fastq.gz
 #create directory
 mkdir -p $(OUTDIR)/Project_WGS/Sample_Daughter
 #align fastq1 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp5176_1.sai $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001_R1_001.fastq.gz
 #align fastq2 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp5176_2.sai $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001_R2_001.fastq.gz
 #sampe those files
 $(BWA) sampe $(BWASAMPEFLAG)  -r '@RG ID:idp5176 CN:MyLaboratory LB:Daughter PG:bwa SM:Daughter PL:ILLUMINA' $(REF) \
  -f $(OUTDIR)/idp5176.sam \
  $(OUTDIR)/idp5176_1.sai \
  $(OUTDIR)/idp5176_2.sai \
  $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001_R1_001.fastq.gz \
  $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001_R2_001.fastq.gz
 #remove sai
 rm -f $(OUTDIR)/idp5176_1.sai $(OUTDIR)/idp5176_2.sai
 #convert sam to bam
 $(SAMTOOLS) view -b -T $(REF) -o $(OUTDIR)/idp5176_unsorted.bam $(OUTDIR)/idp5176.sam
 #remove the sam
 rm -f $(OUTDIR)/idp5176.sam
 #sort this bam
 $(SAMTOOLS) sort  $(OUTDIR)/idp5176_unsorted.bam $(OUTDIR)/idp5176_sorted
 #remove the unsorted bam
 rm -f $(OUTDIR)/idp5176_unsorted.bam
 #mark duplicates
 $(JAVA) -jar $(PICARDLIB)/MarkDuplicates.jar VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true TMP_DIR=$(OUTDIR) \
  INPUT=$(OUTDIR)/idp5176_sorted.bam \
  OUTPUT=$(OUTDIR)/idp5176_dup.bam \
  METRICS_FILE=$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.metrics
 #remove before duplicate
 rm -f $(OUTDIR)/idp5176_sorted.bam
 #move to final bam
 mv $(OUTDIR)/idp5176_dup.bam $@



$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002.bam: $(REF).bwt \
 $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002_R1_001.fastq.gz \
 $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002_R2_001.fastq.gz
 #create directory
 mkdir -p $(OUTDIR)/Project_WGS/Sample_Daughter
 #align fastq1 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp211744_1.sai $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002_R1_001.fastq.gz
 #align fastq2 with BWA
 $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp211744_2.sai $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002_R2_001.fastq.gz
 #sampe those files
 $(BWA) sampe $(BWASAMPEFLAG)  -r '@RG ID:idp211744 CN:MyLaboratory LB:Daughter PG:bwa SM:Daughter PL:ILLUMINA' $(REF) \
  -f $(OUTDIR)/idp211744.sam \
  $(OUTDIR)/idp211744_1.sai \
  $(OUTDIR)/idp211744_2.sai \
  $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002_R1_001.fastq.gz \
  $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002_R2_001.fastq.gz
 #remove sai
 rm -f $(OUTDIR)/idp211744_1.sai $(OUTDIR)/idp211744_2.sai
 #convert sam to bam
 $(SAMTOOLS) view -b -T $(REF) -o $(OUTDIR)/idp211744_unsorted.bam $(OUTDIR)/idp211744.sam
 #remove the sam
 rm -f $(OUTDIR)/idp211744.sam
 #sort this bam
 $(SAMTOOLS) sort  $(OUTDIR)/idp211744_unsorted.bam $(OUTDIR)/idp211744_sorted
 #remove the unsorted bam
 rm -f $(OUTDIR)/idp211744_unsorted.bam
 #mark duplicates
 $(JAVA) -jar $(PICARDLIB)/MarkDuplicates.jar VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true TMP_DIR=$(OUTDIR) \
  INPUT=$(OUTDIR)/idp211744_sorted.bam \
  OUTPUT=$(OUTDIR)/idp211744_dup.bam \
  METRICS_FILE=$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.metrics
 #remove before duplicate
 rm -f $(OUTDIR)/idp211744_sorted.bam
 #move to final bam
 mv $(OUTDIR)/idp211744_dup.bam $@




#creates a VCF for 'Daughter'
$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.vcf.gz : $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.realigned.bam
 $(JAVA) -jar $(GATK) \
  -T UnifiedGenotyper \
  -glm BOTH \
  -baq OFF \
  -R $(REF) \
  -I $< \
  --dbsnp $(VCFDBSNP) \
  -o $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.vcf
 $(TABIX)/bgzip $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.vcf
 $(TABIX)/tabix -f -p vcf $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.vcf.gz
(Of course, this workflow would satisfy my needs, may be not yours)

Et voila !


That's it !


Pierre

30 August 2012

Next Generation Sequencing, GNU-Make and .INTERMEDIATE

I gave a crash course about NGS to a few colleagues today. For my demonstration I wrote a simple Makefile. Basically, it downloads a subset of the human chromosome 22, indexes it with bwa, generates a set of fastqs with wgsim, align the fastqs, generates the *.sai, the *.sam, the *.bam, sorts the bam and calls the SNPs with mpileup.

SAMDIR=/usr/local/package/samtools-0.1.18
SAMTOOLS=$(SAMDIR)/samtools
BCFTOOLS=$(SAMDIR)/bcftools/bcftools
BWA=/usr/local/package/bwa-0.6.1/bwa

%.bam : %.sam
        $(SAMTOOLS) view -o $@ -b -S -T chr22.fa $<
%.bam.bai : %.bam
        $(SAMTOOLS) index $<

variations.vcf: variations.bcf
        $(BCFTOOLS) view $< > $@

variations.bcf : align.sorted.bam align.sorted.bam.bai
            $(SAMTOOLS) mpileup -uf chr22.fa $< | $(BCFTOOLS) view -bvcg - > $@


align.sam : random_1.sai random_2.sai  
        $(BWA) sampe chr22.fa $^ random_1.fq.gz random_2.fq.gz > $@

chr22.fa:
        curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr22.fa.gz" |\
        gunzip -c | grep -v NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN > $@


random_1.sai :  random_1.fq.gz chr22.fa.bwt
        $(BWA) aln -f $@ chr22.fa $<

random_2.sai :  random_2.fq.gz chr22.fa.bwt
        $(BWA) aln -f $@ chr22.fa $<

random_1.fq.gz random_2.fq.gz : chr22.fa
        $(SAMDIR)/misc/wgsim -N 1000000 $< random_1.fq random_2.fq > wgsim.output
        gzip --best random_1.fq random_2.fq

chr22.fa.bwt : chr22.fa
        $(BWA) index -a bwtsw $<

chr22.fa.fai :  chr22.fa
        $(SAMTOOLS) faidx $< 

align.sorted.bam : align.bam
        $(SAMTOOLS) sort $< align.sorted


clean:
        rm -f chr22.* *.bam *.vcf *.bcf *.sai *.gz *.fq *.bai  wgsim.output *.sam
I was asked about the weakness of "Make":
"The problem here, is that you need to generate a bunch of (possibly huge) files that won't be of use later: the *.sam, the unsorted *.bam, the *.sai, etc... if one of those files is removed everything will be re-compiled even if the final sorted.bam and the VCF exist."

I asked StackOverflow ( http://stackoverflow.com/questions/12199237 ) if a solution to fix this problem would exist and I received a nice solution from Eldar Abusalimov:

Use "intermediate files" feature of GNU Make:


Intermediate files are remade using their rules just like all other files. But intermediate files are treated differently in two ways.


The first difference is what happens if the intermediate file does not exist. If an ordinary file b does not exist, and make considers a target that depends on b, it invariably creates b and then updates the target from b. But if b is an intermediate file, then make can leave well enough alone. It won't bother updating b, or the ultimate target, unless some prerequisite of b is newer than that target or there is some other reason to update that target.


The second difference is that if make does create b in order to update something else, it deletes b later on after it is no longer needed. Therefore, an intermediate file which did not exist before make also does not exist after make. make reports the deletion to you by printing a rm -f command showing which file it is deleting.


Ordinarily, a file cannot be intermediate if it is mentioned in the makefile as a target or prerequisite. However, you can explicitly mark a file as intermediate by listing it as a prerequisite of the special target .INTERMEDIATE. This takes effect even if the file is mentioned explicitly in some other way.


You can prevent automatic deletion of an intermediate file by marking it as a secondary file. To do this, list it as a prerequisite of the special target .SECONDARY. When a file is secondary, make will not create the file merely because it does not already exist, but make does not automatically delete the file. Marking a file as secondary also marks it as intermediate.


So, adding the following line to the Makefile should be enough:


I've added the simple line below to my Makefile:
.INTERMEDIATE : align.sam random_1.sai random_2.sai align.bam variations.bcf

and I've invoked make:
[lindenb@srv-clc-02 20120830.demongs]$ make
curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr22.fa.gz" |\
    gunzip -c | grep -v N | head -n 100000 > chr22.fa
/usr/local/package/samtools-0.1.18/misc/wgsim -N 1000000 chr22.fa random_1.fq random_2.fq > wgsim.output
[wgsim_core] calculating the total length of the reference sequence...
[wgsim_core] 1 sequences, total length: 4999950
gzip -f --best random_1.fq random_2.fq
/usr/local/package/bwa-0.6.1/bwa index -a bwtsw chr22.fa
[bwa_index] Pack FASTA... 0.12 sec
[bwa_index] Construct BWT for the packed sequence...
[BWTIncCreate] textLength=9999900, availableWord=12703528
[bwt_gen] Finished constructing BWT in 5 iterations.
[bwa_index] 4.02 seconds elapse.
[bwa_index] Update BWT... 0.05 sec
[bwa_index] Pack forward-only FASTA... 0.08 sec
[bwa_index] Construct SA from BWT and Occ... 1.46 sec
[main] Version: 0.6.1-r104
[main] CMD: /usr/local/package/bwa-0.6.1/bwa index -a bwtsw chr22.fa
[main] Real time: 5.740 sec; CPU: 5.740 sec
/usr/local/package/bwa-0.6.1/bwa aln -f random_1.sai chr22.fa random_1.fq.gz
[bwa_aln] 17bp reads: max_diff = 2
[bwa_aln] 38bp reads: max_diff = 3
[bwa_aln] 64bp reads: max_diff = 4
[bwa_aln] 93bp reads: max_diff = 5
[bwa_aln] 124bp reads: max_diff = 6
[bwa_aln] 157bp reads: max_diff = 7
[bwa_aln] 190bp reads: max_diff = 8
[bwa_aln] 225bp reads: max_diff = 9
[bwa_aln_core] calculate SA coordinate... 26.07 sec
[bwa_aln_core] write to the disk... 0.08 sec
[bwa_aln_core] 262144 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 26.51 sec
[bwa_aln_core] write to the disk... 0.06 sec
[bwa_aln_core] 524288 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 26.80 sec
[bwa_aln_core] write to the disk... 0.08 sec
[bwa_aln_core] 786432 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 21.77 sec
[bwa_aln_core] write to the disk... 0.05 sec
[bwa_aln_core] 1000000 sequences have been processed.
[main] Version: 0.6.1-r104
[main] CMD: /usr/local/package/bwa-0.6.1/bwa aln -f random_1.sai chr22.fa random_1.fq.gz
[main] Real time: 104.702 sec; CPU: 104.675 sec
/usr/local/package/bwa-0.6.1/bwa aln -f random_2.sai chr22.fa random_2.fq.gz
[bwa_aln] 17bp reads: max_diff = 2
[bwa_aln] 38bp reads: max_diff = 3
[bwa_aln] 64bp reads: max_diff = 4
[bwa_aln] 93bp reads: max_diff = 5
[bwa_aln] 124bp reads: max_diff = 6
[bwa_aln] 157bp reads: max_diff = 7
[bwa_aln] 190bp reads: max_diff = 8
[bwa_aln] 225bp reads: max_diff = 9
[bwa_aln_core] calculate SA coordinate... 28.40 sec
[bwa_aln_core] write to the disk... 0.09 sec
[bwa_aln_core] 262144 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 28.94 sec
[bwa_aln_core] write to the disk... 0.07 sec
[bwa_aln_core] 524288 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 29.18 sec
[bwa_aln_core] write to the disk... 0.08 sec
[bwa_aln_core] 786432 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 23.07 sec
[bwa_aln_core] write to the disk... 0.05 sec
[bwa_aln_core] 1000000 sequences have been processed.
[main] Version: 0.6.1-r104
[main] CMD: /usr/local/package/bwa-0.6.1/bwa aln -f random_2.sai chr22.fa random_2.fq.gz
[main] Real time: 113.270 sec; CPU: 113.233 sec
/usr/local/package/bwa-0.6.1/bwa sampe chr22.fa random_1.sai random_2.sai random_1.fq.gz random_2.fq.gz > align.sam
[bwa_sai2sam_pe_core] convert to sequence coordinate...
[infer_isize] (25, 50, 75) percentile: (466, 500, 534)
[infer_isize] low and high boundaries: 330 and 670 for estimating avg and std
[infer_isize] inferred external isize from 220114 pairs: 499.899 +/- 49.771
[infer_isize] skewness: -0.006; kurtosis: -0.083; ap_prior: 1.00e-05
[infer_isize] inferred maximum insert size: 795 (5.93 sigma)
[bwa_sai2sam_pe_core] time elapses: 7.34 sec
[bwa_sai2sam_pe_core] changing coordinates of 6560 alignments.
[bwa_sai2sam_pe_core] align unmapped mate...
[bwa_paired_sw] 16796 out of 16796 Q17 singletons are mated.
[bwa_paired_sw] 27 out of 27 Q17 discordant pairs are fixed.
[bwa_sai2sam_pe_core] time elapses: 4.83 sec
[bwa_sai2sam_pe_core] refine gapped alignments... 0.97 sec
[bwa_sai2sam_pe_core] print alignments... 2.46 sec
[bwa_sai2sam_pe_core] 262144 sequences have been processed.
[bwa_sai2sam_pe_core] convert to sequence coordinate...
[infer_isize] (25, 50, 75) percentile: (466, 500, 534)
[infer_isize] low and high boundaries: 330 and 670 for estimating avg and std
[infer_isize] inferred external isize from 219840 pairs: 499.874 +/- 49.751
[infer_isize] skewness: 0.001; kurtosis: -0.072; ap_prior: 1.00e-05
[infer_isize] inferred maximum insert size: 795 (5.93 sigma)
[bwa_sai2sam_pe_core] time elapses: 7.36 sec
[bwa_sai2sam_pe_core] changing coordinates of 6668 alignments.
[bwa_sai2sam_pe_core] align unmapped mate...
[bwa_paired_sw] 16833 out of 16834 Q17 singletons are mated.
[bwa_paired_sw] 38 out of 38 Q17 discordant pairs are fixed.
[bwa_sai2sam_pe_core] time elapses: 4.78 sec
[bwa_sai2sam_pe_core] refine gapped alignments... 0.98 sec
[bwa_sai2sam_pe_core] print alignments... 2.55 sec
[bwa_sai2sam_pe_core] 524288 sequences have been processed.
[bwa_sai2sam_pe_core] convert to sequence coordinate...
[infer_isize] (25, 50, 75) percentile: (466, 500, 534)
[infer_isize] low and high boundaries: 330 and 670 for estimating avg and std
[infer_isize] inferred external isize from 220140 pairs: 500.062 +/- 49.780
[infer_isize] skewness: -0.000; kurtosis: -0.075; ap_prior: 1.00e-05
[infer_isize] inferred maximum insert size: 795 (5.93 sigma)
[bwa_sai2sam_pe_core] time elapses: 7.76 sec
[bwa_sai2sam_pe_core] changing coordinates of 6522 alignments.
[bwa_sai2sam_pe_core] align unmapped mate...
[bwa_paired_sw] 16804 out of 16806 Q17 singletons are mated.
[bwa_paired_sw] 33 out of 33 Q17 discordant pairs are fixed.
[bwa_sai2sam_pe_core] time elapses: 4.79 sec
[bwa_sai2sam_pe_core] refine gapped alignments... 1.07 sec
[bwa_sai2sam_pe_core] print alignments... 2.57 sec
[bwa_sai2sam_pe_core] 786432 sequences have been processed.
[bwa_sai2sam_pe_core] convert to sequence coordinate...
[infer_isize] (25, 50, 75) percentile: (466, 500, 534)
[infer_isize] low and high boundaries: 330 and 670 for estimating avg and std
[infer_isize] inferred external isize from 179161 pairs: 499.910 +/- 49.860
[infer_isize] skewness: -0.001; kurtosis: -0.075; ap_prior: 1.00e-05
[infer_isize] inferred maximum insert size: 796 (5.93 sigma)
[bwa_sai2sam_pe_core] time elapses: 6.22 sec
[bwa_sai2sam_pe_core] changing coordinates of 5351 alignments.
[bwa_sai2sam_pe_core] align unmapped mate...
[bwa_paired_sw] 13904 out of 13905 Q17 singletons are mated.
[bwa_paired_sw] 27 out of 27 Q17 discordant pairs are fixed.
[bwa_sai2sam_pe_core] time elapses: 3.97 sec
[bwa_sai2sam_pe_core] refine gapped alignments... 0.86 sec
[bwa_sai2sam_pe_core] print alignments... 2.10 sec
[bwa_sai2sam_pe_core] 1000000 sequences have been processed.
[main] Version: 0.6.1-r104
[main] CMD: /usr/local/package/bwa-0.6.1/bwa sampe chr22.fa random_1.sai random_2.sai random_1.fq.gz random_2.fq.gz
[main] Real time: 70.067 sec; CPU: 69.984 sec
/usr/local/package/samtools-0.1.18/samtools view -o align.bam -b -S -T chr22.fa align.sam
[samopen] SAM header is present: 1 sequences.
/usr/local/package/samtools-0.1.18/samtools sort align.bam align.sorted
/usr/local/package/samtools-0.1.18/samtools index align.sorted.bam
/usr/local/package/samtools-0.1.18/samtools mpileup -uf chr22.fa align.sorted.bam | /usr/local/package/samtools-0.1.18/bcftools/bcftools view -bvcg - > variations.bcf
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
[bcfview] 100000 sites processed.
[afs] 0:99762.176 1:154.256 2:83.568
[bcfview] 200000 sites processed.
[afs] 0:99790.512 1:146.516 2:62.972
[bcfview] 300000 sites processed.
[afs] 0:99750.990 1:143.270 2:105.740
[bcfview] 400000 sites processed.
[afs] 0:99764.146 1:156.783 2:79.071
[bcfview] 500000 sites processed.
[afs] 0:99749.220 1:177.647 2:73.133
[bcfview] 600000 sites processed.
[afs] 0:99758.419 1:160.146 2:81.435
[bcfview] 700000 sites processed.
[afs] 0:99769.451 1:155.968 2:74.581
[bcfview] 800000 sites processed.
[afs] 0:99761.914 1:149.704 2:88.382
[bcfview] 900000 sites processed.
[afs] 0:99746.200 1:158.466 2:95.334
[bcfview] 1000000 sites processed.
[afs] 0:99732.973 1:177.167 2:89.860
[bcfview] 1100000 sites processed.
[afs] 0:99815.394 1:109.322 2:75.284
[bcfview] 1200000 sites processed.
[afs] 0:99762.584 1:160.194 2:77.222
[bcfview] 1300000 sites processed.
[afs] 0:99748.201 1:155.610 2:96.189
[bcfview] 1400000 sites processed.
[afs] 0:99739.710 1:170.437 2:89.853
[bcfview] 1500000 sites processed.
[afs] 0:99729.049 1:176.956 2:93.995
[bcfview] 1600000 sites processed.
[afs] 0:99747.410 1:163.200 2:89.390
[bcfview] 1700000 sites processed.
[afs] 0:99751.841 1:171.932 2:76.228
[bcfview] 1800000 sites processed.
[afs] 0:99800.303 1:145.575 2:54.122
[bcfview] 1900000 sites processed.
[afs] 0:99828.697 1:126.069 2:45.234
[bcfview] 2000000 sites processed.
[afs] 0:99730.074 1:158.242 2:111.684
[afs] 0:87928.796 1:106.271 2:97.933
/usr/local/package/samtools-0.1.18/bcftools/bcftools view variations.bcf > variations.vcf
rm random_1.sai variations.bcf random_2.sai align.bam align.sam

here is the last line of the output:
rm random_1.sai variations.bcf random_2.sai align.bam align.sam
Makefile has removed the intermediate files
The working directory only contain the final files:
$ ls -la
total 168020
drwxr-xr-x  2 lindenb users    4096 Aug 30 17:46 .
drwxr-xr-x 12 lindenb users    4096 Aug 30 14:18 ..
-rw-r--r--  1 lindenb users 81537078 Aug 30 17:43 align.sorted.bam
-rw-r--r--  1 lindenb users    15096 Aug 30 17:43 align.sorted.bam.bai
-rw-r--r--  1 lindenb users  5099956 Aug 30 17:36 chr22.fa
-rw-r--r--  1 lindenb users      181 Aug 30 17:37 chr22.fa.amb
-rw-r--r--  1 lindenb users      41 Aug 30 17:37 chr22.fa.ann
-rw-r--r--  1 lindenb users  5000048 Aug 30 17:37 chr22.fa.bwt
-rw-r--r--  1 lindenb users      22 Aug 30 17:42 chr22.fa.fai
-rw-r--r--  1 lindenb users  1249989 Aug 30 17:37 chr22.fa.pac
-rw-r--r--  1 lindenb users  2500024 Aug 30 17:37 chr22.fa.sa
-rw-r--r--  1 lindenb users    1344 Aug 30 17:35 Makefile
-rw-r--r--  1 lindenb users 37831888 Aug 30 17:36 random_1.fq.gz
-rw-r--r--  1 lindenb users 37836209 Aug 30 17:36 random_2.fq.gz
-rw-r--r--  1 lindenb users  611760 Aug 30 17:46 variations.vcf
-rw-r--r--  1 lindenb users  101626 Aug 30 17:36 wgsim.output

That's it,

Pierre



27 August 2012

Reasoning with the Variation Ontology using Apache Jena #OWL #RDF

The Variation Ontology (VariO), "is an ontology for standardized, systematic description of effects, consequences and mechanisms of variations".
In this post I will use the Apache Jena library for RDF to load this ontology. It will then be used to extract a set of variations that are a sub-class of a given class of Variation.

Loading the ontology

The OWL ontology is available for download here: http://www.variationontology.org/download/VariO_0.979.owl. A new RDF model for an OWL ontology is created and the owl file is loaded.
OntModel ontModel = ModelFactory.createOntologyModel();
InputStream in = FileManager.get().open(VO_OWL_URL);
ontModel.read(in, "");
in.close();

Creating a Reasoner

A OWL Reasoner is then created and associated to the previous model:
Reasoner reasoner = ReasonerRegistry.getOWLReasoner();
reasoner=this.reasoner.bindSchema(ontModel);

Creating a random set of variations

A new RDF model is created to hold a few instances of random Variations. For each instance, we add a random property 'my:chromosome', a random property 'my:position' and we associated one of the following type:
  • vo:VariO_0000029 "modified amino acid", a sub-Class of vo:VariO_0000028 ("post translationally modified protein")
  • vo:VariO_0000030 "spliced protein", a sub-Class of vo:VariO_0000028 ("post translationally modified protein")
  • vo:VariO_0000033 "effect on protein subcellular localization". It is NOT a sub-class of vo:VariO_0000028
Random rand=new Random();
com.hp.hpl.jena.rdf.model.Model instances = ModelFactory.createDefaultModel();
instances.setNsPrefix("vo",VO_PREFIX);
instances.setNsPrefix("my",MY_URI);

for(int i=0;i< 10;++i)
 {
 Resource subject= null;
 Resource rdftype=null;
 switch(i%3)
    {
    case 0:
       {
       //modified amino acid
       subject=instances.createResource(AnonId.create("modaa_"+i));
       rdftype=instances.createResource(VO_PREFIX+"VariO_0000029");
       break;
       }
    case 1:
       {
       //spliced protein
       subject=instances.createResource(AnonId.create("spliced_"+i));
       rdftype=instances.createResource(VO_PREFIX+"VariO_0000030");
       break;
       }
    default:
       {
       //effect on protein subcellular localization
       subject=instances.createResource(AnonId.create("subcell_"+i));
       rdftype=instances.createResource(VO_PREFIX+"VariO_0000033");
       break;
       }
    }
 instances.add(subject, RDF.type, rdftype);
 instances.add(subject, hasChromosome, instances.createLiteral("chr"+(1+rand.nextInt(22))));
 instances.add(subject, hasPosition, instances.createTypedLiteral(rand.nextInt(1000000)));
 }

Reasoning

A new inference model is created using the reasoner and the instances of variation. An iterator is used to only list the variations being a subclasses of vo:VariO_0000028 and having a property "my:chromosome" and a property "my:position".
InfModel model = ModelFactory.createInfModel (reasoner, instances);
ExtendedIterator<Statement> sti = model.listStatements(
        null, null, model.createResource(VO_PREFIX+"VariO_0000028"));
sti=sti.filterKeep(new Filter<Statement>()
      {
      @Override
      public boolean accept(Statement stmt)
         {
         return   stmt.getSubject().getProperty(hasChromosome)!=null &&
               stmt.getSubject().getProperty(hasPosition)!=null
               ;
         }
      });
Loop over the iterator and print the result:
while(sti.hasNext() )
         {
         Statement stmt = sti.next();
         System.out.println("\t+ " + PrintUtil.print(stmt));
         Statement val=stmt.getSubject().getProperty(hasChromosome);
         System.out.println("\t\tChromosome:\t"+val.getObject());
         val=stmt.getSubject().getProperty(hasPosition);
         System.out.println("\t\tPosition:\t"+val.getObject());
         }

Result

   + (spliced_7 rdf:type http://purl.obolibrary.org/obo/VariO_0000028)
      Chromosome:   chr7
      Position:   134172^^http://www.w3.org/2001/XMLSchema#int
   + (spliced_4 rdf:type http://purl.obolibrary.org/obo/VariO_0000028)
      Chromosome:   chr13
      Position:   674316^^http://www.w3.org/2001/XMLSchema#int
   + (spliced_1 rdf:type http://purl.obolibrary.org/obo/VariO_0000028)
      Chromosome:   chr22
      Position:   457596^^http://www.w3.org/2001/XMLSchema#int
   + (modaa_9 rdf:type http://purl.obolibrary.org/obo/VariO_0000028)
      Chromosome:   chr12
      Position:   803303^^http://www.w3.org/2001/XMLSchema#int
   + (modaa_6 rdf:type http://purl.obolibrary.org/obo/VariO_0000028)
      Chromosome:   chr15
      Position:   794137^^http://www.w3.org/2001/XMLSchema#int
   + (modaa_3 rdf:type http://purl.obolibrary.org/obo/VariO_0000028)
      Chromosome:   chr14
      Position:   34487^^http://www.w3.org/2001/XMLSchema#int
   + (modaa_0 rdf:type http://purl.obolibrary.org/obo/VariO_0000028)
      Chromosome:   chr15
      Position:   536371^^http://www.w3.org/2001/XMLSchema#int

Full source code

import java.io.IOException;
import java.io.InputStream;
import java.util.Random;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import com.hp.hpl.jena.ontology.OntModel;
import com.hp.hpl.jena.ontology.OntModelSpec;
import com.hp.hpl.jena.rdf.model.AnonId;
import com.hp.hpl.jena.rdf.model.InfModel;
import com.hp.hpl.jena.rdf.model.ModelFactory;
import com.hp.hpl.jena.rdf.model.Property;
import com.hp.hpl.jena.rdf.model.Resource;
import com.hp.hpl.jena.rdf.model.Statement;
import com.hp.hpl.jena.reasoner.Reasoner;
import com.hp.hpl.jena.reasoner.ReasonerRegistry;
import com.hp.hpl.jena.util.FileManager;
import com.hp.hpl.jena.util.PrintUtil;
import com.hp.hpl.jena.util.iterator.ExtendedIterator;
import com.hp.hpl.jena.util.iterator.Filter;
import com.hp.hpl.jena.vocabulary.RDF;

public class VariationOntologyReasoner
  {
  private static final String VO_PREFIX="http://purl.obolibrary.org/obo/";
  private static final String MY_URI="urn:my:ontology";
  private static final String VO_OWL_URL="http://www.variationontology.org/download/VariO_0.979.owl";
  private Reasoner reasoner;
  static final private Property hasChromosome=ModelFactory.createDefaultModel().createProperty(MY_URI,"chromosome");
  static final private Property hasPosition=ModelFactory.createDefaultModel().createProperty(MY_URI,"position");
  
  private VariationOntologyReasoner() throws IOException
    {
    OntModel ontModel = ModelFactory.createOntologyModel();
     InputStream in = FileManager.get().open(VO_OWL_URL);
    ontModel.read(in, "");
    in.close();
    this.reasoner = ReasonerRegistry.getOWLReasoner();
    this.reasoner=this.reasoner.bindSchema(ontModel);
    }
  
  private void run()
    {
    Random rand=new Random();
    com.hp.hpl.jena.rdf.model.Model instances = ModelFactory.createDefaultModel();
    instances.setNsPrefix("vo",VO_PREFIX);
    instances.setNsPrefix("my",MY_URI);

    for(int i=0;i< 10;++i)
      {
      Resource subject= null;
      Resource rdftype=null;
      switch(i%3)
        {
        case 0:
          {
          //modified amino acid
          subject=instances.createResource(AnonId.create("modaa_"+i));
          rdftype=instances.createResource(VO_PREFIX+"VariO_0000029");
          break;
          }
        case 1:
          {
          subject=instances.createResource(AnonId.create("spliced_"+i));
          rdftype=instances.createResource(VO_PREFIX+"VariO_0000030");
          break;
          }
        default:
          {
          //effect on protein subcellular localization
          subject=instances.createResource(AnonId.create("subcell_"+i));
          rdftype=instances.createResource(VO_PREFIX+"VariO_0000033");
          break;
          }
        }
      instances.add(subject, RDF.type, rdftype);
      instances.add(subject, hasChromosome, instances.createLiteral("chr"+(1+rand.nextInt(22))));
      instances.add(subject, hasPosition, instances.createTypedLiteral(rand.nextInt(1000000)));
      }
    
    InfModel model = ModelFactory.createInfModel (reasoner, instances);
    ExtendedIterator<Statement> sti = model.listStatements(null, null, model.createResource(VO_PREFIX+"VariO_0000028"));
    sti=sti.filterKeep(new Filter<Statement>()
        {
        @Override
        public boolean accept(Statement stmt)
          {
          return  stmt.getSubject().getProperty(hasChromosome)!=null &&
              stmt.getSubject().getProperty(hasPosition)!=null
              ;
          }
        });
    while(sti.hasNext() )
      {
      Statement stmt = sti.next();
      System.out.println("\t+ " + PrintUtil.print(stmt));
      Statement val=stmt.getSubject().getProperty(hasChromosome);
      System.out.println("\t\tChromosome:\t"+val.getObject());
      val=stmt.getSubject().getProperty(hasPosition);
      System.out.println("\t\tPosition:\t"+val.getObject());
      }
    }
  
  public static void main(String[] args) throws Exception
    {
    VariationOntologyReasoner app=new VariationOntologyReasoner();
    app.run();
  }
}

That's it,

Pierre






14 August 2012

Apache Pig: first contact with some 'bio' data.

via wikipedia: "Apache Pig is a high-level platform for creating MapReduce programs used with Hadoop. The language for this platform is called Pig Latin. Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for RDBMS systems.".

here I've played with PIG using a local datastore (that is, different from a distributed environment) to handle some data from the UCSC database.

Download and Install

Install Pig:
$ wget "http://mirror.cc.columbia.edu/pub/software/apache/pig/pig-0.10.0/pig-0.10.0.tar.gz"
$ tar xvfz pig-0.10.0.tar.gz
$ rm pig-0.10.0.tar.gz
$ cd pig-0.10.0
$ export PIG_INSTALL=${PWD}
$ export JAVA_HOME=/your/path/to/jdk1.7

Download 'knownGene' from the UCSC:
$ wget "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGene.txt.gz"
$ gunzip knownGene.txt.gz

Start the command line interface

run Pig’s Grunt shell in local mode:
$ pig -x local 
2012-08-15 10:37:25,247 [main] INFO  org.apache.pig.Main - Apache Pig version 0.10.0 (r1328203) compiled Apr 19 2012, 22:54:12
2012-08-15 10:37:25,248 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/lindenb/tmp/HADOOP/pig-0.10.0/pig_1344933445243.log
2012-08-15 10:37:25,622 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:///

Getting the number of Genes by Chromosome

knownGenes = LOAD '/home/lindenb/tmp/HADOOP/knownGene.txt' as 
  ( name:chararray ,  chrom:chararray ,  strand:chararray ,  txStart:int , 
    txEnd:int ,  cdsStart:int ,  cdsEnd:int ,  exonCount:chararray ,
     exonStarts:chararray ,  exonEnds:chararray ,  proteinID:chararray , 
     alignID:chararray );
keep the genes coding for a protein:
coding = FILTER knownGenes BY cdsStart < cdsEnd ;
Remove some columns:
coding = FOREACH coding GENERATE chrom,cdsStart,cdsEnd,strand,name;
Group by chromosome:
C = GROUP coding by chrom;
Filter out some chromosomes:
C = FILTER C by NOT(
  group matches  '.*_random' OR
  group matches  'chrUn_.*' OR
  group matches '.*hap.*'
  );
Get the count of genes for each chromosome:
D= FOREACH C GENERATE
 group as CHROM,
 COUNT(coding.name) as numberOfGenes
 ;
And dump the data:
dump D;
Result:
(chr1,6139)
(chr2,3869)
(chr3,3406)
(chr4,2166)
(chr5,2515)
(chr6,3021)
(chr7,2825)
(chr8,1936)
(chr9,2338)
(chrM,1)
(chrX,2374)
(chrY,318)
(chr10,2470)
(chr11,3579)
(chr12,3066)
(chr13,924)
(chr14,2009)
(chr15,1819)
(chr16,2510)
(chr17,3396)
(chr18,862)
(chr19,3784)
(chr20,1570)
(chr21,663)
(chr22,1348)
Interestingly, noting seems to happen until your ask to dump the data.

Finding the overlapping genes

I want to get a list of pairs of genes overlapping and having an opposite strand. I've not been able to find a quick way to join two tables using a complex criteria.

Create a two identical lists of genes E1 and E2 . Add an extra column "1" that will be used to join both tables.

E1 = FOREACH coding GENERATE 1 as pivot , $0 , $1 , $2 , $3, $4;
E2 = FOREACH coding GENERATE 1 as pivot , $0 , $1 , $2 , $3, $4;
Join the tables using the extra column.
E3 = join E1 by pivot, E2 by pivot;
Extract and rename the fields from the join:
E3 = FOREACH E3 generate 
 $1 as chrom1, $2 as start1, $3 as end1, $4 as strand1, $5 as name1,
 $7 as chrom2, $8 as start2, $9 as end2, $10 as strand2, $11 as name2
 ;
At this point, the data in E3 look like this:
(...)
(chr1,664484,665108,-,uc009vjm.3,chr1,324342,325605,+,uc001aau.3)
(chr1,664484,665108,-,uc009vjm.3,chr1,664484,665108,-,uc001abe.4)
(chr1,664484,665108,-,uc009vjm.3,chr1,324342,325605,+,uc009vjk.2)
(chr1,664484,665108,-,uc009vjm.3,chr1,664484,665108,-,uc009vjm.3)
(chr1,664484,665108,-,uc009vjm.3,chr1,12189,13639,+,uc010nxq.1)
(chr1,664484,665108,-,uc009vjm.3,chr1,367658,368597,+,uc010nxu.2)
(chr1,664484,665108,-,uc009vjm.3,chr1,621095,622034,-,uc010nxv.2)
(chr1,664484,665108,-,uc009vjm.3,chr1,324514,325605,+,uc021oeh.1)
(chr1,664484,665108,-,uc009vjm.3,chr1,327745,328213,+,uc021oei.1)
(...)
Extract the overlapping genes:
E3= FILTER E3 BY
    name1 < name2 AND
    chrom1==chrom2 AND
    strand1!=strand2 AND
    NOT(end1 < start2 OR end2 < start1);
and dump the result:
dump E3
After a few hours the result is computed:
(...)
(chr9,119188129,120177216,-,uc004bjt.2,chr9,119460021,119461983,+,uc004bjw.2)
(chr9,119188129,120177216,-,uc004bjt.2,chr9,119460021,119461983,+,uc004bjx.2)
(chr9,119188129,120177216,-,uc004bjt.2,chr9,119460021,119461983,+,uc022bmo.1)
(chr9,119460021,119461983,+,uc004bjw.2,chr9,119188129,119903719,-,uc022bml.1)
(chr9,119460021,119461983,+,uc004bjw.2,chr9,119188129,119903719,-,uc022bmm.1)
(chr9,119460021,119461983,+,uc004bjx.2,chr9,119188129,119903719,-,uc022bml.1)
(chr9,119460021,119461983,+,uc004bjx.2,chr9,119188129,119903719,-,uc022bmm.1)
(chr9,129724568,129981048,+,uc004bqo.2,chr9,129851217,129871010,-,uc004bqr.1)
(chr9,129724568,129981048,+,uc004bqo.2,chr9,129851217,129856116,-,uc010mxg.1)
(chr9,129724568,129979280,+,uc004bqq.4,chr9,129851217,129871010,-,uc004bqr.1)
(chr9,129724568,129979280,+,uc004bqq.4,chr9,129851217,129856116,-,uc010mxg.1)
(chr9,129851217,129871010,-,uc004bqr.1,chr9,129724568,129940183,+,uc022bno.1)
(chr9,129851217,129871010,-,uc004bqr.1,chr9,129724568,129946390,+,uc011mab.2)
(chr9,129851217,129871010,-,uc004bqr.1,chr9,129724568,129979280,+,uc011mac.2)
(chr9,129851217,129856116,-,uc010mxg.1,chr9,129724568,129940183,+,uc022bno.1)
(chr9,129851217,129856116,-,uc010mxg.1,chr9,129724568,129946390,+,uc011mab.2)
(chr9,129851217,129856116,-,uc010mxg.1,chr9,129724568,129979280,+,uc011mac.2)
(chr9,130455527,130477918,-,uc004brm.3,chr9,130469310,130476184,+,uc004brn.1)
(chr9,131703812,131719311,+,uc004bwq.1,chr9,131707965,131709582,-,uc004bwr.3)
(chrX,11156982,11445715,-,uc004cun.1,chrX,11312908,11318732,+,uc004cus.3)
(chrX,11156982,11445715,-,uc004cun.1,chrX,11312908,11318732,+,uc004cut.3)
(chrX,11156982,11445715,-,uc004cun.1,chrX,11312908,11318732,+,uc004cuu.3)
(chrX,11156982,11682948,-,uc004cup.1,chrX,11312908,11318732,+,uc004cus.3)
(chrX,11156982,11682948,-,uc004cup.1,chrX,11312908,11318732,+,uc004cut.3)
(chrX,11156982,11682948,-,uc004cup.1,chrX,11312908,11318732,+,uc004cuu.3)
(...)
That was very slow. There might be a better way to do this and I wonder if using a hadoop filesystem would really speed the computation. At this point I'll continue to use a SQL database for such small amount of data.

That's it.

Pierre