08 September 2014

It's a kind of magic(5)

The following question was recently asked on biostars.org : Tool that detects data types:"More specifically I am interested for detecting between SAM, BAM, FASTA, FASTQ (if possible descriminating between these types), BED, BED12, BED15, GFF, GFF2, GFF3). The detection should be performed without just looking at the extension of the file....":
I suggested to create a new magic file for the linux file command. As an example, a BAM file starts (position=0) with the magic bytes BAM\1. A magic file definition 'bam' for BAM would be:

0 string BAM\1 BAM file v1.0
Compile the new magic file:
file -C -m bam
Now use this magic:
file -z -m bam.mgc file.bam
file.bam: BAM file v1.0 (data)

Now, I've started a github repo containing some 'magic' patterns for bioinformatics (Fastq, blast, bam, bigwig, etc... ): .
(My current problem is to prioritize some results like differentiating a 'Nucleotide' and a 'Protein' Fasta sequence ( http://unix.stackexchange.com/questions/154096/file1-and-magic5-prioritizing-a-result).)

That's it,
Pierre

1 comment:

Heikki said...

Great idea. Like you noticed (protein versus nucleotide), you'll run into limitations of the file utility pretty soon.

I suggest that you wrap this into a (bash) script -- biofile? -- that extends this project's capabilities to scripting necessary functions. Bio* projects have plenty of code examples illustrating practical algorithms to choose from.