Showing posts with label FASTA. Show all posts

Monday, July 16, 2012

Custom serialization of big DNA or proteins

Lately I've been experimenting with serialization engines in Pharo. Besides the "traditional" alternative (SmartReferenceStream) I took the chance of evaluating Fuel, a new serializer which is nicely documented and supported, actually Mariano Martinez Peck and Martin Dias (the Fuel developers) answered privately my questions and requirements in a very fast way, so thanks to them I can show you an interesting feature now in BioSmalltalk
A typical serialization in bioinformatics includes a huge group of sequences or a big XML tree, so one of my requirements is to customize the serialization strategy to save precious memory. This means to change the serializer on-the-fly when a particular object is found in a graph of objects, specifically, if a DNA or protein sequence with a particular threshold is found, you certainly would like to zip it. Follows an example for serializing an Array with a random object ('hello') and the chromosome 28 of chicken:

objectToSerialize := Array with: 'hello' with: (FileStream readOnlyFileNamed: 'GGA28.fa') contents.
threshold := 1000.

FileStream forceNewFileNamed: 'demo.fuel' do: [ :aStream |
   aSerializer := FLSerializer newDefault.
   aSerializer analyzer
       when: [ :o | o isString and: [ o size > threshold and: [ (BioParser tokenizeFasta: o) second isDNASequence ] ] ]
       substituteBy: [ :o | o zipped ].
   aSerializer        
       serialize: objectToSerialize
       on: aStream binary ].

and of course, the corresponding materialization

result := FileStream oldFileNamed: 'demo.fuel' do: [ :aStream |
 aMaterialization := FLMaterializer newDefault materializeFrom: aStream binary.
 zippedStrings := aMaterialization objects select: [:o | o isString and: [ o isDNASequence ]].
 unzippedStrings := zippedStrings collect: [:o | o unzipped ].
 zippedStrings elementsExchangeIdentityWith: unzippedStrings.
 aMaterialization root ].

Looking at the possibilities, many of the custom DNA compression algorithms (or even XML) could be attached and used if saving space is becoming an issue in your bioinformatics experiments.

Using the BioEntrezClient

Sometimes you have a list of accession numbers and want to get the corresponding FASTA sequences, this is the way to do it:


| result |
    
result := BioEntrezClient new nuccore
                uids: #('AB177765.1' 'AB177791.1');
                setFasta;
                fetch.

result outputToFile: 'nuccore_seqs.fasta'

the script will download the sequences in one trip vía the NCBI Entrez API, if you just wanted the GenBank format, just set #setGb instead of #setFasta above.The default format is ASN.1, which is a "hard" format for bioinformaticians. To download PubMed records from UID's, the following is a simple possible script:


| result |

result := 
 BioEntrezClient new pubmed 
  uids: #( 11877539 11822933 );
  fetch.
result outputToFile: 'fetchPubmed-01.txt'

And pretty much the same way with the spelling service using PubMedCentral database


| result |
 
result := 
 BioEntrezClient new pmc 
  term: 'fiberblast cell grwth';
  spell.
result outputToFile: 'eSpell1.xml'

With some classes you would want to view the possible messages you may send, for example to get the list of databases through Entrez. In this case this is easily done with the reflection capabilities of Smalltalk:


BioEntrezClient organization
   listAtCategoryNamed: 'accessing public - databases'

Filtering a FASTA file

I wondered how Smalltalk will translate a common and simple FASTA file filtering problem, so I've picked randomly a question in the BioStar community: http://biostar.stackexchange.com/questions/1710/extract-a-group-of-fasta-sequences-from-a-file to test. To replicate the problem, I've created a dumb FASTA file named 'Test-Sequences01.fasta' and moved to the BioSmalltalkTestFiles subdirectory:

>1
GGGAATTCTTGGGGTGCTGGGATCTTTTTGGGGTTGGAAAGAAAATGGCC
GTACTGTTATATTGTTGGGGTGGGAACCCGGGGTGGGGGGAGGGAATTTA
>2
GGGAATTCTTGGGGTGCTGGGATCTTTTTGGGGTTGGAAAGAAAATGGCC
GTACTGTTATATTGTTGGGGTGGGAACCCGGGGTGGGGGGAGGGAATTTC
>3
GGGAATTCTTGGGGTGCTGGGATCTTTTTGGGGTTGGAAAGAAAATGGCC
GTACTGTTATATTGTTGGGGTGGGAACCCGGGGTGGGGGGAGGGAATTTG
>4
GGGAATTCTTGGGGTGCTGGGATCTTTTTGGGGTTGGAAAGAAAATGGCC
GTACTGTTATATTGTTGGGGTGGGAACCCGGGGTGGGGGGAGGGAATTTT
>5
GGGAATTCTTGGGGTGCTGGGATCTTTTTGGGGTTGGAAAGAAAATGGCC
GTACTGTTATATTGTTGGGGTGGGAACCCGGGGTGGGGGGAGGGAATTAG
>6
GGGAATTCTTGGGGTGCTGGGATCTTTTTGGGGTTGGAAAGAAAATGGCC
GTACTGTTATATTGTTGGGGTGGGAACCCGGGGTGGGGGGAGGGAATTCG
>7
GGGAATTCTTGGGGTGCTGGGATCTTTTTGGGGTTGGAAAGAAAATGGCC
GTACTGTTATATTGTTGGGGTGGGAACCCGGGGTGGGGGGAGGGAAGGGG

Suppose I want to filter those fasta entries numbered 1 2 5 and 7. This could be a solution using BioSmalltalk

| idSubset fastaRecords |

idSubset := #(1 2 5 7) collect: #asString.
fastaRecords :=
  BioParser parseMultiFasta: ( BioFASTAFile on: 'BioSmalltalkTestFiles\Test-Sequences01.fasta' ) contents.

( fastaRecords select: [: fRecord | idSubset includes: fRecord sequenceName ] ) outputToFile: 'Filtered-Sequences.fa'

note the following differences with the BioPython accepted script.

SeqIO.parse(open(fasta_file),'fasta')

In BioSmalltalk instead of specifying parameters for formats in Strings like 'fasta', we use a BioFASTAFile object, this not only prevents typos in parameters (and even in the case of a class name typo, the syntax highlighter will notify you in bold red typography that the class is not recognized), but also decouples the file with the parser, enabling to use the fasta file as an real object, and perform validity checks for example in the class.

This is another example of API design which IMO is simplified in BioSmalltalk:

with open(result_file, "w") as f:
for seq in fasta_sequences:
   if seq.id in wanted:
       SeqIO.write([seq], f, "fasta")

opening a file with a "w"rite mode isn't something that a developer must know, mostly we just want to write the results in a file name, not dealing with handles(?) or write modes.
If you prefer a more functional style, the script could avoid the fastaRecords variable:

| idSubset |
idSubset := #(1 2 5 7) collect: #asString.
( ( BioParser parseMultiFasta: ( BioFASTAFile on: 'BioSmalltalkTestFiles\Test-Sequences01.fasta' ) contents )
      select: [: fRecord | idSubset includes: fRecord sequenceName ] ) outputToFile: 'Filtered-Sequences.fa'.

Notice first, the #parseMultiFasta: answers a BioFASTAMultiRecords object, which contains a collection of BioFASTARecord's. Then the #select: message acts over this multi records object, answering another BioFASTAMultiRecords and not an OrderedCollection or other typical Smalltalk collection. This way you may continue using the specialized protocol and in case of need, you may ask for the #sequences anytime. In the next posts I will compare the performance of filtering over big FASTA files, so we can get a measure of how well BioSmalltalk will perform for high-demand biocomputing

BioSmalltalk

Usage and development comments about the BioSmalltalk project

Menu

Monday, July 16, 2012

Custom serialization of big DNA or proteins

Sunday, July 15, 2012

Using the BioEntrezClient

Monday, March 19, 2012

Filtering a FASTA file

About

Blogroll

Flattr this blog

Popular Posts

Categories

Blog Archive