Monday, July 16, 2012

Custom serialization of big DNA or proteins

Lately I've been experimenting with serialization engines in Pharo. Besides the "traditional" alternative (SmartReferenceStream) I took the chance of evaluating Fuel, a new serializer which is nicely documented and supported, actually Mariano Martinez Peck and Martin Dias (the Fuel developers) answered privately my questions and requirements in a very fast way, so thanks to them I can show you an interesting feature now in BioSmalltalk
A typical serialization in bioinformatics includes a huge group of sequences or a big XML tree, so one of my requirements is to customize the serialization strategy to save precious memory. This means to change the serializer on-the-fly when a particular object is found in a graph of objects, specifically, if a DNA or protein sequence with a particular threshold is found, you certainly would like to zip it. Follows an example for serializing an Array with a random object ('hello') and the chromosome 28 of chicken:
objectToSerialize := Array with: 'hello' with: (FileStream readOnlyFileNamed: 'GGA28.fa') contents.
threshold := 1000.

FileStream forceNewFileNamed: 'demo.fuel' do: [ :aStream |
   aSerializer := FLSerializer newDefault.
   aSerializer analyzer
       when: [ :o | o isString and: [ o size > threshold and: [ (BioParser tokenizeFasta: o) second isDNASequence ] ] ]
       substituteBy: [ :o | o zipped ].
   aSerializer        
       serialize: objectToSerialize
       on: aStream binary ].
and of course, the corresponding materialization
result := FileStream oldFileNamed: 'demo.fuel' do: [ :aStream |
 aMaterialization := FLMaterializer newDefault materializeFrom: aStream binary.
 zippedStrings := aMaterialization objects select: [:o | o isString and: [ o isDNASequence ]].
 unzippedStrings := zippedStrings collect: [:o | o unzipped ].
 zippedStrings elementsExchangeIdentityWith: unzippedStrings.
 aMaterialization root ].
Looking at the possibilities, many of the custom DNA compression algorithms (or even XML) could be attached and used if saving space is becoming an issue in your bioinformatics experiments.

0 comments:

Post a Comment