Monday, July 16, 2012

Custom serialization of big DNA or proteins

Lately I've been experimenting with serialization engines in Pharo. Besides the "traditional" alternative (SmartReferenceStream) I took the chance of evaluating Fuel, a new serializer which is nicely documented and supported, actually Mariano Martinez Peck and Martin Dias (the Fuel developers) answered privately my questions and requirements in a very fast way, so thanks to them I can show you an interesting feature now in BioSmalltalk
A typical serialization in bioinformatics includes a huge group of sequences or a big XML tree, so one of my requirements is to customize the serialization strategy to save precious memory. This means to change the serializer on-the-fly when a particular object is found in a graph of objects, specifically, if a DNA or protein sequence with a particular threshold is found, you certainly would like to zip it. Follows an example for serializing an Array with a random object ('hello') and the chromosome 28 of chicken:
objectToSerialize := Array with: 'hello' with: (FileStream readOnlyFileNamed: 'GGA28.fa') contents.
threshold := 1000.

FileStream forceNewFileNamed: 'demo.fuel' do: [ :aStream |
   aSerializer := FLSerializer newDefault.
   aSerializer analyzer
       when: [ :o | o isString and: [ o size > threshold and: [ (BioParser tokenizeFasta: o) second isDNASequence ] ] ]
       substituteBy: [ :o | o zipped ].
       serialize: objectToSerialize
       on: aStream binary ].
and of course, the corresponding materialization
result := FileStream oldFileNamed: 'demo.fuel' do: [ :aStream |
 aMaterialization := FLMaterializer newDefault materializeFrom: aStream binary.
 zippedStrings := aMaterialization objects select: [:o | o isString and: [ o isDNASequence ]].
 unzippedStrings := zippedStrings collect: [:o | o unzipped ].
 zippedStrings elementsExchangeIdentityWith: unzippedStrings.
 aMaterialization root ].
Looking at the possibilities, many of the custom DNA compression algorithms (or even XML) could be attached and used if saving space is becoming an issue in your bioinformatics experiments.


Post a Comment