Monday, March 19, 2012

Filtering a FASTA file

I wondered how Smalltalk will translate a common and simple FASTA file filtering problem, so I've picked randomly a question in the BioStar community: to test. To replicate the problem, I've created a dumb FASTA file named 'Test-Sequences01.fasta' and moved to the BioSmalltalkTestFiles subdirectory:
Suppose I want to filter those fasta entries numbered 1 2 5 and 7. This could be a solution using BioSmalltalk
| idSubset fastaRecords |

idSubset := #(1 2 5 7) collect: #asString.
fastaRecords :=
  BioParser parseMultiFasta: ( BioFASTAFile on: 'BioSmalltalkTestFiles\Test-Sequences01.fasta' ) contents.

( fastaRecords select: [: fRecord | idSubset includes: fRecord sequenceName ] ) outputToFile: 'Filtered-Sequences.fa'

note the following differences with the BioPython accepted script.
In BioSmalltalk instead of specifying parameters for formats in Strings like 'fasta', we use a BioFASTAFile object, this not only prevents typos in parameters (and even in the case of a class name typo, the syntax highlighter will notify you in bold red typography that the class is not recognized), but also decouples the file with the parser, enabling to use the fasta file as an real object, and perform validity checks for example in the class.

This is another example of API design which IMO is simplified in BioSmalltalk:
with open(result_file, "w") as f:
for seq in fasta_sequences:
   if in wanted:
       SeqIO.write([seq], f, "fasta")
opening a file with a "w"rite mode isn't something that a developer must know, mostly we just want to write the results in a file name, not dealing with handles(?) or write modes.
If you prefer a more functional style, the script could avoid the fastaRecords variable:
| idSubset |
idSubset := #(1 2 5 7) collect: #asString.
( ( BioParser parseMultiFasta: ( BioFASTAFile on: 'BioSmalltalkTestFiles\Test-Sequences01.fasta' ) contents )
      select: [: fRecord | idSubset includes: fRecord sequenceName ] ) outputToFile: 'Filtered-Sequences.fa'.
Notice first, the #parseMultiFasta: answers a BioFASTAMultiRecords object, which contains a collection of BioFASTARecord's. Then the #select: message acts over this multi records object, answering another BioFASTAMultiRecords and not an OrderedCollection or other typical Smalltalk collection. This way you may continue using the specialized protocol and in case of need, you may ask for the #sequences anytime. In the next posts I will compare the performance of filtering over big FASTA files, so we can get a measure of how well BioSmalltalk will perform for high-demand biocomputing


Post a Comment