Wednesday, March 21, 2012

Modifying sequence names in a FASTA file

I want to show you now another formatting example taken from a recent post in the BioPerl's mailing list. This one will include how to parse a CSV file, a very common taks in bioinformatics programming. The question is about consolidating a FASTA file from a source FASTA and a CSV file containing complementing and corresponding identifiers. First as before, let's use two dumb files: The DNANumbers-Sequences.fasta file
>2863
AGGATTAAAAATCAACGCTATGAATCTGGTGTAATTCCATATGCTAAAATGGGCTATTGGGATCCTAATT
ATGCAATTAAAGAAACTGATGTATTAGCATTATTTC

>2864
AGGATTAAAAATCAACGCTATGAATCTGGTGTAATTCCATATGCTAAAATGGGCTATTGGGATCCTAATT
ATGCAATTAAAGAAACTGATGTATTAGCATTATTTCGTATTACTCCACAACCAGGTGTAGAT
and the DNANumbers-TaxaNames.csv
2863 Gelidium
2864 Poa
Let's look at the code then:
| multiFasta hashTable |

multiFasta := BioParser parseMultiFasta: ( BioFASTAFile on: 'DNANumbers-Sequences.fasta') contents.

hashTable := BioParser
    tokenizeCSV: ( BioCSVFile on: 'DNANumbers-TaxaNames.csv' ) contents
    delimiter: Character space.

( multiFasta renameFromDictionary: hashTable ) outputToFile: 'Renamed-Sequences.fa'.
You will notice a pattern here. In the reading of the FASTA file, the message sent is prefixed with #parse, while the CSV file is "parsed" through #tokenize. This is because in BioSmalltalk we distinguish between two modes of parsing. The tokenize messages always answer a Collection-like object containing other collections or primitive Smalltalk objects (this is something like #('object1' 'object2')), while the parse messages answer BioObject's, which could be sometimes "expensive" objects but it could be useful if you need to keep working with bio-objects and to learn relationships between them. In this case we needed a BioMultiFastaRecord because we wanted to rename the identifiers and output its contents to a new file, but for the CSV we only needed to tokenize it as a Dictionary (or hast table).

Another little thing to take into account, you are responsible to specify the delimiter of your CSV file, this will be the case until someone implements a pattern recognition algorithm for CSV files.

Monday, March 19, 2012

Filtering a FASTA file

I wondered how Smalltalk will translate a common and simple FASTA file filtering problem, so I've picked randomly a question in the BioStar community: http://biostar.stackexchange.com/questions/1710/extract-a-group-of-fasta-sequences-from-a-file to test. To replicate the problem, I've created a dumb FASTA file named 'Test-Sequences01.fasta' and moved to the BioSmalltalkTestFiles subdirectory:
>1
GGGAATTCTTGGGGTGCTGGGATCTTTTTGGGGTTGGAAAGAAAATGGCC
GTACTGTTATATTGTTGGGGTGGGAACCCGGGGTGGGGGGAGGGAATTTA
>2
GGGAATTCTTGGGGTGCTGGGATCTTTTTGGGGTTGGAAAGAAAATGGCC
GTACTGTTATATTGTTGGGGTGGGAACCCGGGGTGGGGGGAGGGAATTTC
>3
GGGAATTCTTGGGGTGCTGGGATCTTTTTGGGGTTGGAAAGAAAATGGCC
GTACTGTTATATTGTTGGGGTGGGAACCCGGGGTGGGGGGAGGGAATTTG
>4
GGGAATTCTTGGGGTGCTGGGATCTTTTTGGGGTTGGAAAGAAAATGGCC
GTACTGTTATATTGTTGGGGTGGGAACCCGGGGTGGGGGGAGGGAATTTT
>5
GGGAATTCTTGGGGTGCTGGGATCTTTTTGGGGTTGGAAAGAAAATGGCC
GTACTGTTATATTGTTGGGGTGGGAACCCGGGGTGGGGGGAGGGAATTAG
>6
GGGAATTCTTGGGGTGCTGGGATCTTTTTGGGGTTGGAAAGAAAATGGCC
GTACTGTTATATTGTTGGGGTGGGAACCCGGGGTGGGGGGAGGGAATTCG
>7
GGGAATTCTTGGGGTGCTGGGATCTTTTTGGGGTTGGAAAGAAAATGGCC
GTACTGTTATATTGTTGGGGTGGGAACCCGGGGTGGGGGGAGGGAAGGGG
Suppose I want to filter those fasta entries numbered 1 2 5 and 7. This could be a solution using BioSmalltalk
| idSubset fastaRecords |

idSubset := #(1 2 5 7) collect: #asString.
fastaRecords :=
  BioParser parseMultiFasta: ( BioFASTAFile on: 'BioSmalltalkTestFiles\Test-Sequences01.fasta' ) contents.

( fastaRecords select: [: fRecord | idSubset includes: fRecord sequenceName ] ) outputToFile: 'Filtered-Sequences.fa'

note the following differences with the BioPython accepted script.
SeqIO.parse(open(fasta_file),'fasta')
In BioSmalltalk instead of specifying parameters for formats in Strings like 'fasta', we use a BioFASTAFile object, this not only prevents typos in parameters (and even in the case of a class name typo, the syntax highlighter will notify you in bold red typography that the class is not recognized), but also decouples the file with the parser, enabling to use the fasta file as an real object, and perform validity checks for example in the class.

This is another example of API design which IMO is simplified in BioSmalltalk:
with open(result_file, "w") as f:
for seq in fasta_sequences:
   if seq.id in wanted:
       SeqIO.write([seq], f, "fasta")
opening a file with a "w"rite mode isn't something that a developer must know, mostly we just want to write the results in a file name, not dealing with handles(?) or write modes.
If you prefer a more functional style, the script could avoid the fastaRecords variable:
| idSubset |
idSubset := #(1 2 5 7) collect: #asString.
( ( BioParser parseMultiFasta: ( BioFASTAFile on: 'BioSmalltalkTestFiles\Test-Sequences01.fasta' ) contents )
      select: [: fRecord | idSubset includes: fRecord sequenceName ] ) outputToFile: 'Filtered-Sequences.fa'.
Notice first, the #parseMultiFasta: answers a BioFASTAMultiRecords object, which contains a collection of BioFASTARecord's. Then the #select: message acts over this multi records object, answering another BioFASTAMultiRecords and not an OrderedCollection or other typical Smalltalk collection. This way you may continue using the specialized protocol and in case of need, you may ask for the #sequences anytime. In the next posts I will compare the performance of filtering over big FASTA files, so we can get a measure of how well BioSmalltalk will perform for high-demand biocomputing