Wednesday, March 21, 2012

Modifying sequence names in a FASTA file

I want to show you now another formatting example taken from a recent post in the BioPerl's mailing list. This one will include how to parse a CSV file, a very common taks in bioinformatics programming. The question is about consolidating a FASTA file from a source FASTA and a CSV file containing complementing and corresponding identifiers. First as before, let's use two dumb files: The DNANumbers-Sequences.fasta file

and the DNANumbers-TaxaNames.csv
2863 Gelidium
2864 Poa
Let's look at the code then:
| multiFasta hashTable |

multiFasta := BioParser parseMultiFasta: ( BioFASTAFile on: 'DNANumbers-Sequences.fasta') contents.

hashTable := BioParser
    tokenizeCSV: ( BioCSVFile on: 'DNANumbers-TaxaNames.csv' ) contents
    delimiter: Character space.

( multiFasta renameFromDictionary: hashTable ) outputToFile: 'Renamed-Sequences.fa'.
You will notice a pattern here. In the reading of the FASTA file, the message sent is prefixed with #parse, while the CSV file is "parsed" through #tokenize. This is because in BioSmalltalk we distinguish between two modes of parsing. The tokenize messages always answer a Collection-like object containing other collections or primitive Smalltalk objects (this is something like #('object1' 'object2')), while the parse messages answer BioObject's, which could be sometimes "expensive" objects but it could be useful if you need to keep working with bio-objects and to learn relationships between them. In this case we needed a BioMultiFastaRecord because we wanted to rename the identifiers and output its contents to a new file, but for the CSV we only needed to tokenize it as a Dictionary (or hast table).

Another little thing to take into account, you are responsible to specify the delimiter of your CSV file, this will be the case until someone implements a pattern recognition algorithm for CSV files.


Post a Comment