Wednesday, March 21, 2012

Modifying sequence names in a FASTA file

I want to show you now another formatting example taken from a recent post in the BioPerl's mailing list. This one will include how to parse a CSV file, a very common taks in bioinformatics programming. The question is about consolidating a FASTA file from a source FASTA and a CSV file containing complementing and corresponding identifiers. First as before, let's use two dumb files: The DNANumbers-Sequences.fasta file

>2863
AGGATTAAAAATCAACGCTATGAATCTGGTGTAATTCCATATGCTAAAATGGGCTATTGGGATCCTAATT
ATGCAATTAAAGAAACTGATGTATTAGCATTATTTC

>2864
AGGATTAAAAATCAACGCTATGAATCTGGTGTAATTCCATATGCTAAAATGGGCTATTGGGATCCTAATT
ATGCAATTAAAGAAACTGATGTATTAGCATTATTTCGTATTACTCCACAACCAGGTGTAGAT

and the DNANumbers-TaxaNames.csv

2863 Gelidium
2864 Poa

Let's look at the code then:

| multiFasta hashTable |

multiFasta := BioParser parseMultiFasta: ( BioFASTAFile on: 'DNANumbers-Sequences.fasta') contents.

hashTable := BioParser
    tokenizeCSV: ( BioCSVFile on: 'DNANumbers-TaxaNames.csv' ) contents
    delimiter: Character space.

( multiFasta renameFromDictionary: hashTable ) outputToFile: 'Renamed-Sequences.fa'.

You will notice a pattern here. In the reading of the FASTA file, the message sent is prefixed with #parse, while the CSV file is "parsed" through #tokenize. This is because in BioSmalltalk we distinguish between two modes of parsing. The tokenize messages always answer a Collection-like object containing other collections or primitive Smalltalk objects (this is something like #('object1' 'object2')), while the parse messages answer BioObject's, which could be sometimes "expensive" objects but it could be useful if you need to keep working with bio-objects and to learn relationships between them. In this case we needed a BioMultiFastaRecord because we wanted to rename the identifiers and output its contents to a new file, but for the CSV we only needed to tokenize it as a Dictionary (or hast table).

Another little thing to take into account, you are responsible to specify the delimiter of your CSV file, this will be the case until someone implements a pattern recognition algorithm for CSV files.

BioSmalltalk

Usage and development comments about the BioSmalltalk project

Menu

Wednesday, March 21, 2012

Modifying sequence names in a FASTA file

0 comments:

Post a Comment

About

Blogroll

Flattr this blog

Popular Posts

Categories

Blog Archive