Tuesday, July 17, 2012

How to do a BLAST from Smalltalk

Provided by the National Center for Biotechnology Information (NCBI), BLAST is a sequence homology search program that finds all the sequences that match a particular given sequence from a database, and it is considered the most fundamental tool in bioinformatics today, although there are many other similar programs for different cases. I'm not going here to review the huge literature on sequence/string matching, for what any simple search in any search engine would get you a lot of resources to start.

For remote queries, you may access the free NCBI BLAST service in two ways, from a web browser and programatically through the BLAST URL API, previously known as QBlast. In this post you may read how to use the API using Smalltalk, and downloading a file with results in XML format for later processing
| search |
 
search := BioNCBIWWWBlastClient new nucleotide query: 'CCCTCAAACAT...TTTGAGGAG';
   hitListSize: 150;
   filterLowComplexity;
   expectValue: 10;
   wordSize: 11;
   blastn;
   blastPlainService;
   alignmentViewFlatQueryAnchored;
   formatTypeXML;   
   fetch.
search outputToFile: 'blast-query-result.xml' contents: search result.
in this script, at first we set up the Blast Client with the database and query sequence. Notice at this stage you may serialize the BioNCBIWWWBlastClient instance without further parameters. The second "set" of messages are "in cascade", in Smalltalk sometimes you need to send one object several consecutive messages, and in this case are directed to configure the search and send the requests in the #fetch. Each command is implemented as a subclass of BioNCBICommand because each one (like GET or PUT) accepts specific parameter types, but you don't need to worry about the valid parameters as they are validated internally, for example, when setting the database. Each message to the Blast client builds the corresponding URL, for example a GET url:

http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Get&RID=954517013-7639-11119

during execution you may open a Transcript and see the request progress:

17 July 2012 6:58:51 am Fetching NCBI BLAST request identifier...
17 July 2012 6:58:53 am Parsing RTOE...
17 July 2012 6:58:53 am New RTOE = 18
17 July 2012 6:58:53 am RTOE = 18
17 July 2012 6:58:53 am RID = 0ADUV9FM016
17 July 2012 6:58:53 am Executing...BioNCBIFetch
17 July 2012 6:58:53 am QBLAST: Delaying retry for 18 seconds
17 July 2012 6:59:11 am Sending another NCBI BLAST request...
17 July 2012 6:59:14 am NCBI BLAST request sent, waiting for results...
17 July 2012 6:59:15 am Finished successfully

Monday, July 16, 2012

Parsing and profiling BLAST results


For my application I wanted to display BLAST summaries because we track some history data in the Nuccore database. If you, as a good bioinformatician, have downloaded the BLAST XML files you may find a summary of each BLAST is found in the header. 

As we may have several BLAST files, processing entirely each one just to read the header is a waste of time, so we need to parse a XML up to a particular node is found, which is provided by the BioSmalltalk API using StAX parser (XMLPullParser) as backend.


Now, compare the time spent reading a whole XML file versus just stopping after the last header node is found.

BioObject requestXmlFile
 ifNotNilDo: [ : fileLocation |
  | blastReader timeToReadFile timeToProcessNode output |  
  timeToReadFile :=   [ blastReader := BioNCBIBlastReader newFromXML: fileLocation ] timeToRun.
  timeToProcessNode :=  [ output := blastReader blastProgramName ] timeToRun.
  timeToReadFile :=   ( Duration milliSeconds: timeToReadFile ) asSeconds.
  timeToProcessNode :=  ( Duration milliSeconds: timeToProcessNode ) asSeconds.
  Transcript 
   cr;
   show: 'Seconds to read ' , fileLocation , ' = ';
   show: timeToReadFile;
   cr;
   show: 'Seconds to process node = ' , timeToProcessNode asString;
   cr;
   show: output ]
Wonder how to read a whole group of BLAST XML files contained in a directory? This script would do it:
| time blastReaders blastHits |

Smalltalk garbageCollect.
Transcript clear.
time := [ blastReaders := BioBlastCollection filesFromXMLDirectory: 'Data\BLASTs\'.
  Transcript show: blastReaders size asString , ' blast XML files were read'; cr.
  blastHits := blastReaders parsedContents ] timeToRun.
blastHits explore.
Transcript 
 cr;
 show: 'Seconds to read NCBI BLAST XML''s = ';
 show: ( Duration milliSeconds: time ) asSeconds

but wrapping everything you want to measure in a #timeToRun block isn't that fun. Plus getting rid of that bunch of temporary variables, in Pharo you may profile the task in a beautiful one-liner:

( BioBlastCollection filesFromXMLDirectory: 'Data\BLASTs\' ) parsedContents.

just right-click the selected line and select "Profile it" as shown in the screenshot.

 

after execution, the Time Profiler is open with detailed information of time spent in each method so you may see where to look better for bottlenecks.

 

Custom serialization of big DNA or proteins

Lately I've been experimenting with serialization engines in Pharo. Besides the "traditional" alternative (SmartReferenceStream) I took the chance of evaluating Fuel, a new serializer which is nicely documented and supported, actually Mariano Martinez Peck and Martin Dias (the Fuel developers) answered privately my questions and requirements in a very fast way, so thanks to them I can show you an interesting feature now in BioSmalltalk
A typical serialization in bioinformatics includes a huge group of sequences or a big XML tree, so one of my requirements is to customize the serialization strategy to save precious memory. This means to change the serializer on-the-fly when a particular object is found in a graph of objects, specifically, if a DNA or protein sequence with a particular threshold is found, you certainly would like to zip it. Follows an example for serializing an Array with a random object ('hello') and the chromosome 28 of chicken:
objectToSerialize := Array with: 'hello' with: (FileStream readOnlyFileNamed: 'GGA28.fa') contents.
threshold := 1000.

FileStream forceNewFileNamed: 'demo.fuel' do: [ :aStream |
   aSerializer := FLSerializer newDefault.
   aSerializer analyzer
       when: [ :o | o isString and: [ o size > threshold and: [ (BioParser tokenizeFasta: o) second isDNASequence ] ] ]
       substituteBy: [ :o | o zipped ].
   aSerializer        
       serialize: objectToSerialize
       on: aStream binary ].
and of course, the corresponding materialization
result := FileStream oldFileNamed: 'demo.fuel' do: [ :aStream |
 aMaterialization := FLMaterializer newDefault materializeFrom: aStream binary.
 zippedStrings := aMaterialization objects select: [:o | o isString and: [ o isDNASequence ]].
 unzippedStrings := zippedStrings collect: [:o | o unzipped ].
 zippedStrings elementsExchangeIdentityWith: unzippedStrings.
 aMaterialization root ].
Looking at the possibilities, many of the custom DNA compression algorithms (or even XML) could be attached and used if saving space is becoming an issue in your bioinformatics experiments.

Sunday, July 15, 2012

Using the BioEntrezClient

Sometimes you have a list of accession numbers and want to get the corresponding FASTA sequences, this is the way to do it:

| result |
    
result := BioEntrezClient new nuccore
                uids: #('AB177765.1' 'AB177791.1');
                setFasta;
                fetch.

result outputToFile: 'nuccore_seqs.fasta'

the script will download the sequences in one trip vía the NCBI Entrez API, if you just wanted the GenBank format, just set #setGb instead of #setFasta above.The default format is ASN.1, which is a "hard" format for bioinformaticians. To download PubMed records from UID's, the following is a simple possible script:

| result |

result := 
 BioEntrezClient new pubmed 
  uids: #( 11877539 11822933 );
  fetch.
result outputToFile: 'fetchPubmed-01.txt'
 
And pretty much the same way with the spelling service using PubMedCentral database

| result |
 
result := 
 BioEntrezClient new pmc 
  term: 'fiberblast cell grwth';
  spell.
result outputToFile: 'eSpell1.xml'

With some classes you would want to view the possible messages you may send, for example to get the list of databases through Entrez. In this case this is easily done with the reflection capabilities of Smalltalk:

BioEntrezClient organization
   listAtCategoryNamed: 'accessing public - databases'

Wednesday, March 21, 2012

Modifying sequence names in a FASTA file

I want to show you now another formatting example taken from a recent post in the BioPerl's mailing list. This one will include how to parse a CSV file, a very common taks in bioinformatics programming. The question is about consolidating a FASTA file from a source FASTA and a CSV file containing complementing and corresponding identifiers. First as before, let's use two dumb files: The DNANumbers-Sequences.fasta file
>2863
AGGATTAAAAATCAACGCTATGAATCTGGTGTAATTCCATATGCTAAAATGGGCTATTGGGATCCTAATT
ATGCAATTAAAGAAACTGATGTATTAGCATTATTTC

>2864
AGGATTAAAAATCAACGCTATGAATCTGGTGTAATTCCATATGCTAAAATGGGCTATTGGGATCCTAATT
ATGCAATTAAAGAAACTGATGTATTAGCATTATTTCGTATTACTCCACAACCAGGTGTAGAT
and the DNANumbers-TaxaNames.csv
2863 Gelidium
2864 Poa
Let's look at the code then:
| multiFasta hashTable |

multiFasta := BioParser parseMultiFasta: ( BioFASTAFile on: 'DNANumbers-Sequences.fasta') contents.

hashTable := BioParser
    tokenizeCSV: ( BioCSVFile on: 'DNANumbers-TaxaNames.csv' ) contents
    delimiter: Character space.

( multiFasta renameFromDictionary: hashTable ) outputToFile: 'Renamed-Sequences.fa'.
You will notice a pattern here. In the reading of the FASTA file, the message sent is prefixed with #parse, while the CSV file is "parsed" through #tokenize. This is because in BioSmalltalk we distinguish between two modes of parsing. The tokenize messages always answer a Collection-like object containing other collections or primitive Smalltalk objects (this is something like #('object1' 'object2')), while the parse messages answer BioObject's, which could be sometimes "expensive" objects but it could be useful if you need to keep working with bio-objects and to learn relationships between them. In this case we needed a BioMultiFastaRecord because we wanted to rename the identifiers and output its contents to a new file, but for the CSV we only needed to tokenize it as a Dictionary (or hast table).

Another little thing to take into account, you are responsible to specify the delimiter of your CSV file, this will be the case until someone implements a pattern recognition algorithm for CSV files.

Monday, March 19, 2012

Filtering a FASTA file

I wondered how Smalltalk will translate a common and simple FASTA file filtering problem, so I've picked randomly a question in the BioStar community: http://biostar.stackexchange.com/questions/1710/extract-a-group-of-fasta-sequences-from-a-file to test. To replicate the problem, I've created a dumb FASTA file named 'Test-Sequences01.fasta' and moved to the BioSmalltalkTestFiles subdirectory:
>1
GGGAATTCTTGGGGTGCTGGGATCTTTTTGGGGTTGGAAAGAAAATGGCC
GTACTGTTATATTGTTGGGGTGGGAACCCGGGGTGGGGGGAGGGAATTTA
>2
GGGAATTCTTGGGGTGCTGGGATCTTTTTGGGGTTGGAAAGAAAATGGCC
GTACTGTTATATTGTTGGGGTGGGAACCCGGGGTGGGGGGAGGGAATTTC
>3
GGGAATTCTTGGGGTGCTGGGATCTTTTTGGGGTTGGAAAGAAAATGGCC
GTACTGTTATATTGTTGGGGTGGGAACCCGGGGTGGGGGGAGGGAATTTG
>4
GGGAATTCTTGGGGTGCTGGGATCTTTTTGGGGTTGGAAAGAAAATGGCC
GTACTGTTATATTGTTGGGGTGGGAACCCGGGGTGGGGGGAGGGAATTTT
>5
GGGAATTCTTGGGGTGCTGGGATCTTTTTGGGGTTGGAAAGAAAATGGCC
GTACTGTTATATTGTTGGGGTGGGAACCCGGGGTGGGGGGAGGGAATTAG
>6
GGGAATTCTTGGGGTGCTGGGATCTTTTTGGGGTTGGAAAGAAAATGGCC
GTACTGTTATATTGTTGGGGTGGGAACCCGGGGTGGGGGGAGGGAATTCG
>7
GGGAATTCTTGGGGTGCTGGGATCTTTTTGGGGTTGGAAAGAAAATGGCC
GTACTGTTATATTGTTGGGGTGGGAACCCGGGGTGGGGGGAGGGAAGGGG
Suppose I want to filter those fasta entries numbered 1 2 5 and 7. This could be a solution using BioSmalltalk
| idSubset fastaRecords |

idSubset := #(1 2 5 7) collect: #asString.
fastaRecords :=
  BioParser parseMultiFasta: ( BioFASTAFile on: 'BioSmalltalkTestFiles\Test-Sequences01.fasta' ) contents.

( fastaRecords select: [: fRecord | idSubset includes: fRecord sequenceName ] ) outputToFile: 'Filtered-Sequences.fa'

note the following differences with the BioPython accepted script.
SeqIO.parse(open(fasta_file),'fasta')
In BioSmalltalk instead of specifying parameters for formats in Strings like 'fasta', we use a BioFASTAFile object, this not only prevents typos in parameters (and even in the case of a class name typo, the syntax highlighter will notify you in bold red typography that the class is not recognized), but also decouples the file with the parser, enabling to use the fasta file as an real object, and perform validity checks for example in the class.

This is another example of API design which IMO is simplified in BioSmalltalk:
with open(result_file, "w") as f:
for seq in fasta_sequences:
   if seq.id in wanted:
       SeqIO.write([seq], f, "fasta")
opening a file with a "w"rite mode isn't something that a developer must know, mostly we just want to write the results in a file name, not dealing with handles(?) or write modes.
If you prefer a more functional style, the script could avoid the fastaRecords variable:
| idSubset |
idSubset := #(1 2 5 7) collect: #asString.
( ( BioParser parseMultiFasta: ( BioFASTAFile on: 'BioSmalltalkTestFiles\Test-Sequences01.fasta' ) contents )
      select: [: fRecord | idSubset includes: fRecord sequenceName ] ) outputToFile: 'Filtered-Sequences.fa'.
Notice first, the #parseMultiFasta: answers a BioFASTAMultiRecords object, which contains a collection of BioFASTARecord's. Then the #select: message acts over this multi records object, answering another BioFASTAMultiRecords and not an OrderedCollection or other typical Smalltalk collection. This way you may continue using the specialized protocol and in case of need, you may ask for the #sequences anytime. In the next posts I will compare the performance of filtering over big FASTA files, so we can get a measure of how well BioSmalltalk will perform for high-demand biocomputing

Thursday, February 16, 2012

Exploring interactively a BLAST XML alignment

Suppose you want to explore interactively the result of a BLAST alignment. If you have exported the XML format from the NCBI site, you may load it in BioSmalltalk into a BioNCBIBlastReader object, and open an explorer window to send messages to this object.
| blastReader |
blastReader :=
BioNCBIBlastReader
newFromXML: 'Data\BLASTs\YT817301R-Alignment.xml'.
The usual way smalltalkers evaluate the source code, implies selecting the code and opening the contextual menu to select the "Explore" item. Here is a screenshot of how an Explorer window looks like:



The next step is to find which messages we could send to that object, so you need to browse the object class. This is done by bringing the contextual menu over that object and selecting Browse.

Now as you may select any of the browse items, the "browse hierarchy" is the best option when you don't remember all the class behaviors, because will open a Hierarchy Browser enabling to view the whole BioNCBIBlastReader hierarchy.










The picture describes the basic architecture of a Class Hierarchy Browser. Generally when exploring an object behavior, the most important is the "Method List" (the 3rd one) which lists all the messages you may send to an object. The so-called "Method Pane" or "Code Pane" (the 4th pane) is the region where you may edit the source code, and is where you look at when you try to understand method usage, or modify the implementation.

These message are sent in the bottom pane in the Explorer window (so the "self" is the blast reader), and shown results to some message sends to the blast reader object:


self blastDbName " -> 'nr' "
self blastQueryLength " -> '923'"
self hitAccessionsAsString first: 3 " -> an OrderedCollection('JN382560.1' 'JQ070349.1' 'EU281529.1')"
self hitAccessionsAsTokens first: 3 " -> an OrderedCollection(#('JN382560' '1') #('JQ070349' '1') #('EU281529' '1') "
self hspHSeqs first: 2 " -> an OrderedCollection('AACTATTCCCTGAACACTATTAATATAGTTCCATAAATACAAAGAGCCTTATCAGTATTAAATTTATCAAAAATCCCAATAACTCAACACAGAATTTGCACCCTAACCAAATATTACAAACACCACTAGCTAACATAACACGCCCATACACAGACCACAGAATGAATTACCTACGCAAGGGGTAATGTACATAACATTAATGTAATAAAGACATAATATGTATATAGTACATTAAATTATATGCCCCATGCATATAAGCAAGTACATGACCTCTATAGCAGTACATAATACATATAATTATTGACTGTACATAGTACATTATGTCAAATTCATTCTTGATAGTATATCTATTATATATTCCTTACCATTAGATCACGAGCTTAATTACCATGCCGCGTGAAACCAGCAACCCGCTAGGCAAGGATCCCTCTTCTCGCTCCGGGCCCATAAATCGTGGGGGTCGCTATCCAATGAATTTTACCAGGCATCTGGTTCTTTCTTCAGGGCCATCTCATCTAAAACGGTCCATTCTTTCCTCTTAAATAAGACATCTCGATGGACTAATGGCTAATCAGCCCATGCTCACACATAACTGTGCTGTCATACATTTGGTATTTTTTTATTTTGGGGGATGCTTGGACTCAGCTATGGCCGTCAAAGGCCCTGACCCGGAGCATCTATTGTAGCTGGACTTAACTGCATCTTGAGCACCAGCATAATGATAAGCGTGGACATTACAGTCAATGGTCACAGGACATAAATTATATTATATATCCCCCCTTCATAAAAATTTCCCCCTTAAATATCTACCACCACTTTTAACAGACTTTTCCCTAGATACTTATTTAAATTTTTCACGCTTTCAATACTCAATTTAGCACTCCAAACAAAGTCAATATATAAACGCAGGCCCCCCCCCCCCG' 'AACTATTCCCTGAACACTATTAATATAGTTCCATAAATACAAAGAGCCTTATCAGTATTAAATTTATCAAAAATCCCAATAACTCAACACAGAATTTGCACCCTAACCAAATATTACAAACACCACTAGCTAACATAACACGCCCATACACAGACCACAGAATGAATTACCTACGCAAGGGGTAATGTACATAACATTAATGTAATAAAGACATAATATGTATATAGTACATTAAATTATATGCCCCATGCATATAAGCAAGTACATGACCTCTATAGCAGTACATAATACATATAATTATTGACTGTACATAGTACATTATGTCAAATTCATTCTTGATAGTATATCTATTATATATTCCTTACCATTAGATCACGAGCTTAATTACCATGCCGCGTGAAACCAGCAACCCGCTAGGCAGGGATCCCTCTTCTCGCTCCGGGCCCATAAACCGTGGGGGTCGCTATCCAATGAATTTTACCAGGCATCTGGTTCTTTCTTCAGGGCCATCTCATCTAAAACGGTCCATTCTTTCCTCTTAAATAAGACATCTCGATGGACTAATGGCTAATCAGCCCATGCTCACACATAACTGTGCTGTCATACATTTGGTATTTTTTTATTTTGGGGGATGCTTGGACTCAGCTATGGCCGTCAAAGGCCCTGACCCGGAGCATCTATTGTAGCTGGACTTAACTGCATCTTGAGCACCAGCATAATGATAAGCGTGGACATTACAGTCAATGGTCACAGGACATAAATTATATTATATATCCCCCCTTCATAAAAATTTCCCCCTTAAATATCTACCACCACTTTTAACAGACTTTTCCCTAGATACTTATTTAAATTTTTCACGCTTTCAATACTCAATTTAGCACTCCAAACAAAGTCAATATATAAACGCAGGCCCCCCCCCCCCG') "

Nothing prevents you from opening another Explorer with the results of sending #hitIdentifiers and writing more messages to the resulting collection, as depicted in the following screenshot:


That's a little introduction of how to work in the Smalltalk environment with Bio objects. Basically remember all the time you have real objects, even the window buttons are objects, and almost all the environment is easily accessible by the very basic tools which exists for more than 30 years!. Following posts will include more tools and more interesting messages for working with BioSmalltalk.

Friday, January 6, 2012

Basic sequence manipulation with BioSmalltalk

My first entry is dedicated to some basic Sequence String manipulation, to get some familiarization with basic Smalltalk objects. The important thing is that you don't need to create files to evaluate the following expressions, you just select it and "print it" the result with the contextual menu option or just the keyboard shortcut. Note that as everything is an object, you could select again the answer and "explore it" again and again.

To obtain the aminoacid name you just send #asAminoacidName to a String, for example
'a' asAminoacidName.    " --> 'Alanine' "
'G'asAminoacidName. " --> 'Glycine' "

Many times we copy and paste sequences from several sources which aren't properly formatted , so to remove all spacing characters from a sequence you could use:
'   ATGCTAGT
CAG
C
AGTTAGCGACA ' asCondensedString.

Both messages are received by a String and answer a String. Now another basic object is Boolean, with its two instances: true and false. For example to determine if a DNA sequence contains ambiguous letters:
'atcggtcggctta' hasAmbiguousDNABases.        " -> false "
'atcggfcggctta' hasAmbiguousDNABases. " -> true "

An important part of working in Smalltalk is the Collection objects, being Array one of the most used. A case which answer an Array instance would be to get the positions of gaps (i.e. : - characters) in a DNA sequence:
'ATCGAT-CAGTGCA--CAGTCA-TTC' indicesOfAscii: $- asciiValue.     
" --> #(7 15 16 23) "

and of course, you could use the impressive amount of features of the String hierarchy.

"Get the sequence size:"
'AATGATCGATGCTAGTCGACA' size.  " -> 21 (a SmallInteger) "
"Compare two sequences:" 
'AATGATCGATGCTAGTCGACA' = 'AATGATCGATCCTAGTCGACA'     
" -> false (a False) " " Find the position of the first (answer 0 if doesn't) subsequence passed as parameter " 'AATGATCGATGCTAGTCGACATGCTA' findString: 'TGCTA' " -> 10 "


And that's all for now, as I don't like long posts we will stop here and I will check the feedback if any, hopefully I will receive comments about what people would like to use.