Showing posts with label BLAST. Show all posts
Showing posts with label BLAST. Show all posts

Saturday, February 23, 2013

PhyloclassTalk preview

In this post I want to present a preview of PhyloclassTalk, an application for phylogenetics analysis using the BioSmalltalk environment with Pharo 1.4. The main parts are presented through the Metro style UI popularized in Windows 8. The following screenshot shows the main application window:


excepting for the icons, the layout was generated programatically with simple and plain Morphic objects. The "Territory Builder" uses a wizard library called Merlin and it is based in a Territorial library which basically is a Composite pattern implementation to build complex territorial objects. I have integrated the Help in just one hour, based in the HelpSystem without any previous knowledge of the library.

The main module window is a "Case Study Browser" implemented with the OmniBrowser framework. From the browser one can create and associate several phylogenetic data to a species case study, classify according to defined territories and then export results into formats like Arlequin, Google Fusion Tables or Fluxus Network.

The following screenshot describes the "Blast Query Builder", which enables dynamic generation and execution of Blast XML results, producing filtered objects which can be later loaded in the case study browser for further associations. Fitered results could be cumulative, meaning that each new execution is applied on the previous results.



Detailed features as the rule engine protocol and the post-curation of classified data are going to be described an the upcoming paper. I will provide also new posts on this front as I prepare a release, stay there online.


Tuesday, July 17, 2012

How to do a BLAST from Smalltalk

Provided by the National Center for Biotechnology Information (NCBI), BLAST is a sequence homology search program that finds all the sequences that match a particular given sequence from a database, and it is considered the most fundamental tool in bioinformatics today, although there are many other similar programs for different cases. I'm not going here to review the huge literature on sequence/string matching, for what any simple search in any search engine would get you a lot of resources to start.

For remote queries, you may access the free NCBI BLAST service in two ways, from a web browser and programatically through the BLAST URL API, previously known as QBlast. In this post you may read how to use the API using Smalltalk, and downloading a file with results in XML format for later processing
| search |
 
search := BioNCBIWWWBlastClient new nucleotide query: 'CCCTCAAACAT...TTTGAGGAG';
   hitListSize: 150;
   filterLowComplexity;
   expectValue: 10;
   wordSize: 11;
   blastn;
   blastPlainService;
   alignmentViewFlatQueryAnchored;
   formatTypeXML;   
   fetch.
search outputToFile: 'blast-query-result.xml' contents: search result.
in this script, at first we set up the Blast Client with the database and query sequence. Notice at this stage you may serialize the BioNCBIWWWBlastClient instance without further parameters. The second "set" of messages are "in cascade", in Smalltalk sometimes you need to send one object several consecutive messages, and in this case are directed to configure the search and send the requests in the #fetch. Each command is implemented as a subclass of BioNCBICommand because each one (like GET or PUT) accepts specific parameter types, but you don't need to worry about the valid parameters as they are validated internally, for example, when setting the database. Each message to the Blast client builds the corresponding URL, for example a GET url:

http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Get&RID=954517013-7639-11119

during execution you may open a Transcript and see the request progress:

17 July 2012 6:58:51 am Fetching NCBI BLAST request identifier...
17 July 2012 6:58:53 am Parsing RTOE...
17 July 2012 6:58:53 am New RTOE = 18
17 July 2012 6:58:53 am RTOE = 18
17 July 2012 6:58:53 am RID = 0ADUV9FM016
17 July 2012 6:58:53 am Executing...BioNCBIFetch
17 July 2012 6:58:53 am QBLAST: Delaying retry for 18 seconds
17 July 2012 6:59:11 am Sending another NCBI BLAST request...
17 July 2012 6:59:14 am NCBI BLAST request sent, waiting for results...
17 July 2012 6:59:15 am Finished successfully

Monday, July 16, 2012

Parsing and profiling BLAST results


For my application I wanted to display BLAST summaries because we track some history data in the Nuccore database. If you, as a good bioinformatician, have downloaded the BLAST XML files you may find a summary of each BLAST is found in the header. 

As we may have several BLAST files, processing entirely each one just to read the header is a waste of time, so we need to parse a XML up to a particular node is found, which is provided by the BioSmalltalk API using StAX parser (XMLPullParser) as backend.


Now, compare the time spent reading a whole XML file versus just stopping after the last header node is found.

BioObject requestXmlFile
 ifNotNilDo: [ : fileLocation |
  | blastReader timeToReadFile timeToProcessNode output |  
  timeToReadFile :=   [ blastReader := BioNCBIBlastReader newFromXML: fileLocation ] timeToRun.
  timeToProcessNode :=  [ output := blastReader blastProgramName ] timeToRun.
  timeToReadFile :=   ( Duration milliSeconds: timeToReadFile ) asSeconds.
  timeToProcessNode :=  ( Duration milliSeconds: timeToProcessNode ) asSeconds.
  Transcript 
   cr;
   show: 'Seconds to read ' , fileLocation , ' = ';
   show: timeToReadFile;
   cr;
   show: 'Seconds to process node = ' , timeToProcessNode asString;
   cr;
   show: output ]
Wonder how to read a whole group of BLAST XML files contained in a directory? This script would do it:
| time blastReaders blastHits |

Smalltalk garbageCollect.
Transcript clear.
time := [ blastReaders := BioBlastCollection filesFromXMLDirectory: 'Data\BLASTs\'.
  Transcript show: blastReaders size asString , ' blast XML files were read'; cr.
  blastHits := blastReaders parsedContents ] timeToRun.
blastHits explore.
Transcript 
 cr;
 show: 'Seconds to read NCBI BLAST XML''s = ';
 show: ( Duration milliSeconds: time ) asSeconds

but wrapping everything you want to measure in a #timeToRun block isn't that fun. Plus getting rid of that bunch of temporary variables, in Pharo you may profile the task in a beautiful one-liner:

( BioBlastCollection filesFromXMLDirectory: 'Data\BLASTs\' ) parsedContents.

just right-click the selected line and select "Profile it" as shown in the screenshot.

 

after execution, the Time Profiler is open with detailed information of time spent in each method so you may see where to look better for bottlenecks.