Monday, July 16, 2012

Parsing and profiling BLAST results

For my application I wanted to display BLAST summaries because we track some history data in the Nuccore database. If you, as a good bioinformatician, have downloaded the BLAST XML files you may find a summary of each BLAST is found in the header. 

As we may have several BLAST files, processing entirely each one just to read the header is a waste of time, so we need to parse a XML up to a particular node is found, which is provided by the BioSmalltalk API using StAX parser (XMLPullParser) as backend.

Now, compare the time spent reading a whole XML file versus just stopping after the last header node is found.

BioObject requestXmlFile
 ifNotNilDo: [ : fileLocation |
  | blastReader timeToReadFile timeToProcessNode output |  
  timeToReadFile :=   [ blastReader := BioNCBIBlastReader newFromXML: fileLocation ] timeToRun.
  timeToProcessNode :=  [ output := blastReader blastProgramName ] timeToRun.
  timeToReadFile :=   ( Duration milliSeconds: timeToReadFile ) asSeconds.
  timeToProcessNode :=  ( Duration milliSeconds: timeToProcessNode ) asSeconds.
   show: 'Seconds to read ' , fileLocation , ' = ';
   show: timeToReadFile;
   show: 'Seconds to process node = ' , timeToProcessNode asString;
   show: output ]
Wonder how to read a whole group of BLAST XML files contained in a directory? This script would do it:
| time blastReaders blastHits |

Smalltalk garbageCollect.
Transcript clear.
time := [ blastReaders := BioBlastCollection filesFromXMLDirectory: 'Data\BLASTs\'.
  Transcript show: blastReaders size asString , ' blast XML files were read'; cr.
  blastHits := blastReaders parsedContents ] timeToRun.
blastHits explore.
 show: 'Seconds to read NCBI BLAST XML''s = ';
 show: ( Duration milliSeconds: time ) asSeconds

but wrapping everything you want to measure in a #timeToRun block isn't that fun. Plus getting rid of that bunch of temporary variables, in Pharo you may profile the task in a beautiful one-liner:

( BioBlastCollection filesFromXMLDirectory: 'Data\BLASTs\' ) parsedContents.

just right-click the selected line and select "Profile it" as shown in the screenshot.


after execution, the Time Profiler is open with detailed information of time spent in each method so you may see where to look better for bottlenecks.



Post a Comment