Showing posts with label NCBI. Show all posts
Showing posts with label NCBI. Show all posts

Tuesday, July 17, 2012

How to do a BLAST from Smalltalk

Provided by the National Center for Biotechnology Information (NCBI), BLAST is a sequence homology search program that finds all the sequences that match a particular given sequence from a database, and it is considered the most fundamental tool in bioinformatics today, although there are many other similar programs for different cases. I'm not going here to review the huge literature on sequence/string matching, for what any simple search in any search engine would get you a lot of resources to start.

For remote queries, you may access the free NCBI BLAST service in two ways, from a web browser and programatically through the BLAST URL API, previously known as QBlast. In this post you may read how to use the API using Smalltalk, and downloading a file with results in XML format for later processing
| search |
 
search := BioNCBIWWWBlastClient new nucleotide query: 'CCCTCAAACAT...TTTGAGGAG';
   hitListSize: 150;
   filterLowComplexity;
   expectValue: 10;
   wordSize: 11;
   blastn;
   blastPlainService;
   alignmentViewFlatQueryAnchored;
   formatTypeXML;   
   fetch.
search outputToFile: 'blast-query-result.xml' contents: search result.
in this script, at first we set up the Blast Client with the database and query sequence. Notice at this stage you may serialize the BioNCBIWWWBlastClient instance without further parameters. The second "set" of messages are "in cascade", in Smalltalk sometimes you need to send one object several consecutive messages, and in this case are directed to configure the search and send the requests in the #fetch. Each command is implemented as a subclass of BioNCBICommand because each one (like GET or PUT) accepts specific parameter types, but you don't need to worry about the valid parameters as they are validated internally, for example, when setting the database. Each message to the Blast client builds the corresponding URL, for example a GET url:

http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Get&RID=954517013-7639-11119

during execution you may open a Transcript and see the request progress:

17 July 2012 6:58:51 am Fetching NCBI BLAST request identifier...
17 July 2012 6:58:53 am Parsing RTOE...
17 July 2012 6:58:53 am New RTOE = 18
17 July 2012 6:58:53 am RTOE = 18
17 July 2012 6:58:53 am RID = 0ADUV9FM016
17 July 2012 6:58:53 am Executing...BioNCBIFetch
17 July 2012 6:58:53 am QBLAST: Delaying retry for 18 seconds
17 July 2012 6:59:11 am Sending another NCBI BLAST request...
17 July 2012 6:59:14 am NCBI BLAST request sent, waiting for results...
17 July 2012 6:59:15 am Finished successfully

Sunday, July 15, 2012

Using the BioEntrezClient

Sometimes you have a list of accession numbers and want to get the corresponding FASTA sequences, this is the way to do it:

| result |
    
result := BioEntrezClient new nuccore
                uids: #('AB177765.1' 'AB177791.1');
                setFasta;
                fetch.

result outputToFile: 'nuccore_seqs.fasta'

the script will download the sequences in one trip vía the NCBI Entrez API, if you just wanted the GenBank format, just set #setGb instead of #setFasta above.The default format is ASN.1, which is a "hard" format for bioinformaticians. To download PubMed records from UID's, the following is a simple possible script:

| result |

result := 
 BioEntrezClient new pubmed 
  uids: #( 11877539 11822933 );
  fetch.
result outputToFile: 'fetchPubmed-01.txt'
 
And pretty much the same way with the spelling service using PubMedCentral database

| result |
 
result := 
 BioEntrezClient new pmc 
  term: 'fiberblast cell grwth';
  spell.
result outputToFile: 'eSpell1.xml'

With some classes you would want to view the possible messages you may send, for example to get the list of databases through Entrez. In this case this is easily done with the reflection capabilities of Smalltalk:

BioEntrezClient organization
   listAtCategoryNamed: 'accessing public - databases'