Monday, December 22, 2014

Download a human chromosome in one line of code

Let's write plain Smalltalk code to download the Human chromosome 22 FASTA from the NCBI servers (about 9,6 Mbytes gzip compressed)
| client fileName fStream |

fileName := 'hs_alt_HuRef_chr22.fa.gz'.
[ client := (FTPClient openOnHostNamed: 'ftp.ncbi.nlm.nih.gov')
                loginUser: 'anonymous' password: '';
                binary;
                changeDirectoryTo: 'genomes/H_sapiens/CHR_22'.
(FileStream newFileNamed: fileName)
        binary;
        nextPutAll: (client getFileNamed: fileName);
        close ]
on: NetworkError, LoginFailedException
do: [ : ex | self error: 'Connection failed' ].

fStream := fileName asFileReference readStream.
(ByteArray streamContents: [ : stream |
    FLSerializer serialize: fStream binary contents on: stream ]) storeString.
That seems a lot of typing for a Bioinformatics library and Smalltalk tradition. That's why I wrote a Genome Downloader class which makes really easy to download the latest build:
BioHSapiensGD new downloadChromosome: 22.
If you don't want the blocking feature, you can easily download in background by setting the process priority:
[ BioHSapiensGD new downloadChromosome: 22 ] 
 forkAt: Processor userBackgroundPriority 
 named: 'Downloading Human Chromosome...'.
Results will be downloaded in the directory where the virtual .image and .changes files are located. But why stop at human race? There are subclasses for Bos Taurus (from the UMD, Center for Bioinformatics and Computational Biology, University of Maryland, and The Bovine Genome Sequencing Consortium), Gallus Gallus (International Chicken Genome Sequencing Consortium) and Mus Musculus (Celera Genomics and Genome Reference Consortium) and others can be built by just specializing very few methods. We can just download any available assembled genomes with just one line of code. Enjoy.