Monday, December 22, 2014

Download a human chromosome in one line of code

Let's write plain Smalltalk code to download the Human chromosome 22 FASTA from the NCBI servers (about 9,6 Mbytes gzip compressed)
| client fileName fStream |

fileName := 'hs_alt_HuRef_chr22.fa.gz'.
[ client := (FTPClient openOnHostNamed: 'ftp.ncbi.nlm.nih.gov')
                loginUser: 'anonymous' password: '';
                binary;
                changeDirectoryTo: 'genomes/H_sapiens/CHR_22'.
(FileStream newFileNamed: fileName)
        binary;
        nextPutAll: (client getFileNamed: fileName);
        close ]
on: NetworkError, LoginFailedException
do: [ : ex | self error: 'Connection failed' ].

fStream := fileName asFileReference readStream.
(ByteArray streamContents: [ : stream |
    FLSerializer serialize: fStream binary contents on: stream ]) storeString.
That seems a lot of typing for a Bioinformatics library and Smalltalk tradition. That's why I wrote a Genome Downloader class which makes really easy to download the latest build:
BioHSapiensGD new downloadChromosome: 22.
If you don't want the blocking feature, you can easily download in background by setting the process priority:
[ BioHSapiensGD new downloadChromosome: 22 ] 
 forkAt: Processor userBackgroundPriority 
 named: 'Downloading Human Chromosome...'.
Results will be downloaded in the directory where the virtual .image and .changes files are located. But why stop at human race? There are subclasses for Bos Taurus (from the UMD, Center for Bioinformatics and Computational Biology, University of Maryland, and The Bovine Genome Sequencing Consortium), Gallus Gallus (International Chicken Genome Sequencing Consortium) and Mus Musculus (Celera Genomics and Genome Reference Consortium) and others can be built by just specializing very few methods. We can just download any available assembled genomes with just one line of code. Enjoy.

Tuesday, January 14, 2014

Arlequin format writer

Introduction


Arlequin is a famous software for population genetics data analysis. The file format is well documented in the Arlequin's Manual, so I will not duplicate information here. Writing an Arlequin file consists of basically generating a customized INI file with both Profile and Samples sections.
Now you can use the API provided in BioSmalltalk to write Arlequin files programatically. The API pattern in the most naive form looks like this
arlequinFile := BioArlequinFile new.
arlequinFile profileSection
        addTitle: 'Sample Title';
        " ... profile configuration messages ... ".

arlequinFile samplesSection
        addSampleName: '"SAMPLE1"';
        addSampleSize: '8';
        addSampleData: " ... sample data 1 ... ";

        addSampleName: '"SAMPLE2"';
        addSampleSize: '8';
        addSampleData: " ... sample data 2 ... ";       

        " ... you guessed it ... "

it seems pretty simple, but in practice you will not type the hundreds of samples in a typical Arlequin data set. You would like to iterate your input data. 

Building the Samples Collection
If you observe the pattern above, each sample contains three pieces of information: Sample Name, Sample Size and Sample Data. Basically you have two input layouts. Each population comes from separate collections, i.e.:
| arlequinFile samplesSection samplesCollection idCollection frqCollection |
arlequinFile := BioArlequinFile new.
samplesSection := arlequinFile samplesSection.

idCollection := #('OT1' 'B1' 'A1' 'CH1' 'J1' 'USA1' 'OT2' 'OT3' 'B2' 'A2' 'A3' 'A4' 'USA2' 'USA3' 'USA4' 'USA5' 'USA6' 'USA7' 'OT4' 'B3' 'B4' 'B5' 'A5' 'J2' 'J3' 'USA8' 'USA9' 'USA10' 'USA11' 'USA12' 'USA13' 'B6' 'C1' 'J4' 'USA14' 'OT5' 'OT6' 'B7' 'CH2' 'CH3' 'A6' 'CH4' 'A7').
frqCollection := #(5 5 6 3 2 11 1 2 1 1 1 1 1 2 1 1 1 1 5 2 1 1 1 1 1 1 1 4 1 1 1 3 1 1 2 4 3 1 1 1 1 1 1).
samplesCollection := #('ATCTAGCAATACTGTTTTGTCTTCTATCGTCAACCATT' 'ATCTAGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'ATCTAGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'ATCTAGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'ATCTAGCAACACTGTTTTGTCTTCTATCGTCACCCATT' 'ATCTAGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'ATCTAGCAATACTGTTTTGTCTTCTGTCGTCACCGATT' 'ATCTAGCAATACTGCTTTGTCTTCTATCGTCACCCATT' 'ATCTAGCAATACTATTTTGTCTTCTATCGTCACCCATT' 'ATCTGGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'ATCTAGCAATACTATTTTGTCTTCTATCATCACCCATT' 'ATCTAGCAATATTGTTTTGTCTTCTATCGTCACCCATT' 'ATCTAGCAATACTGTCTTGTCTTCTATCGTCACCCATT' 'ATCTAACAATACTGTCTTGTCTTCTATCGTCACCCTTT' 'ATCTAGCAATACTGTCTTGTCTTCTATCGTCATCTATT' 'ACCTAGCAATACTGTCTTGTCTTCTATCGTCACCCATT' 'ATCTAGCAATTCTGTCTTATCTTCTATCGTCACCCATT' 'ATCTAGCAATACTGTCTTATGTTTTATCGTCACCCATT' 'ATCTAGCAATACTGCCTTATCTTTTATCGTCACCCACT' 'ATCTAGCAATACTGTCTCATTTTTTATCGTCACCCATT' 'ATCTAGCAATACTGCCTTATCTTTTATCGTCACCCACT' 'ATCTAGTAATACTGCCTTATCTTTTATCGTCGCCCATT' 'ATCTAGCAATACTGCCCCATCTTTTATCGTCACCCATT' 'ATCTAACAACACTGCCTTATCTTTTATCGTCACCCATT' 'ATCTAGCTGTACTGCCTTACCTTTTATCGTCACCCATT' 'ATCCAGCAATACTGCCTCATCTTTTATCGTCACCCATT' 'ATCTAGCAATACCATCTTATCTTTCATCGTCACCCATT' 'ATCTAGCAATACTGCCTTATCTTTTGTCGTCACCCACT' 'ATCTAGCAATACTGTCTTACCCTTTATCGTCACCCATT' 'GTCTAGCAATACTGTCTTACCTTTTATCGTCACCCATT' 'ATCTAGCAATACTGTCTTATCTTTTATCGTCACCCGTT' 'ATTTAGTAATACCGTCTTATCTTTTATCGTCACCCATT' 'ATCTAGCTATACTGTCTTATCTCTCATCGTTACCCATT' 'ATCTAACAATACTGCCTTATCTTTTATCGTCACCCACT' 'ACCTAGCAATACTGTCTTATCTTTTATCGTCATTCATT' 'ATCTAGCGATACTGTCTTATCTTTTATCACCACCTATT' 'ATCTAGCGATACTGTCTTATCTTTTATCACCACCCATG' 'ATCTAGCGATACTGTCTTATCTCTTATCACCACCTATT' 'ATCTAACAACACTGTCCTATCTTTTATCGTCACCCACT' 'ATTTAACAATACTGTCCTATCTTTTATCGTCACCCACT' 'ATTTAGCAATACTCTCCTATCTTTTACCGTCACCCACT' 'ATTTAGCAATACTGTCCTATCTCTTATCGTCACCTACT' 'ATTTAGCAATGCTGTCCCATCTTTTATTGTCACCCACT').
 
samplesSection addSamples: (BioA31SampleCollection forDNA
 identifiers: idCollection;
 frequencies: frqCollection;
 sequences: samplesCollection;
 yourself).

" Export contents into a file "
arlequinFile contents writeOn: (FileStream newFileNamed: 'myArlequin.arp')
Or population data comes as a triplet. This could be the case after you have grouped your input by alignment and calculated the frequencies. In that case you may use #triplesDo: to take each population by 3-element and build your Arlequin file like this:
| arlequinFile samplesSection populations |
arlequinFile := BioArlequinFile new.
samplesSection := arlequinFile samplesSection.

populations := #('OT1' 5 'ATCTAGCAATACTGTTTTGTCTTCTATCGTCAACCATT' 'B1' 5 'ATCTAGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'A1' 6 'ATCTAGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'CH1' 3 'ATCTAGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'J1' 2 'ATCTAGCAACACTGTTTTGTCTTCTATCGTCACCCATT' 'USA1' 11 'ATCTAGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'OT2' 1 'ATCTAGCAATACTGTTTTGTCTTCTGTCGTCACCGATT' 'OT3' 2 'ATCTAGCAATACTGCTTTGTCTTCTATCGTCACCCATT' 'B2' 1 'ATCTAGCAATACTATTTTGTCTTCTATCGTCACCCATT' 'A2' 1 'ATCTGGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'A3' 1 'ATCTAGCAATACTATTTTGTCTTCTATCATCACCCATT' 'A4' 1 'ATCTAGCAATATTGTTTTGTCTTCTATCGTCACCCATT' 'USA2' 1 'ATCTAGCAATACTGTCTTGTCTTCTATCGTCACCCATT' 'USA3' 2 'ATCTAACAATACTGTCTTGTCTTCTATCGTCACCCTTT' 'USA4' 1 'ATCTAGCAATACTGTCTTGTCTTCTATCGTCATCTATT' 'USA5' 1 'ACCTAGCAATACTGTCTTGTCTTCTATCGTCACCCATT' 'USA6' 1 'ATCTAGCAATTCTGTCTTATCTTCTATCGTCACCCATT' 'USA7' 1 'ATCTAGCAATACTGTCTTATGTTTTATCGTCACCCATT' 'OT4' 5 'ATCTAGCAATACTGCCTTATCTTTTATCGTCACCCACT' 'B3' 2 'ATCTAGCAATACTGTCTCATTTTTTATCGTCACCCATT' 'B4' 1 'ATCTAGCAATACTGCCTTATCTTTTATCGTCACCCACT' 'B5' 1 'ATCTAGTAATACTGCCTTATCTTTTATCGTCGCCCATT' 'A5' 1 'ATCTAGCAATACTGCCCCATCTTTTATCGTCACCCATT' 'J2' 1 'ATCTAACAACACTGCCTTATCTTTTATCGTCACCCATT' 'J3' 1 'ATCTAGCTGTACTGCCTTACCTTTTATCGTCACCCATT' 'USA8' 1 'ATCCAGCAATACTGCCTCATCTTTTATCGTCACCCATT' 'USA9' 1 'ATCTAGCAATACCATCTTATCTTTCATCGTCACCCATT' 'USA10' 4 'ATCTAGCAATACTGCCTTATCTTTTGTCGTCACCCACT' 'USA11' 1 'ATCTAGCAATACTGTCTTACCCTTTATCGTCACCCATT' 'USA12' 1 'GTCTAGCAATACTGTCTTACCTTTTATCGTCACCCATT' 'USA13' 1 'ATCTAGCAATACTGTCTTATCTTTTATCGTCACCCGTT' 'B6' 3 'ATTTAGTAATACCGTCTTATCTTTTATCGTCACCCATT' 'C1' 1 'ATCTAGCTATACTGTCTTATCTCTCATCGTTACCCATT' 'J4' 1 'ATCTAACAATACTGCCTTATCTTTTATCGTCACCCACT' 'USA14' 2 'ACCTAGCAATACTGTCTTATCTTTTATCGTCATTCATT' 'OT5' 4 'ATCTAGCGATACTGTCTTATCTTTTATCACCACCTATT' 'OT6' 3 'ATCTAGCGATACTGTCTTATCTTTTATCACCACCCATG' 'B7' 1 'ATCTAGCGATACTGTCTTATCTCTTATCACCACCTATT').
populations triplesDo: [ : id : freq : seq |
 samplesSection 
  addSampleName: id;
  addSampleSize: freq;
  addSampleData: seq;
  yourself ].
" Export contents into a file "
arlequinFile contents writeOn: (FileStream newFileNamed: 'myArlequin.arp')
Don't forget to check BioArlequinFile convenience methods for building for different data types: #buildHaplotypicDataDNAProfileTitle: aString groups: aNumber missingData: missingCharacter #buildHaplotypicDataFrequencyProfileTitle: aString groups: aNumber missingData: missingCharacter And let me know any suggestions for improving the Arlequin API.