Tuesday, January 14, 2014

Arlequin format writer

Introduction


Arlequin is a famous software for population genetics data analysis. The file format is well documented in the Arlequin's Manual, so I will not duplicate information here. Writing an Arlequin file consists of basically generating a customized INI file with both Profile and Samples sections.
Now you can use the API provided in BioSmalltalk to write Arlequin files programatically. The API pattern in the most naive form looks like this
arlequinFile := BioArlequinFile new.
arlequinFile profileSection
        addTitle: 'Sample Title';
        " ... profile configuration messages ... ".

arlequinFile samplesSection
        addSampleName: '"SAMPLE1"';
        addSampleSize: '8';
        addSampleData: " ... sample data 1 ... ";

        addSampleName: '"SAMPLE2"';
        addSampleSize: '8';
        addSampleData: " ... sample data 2 ... ";       

        " ... you guessed it ... "

it seems pretty simple, but in practice you will not type the hundreds of samples in a typical Arlequin data set. You would like to iterate your input data. 

Building the Samples Collection
If you observe the pattern above, each sample contains three pieces of information: Sample Name, Sample Size and Sample Data. Basically you have two input layouts. Each population comes from separate collections, i.e.:
| arlequinFile samplesSection samplesCollection idCollection frqCollection |
arlequinFile := BioArlequinFile new.
samplesSection := arlequinFile samplesSection.

idCollection := #('OT1' 'B1' 'A1' 'CH1' 'J1' 'USA1' 'OT2' 'OT3' 'B2' 'A2' 'A3' 'A4' 'USA2' 'USA3' 'USA4' 'USA5' 'USA6' 'USA7' 'OT4' 'B3' 'B4' 'B5' 'A5' 'J2' 'J3' 'USA8' 'USA9' 'USA10' 'USA11' 'USA12' 'USA13' 'B6' 'C1' 'J4' 'USA14' 'OT5' 'OT6' 'B7' 'CH2' 'CH3' 'A6' 'CH4' 'A7').
frqCollection := #(5 5 6 3 2 11 1 2 1 1 1 1 1 2 1 1 1 1 5 2 1 1 1 1 1 1 1 4 1 1 1 3 1 1 2 4 3 1 1 1 1 1 1).
samplesCollection := #('ATCTAGCAATACTGTTTTGTCTTCTATCGTCAACCATT' 'ATCTAGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'ATCTAGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'ATCTAGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'ATCTAGCAACACTGTTTTGTCTTCTATCGTCACCCATT' 'ATCTAGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'ATCTAGCAATACTGTTTTGTCTTCTGTCGTCACCGATT' 'ATCTAGCAATACTGCTTTGTCTTCTATCGTCACCCATT' 'ATCTAGCAATACTATTTTGTCTTCTATCGTCACCCATT' 'ATCTGGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'ATCTAGCAATACTATTTTGTCTTCTATCATCACCCATT' 'ATCTAGCAATATTGTTTTGTCTTCTATCGTCACCCATT' 'ATCTAGCAATACTGTCTTGTCTTCTATCGTCACCCATT' 'ATCTAACAATACTGTCTTGTCTTCTATCGTCACCCTTT' 'ATCTAGCAATACTGTCTTGTCTTCTATCGTCATCTATT' 'ACCTAGCAATACTGTCTTGTCTTCTATCGTCACCCATT' 'ATCTAGCAATTCTGTCTTATCTTCTATCGTCACCCATT' 'ATCTAGCAATACTGTCTTATGTTTTATCGTCACCCATT' 'ATCTAGCAATACTGCCTTATCTTTTATCGTCACCCACT' 'ATCTAGCAATACTGTCTCATTTTTTATCGTCACCCATT' 'ATCTAGCAATACTGCCTTATCTTTTATCGTCACCCACT' 'ATCTAGTAATACTGCCTTATCTTTTATCGTCGCCCATT' 'ATCTAGCAATACTGCCCCATCTTTTATCGTCACCCATT' 'ATCTAACAACACTGCCTTATCTTTTATCGTCACCCATT' 'ATCTAGCTGTACTGCCTTACCTTTTATCGTCACCCATT' 'ATCCAGCAATACTGCCTCATCTTTTATCGTCACCCATT' 'ATCTAGCAATACCATCTTATCTTTCATCGTCACCCATT' 'ATCTAGCAATACTGCCTTATCTTTTGTCGTCACCCACT' 'ATCTAGCAATACTGTCTTACCCTTTATCGTCACCCATT' 'GTCTAGCAATACTGTCTTACCTTTTATCGTCACCCATT' 'ATCTAGCAATACTGTCTTATCTTTTATCGTCACCCGTT' 'ATTTAGTAATACCGTCTTATCTTTTATCGTCACCCATT' 'ATCTAGCTATACTGTCTTATCTCTCATCGTTACCCATT' 'ATCTAACAATACTGCCTTATCTTTTATCGTCACCCACT' 'ACCTAGCAATACTGTCTTATCTTTTATCGTCATTCATT' 'ATCTAGCGATACTGTCTTATCTTTTATCACCACCTATT' 'ATCTAGCGATACTGTCTTATCTTTTATCACCACCCATG' 'ATCTAGCGATACTGTCTTATCTCTTATCACCACCTATT' 'ATCTAACAACACTGTCCTATCTTTTATCGTCACCCACT' 'ATTTAACAATACTGTCCTATCTTTTATCGTCACCCACT' 'ATTTAGCAATACTCTCCTATCTTTTACCGTCACCCACT' 'ATTTAGCAATACTGTCCTATCTCTTATCGTCACCTACT' 'ATTTAGCAATGCTGTCCCATCTTTTATTGTCACCCACT').
 
samplesSection addSamples: (BioA31SampleCollection forDNA
 identifiers: idCollection;
 frequencies: frqCollection;
 sequences: samplesCollection;
 yourself).

" Export contents into a file "
arlequinFile contents writeOn: (FileStream newFileNamed: 'myArlequin.arp')
Or population data comes as a triplet. This could be the case after you have grouped your input by alignment and calculated the frequencies. In that case you may use #triplesDo: to take each population by 3-element and build your Arlequin file like this:
| arlequinFile samplesSection populations |
arlequinFile := BioArlequinFile new.
samplesSection := arlequinFile samplesSection.

populations := #('OT1' 5 'ATCTAGCAATACTGTTTTGTCTTCTATCGTCAACCATT' 'B1' 5 'ATCTAGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'A1' 6 'ATCTAGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'CH1' 3 'ATCTAGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'J1' 2 'ATCTAGCAACACTGTTTTGTCTTCTATCGTCACCCATT' 'USA1' 11 'ATCTAGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'OT2' 1 'ATCTAGCAATACTGTTTTGTCTTCTGTCGTCACCGATT' 'OT3' 2 'ATCTAGCAATACTGCTTTGTCTTCTATCGTCACCCATT' 'B2' 1 'ATCTAGCAATACTATTTTGTCTTCTATCGTCACCCATT' 'A2' 1 'ATCTGGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'A3' 1 'ATCTAGCAATACTATTTTGTCTTCTATCATCACCCATT' 'A4' 1 'ATCTAGCAATATTGTTTTGTCTTCTATCGTCACCCATT' 'USA2' 1 'ATCTAGCAATACTGTCTTGTCTTCTATCGTCACCCATT' 'USA3' 2 'ATCTAACAATACTGTCTTGTCTTCTATCGTCACCCTTT' 'USA4' 1 'ATCTAGCAATACTGTCTTGTCTTCTATCGTCATCTATT' 'USA5' 1 'ACCTAGCAATACTGTCTTGTCTTCTATCGTCACCCATT' 'USA6' 1 'ATCTAGCAATTCTGTCTTATCTTCTATCGTCACCCATT' 'USA7' 1 'ATCTAGCAATACTGTCTTATGTTTTATCGTCACCCATT' 'OT4' 5 'ATCTAGCAATACTGCCTTATCTTTTATCGTCACCCACT' 'B3' 2 'ATCTAGCAATACTGTCTCATTTTTTATCGTCACCCATT' 'B4' 1 'ATCTAGCAATACTGCCTTATCTTTTATCGTCACCCACT' 'B5' 1 'ATCTAGTAATACTGCCTTATCTTTTATCGTCGCCCATT' 'A5' 1 'ATCTAGCAATACTGCCCCATCTTTTATCGTCACCCATT' 'J2' 1 'ATCTAACAACACTGCCTTATCTTTTATCGTCACCCATT' 'J3' 1 'ATCTAGCTGTACTGCCTTACCTTTTATCGTCACCCATT' 'USA8' 1 'ATCCAGCAATACTGCCTCATCTTTTATCGTCACCCATT' 'USA9' 1 'ATCTAGCAATACCATCTTATCTTTCATCGTCACCCATT' 'USA10' 4 'ATCTAGCAATACTGCCTTATCTTTTGTCGTCACCCACT' 'USA11' 1 'ATCTAGCAATACTGTCTTACCCTTTATCGTCACCCATT' 'USA12' 1 'GTCTAGCAATACTGTCTTACCTTTTATCGTCACCCATT' 'USA13' 1 'ATCTAGCAATACTGTCTTATCTTTTATCGTCACCCGTT' 'B6' 3 'ATTTAGTAATACCGTCTTATCTTTTATCGTCACCCATT' 'C1' 1 'ATCTAGCTATACTGTCTTATCTCTCATCGTTACCCATT' 'J4' 1 'ATCTAACAATACTGCCTTATCTTTTATCGTCACCCACT' 'USA14' 2 'ACCTAGCAATACTGTCTTATCTTTTATCGTCATTCATT' 'OT5' 4 'ATCTAGCGATACTGTCTTATCTTTTATCACCACCTATT' 'OT6' 3 'ATCTAGCGATACTGTCTTATCTTTTATCACCACCCATG' 'B7' 1 'ATCTAGCGATACTGTCTTATCTCTTATCACCACCTATT').
populations triplesDo: [ : id : freq : seq |
 samplesSection 
  addSampleName: id;
  addSampleSize: freq;
  addSampleData: seq;
  yourself ].
" Export contents into a file "
arlequinFile contents writeOn: (FileStream newFileNamed: 'myArlequin.arp')
Don't forget to check BioArlequinFile convenience methods for building for different data types: #buildHaplotypicDataDNAProfileTitle: aString groups: aNumber missingData: missingCharacter #buildHaplotypicDataFrequencyProfileTitle: aString groups: aNumber missingData: missingCharacter And let me know any suggestions for improving the Arlequin API.