Showing posts with label Arlequin. Show all posts
Showing posts with label Arlequin. Show all posts

Tuesday, January 14, 2014

Arlequin format writer

Introduction


Arlequin is a famous software for population genetics data analysis. The file format is well documented in the Arlequin's Manual, so I will not duplicate information here. Writing an Arlequin file consists of basically generating a customized INI file with both Profile and Samples sections.
Now you can use the API provided in BioSmalltalk to write Arlequin files programatically. The API pattern in the most naive form looks like this
arlequinFile := BioArlequinFile new.
arlequinFile profileSection
        addTitle: 'Sample Title';
        " ... profile configuration messages ... ".

arlequinFile samplesSection
        addSampleName: '"SAMPLE1"';
        addSampleSize: '8';
        addSampleData: " ... sample data 1 ... ";

        addSampleName: '"SAMPLE2"';
        addSampleSize: '8';
        addSampleData: " ... sample data 2 ... ";       

        " ... you guessed it ... "

it seems pretty simple, but in practice you will not type the hundreds of samples in a typical Arlequin data set. You would like to iterate your input data. 

Building the Samples Collection
If you observe the pattern above, each sample contains three pieces of information: Sample Name, Sample Size and Sample Data. Basically you have two input layouts. Each population comes from separate collections, i.e.:
| arlequinFile samplesSection samplesCollection idCollection frqCollection |
arlequinFile := BioArlequinFile new.
samplesSection := arlequinFile samplesSection.

idCollection := #('OT1' 'B1' 'A1' 'CH1' 'J1' 'USA1' 'OT2' 'OT3' 'B2' 'A2' 'A3' 'A4' 'USA2' 'USA3' 'USA4' 'USA5' 'USA6' 'USA7' 'OT4' 'B3' 'B4' 'B5' 'A5' 'J2' 'J3' 'USA8' 'USA9' 'USA10' 'USA11' 'USA12' 'USA13' 'B6' 'C1' 'J4' 'USA14' 'OT5' 'OT6' 'B7' 'CH2' 'CH3' 'A6' 'CH4' 'A7').
frqCollection := #(5 5 6 3 2 11 1 2 1 1 1 1 1 2 1 1 1 1 5 2 1 1 1 1 1 1 1 4 1 1 1 3 1 1 2 4 3 1 1 1 1 1 1).
samplesCollection := #('ATCTAGCAATACTGTTTTGTCTTCTATCGTCAACCATT' 'ATCTAGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'ATCTAGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'ATCTAGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'ATCTAGCAACACTGTTTTGTCTTCTATCGTCACCCATT' 'ATCTAGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'ATCTAGCAATACTGTTTTGTCTTCTGTCGTCACCGATT' 'ATCTAGCAATACTGCTTTGTCTTCTATCGTCACCCATT' 'ATCTAGCAATACTATTTTGTCTTCTATCGTCACCCATT' 'ATCTGGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'ATCTAGCAATACTATTTTGTCTTCTATCATCACCCATT' 'ATCTAGCAATATTGTTTTGTCTTCTATCGTCACCCATT' 'ATCTAGCAATACTGTCTTGTCTTCTATCGTCACCCATT' 'ATCTAACAATACTGTCTTGTCTTCTATCGTCACCCTTT' 'ATCTAGCAATACTGTCTTGTCTTCTATCGTCATCTATT' 'ACCTAGCAATACTGTCTTGTCTTCTATCGTCACCCATT' 'ATCTAGCAATTCTGTCTTATCTTCTATCGTCACCCATT' 'ATCTAGCAATACTGTCTTATGTTTTATCGTCACCCATT' 'ATCTAGCAATACTGCCTTATCTTTTATCGTCACCCACT' 'ATCTAGCAATACTGTCTCATTTTTTATCGTCACCCATT' 'ATCTAGCAATACTGCCTTATCTTTTATCGTCACCCACT' 'ATCTAGTAATACTGCCTTATCTTTTATCGTCGCCCATT' 'ATCTAGCAATACTGCCCCATCTTTTATCGTCACCCATT' 'ATCTAACAACACTGCCTTATCTTTTATCGTCACCCATT' 'ATCTAGCTGTACTGCCTTACCTTTTATCGTCACCCATT' 'ATCCAGCAATACTGCCTCATCTTTTATCGTCACCCATT' 'ATCTAGCAATACCATCTTATCTTTCATCGTCACCCATT' 'ATCTAGCAATACTGCCTTATCTTTTGTCGTCACCCACT' 'ATCTAGCAATACTGTCTTACCCTTTATCGTCACCCATT' 'GTCTAGCAATACTGTCTTACCTTTTATCGTCACCCATT' 'ATCTAGCAATACTGTCTTATCTTTTATCGTCACCCGTT' 'ATTTAGTAATACCGTCTTATCTTTTATCGTCACCCATT' 'ATCTAGCTATACTGTCTTATCTCTCATCGTTACCCATT' 'ATCTAACAATACTGCCTTATCTTTTATCGTCACCCACT' 'ACCTAGCAATACTGTCTTATCTTTTATCGTCATTCATT' 'ATCTAGCGATACTGTCTTATCTTTTATCACCACCTATT' 'ATCTAGCGATACTGTCTTATCTTTTATCACCACCCATG' 'ATCTAGCGATACTGTCTTATCTCTTATCACCACCTATT' 'ATCTAACAACACTGTCCTATCTTTTATCGTCACCCACT' 'ATTTAACAATACTGTCCTATCTTTTATCGTCACCCACT' 'ATTTAGCAATACTCTCCTATCTTTTACCGTCACCCACT' 'ATTTAGCAATACTGTCCTATCTCTTATCGTCACCTACT' 'ATTTAGCAATGCTGTCCCATCTTTTATTGTCACCCACT').
 
samplesSection addSamples: (BioA31SampleCollection forDNA
 identifiers: idCollection;
 frequencies: frqCollection;
 sequences: samplesCollection;
 yourself).

" Export contents into a file "
arlequinFile contents writeOn: (FileStream newFileNamed: 'myArlequin.arp')
Or population data comes as a triplet. This could be the case after you have grouped your input by alignment and calculated the frequencies. In that case you may use #triplesDo: to take each population by 3-element and build your Arlequin file like this:
| arlequinFile samplesSection populations |
arlequinFile := BioArlequinFile new.
samplesSection := arlequinFile samplesSection.

populations := #('OT1' 5 'ATCTAGCAATACTGTTTTGTCTTCTATCGTCAACCATT' 'B1' 5 'ATCTAGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'A1' 6 'ATCTAGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'CH1' 3 'ATCTAGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'J1' 2 'ATCTAGCAACACTGTTTTGTCTTCTATCGTCACCCATT' 'USA1' 11 'ATCTAGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'OT2' 1 'ATCTAGCAATACTGTTTTGTCTTCTGTCGTCACCGATT' 'OT3' 2 'ATCTAGCAATACTGCTTTGTCTTCTATCGTCACCCATT' 'B2' 1 'ATCTAGCAATACTATTTTGTCTTCTATCGTCACCCATT' 'A2' 1 'ATCTGGCAATACTGTTTTGTCTTCTATCGTCACCCATT' 'A3' 1 'ATCTAGCAATACTATTTTGTCTTCTATCATCACCCATT' 'A4' 1 'ATCTAGCAATATTGTTTTGTCTTCTATCGTCACCCATT' 'USA2' 1 'ATCTAGCAATACTGTCTTGTCTTCTATCGTCACCCATT' 'USA3' 2 'ATCTAACAATACTGTCTTGTCTTCTATCGTCACCCTTT' 'USA4' 1 'ATCTAGCAATACTGTCTTGTCTTCTATCGTCATCTATT' 'USA5' 1 'ACCTAGCAATACTGTCTTGTCTTCTATCGTCACCCATT' 'USA6' 1 'ATCTAGCAATTCTGTCTTATCTTCTATCGTCACCCATT' 'USA7' 1 'ATCTAGCAATACTGTCTTATGTTTTATCGTCACCCATT' 'OT4' 5 'ATCTAGCAATACTGCCTTATCTTTTATCGTCACCCACT' 'B3' 2 'ATCTAGCAATACTGTCTCATTTTTTATCGTCACCCATT' 'B4' 1 'ATCTAGCAATACTGCCTTATCTTTTATCGTCACCCACT' 'B5' 1 'ATCTAGTAATACTGCCTTATCTTTTATCGTCGCCCATT' 'A5' 1 'ATCTAGCAATACTGCCCCATCTTTTATCGTCACCCATT' 'J2' 1 'ATCTAACAACACTGCCTTATCTTTTATCGTCACCCATT' 'J3' 1 'ATCTAGCTGTACTGCCTTACCTTTTATCGTCACCCATT' 'USA8' 1 'ATCCAGCAATACTGCCTCATCTTTTATCGTCACCCATT' 'USA9' 1 'ATCTAGCAATACCATCTTATCTTTCATCGTCACCCATT' 'USA10' 4 'ATCTAGCAATACTGCCTTATCTTTTGTCGTCACCCACT' 'USA11' 1 'ATCTAGCAATACTGTCTTACCCTTTATCGTCACCCATT' 'USA12' 1 'GTCTAGCAATACTGTCTTACCTTTTATCGTCACCCATT' 'USA13' 1 'ATCTAGCAATACTGTCTTATCTTTTATCGTCACCCGTT' 'B6' 3 'ATTTAGTAATACCGTCTTATCTTTTATCGTCACCCATT' 'C1' 1 'ATCTAGCTATACTGTCTTATCTCTCATCGTTACCCATT' 'J4' 1 'ATCTAACAATACTGCCTTATCTTTTATCGTCACCCACT' 'USA14' 2 'ACCTAGCAATACTGTCTTATCTTTTATCGTCATTCATT' 'OT5' 4 'ATCTAGCGATACTGTCTTATCTTTTATCACCACCTATT' 'OT6' 3 'ATCTAGCGATACTGTCTTATCTTTTATCACCACCCATG' 'B7' 1 'ATCTAGCGATACTGTCTTATCTCTTATCACCACCTATT').
populations triplesDo: [ : id : freq : seq |
 samplesSection 
  addSampleName: id;
  addSampleSize: freq;
  addSampleData: seq;
  yourself ].
" Export contents into a file "
arlequinFile contents writeOn: (FileStream newFileNamed: 'myArlequin.arp')
Don't forget to check BioArlequinFile convenience methods for building for different data types: #buildHaplotypicDataDNAProfileTitle: aString groups: aNumber missingData: missingCharacter #buildHaplotypicDataFrequencyProfileTitle: aString groups: aNumber missingData: missingCharacter And let me know any suggestions for improving the Arlequin API.

Saturday, February 23, 2013

PhyloclassTalk preview

In this post I want to present a preview of PhyloclassTalk, an application for phylogenetics analysis using the BioSmalltalk environment with Pharo 1.4. The main parts are presented through the Metro style UI popularized in Windows 8. The following screenshot shows the main application window:


excepting for the icons, the layout was generated programatically with simple and plain Morphic objects. The "Territory Builder" uses a wizard library called Merlin and it is based in a Territorial library which basically is a Composite pattern implementation to build complex territorial objects. I have integrated the Help in just one hour, based in the HelpSystem without any previous knowledge of the library.

The main module window is a "Case Study Browser" implemented with the OmniBrowser framework. From the browser one can create and associate several phylogenetic data to a species case study, classify according to defined territories and then export results into formats like Arlequin, Google Fusion Tables or Fluxus Network.

The following screenshot describes the "Blast Query Builder", which enables dynamic generation and execution of Blast XML results, producing filtered objects which can be later loaded in the case study browser for further associations. Fitered results could be cumulative, meaning that each new execution is applied on the previous results.



Detailed features as the rule engine protocol and the post-curation of classified data are going to be described an the upcoming paper. I will provide also new posts on this front as I prepare a release, stay there online.