Introduction
Recently we published in the Animal Journal, an article analyzing Bovine Lymphocyte antigen (BoLA) region of the Brangus cattle. We used LAMP-LD (Local Ancestry Inference in Admixed populations) which is a window-based algorithm combined within a hierarchical Hidden Markov Model to represent haplotypes in a population and allows to estimate enrichment from two parental populations. The ancestral haplotype is the one which appeared first in the parental populations. As we do not know which one it is, we assume it is the most abundant combining both of them, and considering its association to the mixing ratio.
In the script below, we analyzed MHC regions (this is chromosome 23 for Cattle) and generated a histogram plot on annotating the ranges on these regions. The script was adapted for general usage, and each step is commented.
Implementation
For replicating the workflow into your Bioinformatics experiments, you will need the reference alleles of your SNPs. Here, we extracted them from the Affymetrix TXT report, exported from the Axiom Analysis Suite software. The PLINK wrapper is used to filter families of interest and genotyping error rate. Our cattle families are labeled as 'AA' for Angus, 'BRANG' for Brangus and 'BRAH' for Brahman respectively. If you do not have family information in your PED file, then you will have to add it manually or using tools like UNIX command-line utilities (cut, paste, etc). We also calculated the effective population size for each family (which is commonly referred as "Ne" in the literature) and used as parameter of ShapeIt.
Plotting was performed using the ROASSAL visualization engine, for which I should thank to Alexandre Bergel for its tireless help with their impressive API.
| shapeItSamplesDir baseDir reportFile alleleAFile plinkFile | " Base directory for all input and output files " baseDir := '.'. " Axiom Analysis Suite TXT Report " reportFile := baseDir , 'AAS_report.txt'. alleleAFile := baseDir , 'allele_A.txt'. plinkFile := baseDir , 'Bos1_REF-FINAL'. " Extract Allele_A column from Affymetrix report file to use it as reference alleles " BioAffyTXTFormatter new inputFile: reportFile; outputFilename: alleleAFile; buildSNP_AlleleA. " Build the PLINK BED file with reference alleles " " The SNPs_BoLA_Bos1-annotdb32.txt is just a text file used to extract the SNPs of interest in the MHC region from the Microarray output " BioPLINK2Wrapper new bfile: plinkFile; out: baseDir , 'Bos1_REF-FINAL_REFA'; referenceAlleles: alleleAFile; extract: baseDir , 'SNPs_BoLA_Bos1-annotdb32.txt'; cow; makeBed; execute. " Apply families of interest filter " BioPLINK2Wrapper new bfile: baseDir , 'Bos1_REF-FINAL_REFA'; out: baseDir , 'Bos1_REF-FINAL-AA-BRAH-BRANG_r1'; keepFams: #('AA' 'BRANG' 'BRAH'); cow; autosome; makeBed; execute. " Apply basic genotyping error rate filter " BioPLINK2Wrapper new bfile: baseDir , 'Bos1_REF-FINAL-AA-BRAH-BRANG_r1'; out: baseDir , 'Bos1_REF-FINAL-AA-BRAH-BRANG_Atsm_GINO-chr23_MHC'; cow; geno: 0.05; makeBed; execute. " Regenerate BED for duplicated BRAH lines " " Split data set into 3 separate BED files by family " #('AA' 'BRANG' 'BRAH') do: [ : famName | BioPLINK2Wrapper new file: baseDir , 'Bos1_REF-FINAL-' , famName , '_Atsm_GINO-chr23_MHC'; out: baseDir , 'Bos1_REF-FINAL-' , famName , '_Atsm_GINO-chr23_MHC'; cow; keepFam: famName; makeBed; execute ]. " Run ShapeIt with Ne parameters " #('AA' -> 100 . 'BRANG' -> 106 . 'BRAH' -> 159) do: [ : assoc | BioShapeIt2WrapperR644 new inputBinarized: baseDir , 'Bos1_REF-FINAL-' , assoc key , '_r1'; outputMax: baseDir , 'Bos1_REF-FINAL-AA_r1'; effectiveSize: assoc value; execute ]. " Generate .GENO file for LAMP-LD " BioLAMPLDGenotypeFormatter new bimFilePath: baseDir , 'Bos1_REF-FINAL-BRANG_Atsm_GINO-chr23_MHC.bim'; pedFilePath: baseDir , 'Bos1_REF-FINAL-BRANG_Atsm_GINO-chr23_MHC.ped'; outputFilename: baseDir , 'Bos1_REF-FINAL-BRANG_Atsm_GINO-chr23_MHC.geno'; export. " Extract positions to build .POS file for LAMP-LD " (baseDir , 'Bos1_REF-FINAL-AA_Atsm_GINO-chr23_MHC.map') asFileReference cut: #(4) to: (baseDir , 'Bos1_REF-FINAL-Atsm_GINO-chr23_MHC.pos') asFileReference writeStream. " Transpose HAPS files " " ... to be included " " Run LAMP-LD " shapeItSamplesDir := 'ShapeItFiles' asFileReference. BioLAMPLDWrapper new unolanc2way; windowSize: 100; nrFounders: 2; positionsFile: (shapeItSamplesDir / 'Bos1_REF-FINAL-Atsm_GINO-chr23_MHC.pos') fullName; positionsFileInMb; addPopFile: (shapeItSamplesDir / 'Bos1_REF-FINAL-AA_Atsm_GINO-chr23_MHC-ShapeIt2-Ne100.haps.ref') fullName atOrder: 1; addPopFile: (shapeItSamplesDir / 'Bos1_REF-FINAL-BRAH_Atsm_GINO-chr23_MHC-ShapeIt2-Ne150.haps.ref') fullName atOrder: 2; genosFile: (shapeItSamplesDir / 'Bos1_REF-FINAL-BRANG_Atsm_GINO-chr23_MHC.geno') fullName; outputFile: baseDir , 'lamp-ld_example1.out'; execute. " Convert positions to mbases " BioLAMPLDWrapper new positionsFile: baseDir , 'Bos1_REF-FINAL-Atsm_GINO-chr23_MHC.pos'; positionsFileInMb. BioLAMPLD2WayVisualizer new lineWidth: 2; posFile: baseDir , 'Bos1_REF-FINAL-Atsm_GINO-chr23_MHC_1000.pos'; population1Name: 'Angus' color: Color red; population2Name: 'Brahman' color: Color blue; readExpanded: baseDir , 'postlampld_ws-100.txt' title: 'Angus versus Brahman'; addGenomicRangeBelowXFrom: 27862 to: 27989 label: 'C I'; addGenomicRangeBelowXFrom: 25350 to: 25592 label: 'C IIa'; addGenomicRangeBelowXFrom: 7011 to: 7534 label: 'C IIb'; addGenomicRangeBelowXFrom: 26973 to: 27576 label: 'C III'; open.
The resulting image can be viewed below:
Depending on your version of ShapeIt, if you have few samples for a population shapeIt v2 won't run with the following error message:
src/modes/phaser/phaser_algorithm.cpp:147: void phaser::phaseSegment(int): Assertion `conditional_index[segment].size() > 5' failed.
Of course you should get more samples, but for continuing testing the workflow you could just duplicate lines and change the ID's of the duplicated samples. This is how to do it:
" First, you split data set into 3 separate PED files by family, but recode the file as textual format (PED) " #('AA' 'BRANG' 'BRAH') do: [ : famName | BioPLINK2Wrapper new bfile: baseDir , 'Bos1_REF-FINAL-AA-BRAH-BRANG_Atsm_GINO-chr23_MHC'; out: baseDir , 'Bos1_REF-FINAL-' , famName , '_Atsm_GINO-chr23_MHC'; cow; keepFam: famName; recodeTab; execute ]. " Create a new temporary file " (baseDir , 'Bos1_REF-FINAL-BRAH_Atsm_GINO-chr23_MHC.ped') asFileReference copyTo: (baseDir , 'Bos1_REF-FINAL-BRAH_Atsm_GINO-chr23_MHC.nwped') asFileReference. " Duplicate BRAH lines with different IDs " (FileStream fileNamed: baseDir , 'Bos1_REF-FINAL-BRAH_Atsm_GINO-chr23_MHC.nwped') ifNotNil: [ : stream | stream setToEnd. (baseDir , 'Bos1_REF-FINAL-BRAH_Atsm_GINO-chr23_MHC.ped') asFileReference readStream contents linesDo: [ : line | | tkl newId | tkl := line findTokens: Character tab. newId := 'A' , tkl second. tkl at: 2 put: newId. stream nextPutAll: (tkl joinUsing: Character tab); nextPutTerminator ] ]. " Rename again to the original file " (baseDir , 'Bos1_REF-FINAL-BRAH_Atsm_GINO-chr23_MHC.nwped') asFileReference renameTo: baseDir , 'Bos1_REF-FINAL-BRAH_Atsm_GINO-chr23_MHC.ped'.