BioSmalltalk workflow for ancestral haplotype analysis published in Animals ~ BioSmalltalk

Introduction

Recently we published in the Animal Journal, an article analyzing Bovine Lymphocyte antigen (BoLA) region of the Brangus cattle. We used LAMP-LD (Local Ancestry Inference in Admixed populations) which is a window-based algorithm combined within a hierarchical Hidden Markov Model to represent haplotypes in a population and allows to estimate enrichment from two parental populations. The ancestral haplotype is the one which appeared first in the parental populations. As we do not know which one it is, we assume it is the most abundant combining both of them, and considering its association to the mixing ratio.

In the script below, we analyzed MHC regions (this is chromosome 23 for Cattle) and generated a histogram plot on annotating the ranges on these regions. The script was adapted for general usage, and each step is commented.

Implementation

For replicating the workflow into your Bioinformatics experiments, you will need the reference alleles of your SNPs. Here, we extracted them from the Affymetrix TXT report, exported from the Axiom Analysis Suite software. The PLINK wrapper is used to filter families of interest and genotyping error rate. Our cattle families are labeled as 'AA' for Angus, 'BRANG' for Brangus and 'BRAH' for Brahman respectively. If you do not have family information in your PED file, then you will have to add it manually or using tools like UNIX command-line utilities (cut, paste, etc). We also calculated the effective population size for each family (which is commonly referred as "Ne" in the literature) and used as parameter of ShapeIt.

Plotting was performed using the ROASSAL visualization engine, for which I should thank to Alexandre Bergel for its tireless help with their impressive API.

| shapeItSamplesDir baseDir reportFile alleleAFile plinkFile |

" Base directory for all input and output files "
baseDir := '.'.
" Axiom Analysis Suite TXT Report "
reportFile := baseDir , 'AAS_report.txt'.
alleleAFile := baseDir , 'allele_A.txt'.
plinkFile := baseDir , 'Bos1_REF-FINAL'.

" Extract Allele_A column from Affymetrix report file to use it as reference alleles "
BioAffyTXTFormatter new
    inputFile: reportFile;
    outputFilename: alleleAFile;
    buildSNP_AlleleA.

" Build the PLINK BED file with reference alleles "
" The SNPs_BoLA_Bos1-annotdb32.txt is just a text file used to extract the SNPs of interest in the MHC region from the Microarray output "
BioPLINK2Wrapper new
    bfile: plinkFile;
    out: baseDir , 'Bos1_REF-FINAL_REFA';
    referenceAlleles: alleleAFile;
    extract: baseDir , 'SNPs_BoLA_Bos1-annotdb32.txt';
    cow;
    makeBed;
    execute.

" Apply families of interest filter "
BioPLINK2Wrapper new
    bfile: baseDir , 'Bos1_REF-FINAL_REFA';
    out: baseDir , 'Bos1_REF-FINAL-AA-BRAH-BRANG_r1';
    keepFams: #('AA' 'BRANG' 'BRAH');  
    cow;
    autosome;
    makeBed;
    execute.

" Apply basic genotyping error rate filter  "
BioPLINK2Wrapper new
    bfile: baseDir , 'Bos1_REF-FINAL-AA-BRAH-BRANG_r1';
    out: baseDir , 'Bos1_REF-FINAL-AA-BRAH-BRANG_Atsm_GINO-chr23_MHC';
    cow;
    geno: 0.05;
    makeBed;
    execute.

" Regenerate BED for duplicated BRAH lines "
" Split data set into 3 separate BED files by family "
#('AA' 'BRANG' 'BRAH') do: [ : famName |
    BioPLINK2Wrapper new
        file: baseDir , 'Bos1_REF-FINAL-' , famName , '_Atsm_GINO-chr23_MHC';
        out: baseDir , 'Bos1_REF-FINAL-' , famName , '_Atsm_GINO-chr23_MHC';
        cow;
        keepFam: famName;
        makeBed;
        execute ].

" Run ShapeIt with Ne parameters "
#('AA' -> 100 . 'BRANG' -> 106 . 'BRAH' -> 159) do: [ : assoc |
    BioShapeIt2WrapperR644 new
        inputBinarized:  baseDir , 'Bos1_REF-FINAL-' , assoc key , '_r1';
        outputMax:  baseDir , 'Bos1_REF-FINAL-AA_r1';
        effectiveSize: assoc value;
        execute ].

" Generate .GENO file for LAMP-LD "
BioLAMPLDGenotypeFormatter new
    bimFilePath: baseDir , 'Bos1_REF-FINAL-BRANG_Atsm_GINO-chr23_MHC.bim';
    pedFilePath: baseDir , 'Bos1_REF-FINAL-BRANG_Atsm_GINO-chr23_MHC.ped';
    outputFilename: baseDir , 'Bos1_REF-FINAL-BRANG_Atsm_GINO-chr23_MHC.geno';
    export.
   
" Extract positions to build .POS file for LAMP-LD "
(baseDir , 'Bos1_REF-FINAL-AA_Atsm_GINO-chr23_MHC.map') asFileReference
    cut: #(4)
    to: (baseDir , 'Bos1_REF-FINAL-Atsm_GINO-chr23_MHC.pos') asFileReference writeStream.

" Transpose HAPS files "
" ... to be included "

" Run LAMP-LD "
shapeItSamplesDir := 'ShapeItFiles' asFileReference.
BioLAMPLDWrapper new
    unolanc2way;
    windowSize: 100;
    nrFounders: 2;
    positionsFile:     (shapeItSamplesDir / 'Bos1_REF-FINAL-Atsm_GINO-chr23_MHC.pos') fullName;
    positionsFileInMb;
    addPopFile:         (shapeItSamplesDir / 'Bos1_REF-FINAL-AA_Atsm_GINO-chr23_MHC-ShapeIt2-Ne100.haps.ref') fullName atOrder: 1;
    addPopFile:         (shapeItSamplesDir / 'Bos1_REF-FINAL-BRAH_Atsm_GINO-chr23_MHC-ShapeIt2-Ne150.haps.ref') fullName atOrder: 2;
    genosFile:          (shapeItSamplesDir / 'Bos1_REF-FINAL-BRANG_Atsm_GINO-chr23_MHC.geno') fullName;
    outputFile:         baseDir , 'lamp-ld_example1.out';
    execute.
   
" Convert positions to mbases "
BioLAMPLDWrapper new
    positionsFile: baseDir , 'Bos1_REF-FINAL-Atsm_GINO-chr23_MHC.pos';
    positionsFileInMb.

BioLAMPLD2WayVisualizer new
 lineWidth: 2;
 posFile: baseDir , 'Bos1_REF-FINAL-Atsm_GINO-chr23_MHC_1000.pos';
 population1Name: 'Angus' color: Color red;
 population2Name: 'Brahman' color: Color blue;
 readExpanded: baseDir , 'postlampld_ws-100.txt' title: 'Angus versus Brahman';
 addGenomicRangeBelowXFrom: 27862 to: 27989 label: 'C I';
 addGenomicRangeBelowXFrom: 25350 to: 25592 label: 'C IIa';
 addGenomicRangeBelowXFrom: 7011 to: 7534 label: 'C IIb';
 addGenomicRangeBelowXFrom: 26973 to: 27576 label: 'C III'; 
 open.

The resulting image can be viewed below:

Depending on your version of ShapeIt, if you have few samples for a population shapeIt v2 won't run with the following error message:

src/modes/phaser/phaser_algorithm.cpp:147: void phaser::phaseSegment(int): Assertion `conditional_index[segment].size() > 5' failed.

Of course you should get more samples, but for continuing testing the workflow you could just duplicate lines and change the ID's of the duplicated samples. This is how to do it:

" First, you split data set into 3 separate PED files by family, but recode the file as textual format (PED) "
#('AA' 'BRANG' 'BRAH') do: [ : famName |
    BioPLINK2Wrapper new
        bfile: baseDir , 'Bos1_REF-FINAL-AA-BRAH-BRANG_Atsm_GINO-chr23_MHC';
        out: baseDir , 'Bos1_REF-FINAL-' , famName , '_Atsm_GINO-chr23_MHC';
        cow;
        keepFam: famName;
        recodeTab;
        execute ].

" Create a new temporary file "
(baseDir , 'Bos1_REF-FINAL-BRAH_Atsm_GINO-chr23_MHC.ped') asFileReference
    copyTo: (baseDir , 'Bos1_REF-FINAL-BRAH_Atsm_GINO-chr23_MHC.nwped') asFileReference.

" Duplicate BRAH lines with different IDs "
(FileStream fileNamed: baseDir , 'Bos1_REF-FINAL-BRAH_Atsm_GINO-chr23_MHC.nwped')
    ifNotNil: [ : stream |
        stream setToEnd.
        (baseDir , 'Bos1_REF-FINAL-BRAH_Atsm_GINO-chr23_MHC.ped') asFileReference readStream contents linesDo: [ : line |
            | tkl newId |
            tkl := line findTokens: Character tab.
            newId := 'A' , tkl second.
            tkl at: 2 put: newId.
            stream
                nextPutAll: (tkl joinUsing: Character tab);
                nextPutTerminator ] ].

" Rename again to the original file "
(baseDir , 'Bos1_REF-FINAL-BRAH_Atsm_GINO-chr23_MHC.nwped') asFileReference renameTo: baseDir , 'Bos1_REF-FINAL-BRAH_Atsm_GINO-chr23_MHC.ped'.

3 comments:

alwardahalrihanMay 7, 2025 at 8:39 AM
Fascinating read! The BioSmalltalk workflow for ancestral haplotype analysis is groundbreaking. Perfect for researchers needing high-performance tools—just like how MacBook Rentals empower tech-driven projects with flexibility. Well done!
laptoprentaluaeMay 30, 2025 at 12:38 AM
This workflow is really fascinating—especially how BioSmalltalk simplifies the ancestral sequence reconstruction process. It’s amazing to see open-source tools contribute so much to bioinformatics. At Techno Edge Systems LLC, we’re always looking for innovative approaches like this that merge computing with complex scientific tasks. Thanks for sharing such a detailed overview!
dubailaptoprentalSeptember 16, 2025 at 9:21 AM
Great post! I really enjoyed reading about the workflow for ancestral sequence reconstruction. Very informative and well-explained.

If you ever need a break from research, I highly recommend checking out Gaming Laptop Rental in Dubai — it’s a smart way to access high-performance machines without the full purchase cost. Whether it’s for computational biology simulations or intense graphics work, having powerful hardware on demand really helps.

Looking forward to more posts like this!

BioSmalltalk

Usage and development comments about the BioSmalltalk project

Menu

Friday, July 28, 2017

BioSmalltalk workflow for ancestral haplotype analysis published in Animals

Introduction

Implementation

3 comments:

About

Blogroll

Flattr this blog

Popular Posts

Categories

Blog Archive