Friday, July 28, 2017

BioSmalltalk workflow for ancestral haplotype analysis published in Animals

Introduction

Recently we published in the Animal Journal, an article analyzing Bovine Lymphocyte antigen (BoLA) region of the Brangus cattle. We used LAMP-LD (Local Ancestry Inference in Admixed populations) which is a window-based algorithm combined within a hierarchical Hidden Markov Model to represent haplotypes in a population and allows to estimate enrichment from two parental populations. The ancestral haplotype is the one which appeared first in the parental populations. As we do not know which one it is, we assume it is the most abundant combining both of them, and considering its association to the mixing ratio.

In the script below, we analyzed MHC regions (this is chromosome 23 for Cattle) and generated a histogram plot on annotating the ranges on these regions. The script was adapted for general usage, and each step is commented.

Implementation

For replicating the workflow into your Bioinformatics experiments, you will need the reference alleles of your SNPs. Here, we extracted them from the Affymetrix TXT report, exported from the Axiom Analysis Suite software. The PLINK wrapper is used to filter families of interest and genotyping error rate. Our cattle families are labeled as 'AA' for Angus, 'BRANG' for Brangus and 'BRAH' for Brahman respectively. If you do not have family information in your PED file, then you will have to add it manually or using tools like UNIX command-line utilities (cut, paste, etc). We also calculated the effective population size for each family (which is commonly referred as "Ne" in the literature) and used as parameter of ShapeIt.

Plotting was performed using the ROASSAL visualization engine, for which I should thank to Alexandre Bergel for its tireless help with their impressive API.

| shapeItSamplesDir baseDir reportFile alleleAFile plinkFile |

" Base directory for all input and output files "
baseDir := '.'.
" Axiom Analysis Suite TXT Report "
reportFile := baseDir , 'AAS_report.txt'.
alleleAFile := baseDir , 'allele_A.txt'.
plinkFile := baseDir , 'Bos1_REF-FINAL'.

" Extract Allele_A column from Affymetrix report file to use it as reference alleles "
BioAffyTXTFormatter new
    inputFile: reportFile;
    outputFilename: alleleAFile;
    buildSNP_AlleleA.

" Build the PLINK BED file with reference alleles "
" The SNPs_BoLA_Bos1-annotdb32.txt is just a text file used to extract the SNPs of interest in the MHC region from the Microarray output "
BioPLINK2Wrapper new
    bfile: plinkFile;
    out: baseDir , 'Bos1_REF-FINAL_REFA';
    referenceAlleles: alleleAFile;
    extract: baseDir , 'SNPs_BoLA_Bos1-annotdb32.txt';
    cow;
    makeBed;
    execute.

" Apply families of interest filter "
BioPLINK2Wrapper new
    bfile: baseDir , 'Bos1_REF-FINAL_REFA';
    out: baseDir , 'Bos1_REF-FINAL-AA-BRAH-BRANG_r1';
    keepFams: #('AA' 'BRANG' 'BRAH');  
    cow;
    autosome;
    makeBed;
    execute.

" Apply basic genotyping error rate filter  "
BioPLINK2Wrapper new
    bfile: baseDir , 'Bos1_REF-FINAL-AA-BRAH-BRANG_r1';
    out: baseDir , 'Bos1_REF-FINAL-AA-BRAH-BRANG_Atsm_GINO-chr23_MHC';
    cow;
    geno: 0.05;
    makeBed;
    execute.

" Regenerate BED for duplicated BRAH lines "
" Split data set into 3 separate BED files by family "
#('AA' 'BRANG' 'BRAH') do: [ : famName |
    BioPLINK2Wrapper new
        file: baseDir , 'Bos1_REF-FINAL-' , famName , '_Atsm_GINO-chr23_MHC';
        out: baseDir , 'Bos1_REF-FINAL-' , famName , '_Atsm_GINO-chr23_MHC';
        cow;
        keepFam: famName;
        makeBed;
        execute ].

" Run ShapeIt with Ne parameters "
#('AA' -> 100 . 'BRANG' -> 106 . 'BRAH' -> 159) do: [ : assoc |
    BioShapeIt2WrapperR644 new
        inputBinarized:  baseDir , 'Bos1_REF-FINAL-' , assoc key , '_r1';
        outputMax:  baseDir , 'Bos1_REF-FINAL-AA_r1';
        effectiveSize: assoc value;
        execute ].

" Generate .GENO file for LAMP-LD "
BioLAMPLDGenotypeFormatter new
    bimFilePath: baseDir , 'Bos1_REF-FINAL-BRANG_Atsm_GINO-chr23_MHC.bim';
    pedFilePath: baseDir , 'Bos1_REF-FINAL-BRANG_Atsm_GINO-chr23_MHC.ped';
    outputFilename: baseDir , 'Bos1_REF-FINAL-BRANG_Atsm_GINO-chr23_MHC.geno';
    export.
   
" Extract positions to build .POS file for LAMP-LD "
(baseDir , 'Bos1_REF-FINAL-AA_Atsm_GINO-chr23_MHC.map') asFileReference
    cut: #(4)
    to: (baseDir , 'Bos1_REF-FINAL-Atsm_GINO-chr23_MHC.pos') asFileReference writeStream.

" Transpose HAPS files "
" ... to be included "

" Run LAMP-LD "
shapeItSamplesDir := 'ShapeItFiles' asFileReference.
BioLAMPLDWrapper new
    unolanc2way;
    windowSize: 100;
    nrFounders: 2;
    positionsFile:     (shapeItSamplesDir / 'Bos1_REF-FINAL-Atsm_GINO-chr23_MHC.pos') fullName;
    positionsFileInMb;
    addPopFile:         (shapeItSamplesDir / 'Bos1_REF-FINAL-AA_Atsm_GINO-chr23_MHC-ShapeIt2-Ne100.haps.ref') fullName atOrder: 1;
    addPopFile:         (shapeItSamplesDir / 'Bos1_REF-FINAL-BRAH_Atsm_GINO-chr23_MHC-ShapeIt2-Ne150.haps.ref') fullName atOrder: 2;
    genosFile:          (shapeItSamplesDir / 'Bos1_REF-FINAL-BRANG_Atsm_GINO-chr23_MHC.geno') fullName;
    outputFile:         baseDir , 'lamp-ld_example1.out';
    execute.
   
" Convert positions to mbases "
BioLAMPLDWrapper new
    positionsFile: baseDir , 'Bos1_REF-FINAL-Atsm_GINO-chr23_MHC.pos';
    positionsFileInMb.

BioLAMPLD2WayVisualizer new
 lineWidth: 2;
 posFile: baseDir , 'Bos1_REF-FINAL-Atsm_GINO-chr23_MHC_1000.pos';
 population1Name: 'Angus' color: Color red;
 population2Name: 'Brahman' color: Color blue;
 readExpanded: baseDir , 'postlampld_ws-100.txt' title: 'Angus versus Brahman';
 addGenomicRangeBelowXFrom: 27862 to: 27989 label: 'C I';
 addGenomicRangeBelowXFrom: 25350 to: 25592 label: 'C IIa';
 addGenomicRangeBelowXFrom: 7011 to: 7534 label: 'C IIb';
 addGenomicRangeBelowXFrom: 26973 to: 27576 label: 'C III'; 
 open.

The resulting image can be viewed below:

Depending on your version of ShapeIt, if you have few samples for a population shapeIt v2 won't run with the following error message:

src/modes/phaser/phaser_algorithm.cpp:147: void phaser::phaseSegment(int): Assertion `conditional_index[segment].size() > 5' failed.

Of course you should get more samples, but for continuing testing the workflow you could just duplicate lines and change the ID's of the duplicated samples. This is how to do it:

" First, you split data set into 3 separate PED files by family, but recode the file as textual format (PED) "
#('AA' 'BRANG' 'BRAH') do: [ : famName |
    BioPLINK2Wrapper new
        bfile: baseDir , 'Bos1_REF-FINAL-AA-BRAH-BRANG_Atsm_GINO-chr23_MHC';
        out: baseDir , 'Bos1_REF-FINAL-' , famName , '_Atsm_GINO-chr23_MHC';
        cow;
        keepFam: famName;
        recodeTab;
        execute ].

" Create a new temporary file "
(baseDir , 'Bos1_REF-FINAL-BRAH_Atsm_GINO-chr23_MHC.ped') asFileReference
    copyTo: (baseDir , 'Bos1_REF-FINAL-BRAH_Atsm_GINO-chr23_MHC.nwped') asFileReference.

" Duplicate BRAH lines with different IDs "
(FileStream fileNamed: baseDir , 'Bos1_REF-FINAL-BRAH_Atsm_GINO-chr23_MHC.nwped')
    ifNotNil: [ : stream |
        stream setToEnd.
        (baseDir , 'Bos1_REF-FINAL-BRAH_Atsm_GINO-chr23_MHC.ped') asFileReference readStream contents linesDo: [ : line |
            | tkl newId |
            tkl := line findTokens: Character tab.
            newId := 'A' , tkl second.
            tkl at: 2 put: newId.
            stream
                nextPutAll: (tkl joinUsing: Character tab);
                nextPutTerminator ] ].

" Rename again to the original file "
(baseDir , 'Bos1_REF-FINAL-BRAH_Atsm_GINO-chr23_MHC.nwped') asFileReference renameTo: baseDir , 'Bos1_REF-FINAL-BRAH_Atsm_GINO-chr23_MHC.ped'.

Monday, June 19, 2017

Convert rehh output to UCSC Bed file

The rehh (Relative Extended Haplotype Homozygosity) R package provides several scores for detecting recent natural positive selection taking as input SNP data. Resulting regions with high p-values can be considered as candidates for selective sweeps.

The following BioSmalltalk script reads rehh output files from ihs and rsb functions (ihh2ihs() or ies2rsb()), and for each chromosome, collects peaks given by p-values to generate an UCSC BED file (Browser Extensible Data). BED files can be used to view the genome annotations in significant regions using Genome Browsers, such as the Animal Genome Browser, through the "add custom track" option.

The collector object first filter valid (not null) -log10(p-value)'s which are greater than ylim parameter, this is done to find significative p-values. Then scans resulting collection to detect a peak range, checking if difference between two adjacent SNP's is greater than the nbases parameter. Finally the output is exported to BED file format (#dumpAsUCSCBedFile, #dumpAsEnsembleBedFile or #dumpAsEnsembleBedFileBothStrands), ensuring each feature is written in correct order

| minChr maxChr cwd nbases |
minChr := 1.
maxChr := 29.
cwd := 'c:\...'.
nbases := 1000000.

minChr to: maxChr do: [ : chrNumber | 
 | inputFilename outputFilename |
 inputFilename := 'chr_' , chrNumber asString , '.rsb.txt'.
 outputFilename := 'peaks_chr' , chrNumber asString , '.bed'.
 BioRehhLogPValueCollector new
  cwd: cwd;
  ylim: 2.0;
  nbases: nbases;
  chrNumber: chrNumber;
  inputFile: inputFilename;
  outputFilename: outputFilename;
  collectPeaks;
  dumpAsEnsembleBedFileBothStrands ]