Monday, June 19, 2017

Convert rehh output to UCSC Bed file

The rehh (Relative Extended Haplotype Homozygosity) R package provides several scores for detecting recent natural positive selection taking as input SNP data. Resulting regions with high p-values can be considered as candidates for selective sweeps.

The following BioSmalltalk script reads rehh output files from ihs and rsb functions (ihh2ihs() or ies2rsb()), and for each chromosome, collects peaks given by p-values to generate an UCSC BED file (Browser Extensible Data). BED files can be used to view the genome annotations in significant regions using Genome Browsers, such as the Animal Genome Browser, through the "add custom track" option.

The collector object first filter valid (not null) -log10(p-value)'s which are greater than ylim parameter, this is done to find significative p-values. Then scans resulting collection to detect a peak range, checking if difference between two adjacent SNP's is greater than the nbases parameter. Finally the output is exported to BED file format (#dumpAsUCSCBedFile, #dumpAsEnsembleBedFile or #dumpAsEnsembleBedFileBothStrands), ensuring each feature is written in correct order

| minChr maxChr cwd nbases |
minChr := 1.
maxChr := 29.
cwd := 'c:\...'.
nbases := 1000000.

minChr to: maxChr do: [ : chrNumber | 
 | inputFilename outputFilename |
 inputFilename := 'chr_' , chrNumber asString , '.rsb.txt'.
 outputFilename := 'peaks_chr' , chrNumber asString , '.bed'.
 BioRehhLogPValueCollector new
  cwd: cwd;
  ylim: 2.0;
  nbases: nbases;
  chrNumber: chrNumber;
  inputFile: inputFilename;
  outputFilename: outputFilename;
  dumpAsEnsembleBedFileBothStrands ]