Friday, January 6, 2012

Basic sequence manipulation with BioSmalltalk

My first entry is dedicated to some basic Sequence String manipulation, to get some familiarization with basic Smalltalk objects. The important thing is that you don't need to create files to evaluate the following expressions, you just select it and "print it" the result with the contextual menu option or just the keyboard shortcut. Note that as everything is an object, you could select again the answer and "explore it" again and again.

To obtain the aminoacid name you just send #asAminoacidName to a String, for example
'a' asAminoacidName.    " --> 'Alanine' "
'G'asAminoacidName. " --> 'Glycine' "

Many times we copy and paste sequences from several sources which aren't properly formatted , so to remove all spacing characters from a sequence you could use:
AGTTAGCGACA ' asCondensedString.

Both messages are received by a String and answer a String. Now another basic object is Boolean, with its two instances: true and false. For example to determine if a DNA sequence contains ambiguous letters:
'atcggtcggctta' hasAmbiguousDNABases.        " -> false "
'atcggfcggctta' hasAmbiguousDNABases. " -> true "

An important part of working in Smalltalk is the Collection objects, being Array one of the most used. A case which answer an Array instance would be to get the positions of gaps (i.e. : - characters) in a DNA sequence:
'ATCGAT-CAGTGCA--CAGTCA-TTC' indicesOfAscii: $- asciiValue.     
" --> #(7 15 16 23) "

and of course, you could use the impressive amount of features of the String hierarchy.

"Get the sequence size:"
'AATGATCGATGCTAGTCGACA' size.  " -> 21 (a SmallInteger) "
"Compare two sequences:" 
" -> false (a False) " " Find the position of the first (answer 0 if doesn't) subsequence passed as parameter " 'AATGATCGATGCTAGTCGACATGCTA' findString: 'TGCTA' " -> 10 "

And that's all for now, as I don't like long posts we will stop here and I will check the feedback if any, hopefully I will receive comments about what people would like to use.