[Biococoa-dev] BCSequence

Wed Aug 25 17:25:43 EDT 2004

On Aug 24, 2004, at 9:47 PM, John Timmer wrote:

> That's a great idea.  The main reason I was putting in a new method 
> was that
> I added a separate weight for the ambiguous bases that's the average of
> every base they represent, since that would be more accurate than 
> returning
> 0 for estimating the weight.  I also wanted to avoid stomping on the 
> figures
> you put in, since I don't know enough about how they're typically 
> used.  I
> just wanted to use something different that didn't interfere with what 
> you
> were likely to do.
>

There is an alternative solution, which is what BioPerl does:

         # Obtain the molecular weight of a sequence. Since the sequence 
may contain
         # ambiguous monomers, the molecular weight is returned as a 
(reference to) a
         # two element array containing greatest lower bound (GLB) and 
least upper bound
         # (LUB) of the molecular weight
         $weight = $seq_stats->get_mol_wt();
         print "\nMolecular weight (using statistics object) of sequence 
", $seqobj->id(),
              " is between ", $$weight[0], " and " ,
              $$weight[1], "\n";

         #  or
         $weight = Bio::Tools::SeqStats->get_mol_wt($seqobj);
         print "\nMolecular weight (without statistics object) of 
sequence ", $seqobj->id(),
              " is between ", $$weight[0], " and " ,
              $$weight[1], "\n";

I have looked at their code, and it doesn't seem to be that difficult 
to do. What I propose is the following. The bases have 4 (extra) 
variables:

lowMonoisotopicMass
highMonoisotopicMass
lowAverageMass
highAverageMass

which are calculated during the initialization of the singletons.

Then when we calculate the mass of a sequence, we actually calculate 
two masses, one high and one low. For ACGT these values are of course 
the same. Because we use the NSCountedSet we don't have to iterate 
through all symbols, just through the set and multiply by their 
occurance number. Although it will take more time to calculate, 
especially for DNA segments with a large number of nucleotides, I don't 
think this is critical, because we use the counted set.

I will write this code and add it all in a new class MassCalculator 
which will end up in BCUtils. I will also work on a 
IsoelectricPointCalculator and ProteinDigest class.

- Koen.