[Biophp-dev] Fasta filetype parser updated

S Clark biophp-dev@bioinformatics.org
Tue, 6 May 2003 22:25:07 -0600


On Tuesday 06 May 2003 09:04 pm, nicos@itsa.ucsf.edu wrote:
> Shall we wait untill we have a few more and then see how much redundancy
> there is?  I somehow dislike creating yet another object, chain will get
> so long, and harder to understand...

Yeah, I think waiting until we've got a better idea what's redundant before
designing the "base" class.  When we do it shouldn't cause a problem with
unnecessary extra classes - there'll just be a technically-optional base 
class that one can extend, and just add or override the parts that need
changing for the specific parser.  

> I thought that clustalx is simply the X-windows interface to clustalw.
> Fileformats should be identical.

Sorta - ClustalX has both MS Windows and XWindows versions.  The data file
does specify the version on it's first line:

"CLUSTAL X (1.81)"

I'm PRETTY sure they're the same anyway, but have never confirmed that.
Even if there are differences, I suspect the way I have the parser dealing
with them will handle it (i.e. I think clustalW may have been limited to
10(?)-character labels for the sequences, whereas the newer ClustalX allows
for more, but the parser doesn't care, it just splits the line at the space
between the label and the section of sequence wherever it may be.)

> I added seqlength to both the clustal and fasta parsers.  Also, it is (a
> little bit) better to quote with single quotes if you do not need
> variable interpolation, that way php does not have to invoke it's parser.
> The clustal parser output has a whole lot of dashes in the sequences.
> Can you have a look at those? (reg expressions are not my strong point as
> you might have noticed)

Ah, thanks for reminding me - I've been just using "" so long out of laziness
that I'd forgotten about the difference.

I was actually going to suggest that length instead be implemented inside
the seq object - (strlen($this->sequence)) rather than having the parser
specify - seems odd to be able to "lie" about the sequence length.  That is, 
as it currently stands, seqlength really doesn't have any "real" relation to
the sequence...other than the fact that the parsers CLAIM the seqlength
is particular number.  (But if one clips or appends to the sequence later, 
the "seqlength" stays the same...)

Not really a big deal - If we really want to keep seqlength as a "specified" 
rather than 'figured' characteristic, it's a one-liner to make the clustal
parser figure the length of the actual sequence (minus gaps, that is).

I left the dashes in intentionally - those are the "gap" markers in 
the sequences.  I figure that'll come in handy when loading up a seqalign
full of seq objects. That way if you load up a bunch of sequences from
an alignment, you keep the alignment information.

In my version of the nuc_sequence object, I actually just implemented a
"removeGaps()" methods to get rid of them if desired.

Regular expressions are a bit cryptic to learn at first, but I've been finding
that they are ridiculously useful when dealing with text...("Perl-Compatible
Regular Expressions" are one of the things that make Perl so good at dealing
with text...)

> Cool

Thanks!  I may have time to get back to the NCBI-BLAST query module shortly.
Going to have to figure out where that and the EUtils modules fit in
(the EFetch parsers for nucleotide and protein sequences can fit with the
rest of the sequence file parsers, but the rest don't...)