ViewVC Help
View File | Revision Log | Show Annotations | Root Listing
root/yamap/pfam_scan.pl
Revision: 1.1.1.1 (vendor branch)
Committed: Thu Sep 7 15:35:21 2006 UTC (9 years, 10 months ago) by knirirr
Branch: MAIN, cehox
CVS Tags: start, HEAD
Changes since 1.1: +0 -0 lines
Log Message:
Imported sources

Line File contents
1 #!/usr/bin/perl -w
2
3 # Copyright (c) 2002-2005 Genome Research Ltd
4 # Distributed under the same terms as the Pfam database
5 # See ftp://ftp.sanger.ac.uk/pub/databases/Pfam/COPYRIGHT
6 # for more details.
7 #
8 # Written by Sam Griffiths-Jones borrowing heavily from some
9 # old code by Ewan Birney. Contact pfam@sanger.ac.uk for
10 # help.
11 #
12 # version 0.5 (16/3/2005)
13
14 =head1 NAME
15
16 pfam_scan.pl - search protein fasta sequences against the Pfam
17 library of HMMs.
18
19 =head1 VERSION
20
21 This is version 0.5 of pfam_scan.pl. See the history section for
22 recent changes.
23
24 Behaviour of recent versions is a significantly different from 0.1.
25 From version 0.5, overlapping matches to families within the same clan
26 are removed, keeping the best scoring hit. This behaviour can be
27 overridden with the --overlap option. From version 0.2, we can use
28 BLAST to preprocess the input sequences with the --fast option, so we
29 only have to search a subset of sequences against a subset of HMMs
30 using hmmpfam. For puritanical reasons we don't do this by default!
31 Read the notes about this below.
32
33 This version has been tested with Perl 5.6.1, Pfam 10.0 (through
34 13.0), Bioperl 1.2 and HMMER 2.3.1. It should work with any versions
35 higher than these.
36
37 =head1 REQUIREMENTS
38
39 - this script
40 - Perl 5.6 or higher (and maybe lower)
41 - The Pfam database (downloadable from
42 ftp://ftp.sanger.ac.uk/pub/databases/Pfam/)
43 - HMMER software (from http://hmmer.wustl.edu/)
44 - NCBI BLAST binaries (from http://www.ncbi.nlm.nih.gov/Ftp/)
45 - Bioperl (from http://bio.perl.org/)
46
47 The Bioperl modules directory must be in your perl library path, and
48 the HMMER and BLAST binaries must be in your executable path.
49
50 You also need to be able to read and write to /tmp on your machine.
51
52 Some of these requirements are easily circumvented, but this script
53 should at least give you a start.
54
55 =head1 HOW TO INSTALL PFAM LOCALLY
56
57 1. Get the Pfam database from
58 ftp://ftp.sanger.ac.uk/pub/databases/Pfam/. In particular you need
59 the files Pfam-A.fasta, Pfam_ls, Pfam_fs, and Pfam-A.seed.
60
61 2. Unzip them if necessary
62 $ gunzip Pfam*.gz
63
64 3. Grab and install HMMER, NCBI BLAST and Bioperl, and make sure your
65 paths etc are set up properly.
66
67 4. Index Pfam-A.fasta for BLAST searches
68 $ formatdb -i Pfam-A.fasta -p T
69
70 5. Index the Pfam_ls and Pfam_fs libraries for HMM fetching
71 $ hmmindex Pfam_ls
72 $ hmmindex Pfam_fs
73
74
75 =head1 SEARCHING PFAM
76
77 This script is really just a wrapper around hmmpfam.
78
79 Run pfam_scan.pl -h to get a list of options. Probably the only thing
80 to worry about is supplying the -d option with the location of your
81 downloaded Pfam database. Or you can set the PFAMDB environment
82 variable to point to the right place and things should work without
83 -d. And you should decide whether or not to use --fast.
84
85 A few things to note:
86
87 --fast uses BLAST as a preprocessor to reduce the amount of compute we
88 have to do with hmmpfam. This is known to reduce sensitivity in the
89 case of a very small number of families (those whose length is
90 exceptionally short, like the XYPPX repeat). If you're annotating
91 genomes then you *probably* don't care too much about these families.
92 Omiting this option may give you a small added sensitivity, but with a
93 rough 10 fold time cost. If you want to exactly replicate the Pfam
94 web site results or distributed data, you probably shouldn't use this.
95
96 Overlapping above-threshold hits to families within the same clan are
97 removed -- only the best scoring hit is kept. You can override this
98 behaviour with the --overlap option.
99
100 Pfam provides two sets of models, called ls and fs models, for whole
101 domain and fragment searches. This wrapper basically returns all hits
102 to the ls models, and then adds to these all non-overlapping hits to
103 the fragment models. This mimics the behaviour of Pfam web site
104 searches. You can choose to search only one set of models with the
105 --mode option.
106
107 Unless you want to grub around in the noise you should probably use
108 the default thresholds - these are hand curated for every family by
109 the Pfam team, such that we believe false positives will not score
110 above these levels. The consequence is that some families may miss
111 members.
112
113 You may want to adjust the threshold used for the preprocessing BLAST
114 search (default evalue 10). Raising this to 50 will slow everything
115 down a bit but may gain you a little sensitivity. Lowering the evalue
116 cutoff will speed things up but with an inevitable sensitivity cost.
117
118 It is important that each sequence in the fasta file has a unique
119 identifier. Note that the fasta header format should be:
120
121 >identifier <optional description>
122
123 so the identifier should not contain whitespace.
124
125 The format of the output is:
126
127 <seq id> <seq start> <seq end> <hmm acc> <hmm start> <hmm end> <bit score> <evalue> <hmm name>
128
129 hmmpfam returns scores for sequence and domain matches seperately.
130 For simplicity, the single line for each domain format returned here
131 reports domain scores.
132
133 =head1 BUGS
134
135 Many options are not rigorously tested. Error messages are
136 uninformative. The documentation is inadequate. You may find it
137 useful. You may not.
138
139 =head1 HISTORY
140
141 Version Main changes
142 ------- ------------
143
144 0.5 Removes overlapping above-threshold hits to families
145 within the same clan. --overlap overrides.
146
147 0.4 Work-around for hmmpfam bug/feature that reports hits
148 above domain threshold even if the sequence doesn't
149 score above the sequence threshold.
150
151 0.3 Fix minor bugs to be compatable with HMM versions in
152 Pfam 13.
153
154 0.2 --fast option to use BLAST preprocessing for significant
155 speed-up.
156
157 0.1 First effort, simply wraps up hmmpfam without doing
158 anything clever.
159
160 =head1 CONTACT
161
162 This script is copyright (c) Genome Research Ltd 2002-2005. Please
163 contact pfam@sanger.ac.uk for help.
164
165 =cut
166
167 use strict;
168 use Getopt::Long;
169 use IO::File;
170 use Bio::SeqIO;
171 use Bio::SearchIO;
172
173
174 sub show_help {
175 print STDERR <<EOF;
176
177 $0: search a fasta file against Pfam
178
179 Usage: $0 <options> fasta_file
180 Options
181 -h : show this help
182 -d <dir> : directory location of Pfam flatfiles
183 -o <file> : output file, otherwise send to STDOUT
184 --fast : use BLAST preprocessing
185
186 Expert options
187 --overlap : show overlapping hits within clan member families
188 --mode <ls|fs> : search only ls or fs models (default both)
189 -e <n> : specify hmmpfam evalue cutoff, else use Pfam cutoffs
190 -t <n> : specify hmmpfam bits cutoff, else use Pfam cutoffs
191 -be <n> : specify blast evalue cutoff (default 1)
192
193 Output format is:
194 <seq id> <seq start> <seq end> <hmm acc> <hmm start> <hmm end> <bit score> <evalue> <hmm name>
195
196 EOF
197 }
198
199 my( $help,
200 $ecut,
201 $bcut,
202 $outfile,
203 $pfamdir,
204 $mode,
205 $fast,
206 $noclean,
207 $verbose,
208 $overlap,
209 );
210
211 # defaults
212 my $blastecut = 1;
213
214 if( $ENV{ 'PFAMDB' } ) {
215 $pfamdir = $ENV{ 'PFAMDB' };
216 }
217 else {
218 $pfamdir = '.';
219 }
220
221 &GetOptions( "mode=s" => \$mode,
222 "fast" => \$fast,
223 "o=s" => \$outfile,
224 "e=s" => \$ecut,
225 "t=s" => \$bcut,
226 "h" => \$help,
227 "be=s" => \$blastecut,
228 "d=s" => \$pfamdir,
229 "noclean" => \$noclean,
230 "v" => \$verbose,
231 "overlap" => \$overlap,
232 );
233
234 my $fafile = shift;
235
236 if( $help ) {
237 &show_help();
238 exit(1);
239 }
240 if( !-s $fafile ) {
241 &show_help();
242 print STDERR "FATAL: can't find fasta file [$fafile]\n\n";
243 exit(1);
244 }
245 if( !-d $pfamdir ) {
246 &show_help();
247 print STDERR "FATAL: database location [$pfamdir] is not a directory\n\n";
248 exit(1);
249 }
250 if( !-s "$pfamdir/Pfam_ls" and ( $mode ne "fs" ) ) {
251 &show_help();
252 print STDERR "FATAL: database location [$pfamdir] does not contain Pfam_ls\n\n";
253 exit(1);
254 }
255 if( !-s "$pfamdir/Pfam_fs" and ( $mode ne "ls" ) ) {
256 &show_help();
257 print STDERR "FATAL: database location [$pfamdir] does not contain Pfam_fs\n\n";
258 exit(1);
259 }
260 if( $fast ) {
261 if( !-s "$pfamdir/Pfam-A.fasta" ) {
262 &show_help();
263 print STDERR "FATAL: database location [$pfamdir] does not contain Pfam-A.fasta.\nYou cannot use --fast without this file.\n\n";
264 exit(1);
265 }
266 if( !-s "$pfamdir/Pfam-A.fasta.phr" ) {
267 &show_help();
268 print STDERR "FATAL: you need to index Pfam-A.fasta using formatdb to use --fast\n\n";
269 exit(1);
270 }
271 if( !-s "$pfamdir/Pfam_ls.ssi" or !-s "$pfamdir/Pfam_fs.ssi" ) {
272 &show_help();
273 print STDERR "FATAL: you need to index Pfam_ls and Pfam_fs using hmmindex to use --fast\n\n";
274 exit(1);
275 }
276 }
277
278 if( !$overlap and !-s "$pfamdir/Pfam-C" ) {
279 print STDERR "WARNING: database directory [$pfamdir] does not contain Pfam-C.\nYou may get overlapping hits to families in the same clan\nwithout this file.\n\n";
280 }
281
282
283
284 my $maxidlength = 1;
285 open( FA, $fafile ) or die "FATAL: can't open fasta file [$fafile]\n";
286 while(<FA>) {
287 if( /^\>(\S+)/ ) {
288 $maxidlength = length( $1 ) if( $maxidlength < length( $1 ) );
289 }
290 }
291 close FA;
292
293 my $options = "";
294 if( $ecut ) {
295 $options .= "-E $ecut";
296 }
297 elsif( $bcut ) {
298 $options .= "-T $bcut";
299 }
300 else {
301 $options .= "--cut_ga";
302 }
303
304 # map Pfam accessions to ids
305 # expensive, but we have to read Pfam-A.seed or Pfam-A.fasta
306 my %accmap;
307 if( -s "$pfamdir/Pfam-A.seed" ) {
308 my $id;
309 open( SEED, "$pfamdir/Pfam-A.seed" ) or die "FATAL: can't open Pfam-A.seed file\n";
310 while(<SEED>) {
311 if( /^\#=GF ID\s+(\S+)/ ) {
312 $id = $1;
313 }
314 if( /^\#=GF AC\s+(PF\d+\.?\d*)/ ) {
315 $accmap{$id} = $1;
316 }
317 }
318 }
319 elsif( -s "$pfamdir/Pfam-A.fasta" ) {
320 open( FASTA, "$pfamdir/Pfam-A.fasta" ) or die;
321 while(<FASTA>) {
322 if( /^\>\S+\s+(PF\d+\.?\d*)\;(\S+)\;/ ) {
323 $accmap{$2} = $1;
324 }
325 }
326 close FASTA;
327 }
328 else {
329 die "FATAL: can't get Pfam id/accession mapping. I need either\nPfam-A.seed or Pfam-A.fasta.\n";
330 }
331 $options .= " -Z ".scalar( keys %accmap );
332
333
334 my( %clanmap, $clanacc );
335 # read the clan mappings in
336 if( !$overlap ) {
337 open( CLAN, "$pfamdir/Pfam-C" ) or die;
338 while(<CLAN>) {
339 if( /^AC\s+(\S+)/ ) {
340 $clanacc = $1;
341 }
342 if( /^MB\s+(\S+)\;/ ) {
343 $clanmap{$1} = $clanacc;
344 }
345 }
346 close CLAN;
347 }
348
349
350 # The organisation here is not intuitive. This is the first way I
351 # thought of reusing code nicely for both the full hmmpfam search
352 # and the mini database searches created from blast results
353
354 my( $hmm_seq_map, @hmmlist, %seq );
355
356 if( $fast ) {
357 # read sequences in
358 my $seqin = Bio::SeqIO -> new( '-file' => $fafile,
359 '-format' => 'Fasta' );
360 while( my $seq = $seqin->next_seq() ) {
361 if( exists $seq{ $seq->id } ) {
362 die "FATAL: We're going to struggle here - you have two sequences\nwith the same name in your fasta file\n";
363 }
364 $seq{ $seq->id } = $seq;
365 }
366
367 # then run blast searches
368 my $blastdb = "$pfamdir/Pfam-A.fasta";
369 my $blastcmd = "blastall -p blastp -i $fafile -d $blastdb -e $blastecut -F F -b 100000 -v 100000";
370 print STDERR "running [$blastcmd > /tmp/$$.blast] ....\n" if( $verbose );
371 system "$blastcmd > /tmp/$$.blast" and die "FATAL: failed to run blastall - is it in your path?\n";
372 print STDERR "parsing blast output ....\n" if( $verbose );
373 $hmm_seq_map = parse_blast( "/tmp/$$.blast" );
374 unless( $noclean ) {
375 unlink( "/tmp/$$.blast" ) or die "FATAL: failed to remove temporary files - can you write to /tmp?\n";
376 }
377 @hmmlist = keys %{ $hmm_seq_map };
378 }
379
380 unless( $fast ) {
381 @hmmlist = (1); # just so we get through the following while loop once
382 }
383
384 my $allresults = HMMResults->new();
385
386 while( my $hmmacc = shift @hmmlist ) {
387 my $seqfile;
388 if( $fast ) {
389 $seqfile = "/tmp/$$.fa";
390
391 open( SOUT, ">$seqfile" ) or die "FATAL: failed to create temporary files - can you write to /tmp?\n";
392 my $seqout = Bio::SeqIO -> new( '-fh' => \*SOUT,
393 '-format' => 'Fasta' );
394
395 foreach my $id ( keys %{ $hmm_seq_map->{$hmmacc} } ) {
396 if( not exists $seq{ $id } ) {
397 warn "can't find [$id] in your sequence file\n";
398 }
399 else {
400 $seqout -> write_seq( $seq{ $id } );
401 print STDERR "searching [$id] against [$hmmacc]\n" if( $verbose );
402 }
403 }
404 close SOUT;
405 }
406 else {
407 $seqfile = $fafile;
408 }
409
410 foreach my $m ( "ls", "fs" ) {
411 my $hmmfile;
412 if( $mode ) {
413 next unless( $m eq $mode );
414 }
415 if( $fast ) {
416 $hmmfile = "/tmp/$$.hmm_$m";
417 system "hmmfetch $pfamdir/Pfam_$m $hmmacc > $hmmfile" and die "FATAL: failed to fetch [$hmmacc] from $pfamdir/Pfam_$m\n";
418 }
419 else {
420 $hmmfile = "$pfamdir/Pfam_$m";
421 }
422
423 my $fh = new IO::File;
424 my $results = new HMMResults;
425
426 $fh -> open( "hmmpfam --compat -A 0 $options $hmmfile $seqfile |" );
427 $results -> parse_hmmpfam( $fh );
428 $fh -> close;
429 $allresults -> merge_results( $results );
430 if( -e "/tmp/$$.hmm_$m" and not $noclean ) {
431 unlink "/tmp/$$.hmm_$m" or die "FATAL: failed to remove temporary files - can you write to /tmp?\n";
432 }
433 }
434 if( -e "/tmp/$$.fa" and not $noclean ) {
435 unlink "/tmp/$$.fa" or die "FATAL: failed to remove temporary files - can you write to /tmp?\n";
436 }
437 }
438
439 if( !$overlap ) {
440 $allresults = $allresults->remove_overlaps( \%clanmap );
441 }
442
443
444 if( $outfile ) {
445 open( O, ">$outfile" ) or die "FATAL: can't write to your output file [$outfile]\n";
446 $allresults -> write_ascii_out( \*O );
447 close O;
448 }
449 else {
450 $allresults -> write_ascii_out( \*STDOUT );
451 }
452
453 exit(0);
454
455
456
457 ###############
458 # SUBROUTINES #
459 ###############
460
461 sub parse_blast {
462 my $file = shift;
463 my %hmm_seq_map;
464 my %already;
465
466 my $searchin = Bio::SearchIO -> new( '-file' => $file,
467 '-format' => 'Blast' );
468
469 while( my $result = $searchin->next_result() ) {
470 while( my $hit = $result -> next_hit() ) {
471 my( $acc, $id ) = $hit->description =~ /(PF\d+\.?\d*)\;(\S+)\;/;
472 # print STDERR "seq [",$result->query_name(),"] hmm [$acc]\n" if( $verbose );
473 $hmm_seq_map{$acc} -> { $result->query_name() } = 1;
474 }
475 }
476
477 return \%hmm_seq_map;
478 }
479
480
481 ############
482 # PACKAGES #
483 ############
484
485 package HMMResults;
486
487 sub new {
488 my $ref = shift;
489 my $class = ref($ref) || $ref;
490 my $self = {
491 'domain' => [],
492 'seq' => {},
493 'hmmname' => undef };
494 bless $self, $class;
495 return $self;
496 }
497
498 sub hmmname {
499 my $self = shift;
500 my $value = shift;
501 $self->{'hmmname'} = $value if( defined $value );
502 return $self->{'hmmname'};
503 }
504
505 sub addHMMUnit {
506 my $self = shift;
507 my $unit = shift;
508 my $name = $unit->seqname();
509
510 if( !exists $self->{'seq'}->{$name} ) {
511 warn "Adding a domain of $name but with no HMMSequence. Will be kept in domain array but not added to a HMMSequence";
512 } else {
513 $self->{'seq'}->{$name}->addHMMUnit($unit);
514 }
515 push( @{$self->{'domain'}}, $unit );
516 }
517
518 sub eachHMMUnit {
519 my $self = shift;
520 return @{$self->{'domain'}};
521 }
522
523 sub addHMMSequence {
524 my $self = shift;
525 my $seq = shift;
526 my $name = $seq->name();
527 if( exists $self->{'seq'}->{$name} ) {
528 warn "You already have $name in HMMResults. Replacing by a new entry!";
529 }
530 $self->{'seq'}->{$name} = $seq;
531 }
532
533 sub eachHMMSequence {
534 my $self = shift;
535 my (@array,$name);
536
537 foreach $name ( keys %{$self->{'seq'}} ) {
538 push( @array, $self->{'seq'}->{$name} );
539 }
540
541 return @array;
542 }
543
544 sub getHMMSequence {
545 my $self = shift;
546 my $name = shift;
547 return $self->{'seq'}->{$name};
548 }
549
550 sub parse_hmmpfam {
551 my $self = shift;
552 my $file = shift;
553
554 QUERY: {
555 my( %hash, $seqname );
556 while(<$file>) {
557 /^Scores for sequence/ && last;
558 if( /^Query:\s+(\S+)/ ) {
559 $seqname = $1;
560 my $seq = HMMSequence->new();
561 $seq->name($seqname);
562 $self->addHMMSequence($seq);
563 }
564 }
565
566 while(<$file>) {
567 /^Parsed for domains/ && last;
568 if( my( $id, $sc, $ev, $nd ) = /^\s*(\S+).+?\s(\S+)\s+(\S+)\s+(\d+)\s*$/ ) {
569 $hash{$id} = $sc;
570 }
571 }
572
573 while(<$file>) {
574 /^\/\// && redo QUERY;
575
576 if( my( $id, $sqfrom, $sqto, $hmmf, $hmmt, $sc, $ev ) =
577 /(\S+)\s+\S+\s+(\d+)\s+(\d+).+?(\d+)\s+(\d+)\s+\S+\s+(\S+)\s+(\S+)\s*$/ ) {
578
579 if( !exists($hash{$id}) ) {
580 # hmmpfam seems to return hits above the domain score that aren't
581 # above the sequence score. We don't want these, so skip if we
582 # haven't seen a sequence score.
583 next;
584 # warn("HMMResults parsing error in hmmpfam for $id - can't find sequence score");
585 }
586
587 my $unit = HMMUnit->new();
588 $unit -> seqname($seqname);
589 $unit -> hmmname($id);
590 $unit -> hmmacc($accmap{$id});
591 $unit -> start_seq($sqfrom);
592 $unit -> end_seq($sqto);
593 $unit -> start_hmm($hmmf);
594 $unit -> end_hmm($hmmt);
595 $unit -> bits($sc);
596 $unit -> evalue($ev);
597 $unit->seqbits($hash{$id});
598
599 # this should find it's own sequence!
600 $self->addHMMUnit($unit);
601 }
602 }
603 }
604 return $self;
605 }
606
607
608 sub get_unit_nse {
609 my $self = shift;
610 my $seqname = shift;
611 my $domname = shift;
612 my $start = shift;
613 my $end = shift;
614 my($seq,$unit);
615
616 $seq = $self->getHMMSequence($seqname);
617 if( !defined $seq ) {
618 warn("Could not get sequence name $seqname - so can't get its unit");
619 return undef;
620 }
621 foreach $unit ( $seq->eachHMMUnit() ) {
622 if( $unit->hmmname() eq $domname && $unit->start_seq() == $start && $unit->end_seq() == $end ) {
623 return $unit;
624 }
625 }
626
627 return undef;
628 }
629
630 sub write_ascii_out {
631 my $self = shift;
632 my $fh = shift;
633
634 if( !defined $fh) {
635 $fh = \*STDOUT;
636 }
637
638 my @units = sort { $a->seqname cmp $b->seqname ||
639 $a->start_seq() <=> $b->start_seq } $self->eachHMMUnit();
640
641 foreach my $unit ( @units ) {
642 print $fh sprintf( "%-".$maxidlength."s %4d %4d %-11s %4d %4d %7s %8s %s\n",
643 $unit->seqname(),
644 $unit->start_seq(),
645 $unit->end_seq(),
646 $unit->hmmacc,
647 $unit->start_hmm,
648 $unit->end_hmm,
649 $unit->bits,
650 $unit->evalue,
651 $unit->hmmname );
652 }
653 }
654
655
656 sub remove_overlaps {
657 # self is unchanged
658 # so need to grab the returned results object
659 my $self = shift;
660 # if clanmap is specified then remove overlaps between
661 # families within a clan
662 my $clanmap = shift;
663
664 my $new = HMMResults->new();
665
666 UNIT:
667 foreach my $unit ( sort { $b->bits <=> $a->bits } $self->eachHMMUnit() ) {
668 if( not $new->getHMMSequence( $unit->seqname ) ) {
669 my $seq = HMMSequence->new();
670 $seq->name( $unit->seqname );
671 $seq->bits( $unit->seqbits );
672 $new->addHMMSequence( $seq );
673 }
674 else {
675 # check overlaps
676 my ($acc1) = $accmap{ $unit->hmmname } =~ /(\S+)\.\d+/; # grrrrr
677 foreach my $u ( $new->getHMMSequence( $unit->seqname )->eachHMMUnit ) {
678 my ($acc2) = $accmap{ $u->hmmname } =~ /(\S+)\.\d+/;
679
680 if( $unit->hmmname eq $u->hmmname ) {
681 next UNIT if( $unit->overlap( $u ) );
682 }
683
684 if( ( $clanmap and
685 exists $clanmap->{ $acc1 } and
686 exists $clanmap->{ $acc2 } and
687 $clanmap->{ $acc1 } eq $clanmap->{ $acc2 } ) ) {
688 next UNIT if( $unit->overlap( $u ) );
689 }
690 }
691 }
692
693 $new->addHMMUnit( $unit );
694 }
695
696 return $new;
697 }
698
699
700 sub merge_results {
701 # merge two HMMResults objects, preferring those from self over res in case of overlap
702 my $self = shift;
703 my $res = shift;
704
705 UNIT:
706 foreach my $unit ( $res->eachHMMUnit() ) {
707 my $seqname = $unit->seqname();
708 my $seqbits = $unit->seqbits();
709 my $oldseq = $self->getHMMSequence($seqname);
710 if( not $oldseq ) {
711 my $seq = HMMSequence->new();
712 $seq->name($seqname);
713 $seq->bits($seqbits);
714 $self->addHMMSequence($seq);
715 }
716 else {
717 foreach my $oldunit ( $oldseq->eachHMMUnit() ) {
718 if( ( $unit->hmmname() eq $oldunit->hmmname() ) and $unit->overlap( $oldunit ) ) {
719 next UNIT;
720 }
721 }
722 }
723 $self->addHMMUnit($unit);
724 }
725 return $self;
726 }
727
728
729 ######################
730
731 package HMMSequence;
732
733 sub new {
734 my $ref = shift;
735 my $class = ref($ref) || $ref;
736 my $self = {
737 'name' => undef,
738 'desc' => undef,
739 'bits' => undef,
740 'evalue' => undef,
741 'domain' => [] };
742 bless $self, $class;
743 return $self;
744 }
745
746 sub name {
747 my $self = shift;
748 my $name = shift;
749
750 if( defined $name ) {
751 $self->{'name'} = $name;
752 }
753 return $self->{'name'};
754 }
755
756 sub desc {
757 my $self = shift;
758 my $desc = shift;
759
760 if( defined $desc ) {
761 $self->{'desc'} = $desc;
762 }
763 return $self->{'desc'};
764 }
765
766 sub bits {
767 my $self = shift;
768 my $bits = shift;
769
770 if( defined $bits ) {
771 $self->{'bits'} = $bits;
772 }
773 return $self->{'bits'};
774 }
775
776 sub evalue {
777 my $self = shift;
778 my $evalue = shift;
779
780 if( defined $evalue ) {
781 $self->{'evalue'} = $evalue;
782 }
783 return $self->{'evalue'};
784 }
785
786 sub addHMMUnit {
787 my $self = shift;
788 my $unit = shift;
789
790 push(@{$self->{'domain'}},$unit);
791 }
792
793 sub eachHMMUnit {
794 my $self = shift;
795
796 return @{$self->{'domain'}};
797 }
798
799
800 ###################
801
802 package HMMUnit;
803
804 sub new {
805 my $ref = shift;
806 my $class = ref($ref) || $ref;
807 my $self = {
808 seqname => undef,
809 seq_range => new Range,
810 hmmname => undef,
811 hmmacc => undef,
812 hmm_range => new Range,
813 bits => undef,
814 evalue => undef,
815 prob => undef,
816 seqbits => undef,
817 alignlines => [],
818 mode => undef };
819
820 bless $self, $class;
821 return $self;
822 }
823
824 sub seqname {
825 my $self = shift;
826 my $value = shift;
827 $self->{'seqname'} = $value if( defined $value );
828 return $self->{'seqname'};
829 }
830
831 sub hmmname {
832 my $self = shift;
833 my $value = shift;
834 $self->{'hmmname'} = $value if( defined $value );
835 return $self->{'hmmname'};
836 }
837
838 sub hmmacc {
839 my $self = shift;
840 my $value = shift;
841 $self->{'hmmacc'} = $value if( defined $value );
842 return $self->{'hmmacc'};
843 }
844
845 sub bits {
846 my $self = shift;
847 my $value = shift;
848 $self->{'bits'} = $value if( defined $value );
849 return $self->{'bits'};
850 }
851
852 sub evalue {
853 my $self = shift;
854 my $value = shift;
855 $self->{'evalue'} = $value if( defined $value );
856 return $self->{'evalue'};
857 }
858
859 sub seqbits {
860 my $self = shift;
861 my $value = shift;
862 $self->{'seqbits'} = $value if( defined $value );
863 return $self->{'seqbits'};
864 }
865
866 sub mode {
867 my $self = shift;
868 my $value = shift;
869 $self->{'mode'} = $value if( defined $value );
870 return $self->{'mode'};
871 }
872
873 sub add_alignment_line {
874 my $self = shift;
875 my $line = shift;
876 push(@{$self->{'alignlines'}},$line);
877 }
878
879 sub each_alignment_line {
880 my $self = shift;
881 return @{$self->{'alignlines'}};
882 }
883
884 sub get_nse {
885 my $self = shift;
886 my $sep1 = shift;
887 my $sep2 = shift;
888
889 if( !defined $sep2 ) {
890 $sep2 = "-";
891 }
892 if( !defined $sep1 ) {
893 $sep1 = "/";
894 }
895
896 return sprintf("%s%s%d%s%d",$self->seqname,$sep1,$self->start_seq,$sep2,$self->end_seq);
897 }
898
899
900 sub start_seq {
901 my $self = shift;
902 my $start = shift;
903
904 if( !defined $start ) {
905 return $self->{'seq_range'}->start();
906 }
907 $self->{'seq_range'}->start($start);
908 return $start;
909 }
910
911 sub end_seq {
912 my $self = shift;
913 my $end = shift;
914
915 if( !defined $end ) {
916 return $self->{'seq_range'}->end();
917 }
918 $self->{'seq_range'}->end($end);
919 return $end;
920
921 }
922
923
924 sub start_hmm {
925 my $self = shift;
926 my $start = shift;
927
928 if( !defined $start ) {
929 return $self->{'hmm_range'}->start();
930 }
931 $self->{'hmm_range'}->start($start);
932 return $start;
933 }
934
935 sub end_hmm {
936 my $self = shift;
937 my $end = shift;
938
939 if( !defined $end ) {
940 return $self->{'hmm_range'}->end();
941 }
942 $self->{'hmm_range'}->end($end);
943 return $end;
944
945 }
946
947
948 sub overlap {
949 # does $self overlap with $unit?
950 my $self = shift;
951 my $unit = shift;
952 my( $u1, $u2 ) = sort { $a->start_seq <=> $b->start_seq } ( $self, $unit );
953
954 if( $u2->start_seq <= $u1->end_seq ) {
955 return 1;
956 }
957
958 return 0;
959 }
960
961
962 #################
963
964 package Range;
965
966 sub new {
967 my $ref = shift;
968 my $class = ref($ref) || $ref;
969
970 my $self = {
971 start => undef,
972 end => undef
973 };
974
975 bless $self, $class;
976
977 if(@_ == 2) {
978 $self->start(shift);
979 $self->end(shift);
980 } elsif (@_) {
981 print STDERR "Usage: new Range()\n\tnew Range(start, stop)\n";
982 }
983
984 return $self;
985 }
986
987 sub start {
988 my $self = shift;
989 my $value = shift;
990 $self->{'start'} = $value if( defined $value );
991 return $self->{'start'};
992 }
993
994 sub end {
995 my $self = shift;
996 my $value = shift;
997 $self->{'end'} = $value if( defined $value );
998 return $self->{'end'};
999 }