ViewVC Help
View File | Revision Log | Show Annotations | View Changeset | Root Listing
root/gclib/scripts/pfam_scan.pl
Revision: 24
Committed: Tue Jul 26 21:46:39 2011 UTC (8 years, 1 month ago) by gpertea
File size: 24537 byte(s)
Log Message:
Line File contents
1 #!/usr/bin/perl -w
2
3 # Copyright (c) 2002-2005 Genome Research Ltd
4 # Distributed under the same terms as the Pfam database
5 # See ftp://ftp.sanger.ac.uk/pub/databases/Pfam/COPYRIGHT
6 # for more details.
7 #
8 # Written by Sam Griffiths-Jones borrowing heavily from some
9 # old code by Ewan Birney. Contact pfam@sanger.ac.uk for
10 # help.
11 #
12 # version 0.5 (16/3/2005)
13
14 =head1 NAME
15
16 pfam_scan.pl - search protein fasta sequences against the Pfam
17 library of HMMs.
18
19 =head1 VERSION
20
21 This is version 0.5 of pfam_scan.pl. See the history section for
22 recent changes.
23
24 Behaviour of recent versions is a significantly different from 0.1.
25 From version 0.5, overlapping matches to families within the same clan
26 are removed, keeping the best scoring hit. This behaviour can be
27 overridden with the --overlap option. From version 0.2, we can use
28 BLAST to preprocess the input sequences with the --fast option, so we
29 only have to search a subset of sequences against a subset of HMMs
30 using hmmpfam. For puritanical reasons we don't do this by default!
31 Read the notes about this below.
32
33 This version has been tested with Perl 5.6.1, Pfam 10.0 (through
34 13.0), Bioperl 1.2 and HMMER 2.3.1. It should work with any versions
35 higher than these.
36
37 =head1 REQUIREMENTS
38
39 - this script
40 - Perl 5.6 or higher (and maybe lower)
41 - The Pfam database (downloadable from
42 ftp://ftp.sanger.ac.uk/pub/databases/Pfam/)
43 - HMMER software (from http://hmmer.wustl.edu/)
44 - NCBI BLAST binaries (from http://www.ncbi.nlm.nih.gov/Ftp/)
45 - Bioperl (from http://bio.perl.org/)
46
47 The Bioperl modules directory must be in your perl library path, and
48 the HMMER and BLAST binaries must be in your executable path.
49
50 You also need to be able to read and write to /tmp on your machine.
51
52 Some of these requirements are easily circumvented, but this script
53 should at least give you a start.
54
55 =head1 HOW TO INSTALL PFAM LOCALLY
56
57 1. Get the Pfam database from
58 ftp://ftp.sanger.ac.uk/pub/databases/Pfam/. In particular you need
59 the files Pfam-A.fasta, Pfam_ls, Pfam_fs, and Pfam-A.seed.
60
61 2. Unzip them if necessary
62 $ gunzip Pfam*.gz
63
64 3. Grab and install HMMER, NCBI BLAST and Bioperl, and make sure your
65 paths etc are set up properly.
66
67 4. Index Pfam-A.fasta for BLAST searches
68 $ formatdb -i Pfam-A.fasta -p T
69
70 5. Index the Pfam_ls and Pfam_fs libraries for HMM fetching
71 $ hmmindex Pfam_ls
72 $ hmmindex Pfam_fs
73
74
75 =head1 SEARCHING PFAM
76
77 This script is really just a wrapper around hmmpfam.
78
79 Run pfam_scan.pl -h to get a list of options. Probably the only thing
80 to worry about is supplying the -d option with the location of your
81 downloaded Pfam database. Or you can set the PFAMDB environment
82 variable to point to the right place and things should work without
83 -d. And you should decide whether or not to use --fast.
84
85 A few things to note:
86
87 --fast uses BLAST as a preprocessor to reduce the amount of compute we
88 have to do with hmmpfam. This is known to reduce sensitivity in the
89 case of a very small number of families (those whose length is
90 exceptionally short, like the XYPPX repeat). If you're annotating
91 genomes then you *probably* don't care too much about these families.
92 Omiting this option may give you a small added sensitivity, but with a
93 rough 10 fold time cost. If you want to exactly replicate the Pfam
94 web site results or distributed data, you probably shouldn't use this.
95
96 Overlapping above-threshold hits to families within the same clan are
97 removed -- only the best scoring hit is kept. You can override this
98 behaviour with the --overlap option.
99
100 Pfam provides two sets of models, called ls and fs models, for whole
101 domain and fragment searches. This wrapper basically returns all hits
102 to the ls models, and then adds to these all non-overlapping hits to
103 the fragment models. This mimics the behaviour of Pfam web site
104 searches. You can choose to search only one set of models with the
105 --mode option.
106
107 Unless you want to grub around in the noise you should probably use
108 the default thresholds - these are hand curated for every family by
109 the Pfam team, such that we believe false positives will not score
110 above these levels. The consequence is that some families may miss
111 members.
112
113 You may want to adjust the threshold used for the preprocessing BLAST
114 search (default evalue 10). Raising this to 50 will slow everything
115 down a bit but may gain you a little sensitivity. Lowering the evalue
116 cutoff will speed things up but with an inevitable sensitivity cost.
117
118 It is important that each sequence in the fasta file has a unique
119 identifier. Note that the fasta header format should be:
120
121 >identifier <optional description>
122
123 so the identifier should not contain whitespace.
124
125 The format of the output is:
126
127 <seq id> <seq start> <seq end> <hmm acc> <hmm start> <hmm end> <bit score> <evalue> <hmm name>
128
129 hmmpfam returns scores for sequence and domain matches seperately.
130 For simplicity, the single line for each domain format returned here
131 reports domain scores.
132
133 =head1 BUGS
134
135 Many options are not rigorously tested. Error messages are
136 uninformative. The documentation is inadequate. You may find it
137 useful. You may not.
138
139 =head1 HISTORY
140
141 Version Main changes
142 ------- ------------
143
144 0.5 Removes overlapping above-threshold hits to families
145 within the same clan. --overlap overrides.
146
147 0.4 Work-around for hmmpfam bug/feature that reports hits
148 above domain threshold even if the sequence doesn't
149 score above the sequence threshold.
150
151 0.3 Fix minor bugs to be compatable with HMM versions in
152 Pfam 13.
153
154 0.2 --fast option to use BLAST preprocessing for significant
155 speed-up.
156
157 0.1 First effort, simply wraps up hmmpfam without doing
158 anything clever.
159
160 =head1 CONTACT
161
162 This script is copyright (c) Genome Research Ltd 2002-2005. Please
163 contact pfam@sanger.ac.uk for help.
164
165 =cut
166
167 use strict;
168 use lib '/fs/sz-user-supported/common/lib/bioperl-1.5.0';
169 use Getopt::Long;
170 use IO::File;
171 use Bio::SeqIO;
172 use Bio::SearchIO;
173
174
175 sub show_help {
176 print STDERR <<EOF;
177
178 $0: search a fasta file against Pfam
179
180 Usage: $0 <options> fasta_file
181 Options
182 -h : show this help
183 -d <dir> : directory location of Pfam flatfiles
184 -o <file> : output file, otherwise send to STDOUT
185 --fast : use BLAST preprocessing
186
187 Expert options
188 --overlap : show overlapping hits within clan member families
189 --mode <ls|fs> : search only ls or fs models (default both)
190 -e <n> : specify hmmpfam evalue cutoff, else use Pfam cutoffs
191 -t <n> : specify hmmpfam bits cutoff, else use Pfam cutoffs
192 -be <n> : specify blast evalue cutoff (default 1)
193
194 Output format is:
195 <seq id> <seq start> <seq end> <hmm acc> <hmm start> <hmm end> <bit score> <evalue> <hmm name>
196
197 EOF
198 }
199
200 my( $help,
201 $ecut,
202 $bcut,
203 $outfile,
204 $pfamdir,
205 $mode,
206 $fast,
207 $noclean,
208 $verbose,
209 $overlap,
210 );
211
212 # defaults
213 my $blastecut = 1;
214
215 if( $ENV{ 'PFAMDB' } ) {
216 $pfamdir = $ENV{ 'PFAMDB' };
217 }
218 else {
219 $pfamdir = '.';
220 }
221
222 &GetOptions( "mode=s" => \$mode,
223 "fast" => \$fast,
224 "o=s" => \$outfile,
225 "e=s" => \$ecut,
226 "t=s" => \$bcut,
227 "h" => \$help,
228 "be=s" => \$blastecut,
229 "d=s" => \$pfamdir,
230 "noclean" => \$noclean,
231 "v" => \$verbose,
232 "overlap" => \$overlap,
233 );
234
235 my $fafile = shift;
236
237 if( $help ) {
238 &show_help();
239 exit(1);
240 }
241 if( !-s $fafile ) {
242 &show_help();
243 print STDERR "FATAL: can't find fasta file [$fafile]\n\n";
244 exit(1);
245 }
246 if( !-d $pfamdir ) {
247 &show_help();
248 print STDERR "FATAL: database location [$pfamdir] is not a directory\n\n";
249 exit(1);
250 }
251 if( !-s "$pfamdir/Pfam_ls" and ( $mode ne "fs" ) ) {
252 &show_help();
253 print STDERR "FATAL: database location [$pfamdir] does not contain Pfam_ls\n\n";
254 exit(1);
255 }
256 if( !-s "$pfamdir/Pfam_fs" and ( $mode ne "ls" ) ) {
257 &show_help();
258 print STDERR "FATAL: database location [$pfamdir] does not contain Pfam_fs\n\n";
259 exit(1);
260 }
261 if( $fast ) {
262 if( !-s "$pfamdir/Pfam-A.fasta" ) {
263 &show_help();
264 print STDERR "FATAL: database location [$pfamdir] does not contain Pfam-A.fasta.\nYou cannot use --fast without this file.\n\n";
265 exit(1);
266 }
267 if( !-s "$pfamdir/Pfam-A.fasta.phr" ) {
268 &show_help();
269 print STDERR "FATAL: you need to index Pfam-A.fasta using formatdb to use --fast\n\n";
270 exit(1);
271 }
272 if( !-s "$pfamdir/Pfam_ls.ssi" or !-s "$pfamdir/Pfam_fs.ssi" ) {
273 &show_help();
274 print STDERR "FATAL: you need to index Pfam_ls and Pfam_fs using hmmindex to use --fast\n\n";
275 exit(1);
276 }
277 }
278
279 if( !$overlap and !-s "$pfamdir/Pfam-C" ) {
280 print STDERR "WARNING: database directory [$pfamdir] does not contain Pfam-C.\nYou may get overlapping hits to families in the same clan\nwithout this file.\n\n";
281 }
282
283
284
285 my $maxidlength = 1;
286 open( FA, $fafile ) or die "FATAL: can't open fasta file [$fafile]\n";
287 while(<FA>) {
288 if( /^\>(\S+)/ ) {
289 $maxidlength = length( $1 ) if( $maxidlength < length( $1 ) );
290 }
291 }
292 close FA;
293
294 my $options = "";
295 if( $ecut ) {
296 $options .= "-E $ecut";
297 }
298 elsif( $bcut ) {
299 $options .= "-T $bcut";
300 }
301 else {
302 $options .= "--cut_ga";
303 }
304
305 # map Pfam accessions to ids
306 # expensive, but we have to read Pfam-A.seed or Pfam-A.fasta
307 my %accmap;
308 if( -s "$pfamdir/Pfam-A.seed" ) {
309 my $id;
310 open( SEED, "$pfamdir/Pfam-A.seed" ) or die "FATAL: can't open Pfam-A.seed file\n";
311 while(<SEED>) {
312 if( /^\#=GF ID\s+(\S+)/ ) {
313 $id = $1;
314 }
315 if( /^\#=GF AC\s+(PF\d+\.?\d*)/ ) {
316 $accmap{$id} = $1;
317 }
318 }
319 }
320 elsif( -s "$pfamdir/Pfam-A.fasta" ) {
321 open( FASTA, "$pfamdir/Pfam-A.fasta" ) or die;
322 while(<FASTA>) {
323 if( /^\>\S+\s+(PF\d+\.?\d*)\;(\S+)\;/ ) {
324 $accmap{$2} = $1;
325 }
326 }
327 close FASTA;
328 }
329 else {
330 die "FATAL: can't get Pfam id/accession mapping. I need either\nPfam-A.seed or Pfam-A.fasta.\n";
331 }
332 $options .= " -Z ".scalar( keys %accmap );
333
334
335 my( %clanmap, $clanacc );
336 # read the clan mappings in
337 if( !$overlap ) {
338 open( CLAN, "$pfamdir/Pfam-C" ) or die;
339 while(<CLAN>) {
340 if( /^AC\s+(\S+)/ ) {
341 $clanacc = $1;
342 }
343 if( /^MB\s+(\S+)\;/ ) {
344 $clanmap{$1} = $clanacc;
345 }
346 }
347 close CLAN;
348 }
349
350
351 # The organisation here is not intuitive. This is the first way I
352 # thought of reusing code nicely for both the full hmmpfam search
353 # and the mini database searches created from blast results
354
355 my( $hmm_seq_map, @hmmlist, %seq );
356
357 if( $fast ) {
358 # read sequences in
359 my $seqin = Bio::SeqIO -> new( '-file' => $fafile,
360 '-format' => 'Fasta' );
361 while( my $seq = $seqin->next_seq() ) {
362 if( exists $seq{ $seq->id } ) {
363 die "FATAL: We're going to struggle here - you have two sequences\nwith the same name in your fasta file\n";
364 }
365 $seq{ $seq->id } = $seq;
366 }
367
368 # then run blast searches
369 my $blastdb = "$pfamdir/Pfam-A.fasta";
370 my $blastcmd = "blastall -p blastp -i $fafile -d $blastdb -e $blastecut -F F -b 100000 -v 100000";
371 print STDERR "running [$blastcmd > tmp_$$.blast] ....\n" if( $verbose );
372 system "$blastcmd > tmp_$$.blast" and die "FATAL: failed to run blastall - is it in your path?\n";
373 print STDERR "parsing blast output ....\n" if( $verbose );
374 $hmm_seq_map = parse_blast( "tmp_$$.blast" );
375 unless( $noclean ) {
376 unlink( "tmp_$$.blast" ) or die "FATAL: failed to remove temporary files - can you write to tmp?\n";
377 }
378 @hmmlist = keys %{ $hmm_seq_map };
379 }
380
381 unless( $fast ) {
382 @hmmlist = (1); # just so we get through the following while loop once
383 }
384
385 my $allresults = HMMResults->new();
386
387 while( my $hmmacc = shift @hmmlist ) {
388 my $seqfile;
389 if( $fast ) {
390 $seqfile = "tmp_$$.fa";
391
392 open( SOUT, ">$seqfile" ) or die "FATAL: failed to create temporary files - can you write to /tmp?\n";
393 my $seqout = Bio::SeqIO -> new( '-fh' => \*SOUT,
394 '-format' => 'Fasta' );
395
396 foreach my $id ( keys %{ $hmm_seq_map->{$hmmacc} } ) {
397 if( not exists $seq{ $id } ) {
398 warn "can't find [$id] in your sequence file\n";
399 }
400 else {
401 $seqout -> write_seq( $seq{ $id } );
402 print STDERR "searching [$id] against [$hmmacc]\n" if( $verbose );
403 }
404 }
405 close SOUT;
406 }
407 else {
408 $seqfile = $fafile;
409 }
410
411 foreach my $m ( "ls", "fs" ) {
412 my $hmmfile;
413 if( $mode ) {
414 next unless( $m eq $mode );
415 }
416 if( $fast ) {
417 $hmmfile = "tmp_$$.hmm_$m";
418 system "hmmfetch $pfamdir/Pfam_$m $hmmacc > $hmmfile" and die "FATAL: failed to fetch [$hmmacc] from $pfamdir/Pfam_$m\n";
419 }
420 else {
421 $hmmfile = "$pfamdir/Pfam_$m";
422 }
423
424 my $fh = new IO::File;
425 my $results = new HMMResults;
426
427 $fh -> open( "hmmpfam --compat -A 0 $options $hmmfile $seqfile |" );
428 $results -> parse_hmmpfam( $fh );
429 $fh -> close;
430 $allresults -> merge_results( $results );
431 if( -e "tmp_$$.hmm_$m" and not $noclean ) {
432 unlink "tmp_$$.hmm_$m" or die "FATAL: failed to remove temporary files - can you write to /tmp?\n";
433 }
434 }
435 if( -e "tmp_$$.fa" and not $noclean ) {
436 unlink "tmp_$$.fa" or die "FATAL: failed to remove temporary files - can you write to /tmp?\n";
437 }
438 }
439
440 if( !$overlap ) {
441 $allresults = $allresults->remove_overlaps( \%clanmap );
442 }
443
444
445 if( $outfile ) {
446 open( O, ">$outfile" ) or die "FATAL: can't write to your output file [$outfile]\n";
447 $allresults -> write_ascii_out( \*O );
448 close O;
449 }
450 else {
451 $allresults -> write_ascii_out( \*STDOUT );
452 }
453
454 exit(0);
455
456
457
458 ###############
459 # SUBROUTINES #
460 ###############
461
462 sub parse_blast {
463 my $file = shift;
464 my %hmm_seq_map;
465 my %already;
466
467 my $searchin = Bio::SearchIO -> new( '-file' => $file,
468 '-format' => 'Blast' );
469
470 while( my $result = $searchin->next_result() ) {
471 while( my $hit = $result -> next_hit() ) {
472 my( $acc, $id ) = $hit->description =~ /(PF\d+\.?\d*)\;(\S+)\;/;
473 # print STDERR "seq [",$result->query_name(),"] hmm [$acc]\n" if( $verbose );
474 $hmm_seq_map{$acc} -> { $result->query_name() } = 1;
475 }
476 }
477
478 return \%hmm_seq_map;
479 }
480
481
482 ############
483 # PACKAGES #
484 ############
485
486 package HMMResults;
487
488 sub new {
489 my $ref = shift;
490 my $class = ref($ref) || $ref;
491 my $self = {
492 'domain' => [],
493 'seq' => {},
494 'hmmname' => undef };
495 bless $self, $class;
496 return $self;
497 }
498
499 sub hmmname {
500 my $self = shift;
501 my $value = shift;
502 $self->{'hmmname'} = $value if( defined $value );
503 return $self->{'hmmname'};
504 }
505
506 sub addHMMUnit {
507 my $self = shift;
508 my $unit = shift;
509 my $name = $unit->seqname();
510
511 if( !exists $self->{'seq'}->{$name} ) {
512 warn "Adding a domain of $name but with no HMMSequence. Will be kept in domain array but not added to a HMMSequence";
513 } else {
514 $self->{'seq'}->{$name}->addHMMUnit($unit);
515 }
516 push( @{$self->{'domain'}}, $unit );
517 }
518
519 sub eachHMMUnit {
520 my $self = shift;
521 return @{$self->{'domain'}};
522 }
523
524 sub addHMMSequence {
525 my $self = shift;
526 my $seq = shift;
527 my $name = $seq->name();
528 if( exists $self->{'seq'}->{$name} ) {
529 warn "You already have $name in HMMResults. Replacing by a new entry!";
530 }
531 $self->{'seq'}->{$name} = $seq;
532 }
533
534 sub eachHMMSequence {
535 my $self = shift;
536 my (@array,$name);
537
538 foreach $name ( keys %{$self->{'seq'}} ) {
539 push( @array, $self->{'seq'}->{$name} );
540 }
541
542 return @array;
543 }
544
545 sub getHMMSequence {
546 my $self = shift;
547 my $name = shift;
548 return $self->{'seq'}->{$name};
549 }
550
551 sub parse_hmmpfam {
552 my $self = shift;
553 my $file = shift;
554
555 QUERY: {
556 my( %hash, $seqname );
557 while(<$file>) {
558 /^Scores for sequence/ && last;
559 if( /^Query:\s+(\S+)/ ) {
560 $seqname = $1;
561 my $seq = HMMSequence->new();
562 $seq->name($seqname);
563 $self->addHMMSequence($seq);
564 }
565 }
566
567 while(<$file>) {
568 /^Parsed for domains/ && last;
569 if( my( $id, $sc, $ev, $nd ) = /^\s*(\S+).+?\s(\S+)\s+(\S+)\s+(\d+)\s*$/ ) {
570 $hash{$id} = $sc;
571 }
572 }
573
574 while(<$file>) {
575 /^\/\// && redo QUERY;
576
577 if( my( $id, $sqfrom, $sqto, $hmmf, $hmmt, $sc, $ev ) =
578 /(\S+)\s+\S+\s+(\d+)\s+(\d+).+?(\d+)\s+(\d+)\s+\S+\s+(\S+)\s+(\S+)\s*$/ ) {
579
580 if( !exists($hash{$id}) ) {
581 # hmmpfam seems to return hits above the domain score that aren't
582 # above the sequence score. We don't want these, so skip if we
583 # haven't seen a sequence score.
584 next;
585 # warn("HMMResults parsing error in hmmpfam for $id - can't find sequence score");
586 }
587
588 my $unit = HMMUnit->new();
589 $unit -> seqname($seqname);
590 $unit -> hmmname($id);
591 $unit -> hmmacc($accmap{$id});
592 $unit -> start_seq($sqfrom);
593 $unit -> end_seq($sqto);
594 $unit -> start_hmm($hmmf);
595 $unit -> end_hmm($hmmt);
596 $unit -> bits($sc);
597 $unit -> evalue($ev);
598 $unit->seqbits($hash{$id});
599
600 # this should find it's own sequence!
601 $self->addHMMUnit($unit);
602 }
603 }
604 }
605 return $self;
606 }
607
608
609 sub get_unit_nse {
610 my $self = shift;
611 my $seqname = shift;
612 my $domname = shift;
613 my $start = shift;
614 my $end = shift;
615 my($seq,$unit);
616
617 $seq = $self->getHMMSequence($seqname);
618 if( !defined $seq ) {
619 warn("Could not get sequence name $seqname - so can't get its unit");
620 return undef;
621 }
622 foreach $unit ( $seq->eachHMMUnit() ) {
623 if( $unit->hmmname() eq $domname && $unit->start_seq() == $start && $unit->end_seq() == $end ) {
624 return $unit;
625 }
626 }
627
628 return undef;
629 }
630
631 sub write_ascii_out {
632 my $self = shift;
633 my $fh = shift;
634
635 if( !defined $fh) {
636 $fh = \*STDOUT;
637 }
638
639 my @units = sort { $a->seqname cmp $b->seqname ||
640 $a->start_seq() <=> $b->start_seq } $self->eachHMMUnit();
641
642 foreach my $unit ( @units ) {
643 print $fh sprintf( "%-".$maxidlength."s %4d %4d %-11s %4d %4d %7s %8s %s\n",
644 $unit->seqname(),
645 $unit->start_seq(),
646 $unit->end_seq(),
647 $unit->hmmacc,
648 $unit->start_hmm,
649 $unit->end_hmm,
650 $unit->bits,
651 $unit->evalue,
652 $unit->hmmname );
653 }
654 }
655
656
657 sub remove_overlaps {
658 # self is unchanged
659 # so need to grab the returned results object
660 my $self = shift;
661 # if clanmap is specified then remove overlaps between
662 # families within a clan
663 my $clanmap = shift;
664
665 my $new = HMMResults->new();
666
667 UNIT:
668 foreach my $unit ( sort { $b->bits <=> $a->bits } $self->eachHMMUnit() ) {
669 if( not $new->getHMMSequence( $unit->seqname ) ) {
670 my $seq = HMMSequence->new();
671 $seq->name( $unit->seqname );
672 $seq->bits( $unit->seqbits );
673 $new->addHMMSequence( $seq );
674 }
675 else {
676 # check overlaps
677 my ($acc1) = $accmap{ $unit->hmmname } =~ /(\S+)\.\d+/; # grrrrr
678 foreach my $u ( $new->getHMMSequence( $unit->seqname )->eachHMMUnit ) {
679 my ($acc2) = $accmap{ $u->hmmname } =~ /(\S+)\.\d+/;
680
681 if( $unit->hmmname eq $u->hmmname ) {
682 next UNIT if( $unit->overlap( $u ) );
683 }
684
685 if( ( $clanmap and
686 exists $clanmap->{ $acc1 } and
687 exists $clanmap->{ $acc2 } and
688 $clanmap->{ $acc1 } eq $clanmap->{ $acc2 } ) ) {
689 next UNIT if( $unit->overlap( $u ) );
690 }
691 }
692 }
693
694 $new->addHMMUnit( $unit );
695 }
696
697 return $new;
698 }
699
700
701 sub merge_results {
702 # merge two HMMResults objects, preferring those from self over res in case of overlap
703 my $self = shift;
704 my $res = shift;
705
706 UNIT:
707 foreach my $unit ( $res->eachHMMUnit() ) {
708 my $seqname = $unit->seqname();
709 my $seqbits = $unit->seqbits();
710 my $oldseq = $self->getHMMSequence($seqname);
711 if( not $oldseq ) {
712 my $seq = HMMSequence->new();
713 $seq->name($seqname);
714 $seq->bits($seqbits);
715 $self->addHMMSequence($seq);
716 }
717 else {
718 foreach my $oldunit ( $oldseq->eachHMMUnit() ) {
719 if( ( $unit->hmmname() eq $oldunit->hmmname() ) and $unit->overlap( $oldunit ) ) {
720 next UNIT;
721 }
722 }
723 }
724 $self->addHMMUnit($unit);
725 }
726 return $self;
727 }
728
729
730 ######################
731
732 package HMMSequence;
733
734 sub new {
735 my $ref = shift;
736 my $class = ref($ref) || $ref;
737 my $self = {
738 'name' => undef,
739 'desc' => undef,
740 'bits' => undef,
741 'evalue' => undef,
742 'domain' => [] };
743 bless $self, $class;
744 return $self;
745 }
746
747 sub name {
748 my $self = shift;
749 my $name = shift;
750
751 if( defined $name ) {
752 $self->{'name'} = $name;
753 }
754 return $self->{'name'};
755 }
756
757 sub desc {
758 my $self = shift;
759 my $desc = shift;
760
761 if( defined $desc ) {
762 $self->{'desc'} = $desc;
763 }
764 return $self->{'desc'};
765 }
766
767 sub bits {
768 my $self = shift;
769 my $bits = shift;
770
771 if( defined $bits ) {
772 $self->{'bits'} = $bits;
773 }
774 return $self->{'bits'};
775 }
776
777 sub evalue {
778 my $self = shift;
779 my $evalue = shift;
780
781 if( defined $evalue ) {
782 $self->{'evalue'} = $evalue;
783 }
784 return $self->{'evalue'};
785 }
786
787 sub addHMMUnit {
788 my $self = shift;
789 my $unit = shift;
790
791 push(@{$self->{'domain'}},$unit);
792 }
793
794 sub eachHMMUnit {
795 my $self = shift;
796
797 return @{$self->{'domain'}};
798 }
799
800
801 ###################
802
803 package HMMUnit;
804
805 sub new {
806 my $ref = shift;
807 my $class = ref($ref) || $ref;
808 my $self = {
809 seqname => undef,
810 seq_range => new Range,
811 hmmname => undef,
812 hmmacc => undef,
813 hmm_range => new Range,
814 bits => undef,
815 evalue => undef,
816 prob => undef,
817 seqbits => undef,
818 alignlines => [],
819 mode => undef };
820
821 bless $self, $class;
822 return $self;
823 }
824
825 sub seqname {
826 my $self = shift;
827 my $value = shift;
828 $self->{'seqname'} = $value if( defined $value );
829 return $self->{'seqname'};
830 }
831
832 sub hmmname {
833 my $self = shift;
834 my $value = shift;
835 $self->{'hmmname'} = $value if( defined $value );
836 return $self->{'hmmname'};
837 }
838
839 sub hmmacc {
840 my $self = shift;
841 my $value = shift;
842 $self->{'hmmacc'} = $value if( defined $value );
843 return $self->{'hmmacc'};
844 }
845
846 sub bits {
847 my $self = shift;
848 my $value = shift;
849 $self->{'bits'} = $value if( defined $value );
850 return $self->{'bits'};
851 }
852
853 sub evalue {
854 my $self = shift;
855 my $value = shift;
856 $self->{'evalue'} = $value if( defined $value );
857 return $self->{'evalue'};
858 }
859
860 sub seqbits {
861 my $self = shift;
862 my $value = shift;
863 $self->{'seqbits'} = $value if( defined $value );
864 return $self->{'seqbits'};
865 }
866
867 sub mode {
868 my $self = shift;
869 my $value = shift;
870 $self->{'mode'} = $value if( defined $value );
871 return $self->{'mode'};
872 }
873
874 sub add_alignment_line {
875 my $self = shift;
876 my $line = shift;
877 push(@{$self->{'alignlines'}},$line);
878 }
879
880 sub each_alignment_line {
881 my $self = shift;
882 return @{$self->{'alignlines'}};
883 }
884
885 sub get_nse {
886 my $self = shift;
887 my $sep1 = shift;
888 my $sep2 = shift;
889
890 if( !defined $sep2 ) {
891 $sep2 = "-";
892 }
893 if( !defined $sep1 ) {
894 $sep1 = "/";
895 }
896
897 return sprintf("%s%s%d%s%d",$self->seqname,$sep1,$self->start_seq,$sep2,$self->end_seq);
898 }
899
900
901 sub start_seq {
902 my $self = shift;
903 my $start = shift;
904
905 if( !defined $start ) {
906 return $self->{'seq_range'}->start();
907 }
908 $self->{'seq_range'}->start($start);
909 return $start;
910 }
911
912 sub end_seq {
913 my $self = shift;
914 my $end = shift;
915
916 if( !defined $end ) {
917 return $self->{'seq_range'}->end();
918 }
919 $self->{'seq_range'}->end($end);
920 return $end;
921
922 }
923
924
925 sub start_hmm {
926 my $self = shift;
927 my $start = shift;
928
929 if( !defined $start ) {
930 return $self->{'hmm_range'}->start();
931 }
932 $self->{'hmm_range'}->start($start);
933 return $start;
934 }
935
936 sub end_hmm {
937 my $self = shift;
938 my $end = shift;
939
940 if( !defined $end ) {
941 return $self->{'hmm_range'}->end();
942 }
943 $self->{'hmm_range'}->end($end);
944 return $end;
945
946 }
947
948
949 sub overlap {
950 # does $self overlap with $unit?
951 my $self = shift;
952 my $unit = shift;
953 my( $u1, $u2 ) = sort { $a->start_seq <=> $b->start_seq } ( $self, $unit );
954
955 if( $u2->start_seq <= $u1->end_seq ) {
956 return 1;
957 }
958
959 return 0;
960 }
961
962
963 #################
964
965 package Range;
966
967 sub new {
968 my $ref = shift;
969 my $class = ref($ref) || $ref;
970
971 my $self = {
972 start => undef,
973 end => undef
974 };
975
976 bless $self, $class;
977
978 if(@_ == 2) {
979 $self->start(shift);
980 $self->end(shift);
981 } elsif (@_) {
982 print STDERR "Usage: new Range()\n\tnew Range(start, stop)\n";
983 }
984
985 return $self;
986 }
987
988 sub start {
989 my $self = shift;
990 my $value = shift;
991 $self->{'start'} = $value if( defined $value );
992 return $self->{'start'};
993 }
994
995 sub end {
996 my $self = shift;
997 my $value = shift;
998 $self->{'end'} = $value if( defined $value );
999 return $self->{'end'};
1000 }

Properties

Name Value
svn:executable *