[Biodevelopers] RE: [BiO BB] Poly A tail length - script help please

Joe Landman landman at scalableinformatics.com
Wed Sep 10 04:23:51 EDT 2003


Malcom

  Good catch.  I paid attention to the tail part, not the longest
sequence part.  Should be easy to modify the regex, and generate an
length sorted array of matches, but as you have already solved the
(correct) problem ...

Joe

On Wed, 2003-09-10 at 16:12, Cook, Malcolm wrote:
> But that does not compute the 'longest stretch'.
> 
> The attached perl script does, and will allow you to write:
> 
> > polyfind [-all] *.seq > polyfind.results
> 
> Enjoy,
> 
> Malcolm Cook
> 
> > -----Original Message-----
> > From: Joseph Landman [mailto:landman at scalableinformatics.com]
> > Sent: Tuesday, September 09, 2003 6:58 PM
> > To: BiO BB
> > Cc: biodevelopers
> > Subject: Re: [BiO BB] Poly A tail length - script help please
> > 
> > 
> > First one is free ... 
> > 
> >         #!/usr/bin/perl
> >         
> >         use strict;
> >         
> >         my ($directory,$directory_handle,$file, at files,$sequence);
> >         my ($file_handle,$poly_a_tail,$rseq);
> >         
> >         $directory = "./";	# directory to open
> >         if (!(opendir $directory_handle,$directory))
> >            {
> >              die "FATAL ERROR: Unable to open directory = 
> > ".$directory."\n";
> >            }
> >            
> >         # select only the .seq files
> >         @files = grep { /\.seq$/ } readdir($directory_handle); 
> >         
> >         # loop over these selected files
> >         foreach $file (@files)
> >           {    
> >             # try to open the file
> >             if (!(open($file_handle,"< ".$file)))
> >                {
> >                  # if we cannot open it, warn the user, and 
> > skip to the next file
> >                  warn "Warning: unable to open file = 
> > ".$file."\. Skipping\.\n";
> >         	 next;
> >                }
> >               else
> >                {
> >                  # assume one line per file, or we will have 
> > to modify this
> >         	 chomp($sequence=<$file_handle>);
> >         	 # now time to bring out the heavy artillery
> >         	 $rseq=reverse $sequence;	# poly-a is now 
> > at the head
> >         	 $rseq =~ /^([AN]+)\w+$/;	# match A's 
> > and/or N's at the front
> >         	 $poly_a_tail = $1;		# return the match ...
> >         	 printf "%i %s\n",length($poly_a_tail),$file;	
> > # tell the world ...
> >         	 close($file_handle);
> >                }
> >           }
> > 
> > 
> > 
> > On Tue, 2003-09-09 at 17:00, Tristan Fiedler wrote:
> > > Thanks for the scripting tips!  I have a 'counting' issue 
> > which I need to
> > > quickly resolve.  A typical sequence input file (5 - 700 
> > bases) looks like
> > > :
> > > 
> > > AGTAGTCGATCATNATANCTANTACNACTACTAACTATGCTAGNNAATATAAAAAAAAANAAA
> > > 
> > > I have over 500 files, named *.seq.  I would like to create 
> > a script which :
> > > 
> > > a.  runs through all the files,
> > > b.  counts the length of the 'poly A' tail (defined as the 
> > longest stretch
> > > of A or N)
> > > c. sends the output to a file, eg.
> > > 
> > > 25 1.seq
> > > 87 2.seq
> > > 13 3.seq
> > > 
> > > Example valid poly A tails :
> > > 
> > > AAAANANANANAAANNAAAAAA
> > > 
> > > AAAAAAAAAAAAAA
> > > 
> > > NNNNNNNNNNNNN
> > > 
> > > AAANNNNNNNNNNNAAAAAAAAA
> > > 
> > > Thank you so much for your expertise!
> > > 
> > > Tristan
> > -- 
> > Joseph Landman, Ph.D
> > Scalable Informatics LLC
> > email: landman at scalableinformatics.com
> >   web: http://scalableinformatics.com
> > phone: +1 734 612 4615
> > 
> > 
> > _______________________________________________
> > BiO_Bulletin_Board maillist  -  BiO_Bulletin_Board at bioinformatics.org
> > https://bioinformatics.org/mailman/listinfo/bio_bulletin_board
> > 
-- 
Joseph Landman, Ph.D
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
phone: +1 734 612 4615




More information about the BBB mailing list