[BiO BB] how to work on two txt files simultaneously by handle corresponding lines from each file

Fri Jul 22 03:16:02 EDT 2005

Hello Alex,

It is easy as pie.

The idea:
1) First generate the library of posibilities (16) in an array named 
@library.
2) There should be another array name @already_used with the same 
amount  of  entries  (At the beginning assign 1 to all entries, and when 
i-th element from @library  is selected to be used in any group. [Think 
on it as a label 0->Used, 1->Not used for the same entry index.]
3) When you choose the element to be inserted in a group skip and repeat 
in case it has been already used (this is, the label is 0), if not,  
mark it as visited and copy it to your group array.

The code (It is not debugged, it might contain errors, think of it as a 
very detailed pseudocode)

#generate the library
       @library=qw(AA AC AT AG CA CC CG CT GA GC GG GT TA TC TG TT);
#generate already used array
       for ($i=0;$i<scalar @library;$i++){
             $already_used[$i]=1;
             }
#Randomly select group A
      my $ct=0;
      while ($ct<4){
               $ri=int (rand(scalar @library));
               if ($already_used[$ri]==1){
                         $already_used[$ri]=0;
                         $groupA[$ct]=$library[$ri];
                         $ct++;
                         }
          }
    print "GROUP A: @groupA\n";
#Randomly select group B
      my $ct=0;
      while ($ct<4){
                   $ri=int (rand(scalar @library));
                   if ($already_used[$ri]==1){
                         $already_used[$ri]=0;
                         $groupB[$ct]=$library[$ri];
                         $ct++;
                         }
       }
        print "GROUP B: @groupB\n";

#NOW THAT THE GROUPS ARE GENERATED BY RANDOM... YOU CAN CONTINUE WITH 
YOUR OWN CODE.

VERY IMPORTANT: If in any of the 99 iterations you want to perform, the 
groups A and B are changed DONT FORGET TO REINITIALIZE LABELS IN 
ALREADY_VISITED TO 1 (otherwise you are deemed to an infinite loop at 
the 3rd iteration)

I think this is enough to let you continue, if not, please dont hesitate 
in asking BUT please, use the list rather than my personal mail  
address... because sometimes the answers are useful to someone else.

Good luck.
Txema
...........................................................................................................................................................................................
Alex Zhang wrote:

> Dear Jose,
>  
> Thank you very much for your suggestions! I also came up with the almost
> same idea with yours. That was great and I appreciate it.
>  
> Can I ask for another help from you on my question?
>  
>  .
>
> The thread is:
>
> (1)First,randomly select 4 pairs of nucleotides as the group A.
>  
> For example, we randomly select 4 pairs of nucleotides, ¡°CA¡±, 
> ¡°TG¡±, ¡°CG¡±, and ¡°TC¡± from 16 combinations of two nucleoties :
>
> AA,AC,AG,AT,
> CA,CC,CG,CT,
> GA,GC,GG,GT,
> TA,TC,TG,TT          
>
>       
> Then randomly select another 4 pairs of nucleotides, for instance, 
> like ¡°AA¡±,¡±CA¡±,¡±CT¡± and ¡°TT¡± as the group B.
>
> Any pair in group A should be different from the one in group B.
>
> (2)To create the motif binding site. Choose ¡°CA¡±, the first pair 
> from the group A with 85% probability or ¡°AA¡± from the group B with 
> 15% probability to occupy the 1st position in the binding site, which 
> means that each pair in the group A will be the dominant pair compared 
> to the one in the group B. The sum of probabilities of corresponding 
> pairs in group A and B will be 1.00 (0.85+0.15=1.00).
>
> Repeat this for the next 3 positions and put 4 pairs of nucleotides 
> together, we will end up with a random binding motif like:
>
> CA --- 1st position                       
> CC --- 2nd position                    
> CG --- 3rd position                       
> TC --- 4th position
>
> CACCCGTC----the 8 bp long binding motif site
>
> We also can form another group A like GG,GT,AT,CG and group
> B like GA,AA,AC,CC.
>
> Then we can get a random motif like: GGAAATCG
>
>
> (3)Repeat this to make another 99 binding motif sites.
>
>
>
> I had experience of generating 8 base-pair long indepent motif binding 
> sites with the probability of dominant nucleotide as 0.85. But I don't 
> know how to generate two groups of pairs of nucleotides which are not 
> identical. Thank you very much for suggestions!
>
> Alex
>
>
>
> The previous code is:
>
> ## $len = length of the sequence - 1
> $len = 2;
>
> ## $ar[i] letter in position i with probability $ap[i]
> ## other letters have the same probability (in our case 0.05)
> ## you have to build the arrays
> $ar[0] = "T";
> $ap[0] = 0.85;
>
> $ar[1] = "C";
> $ap[1] = 0.85;
>
> $ar[2] = "A";
> $ap[2] = 0.85;
>
> for ($i=0;$i<=$len;$i++) {
>
>     $ar_other[$i][0] = "A";
>     $ar_other[$i][1] = "C";
>     $ar_other[$i][2] = "G";
>
>     if ($ar[$i] eq "A") {
>         $ar_other[$i][0] = "T";
>     } elsif ($ar[$i] eq "C") {
>         $ar_other[$i][1] = "T";
>     } elsif ($ar[$i] eq "G") {
>         $ar_other[$i][2] = "T";
>     }
>
>     $p = rand(1);
>     #print 
> <http://www.tek-tips.com/viewthread.cfm?qid=1095895&page=1#> "$p";
>     if ($p<(1-$ap[$i])) {
>         $frac = (1-$ap[$i])/3;
>         if ($p<(1-$ap[$i]-(2*$frac))) {
>                 $ran_seq[$i] = $ar_other[$i][0];
>         } elsif ($p<(1-$ap[$i]-$frac)) {
>                 $ran_seq[$i] = $ar_other[$i][1];
>         }  else {
>                 $ran_seq[$i] = $ar_other[$i][2];
>         }
>     } else {
>
>         $ran_seq[$i] = $ar[$i];
>
>     }
>
> }
>
> ## you have your sequence in the array @ran_seq
> for ($i=0;$i<=$len;$i++) {
>     print "$ran_seq[$i] ";
> }
> print "\n";
>
>
> A overly longer code is:
>
> #!/usr/bin/perl
> # randommotifs.pl
> #
> # This program 
> <http://www.tek-tips.com/viewthread.cfm?qid=1095895&page=1#> generates 
> randomized instances of pseudomotifs, i.e.
> # sequences that are obtained by adding "noise", according to a
> specified
> # model to a consensus sequence. To keep this reasonably general, we
> # proceed through the following steps:
> #
> #    Define consensus motif and alphabet
> #    For the required number of repetitions
> #       For each position in motif
> #          Define an appropriate probability distribution over the
> alphabet
> #          Produce a character according to the distribution
> #       ...
> #    ...
> #
> # The probability distributions are generated to produce a consensus
> character
> # with a particular "consensus probability" and all other characters
> # with equal probabilites (the noise). This is defined in a single
> subroutine
> # so it is straightforward to change this for a different model of noise
> # (e.g. increase the probabilities for "similar" characters). It's easy
> # to become creative here and incorporate (biological) knowledge into
> the
> # noise-model.
> #
> # Probablities are then converted into cumulative intervals to chose a
> # particular character:
> #
> # Eg: Probabilities  A: 0.4, C: 0.3, G: 0.2, T: 0.1
> #     Intervals      A: 0.4  C: 0.7, G: 0.9, T: 1.0
> #
> # This example can be pictured as
> #
> #   0.0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1.0
> #    +----+----+----+----+----+----
> +----+----+----+----+
> #    |         A         |      C       |    G    | T  |
> #    +----+----+----+----+----+----+----+----+----+----+
> #
> # .. now all we need to do is generate a random number between 0 and 1
> # and see in which interval it falls. This then points to the character
> to be
> # written to output.
> #
> # B. Steipe, July 2005. Public domain 
> <http://www.tek-tips.com/viewthread.cfm?qid=1095895&page=1#> code.
> # =====================================================================
>
> use strict;
> use warnings;
>
>
> my $N_Sequences = 100;                # Number of sequences to be
> generated
> my @Motif = split(//,'TTGACATATAAT'); # This example is simply a
> concatenation
>                                       # of the -35 and -10 consensus
> elements
>                                       # of prokaryotic promoters;
> replace with
>                                       # whatever ...
> my @Alphabet = split(//,'ACGT');      # Nucleotides; replace with
> whatever ...
> my $P_Consensus = 0.85;               # Replace with whatever ...
>
> # ====== Globals ==========================
> my @Probabilities;              # Stores the probability of each
> character
> # ====== Program ==========================
>
>
> for (my $i=0; $i < $N_Sequences; $i++) {
>     for (my $j=0; $j < scalar(@Motif); $j++) {
>
>         loadConsensusCharacter($Motif[$j]);    # Load the character for
> this position
>         addNoiseToDistribution();              # Modify probabilities
> according to noise model
>         convertToIntervals();
>         print(getRandomCharacter(rand(1.0)));  # Output character
>     }
> print "\n";
> }
>
> exit();
>
> # ====== Subroutines =======================
> #
> sub loadConsensusCharacter {
>     my ($char) = @_;
>     my $Found = 'FALSE';
>
>     for (my $i=0; $i < scalar(@Alphabet); $i++) {
>         if ( $char eq $Alphabet[$i]) { # found character in i_th
> position in Alphabet
>             $Probabilities[$i] = 1.0;
>             $Found = 'TRUE';
>         } else {
>             $Probabilities[$i] = 0.0;
>         }
>     }
>     if ($Found eq 'FALSE') {
>     die("Panic: Motif-Character\"$char\" was not found in Alphabet.
> Aborting.\n");
>     }
>
> return();
> }
>
> # ==========================================
> sub addNoiseToDistribution {
>
>     # This implements a very simple noise model:
>     # We work on an array in which one element is 1.0 and
>     # all others are 0.0. The element with 1.0 has the same
>     # index as the consensus character in the Alphabet.
>     #
>     # We change that value to the consensus probability and
>     # we distribute the remaining probability equally among
>     # all other elements.
>     #
>     # It should be straightforward to implement more elaborate
>     # noise models, or use a position specific scoring matrix
>     # or something else here.
>
>     my $P_NonConsensus = ( 1.0-$P_Consensus) / (scalar(@Alphabet) - 1);
>
>     for (my $i=0; $i < scalar(@Probabilities); $i++) {
>         if ( $Probabilities[$i] == 1.0 ) {     # ... this is the consensus
> character
>             $Probabilities[$i] = $P_Consensus;
>         } else {
>             $Probabilities[$i] = $P_NonConsensus;
>         }
>     }
>
>     return();
> }
>
> # ==========================================
> sub convertToIntervals {
>
>     my $Sum = 0;
>
>     # Convert the values of the probabilities array to the top
> boundaries
>     # of intervals having widths proportional to their relative
> frequency.
>     # Numbers range from 0 to 1.0  ...
>
>     for (my $i=1; $i < scalar(@Probabilities); $i++) {
>         $Probabilities[$i] += $Probabilities[$i-1];
>     }
>
>     return();
> }
>
> # ==========================================
> sub getRandomCharacter {
>
>     my ($RandomNumber) = @_;
>     my $i=0;
>
>     # Test which interval contains the RandomNumber ...
>     # The index of the interval points to the Event we should choose.
>
>     for ($i=0; $i < scalar(@Probabilities); $i++) {
>         if ($Probabilities[$i] > $RandomNumber) { last; }
>     }
>
>     return($Alphabet[$i]);
> }
>
> */Jose Maria Gonzalez Izarzugaza <biopctgi at yahoo.es>/* wrote:
>
>     Hello Alex,
>
>     I think that what you want is to modify long1 with short1, long2 with
>     short2 and so on.
>
>     I recommed you to replace your 2 loops with this one.
>
>     for ($seq=0;$seq
>     $short=$short[$seq];
>     $long=$long[$seq];
>     $offset = int(rand(length($long)%193));
>     substr($long,$offset,length($short),$short);
>     printf "%3d", $offset+1;
>     print "\n", $long, "\n";
>     }
>
>     Good Luck!
>     Txema
>
>
>     Alex Zhang wrote:
>
>     >Dear All,
>     >
>     >Sorry to bother you again.
>     >
>     >I have two txt files to handle. One is
>     >"short_sequences" and the other
>     >one is "long_sequences". The "short_sequences" holds
>     >100 short sequences (8 nucleotide long) and 100 long
>     >sequences (200 nucleotide long) in the
>     >"long_sequence".
>     >
>     >For example, the first short sequence is "TTGACATA"
>     >and the first long sequence is
>     >"GAATCATATATTAGTCTCCACATACTCCGTTCGTGACCCATTACCCTTTCGGGAGA
>     >GCCACAGCAACTGTAGATCTCGAAGTTGACAGGGGCAACTAGAGGCCTCAGAATTCT
>     >CACTCTTGAGGAGAGAAGTCTAAGACCTACAGTATGGTCGGGTTAGTTTTTGTTCCGTC
>     >GAACCTTGGACTAACCACTGTCTGGATA".
>     >
>     >Basically, I want to generate a random position as a
>     >starting site to replace a substring
>     >in the long sequence with a short sequence. In this
>     >example, we can choose a starting site
>     >as 5th nucleotide in the long sequence, after
>     >replacing using "TTGACATA", the replaced
>     >long sequence is
>     >"GAATTTGACATAAGTCTCCACATACTCCGTTCGTGACCCATTACCCTTTCGGGAGA
>     >GCCACAGCAACTGTAGATCTCGAAGTTGACAGGGGCAACTAGAGGCCTCAGAATTCT
>     >CACTCTTGAGGAGAGAAGTCTAAGACCTACAGTATGGTCGGGTTAGTTTTTGTTCCGTC
>     >GAACCTTGGACTAACCACTGTCTGGATA".
>     >
>     >Then I want replace the 2nd long sequence with the 2nd
>     >short sequence and then repeat this over and over
>     >again until the last long sequence is reached and
>     >replaced. I think the only problem is that the
>     >starting site should not be larger than 193.
>     >Otherwise, there are
>     >not enough nucleotides in the long sequence for
>     >replacement.
>     >
>     >Furthurmore, I want to keep track the starting
>     >replacement site for each long sequence.
>     >
>     >
>     >I am copying my code in the below.
>     >******************************************
>     >use strict;
>     >use warnings;
>     >
>     >my (@short, @long, $offset); # the 'short' array will
>     >hold the short
>     > #sequences while 'long'
>     >array the long sequences
>     >
>     >open(FILE1, '<', "short_sequences.txt") || die "Can't
>     >open short_sequences.txt: $!\n";
>     >while(){
>     > chomp;
>     > push(@short, $_);
>     >}
>     >close FILE1; #Close the file
>     >
>     >open(FILE2, '<', "long_sequences.txt") || die "Can't
>     >open long_sequences.txt: $!\n";
>     >while(){
>     > chomp;
>     > push(@long, $_);
>     >}
>     >close FILE2; #Close the file
>     >
>     >
>     ># replacement
>     >foreach my $short(@short){
>     > foreach my $long(@long){
>     > $offset = int(rand(length($long)%193));
>     > substr($long,$offset,length($short),$short);
>     > printf "%3d", $offset+1;
>     > print "\n", $long, "\n";
>     >
>     > }
>     >}
>     >********************************************
>     >
>     >But I just realized that there is a problem for the
>     >two
>     >loops. The problem is that each short sequence will be
>     >used to replace all long sequences not the
>     >corresponding one.
>     >
>     >So I seek your suggestions on how to handle two files
>     >simultaneously for my case.
>     >
>     >Thank you very much and look forward to your reply!
>     >
>     >Best Regards,
>     > Alex
>     >
>     >__________________________________________________
>     >Do You Yahoo!?
>     >Tired of spam? Yahoo! Mail has the best spam protection around
>     >http://mail.yahoo.com
>     >_______________________________________________
>     >Bioinformatics.Org general forum -
>     BiO_Bulletin_Board at bioinformatics.org
>     >https://bioinformatics.org/mailman/listinfo/bio_bulletin_board
>     >
>     >
>     >
>
>     _______________________________________________
>     Bioinformatics.Org general forum -
>     BiO_Bulletin_Board at bioinformatics.org
>     https://bioinformatics.org/mailman/listinfo/bio_bulletin_board
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>