[BiO BB] Matching and Filtering -- try grep- thanks

Harry Mangalam hjm at tacgi.com
Mon Nov 17 16:08:38 EST 2003

Try this. so named b/c it can act as a 'super cut' or can perform a lot of 
slicing/dicing scut work for you

    I wrote it do exactly what (I think) you're describing.  Here's the description:

Usage: scut [options, below] > output_file
   --f1=[file1]    - the shorter or 'needle' file.  If using as a smarter cut,
                    use STDIN.
   --f2=[file2]    - the 'longer' or 'haystack' file

   --k1=col#       - the key column from file1 (numbered from ZERO, not 1)
                      i.e the number of the column (starting from 0) that
                      has the key column name for file1 (see example below)
   --k2=col#       - the key column from file2 (ditto)

   --c1='# # ..'   - the numbers of the columns from file1 that you want
      or              printed out in the order in which you want them.  If
   --c1='A C F ..'    you DON'T want any columns from the file, just
                      enter it as '' (2 single quotes) or omit it
                      completely.  If you want the whole line, type 'all'
                      1) #s are split on whitespace, not commas.
      or              2) scut also supports Excel-style column specifiers
   --c1='A C F ..'    (A B F AD BG etc) for up to 78 columns (->BZ)  If you want
                      more, add them to the %excel_ids hash above or create an
                      algo that does it right.

   --c2='# # ..'   - ditto for file2
   --c2='A C F ..'

   --id1='...'     - the delimiter string for file1; defaults to whitespace
                     (specify TAB = '\t'), but can be a multicharacter string
                     as well such as '_|_'

   --id2='...'     - ditto for file2

   --od='...'      - the delimiter string for the output (defaults to TAB)

   --noerr         - stops most stderr from being generated (for large files,
                      most of the CPU is dedicated to processing the STDERR text
                      stream (thanks for stressing it, Peter), but if you need
                      this output, you'll just have to deal with it.

   NB: the following 3 options: --begin, --end, --excl currently only work with
   the single file version (as a smarter cut, not the merging functions).
   Stay tuned for the 2 file version..

   --begin=[#|regex] - specifies the line to START processing data at (for
                       example, if the file has 2 format sections and you only
                       want to process one of them).  The option can be either
                       an integer value to specify the line number, or a non-
                       repeating regular expression that unambiguously identifies
                       the line.

   --end=[#|regex] - as above, but specifies the line to STOP processing data at.

   --excl          - if added to the arguments, excludes the lines specified by
                       --begin and --end (in case you need to exclude the
                       defining header lines).

   --version       - gives the version of the software and dies.

   --nocase        - makes the merging key case INSENSITIVE.

   --sync          - whether you want the output sync'ed on file2.  The sync
                     will insert blank lines where there are comments as well.
   --help          - dumps these lines to stdout and dies.


  = there have to be the same number of columns in each line or it will get
  confused.  The matches are case-sensitive, unless you use the '--nocase'
  option to turn it off.

  = scut sends its output to stdout, so if you want to catch the output in a
  file, use redirection '>' (see below) and if you want to catch the stderr
  you'll have to catch that as well ( >& out ).

  = scut ignores any line that starts with a '#', so you can document what
  the columns mean, add column numbering, etc, as long as those lines start
  with a '#'

  = scut always puts the matched key in the 1st column of the output

  = under Win/DOS execution, you will probably need to run it with the perl
    prefix i.e. perl scut [options] and will also have to enclose the option
    strings with DOUBLE QUOTES (\"opts\") instead of single quotes('opts').

Pooja Jain wrote:
> Hi Dmitri I Gouliaev ,
> Thank you for your suggestion. I followed the grep man pages and used 
> grep -f  and it worked.
> grep  -f 'file1.txt'  file2.txt  > file3.txt
> Where file1.txt has the list of accession numbers corresponding to which I
> would like to filter the details from file2.txt. But the above command
> writes the contents of the file2.txt to file3.txt.
> thanks again.
> Regards,
> -Pooja
>>Hi, Pooja Jain !
>> On Mon, Nov 17, 2003 at 11:15:10AM -0000, Pooja Jain wrote:
>>>I am having a txt file with a list of accession numbers for few of the
>>>seqeuence from entire Arabidopsis thaliana genome. I have another tab
>>>delimited txt file with all the accession numbers and other details
>>>every sequence peresent in the genome of it (row wise). From this later
>>>file I want to filter the details about only those  sequences which have
>>>the same accesion numbers as in the former file.
>>>Could some one please suggest some simple way to do this matching and
>>>filtering? I tried using the simple shell scripts commands like cmp and
>>>diff but none of them worked. Is ther any other command I can use with
>>>shell. Any other way to do so with perl is also welcome.
>>From man pages:
>>    grep, egrep, fgrep - print lines matching a pattern
>>You should use grep.
>>    file-with-a-list is a txt file with a list of accession numbers
>>    file-with-all-the-details is the other file,
>>then this shell one-liner
>>    user at host$ cat file-with-a-list \
>>               | while read AN ; do \
>>                   grep "^$AN" file-with-all-the-details ; \
>>                 done >> file-with-the-details-for-the-listed-accnum
>>should work for you (if the accession numbers are at the beginning of the
>>lines in the "other" file).  If they are not, but there are some
>>white-space characters at the beginning of each lines, then change "^$AN"
>>to "[ \t]$AN" (with quotation marks).
>>Hope this helps,
>>DIG (Dmitri I GOULIAEV)        http://www.bioinformatics.org/~dig/
>>1024D/63A6C649: 26A0 E4D5 AB3F C2D4 0112  66CD 4343 C0AF 63A6 C649
>>BiO_Bulletin_Board maillist  -  BiO_Bulletin_Board at bioinformatics.org
> _______________________________________________
> BiO_Bulletin_Board maillist  -  BiO_Bulletin_Board at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bio_bulletin_board

Cheers, Harry
Harry J Mangalam - 949 856 2847 (v&f) - hjm at tacgi.com
             <<plain text preferred>>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: scut
URL: <http://www.bioinformatics.org/pipermail/bbb/attachments/20031117/344692f1/attachment.ksh>

More information about the BBB mailing list