The FINDPATTERN language

The FINDPATTERN language has been developed by the GCG developers in order to allow a more flexible description of nucleic acid / protein patterns.
                    Implied Sets and Repeat Counts

               Parentheses () enclose one or more symbols that can be repeated some number
               of times.   Braces {} enclose numbers that  tell how many times the symbols
               within the preceding parentheses must be found.

               Sometimes, you can leave out  part  of  an  expression.  If  braces  appear
               without preceding parentheses, the numbers in the  braces define the number
               of  repeats  for  the immediately  preceding  symbol.   One  or both of the
               numbers within  the  braces  may be  missing.   For instance,  the  pattern
               GATG{2,}A  means GAT,  followed by  G  repeated from  2 to  350,000  times,
               followed by A; the pattern GATG{}A means GAT, followed by G repeated from 0
               to  350,000  times,  followed  by  A;  the pattern GAT(TG){,2}A means  GAT,
               followed  by TG repeated from 0 to 2 times, followed by A.  (If the pattern
               in the parentheses is an  OR expression (see below),  it cannot be repeated
               more than 2,000 times.)

                    OR Matching

               If  you  are searching  nucleic  acids,  the ambiguity symbols  defined  in
               Appendix III let you define any combination of  G, A, T, or C.   If you are
               searching  proteins, you  can specify any  of  several  symbol  choices  by
               enclosing the  different choices in parentheses and separating  the choices
               with commas.  For instance, RGF(Q,A)S means RGF followed  by either Q  or A
               followed by S.   The length of  choices need not be the same, and there can
               be up to 31 different choices within each set of parentheses.  The  pattern
               GAT(TG,T,G){1,4}A means GAT followed by any combination of TG, T, or G from
               1 to 4 times followed by A.   The sequence GATTGGA  matches  this  pattern.
               There can be several parentheses  in  a pattern, but parentheses  cannot be
               nested.

                    NOT Matching

               The pattern GC~CAT means GC, followed by  any symbol except C, followed  by
               AT.  The pattern GC~(A,T)CC means GC, followed by any symbol except A or T,
               followed by CC.

                    Begin and End Constraints

               The pattern <GACCAT can only be found if it occurs  at the beginning of the
               sequence range being searched.  Likewise, the pattern GACCAT> would only be
               found if it occurs at the end of the sequence range.