[Pipet Devel] New version of tacg: (mostly) free sequence analysis app

Harry Mangalam mangalam at home.com
Fri Nov 5 15:56:34 EST 1999

Hi All,

Version 3 of tacg is feature-complete and is ready for testing on as many
platforms as you can run it on.  It has been tested to various levels on
Linux (Intel, alpha), Solaris, IRIX.  It should compile on anything that
supports gcc.

Unfortunately, 'feature-complete' doesn't include a lot of things I was
thinking about, including graphic restriction maps, XML output, more
protein analyses, etc, etc, etc.  That'll be next...

I'm being overwhelmed with other obligations and consulting and so the
final polishing, testing, and especially documenting is not going as fast
as I thought it would (currently only tacg3.man.html and tacg.1 in the Docs
subdirectory can give you any hint of what it does).

BUT it is not GPL.  Not to take anything away from RMS and it's not to say
that someday it won't be, but for the forseeable future, it'll be 'mostly
free', as defined in 'tacg.h'.  But what that means for Loci is that as
long as Loci is freely available, tacg can be bundled with it for no
charge.  I think that makes it advantageous to everyone (but I'm willing to
be convinced otherwise)...

I thought the loci group would be a good to release it to 1st as you're all
pretty savvy about compilation and you might be able to give functional
feedback instead of 'it doesn't work'.  

Here's the promo blurb - let me know what it dies on or how it can be made

   tacg is a command-line program that performs many of the common routines
pattern matching in biological strings.  It was originally designed for 
restriction enzyme analysis and while that still forms a core of the
program, it
has been expanded to fill more roles, sort of a 'grep' for DNA.  However,
it is
also more than that: a brief description of its abilities is at the bottom
this announcement.

What it is not:

- It is not a point and click graphical user program like Lasergene,
  or even GCG's SeqLab.  
- It is not free, nor public domain, nor GPL, nor Open Source.  
  It remains the intellectual property of tacg Informatics.  
  However, that said, the distribution policy is meant to encourage
  use.  It is freely available to you in both binary AND SOURCE CODE for
  personal use and examination, with the only restriction being that you
  not incorporate it into other software that is sold for profit without
  licensing it for that purpose.  I will be happy to consider making
  exemptions for copies to be included with certain distributions of Linux
  other collections of software.

What it is:

- It is a commandline tool, albeit one designed for humans.  It has
  brief help hints, a decent man page (and expanded version thereof in the
  version) and a full length manual as well.
- It was written congruent to the philosophy that most large-scale
  will done in a pipeline and therefore the tools for those analyses should
  support pipelines as much as possible.  It is also highly applicable to
  interfaces and an updated one will be made available shortly.  
- It was also written for the analysis of genomic DNA, so it uses dynamic
  for almost all internal storage and therefore it is not limited to a
  sequence length.  
- It is written in ANSI C, using standard libaries or  supplied code, so
  it compiles and runs on all unix variants that I've tried: Linux (PPC,
  Alpha (64-bit)), DEC Unix(64bit), IRIX(32/64bit), Solaris, ConvexOS,
- It also runs on Win95/NT (compiled with the amazing Cygwin tools), DOS 
  the DJGPP extender), and has been compiled for the Mac as part of Don
  SeqPup application, although the DOS and Mac versions have not yet been
  compiled as I write this.
- It is meant to be extended to include your own functions.  An example
  function is included as a guide, so that even relative newcomers to the
  should be able to plug in analytical code to extend it.  This code is, of
  course, YOUR Intellectual Property.
- perhaps you can think of it as the bioinformatics equivalalent of a
  chainsaw with all the guards and brakes removed :)  
  Savage but effective.

It supports:            [* = new or improved]
- searching arbitrarily large nucleic acid strings (dynamic memory
   allows searching whole bacterial genomes, or eukaryotic chromosomes (or
   genomes) in one sweep)
- handles sequences as small as 5 bases for analysis of linkers, oligos
- handles circular and linear DNA appropriately, correctly
- allows subsequences to be extracted from larger sequences, generates both
   subsequence #ing and original sequence #ing
* integrated with Jim Knight's Seqio to allow automatic sequence conversion
   on input, and scanning multiple sequence databases (or not - it still 
   supports the ability to assume that ALL input is sequence - this is
   for analyzing file fragments or editor buffers).
- FAST (5-35X equiv routines in GCG) pattern matching of nucleic acids 
   up to about 30 bases
- simultaneous searching of thousands of patterns read from a database or a
   few explicit patterns read from the command-line
- searching with errors
- searching for patterns containing IUPAC degeneracies in strings which
   also contain IUPAC degeneracies
* searching for regular expressions (in nucleic acid), with autoconversion
   IUPAC degeneracies to the appro regex.
* searching for TRANSFAC matrices, with user-specified cutoffs
- GCG-style ladder maps
* gel simulations with low and high end cutoffs for expansion
* selection of Restriction Enzymes explicitly, by overhang generated,
   magnitude of recognition site, price, minimum, maximum number of cuts
   (overall or on a per-pattern basis)
* supports Combination Cuts of up to 15 REs at a time
* supports limited AFLP fragment matching / simulation
* simulates Dam and/or Dcm methylation of DNA
- generates summaries of # of patterns found, Sites, Fragments
* searches for silent sites, with reverse translation
- Full Linear Maps with enzyme cuts marked AT THE POINT OF CUTTING, not
   beginning of pattern, with double/single strand selection
- co-translation of DNA, based on a (user-expandable) number of Codon usage 
   tables, in 1 2 or 6 frames
* ORF finding in any combination of frames with FASTA output, with
   offsets in DNA, protein, Molecular Wt, pI, with optional additional info
   on AA  frequency in #s or %
- dump of internal data for analysis / plotting with external plotting
   in 3 formats, incl gnuplot.
* conditional output based on matches, for scanning large numbers of
   sequences at a time
* 2 types of Proximity matching:
   1 - exact specification of the relationships of 2 patterns (upstream,
        downstream, by how much, within/outside of a range
   2 - specify rules for arbitrarily complex relationships among many
   patterns (in a sliding window or in the whole sequence) with logical
   conjunctions (if people want the FULL logical connectives (XOR, NAND,
NOR, etc)
   it's trivial to add them) joining pattern specifications (unbiased
   comment :) - this is cool!)
* sequence extraction surrounding pattern matches, with variable upstream,
   downstream inclusion, optional reverse translation in FASTA format.
* uses autoconf/configure to ease building on different platforms.
* includes an explicit example function to show how to add your own
- (mostly) free
- source code included

And you can get it from:


tar -xzvf tacg-latest-beta.tar.gz
cd tacg; ./configure; make

the files in the Data subdir need to go in the current directory, your home
directory, or in '/usr/local/lib/tacg' or wherever you define the
environment var TACGLIB to be.

(no make install just yet - put the bits where you want them)


Harry J Mangalam -- (949) 856 2847 -- mangalam at home.com

More information about the Pipet-Devel mailing list