Hi All, Version 3 of tacg is feature-complete and is ready for testing on as many platforms as you can run it on. It has been tested to various levels on Linux (Intel, alpha), Solaris, IRIX. It should compile on anything that supports gcc. Unfortunately, 'feature-complete' doesn't include a lot of things I was thinking about, including graphic restriction maps, XML output, more protein analyses, etc, etc, etc. That'll be next... I'm being overwhelmed with other obligations and consulting and so the final polishing, testing, and especially documenting is not going as fast as I thought it would (currently only tacg3.man.html and tacg.1 in the Docs subdirectory can give you any hint of what it does). BUT it is not GPL. Not to take anything away from RMS and it's not to say that someday it won't be, but for the forseeable future, it'll be 'mostly free', as defined in 'tacg.h'. But what that means for Loci is that as long as Loci is freely available, tacg can be bundled with it for no charge. I think that makes it advantageous to everyone (but I'm willing to be convinced otherwise)... I thought the loci group would be a good to release it to 1st as you're all pretty savvy about compilation and you might be able to give functional feedback instead of 'it doesn't work'. Here's the promo blurb - let me know what it dies on or how it can be made better. tacg is a command-line program that performs many of the common routines in pattern matching in biological strings. It was originally designed for restriction enzyme analysis and while that still forms a core of the program, it has been expanded to fill more roles, sort of a 'grep' for DNA. However, it is also more than that: a brief description of its abilities is at the bottom of this announcement. What it is not: - It is not a point and click graphical user program like Lasergene, MacVector, or even GCG's SeqLab. - It is not free, nor public domain, nor GPL, nor Open Source. It remains the intellectual property of tacg Informatics. However, that said, the distribution policy is meant to encourage widespread use. It is freely available to you in both binary AND SOURCE CODE for personal use and examination, with the only restriction being that you may not incorporate it into other software that is sold for profit without licensing it for that purpose. I will be happy to consider making specific exemptions for copies to be included with certain distributions of Linux and other collections of software. What it is: - It is a commandline tool, albeit one designed for humans. It has compiled-in brief help hints, a decent man page (and expanded version thereof in the HTML version) and a full length manual as well. - It was written congruent to the philosophy that most large-scale bioinfomatics will done in a pipeline and therefore the tools for those analyses should support pipelines as much as possible. It is also highly applicable to Web interfaces and an updated one will be made available shortly. - It was also written for the analysis of genomic DNA, so it uses dynamic memory for almost all internal storage and therefore it is not limited to a specific sequence length. - It is written in ANSI C, using standard libaries or supplied code, so that it compiles and runs on all unix variants that I've tried: Linux (PPC, Intel, Alpha (64-bit)), DEC Unix(64bit), IRIX(32/64bit), Solaris, ConvexOS, HP/UX, NeXTStep. - It also runs on Win95/NT (compiled with the amazing Cygwin tools), DOS (with the DJGPP extender), and has been compiled for the Mac as part of Don Gilbert's SeqPup application, although the DOS and Mac versions have not yet been compiled as I write this. - It is meant to be extended to include your own functions. An example skeleton function is included as a guide, so that even relative newcomers to the field should be able to plug in analytical code to extend it. This code is, of course, YOUR Intellectual Property. - perhaps you can think of it as the bioinformatics equivalalent of a Husqvarna chainsaw with all the guards and brakes removed :) Savage but effective. It supports: [* = new or improved] - searching arbitrarily large nucleic acid strings (dynamic memory allows searching whole bacterial genomes, or eukaryotic chromosomes (or genomes) in one sweep) - handles sequences as small as 5 bases for analysis of linkers, oligos - handles circular and linear DNA appropriately, correctly - allows subsequences to be extracted from larger sequences, generates both subsequence #ing and original sequence #ing * integrated with Jim Knight's Seqio to allow automatic sequence conversion on input, and scanning multiple sequence databases (or not - it still supports the ability to assume that ALL input is sequence - this is useful for analyzing file fragments or editor buffers). - FAST (5-35X equiv routines in GCG) pattern matching of nucleic acids up to about 30 bases - simultaneous searching of thousands of patterns read from a database or a few explicit patterns read from the command-line - searching with errors - searching for patterns containing IUPAC degeneracies in strings which also contain IUPAC degeneracies * searching for regular expressions (in nucleic acid), with autoconversion of IUPAC degeneracies to the appro regex. * searching for TRANSFAC matrices, with user-specified cutoffs - GCG-style ladder maps * gel simulations with low and high end cutoffs for expansion * selection of Restriction Enzymes explicitly, by overhang generated, magnitude of recognition site, price, minimum, maximum number of cuts (overall or on a per-pattern basis) * supports Combination Cuts of up to 15 REs at a time * supports limited AFLP fragment matching / simulation * simulates Dam and/or Dcm methylation of DNA - generates summaries of # of patterns found, Sites, Fragments (sorted/unsorted) * searches for silent sites, with reverse translation - Full Linear Maps with enzyme cuts marked AT THE POINT OF CUTTING, not beginning of pattern, with double/single strand selection - co-translation of DNA, based on a (user-expandable) number of Codon usage tables, in 1 2 or 6 frames * ORF finding in any combination of frames with FASTA output, with offsets in DNA, protein, Molecular Wt, pI, with optional additional info on AA frequency in #s or % - dump of internal data for analysis / plotting with external plotting programs in 3 formats, incl gnuplot. * conditional output based on matches, for scanning large numbers of sequences at a time * 2 types of Proximity matching: 1 - exact specification of the relationships of 2 patterns (upstream, downstream, by how much, within/outside of a range 2 - specify rules for arbitrarily complex relationships among many patterns (in a sliding window or in the whole sequence) with logical AND, OR conjunctions (if people want the FULL logical connectives (XOR, NAND, NOR, etc) it's trivial to add them) joining pattern specifications (unbiased editorial comment :) - this is cool!) * sequence extraction surrounding pattern matches, with variable upstream, downstream inclusion, optional reverse translation in FASTA format. * uses autoconf/configure to ease building on different platforms. * includes an explicit example function to show how to add your own funtionality - (mostly) free - source code included And you can get it from: http://24.1.175.29/tacg/Beta/tacg-latest-beta.tar.gz Instructions: tar -xzvf tacg-latest-beta.tar.gz cd tacg; ./configure; make the files in the Data subdir need to go in the current directory, your home directory, or in '/usr/local/lib/tacg' or wherever you define the environment var TACGLIB to be. (no make install just yet - put the bits where you want them) -- Cheers, Harry Harry J Mangalam -- (949) 856 2847 -- mangalam at home.com