ViewVC Help
View File | Revision Log | Show Annotations | Root Listing
root/dnacgr/README
Revision: 1.2
Committed: Sun Apr 22 17:40:35 2001 UTC (15 years, 5 months ago) by indraneel
Branch: MAIN
CVS Tags: HEAD_NEW, HEAD
Changes since 1.1: +1 -1 lines
Log Message:
License issues:
Dropped the binary and the GPL
Source was always under LGPL-2.1 (no changes here)

Line File contents
1 dnacgr version 0.2 copyright(c) Indraneel Majumdar<indraneel@123india.com>
2
3 /* Program to plot Chaos Game Representation of a DNA or RNA sequence. */
4
5 This file describes the application of chaos game theory in the visualisation of DNA and RNA sequence patterns. It also acts as a brief user's manual.
6
7 dnacgr is available at:
8 http://scorpius.iwarp.com
9 ftp://metalab.unc.edu/pub/linux/science/biology/
10
11 Code for printing to file in PNG format has been modified from code donated by Sergio Masci
12
13 dnacgr has been tested on linux-2.2.16 as I do not have any other OS to work on. If you have the resources and time kindly consider porting the application. Please send your queries, comments, suggestions, bugfixes, features to the above address so that they can be put up on the primary site, if applicable.
14
15 Why DNACGR ?
16 looking through metalab's site I was astounded to find so much molecular biology software and so little of it in the /pub/linux directory (all of it is at ftp://metalab.unc.edu/pub/academic/biology none of which is specially for Linux). Also I didn't notice any software that used CGR, even on the ncbi and ebi servers. I was learning C so...
17
18 What is CGR ?
19 The Chaos Game Representation (CGR) of generating fractal structures uses an iterated function system. The same set of equations is iterated to generate a pattern of points. The simplest form gives rise to what is known as a 'Sierpinsky triangle' in mathematics (open the IFS compose box of 'The GIMP' to see this). Briefly this is what is done. Mark any three points(x,y,z) on a sheet of paper. Now select any other point and one of the three points. Mark a point halfway between them (this is P1). Select any of the three initial points(x,y,z) and mark P2 midway between that and P1. Select one of x,y,z again and mark P3 midway between that and P2 and so on. After a sufficient number of iterations a pattern (the Sierpinsky triangle) will be formed (named after the mathematician who discovered this). However when this is done with not three but four initial points and with equal probability of selecting any of them at each step there is no pattern. The box will be evenly filled up. Thus any uneven selection of the four points at each step will form a pattern, more so if the points are sometimes chosen in a particular sequence. For our program the four initial points are the DNA bases A,T,G or C (AUGC for RNA also works) read from a DNA (or RNA) sequence file. Labelling the points A,T,G,C and selecting them in the same sequence as from a sequence file gives interesting patterns since nucleic acids generally have a lopsided probability distribution and contain repeated sequences.
20
21 USING DNA_CGR :
22
23 dnacgr runs in interactive mode and does not (yet) accept command line arguments. The shared libraries from readline, ncurses, svgalib, libpng and zlib should be in the searchpath. On startup dnacgr shows the about screen. press any key to start. Enter '*' for a random plot or enter a sequence file. Currently fasta format is supported. The first line of the file is shown on screen and is not scored. Sequence is assumed to start from the first line after that which starts with 'a' 't' 'u' 'g' 'c' or blank in upper or lower case. Any line between this and the first line is discarded. Thus you may need to modify some sequence files by hand so that the sequence can be read (simply put an 'X' at the start of each line till the sequence begins). Readline will assist you in reading filename by providing tab completion of filenames.
24
25 Enter probabilities and iterations for a random plot, or the first and last bases to check for a sequence file.
26
27 Probabilities of A,T,G,C in the plot is shown on the left together with the total points plotted (a million points take about 2 seconds). Start and end bases are shown if a file has been read. (In case you enter the first base number more than the total bases in the file it will show the total bases present in the file).
28
29 Filename is shown below the plot, first line of file below that.
30
31 f to select a new file to read
32 i to select number of points to plot (if you enter 0 (for total
33 file) last time, no points will be plotted if you choose '*'
34 as file name, you have to press 'i' to specify iterations.)
35 p select new probability (only for random plot, you can't
36 change the probability of bases in your gene(with dnacgr-0.1))
37 c displays points in a grid box per 10'000 total points(only for
38 sequence file reads).
39 g toggles grid above or below the plot
40 s shows sequence of last 4 bases (see below)
41 P print to PNG image file
42 R to replot or reread from file
43 r to refresh the screen
44 a about screen and LGPL
45 q quit (and report bugs)
46
47 The grid numbering helps to locate any point easily. Any two points in the plot must have the last 9 base sequence (including it) different. ie. TTTTATTTTTTTT and TTTTGTTTTTTTT will be two different points. However TTTATTTTTTTTT and TTTGTTTTTTTTT will be plotted at the same point. This happens because the box is 600 x 600 pixels (which is between 2^9 and 2^10). However the number of points in a box of the grid per 10'000 points plotted can be seen by pressing 'c'. This helps if the sequence is highly repetitive and points tend to overlap. The last four bases leading to a point in any box of the grid can be seen by pressing 's'. The lines of the grid are 37.5 pixels apart (alternately 37 and 38) and thus enclose all points whose last four bases are same. Similiarity or patterns of base sequences (or lack of it) can thus be identified very easily.
48
49 References:
50
51 1) Chaos game representation of gene structure - Jeffrey,H.Joel - Nucleic Acids Research, Vol 18, No 8 2163 - 1990
52 2) Mathematical characterization of chaos game representation, new algorithms for nucleotide sequence analysis - Chitra Dutta and Jyotirmoy Das - Journal of Molecular Biology 228, 715 719 - 1992
53
54
55 Sequences I have used to test this software were from:
56
57 ftp://metalab.unc.edu/pub/academic/biology/molbio/data/
58 ftp://ftp.ebi.ac.uk/pub/databases/embl/genomes/
59
60 (contigs of human chromosome 22 give a very nice pattern)
61 ( .fasta and .seq files are easy to read)
62
63
64 Now that you know how to use DNA-CGR look at your genes and find out more about yourself. Good luck ;-)
65
66 TODO
67 lots of things:
68 command line operation
69 converting to 3D view
70 handlers for different sequence formats
71 porting to Mesa (?)
72 ...