ViewVC Help
View File | Revision Log | Show Annotations | View Changeset | Root Listing
Revision: 8
Committed: Mon Mar 22 22:11:25 2010 UTC (12 years, 6 months ago) by gpertea
File size: 11295 byte(s)
Log Message:
added cdbfasta source files

Line User Rev File contents
1 gpertea 8 CDB (Constant DataBase) indexing and retrieval tools for FASTA files
2     =====================================================================
4     This is a brief introduction to a couple of platform independent file-based
5     hashing tools (cdbfasta and cdbyank) that can be used for creating indices for
6     quick retrieval of any particular sequences from large multi-FASTA files. The
7     last version has the option to compress data records in order to save space.
8     The index files are now architecture independent, the same index file can be
9     created and used on many different Unix platform (be it 32bit/64bit,
10     big-endian or little-endian architectures) and even Windows.
12     1.Install instructions
13     2.Typical usage
14     3.Retrieving sequence ranges or only the defline
15     4.Data compression option
16     5.Development notes
19     1.Install instructions
20     ===============================
21     Before running 'make' in the source directory, please take a look at
22     the Makefile and note the following:
24     * GCLDIR must point to the directory containing the gclib source
25     files (should be included in this source package already as a subdirectory)
26     * in order to support record compression, change the BASEFLAGS variable
28     (default is: no compression support)
29     * if compression was enabled, ZDIR should point to the directory where the
30     zlib library (libz.a and all the zlib header files like zlib.h) can be found.
31     This is only needed if your system does not have the zlib library installed
32     already (most systems do). In case you get zlib related errors when you try
33     to compile cdbfasta you might have to download zlib and install/build it
34     in a directory that should then be specified as ZDIR in the Makefile
36     Running 'make' should produce the binaries 'cdbfasta' (the indexer program)
37     and 'cdbyank' (the query program) in the current directory.
39     2.Typical usage
40     ===============
42     Use cdbfasta to create the index file for a multi-FASTA file and cdbyank to
43     pull records based on that index file. An usage message is displayed if the
44     commands cdbyank or cdbyank are run without any parameters.
46     In order to create an index file, only the name of the fasta file must be
47     provided:
49     cdbfasta <fasta_file>
51     The fasta file can be specified with the whole path (if it's not in the current
52     directory), e.g.
54     cdbfasta /usr/local/db/GUDB.human
56     By default cdbfasta creates an index file with the same name as the database
57     file but with the .cidx suffix added to the original name. So in the example
58     above, a file GUDB.human.cidx will be created in /usr/local/db/. The default
59     usage considers the key for a FASTA record to be the first space-delimited
60     token following the ">" starting character from the definition line. For
61     example, if a FASTA record had a defline like this:
63     >AA141526
65     Then we can use the string 'AA141526' with cdbyank to retrieve the full FASTA
66     record associated to that sequence name:
68     cdbyank -a 'AA141526' /usr/local/db/GUDB.human.cidx
70     Sometimes all the space delimited tokens in the defline need to be declared as
71     keys in the index file, pointing to the same fasta record. This can be
72     accomplished by cdbfasta by using the "-m" switch.
74     For long and complex fastA file accessions like this:
75     EGAD|61|GP|186739|gb|AAA63210.1||M60828 there is an option to create the
76     index file in such a way that there is no need to provide the full string to
77     cdbyank in order to retrieve such a sequence, but only the first
78     "<db>|<accession>" pair (i.e. a substring ending at the second '|' character)
79     should be enough. (EGAD|61 in the example above). In order to enable this
80     feature, there are two alternative options for cdbfasta:
83     -c : the index file is built only by storing the "shortcut key" (the first
84     "db|accession" pair found in the defline of each fasta record). In this
85     case, cdbyank will only be able to accept these "shortcut" accessions for
86     record retrieval.
88     -C : the index file is built by storing both the "shortcut key" and the full
89     keys (which are considered to end at the first space character in the
90     defline). In this case, two strings are stored as keys for each fastA
91     record so any of them can be used as an accession for retrieval of the
92     same record with cdbyank.
94     In order to retrieve records from the database file, cdbyank should be provided
95     with the name of the index file created previously with cdbfasta, e.g.:
97     cdbyank -a 'human|Z98492' /usr/local/db/GUDB.human.cidx
99     A list of accessions is expected at stdin if -a option is not provided, e.g.:
101     cat seq_list | cdbyank /usr/local/db/GUDB.human.cidx
103     This way the output will be a series a fasta records at stdout. By redirecting
104     this output to a file a multifasta file is obtained. cdbyank locates the
105     database file by stripping the '.cidx' suffix off the index filename. But this
106     is not enforced, because by using the -d option, cdbyank can make use of a
107     user-provided database to be used by the given index file. In the example
108     above, if the index file "GUDB.human.cidx" is moved into another directory, a
109     cdbyank command (in that other directory) can be issued like that:
111     cdbyank -a 'human|Z98492' -d /usr/local/db/GUDB.human GUDB.human.cidx
113     The position of the index file in the list of arguments of cdbyank is not
114     enforced. For the -a usage, the error status returned by cdbyank to the shell
115     will be 1 if the given key was not found and 0 for success.
117     The total number of fasta records indexed and the list of the keys stored in a
118     specific cdb index file can be retrieved with cdbyank's -n and -l switches,
119     respectively. This information is obtained from the index file directly (the
120     database file is not needed for that). There is also a -s option that displays
121     a summary of the indexing information stored in the index at index time. These
122     are the initial name of the fastA file, its size, how the index was created
123     (e.g. was -m (multiple keys) option given ? was -c or -C (shortcut keys) option
124     given?), the number of keys stored in the file as well as the number of fasta
125     records indexed - the latter being the same with what -n option returns.
127     As an extra feature, cdbfasta and cdbyank can also be used for some special
128     cases where databases may have different records but with the same key
129     (non-unique keys). Although the performance will degrade a little, cdbfasta is
130     able to index this kind of files, but by default cdbyank only outputs the first
131     record found. If you want all the possible records sharing the same key
132     (accession) to be retrieved and displayed, the -x option should be given to
133     cdbyank.
135     3.Retrieving sequence ranges or only the defline
136     ================================================
138     There are two cdbyank options added for convenience: -F option returns the
139     definition line of each requested FASTA record (the first line for each
140     record). The -R option of cdbyank is intended for FASTA files containing
141     actual genetic sequences (nucleotide or protein) and expects each of the
142     retrieval commands to have the following format (space delimited)
144     <key> <right_coordinate> <left_coordinate>
146     For example if we only want to retrieve the sequence range 24...178 (letter
147     numbering starts at 1) from sequence with the name 'human|Z98492', then the
148     cdbyank command would look like this:
150     cdbyank -a 'human|Z98492 24 178' -R GUDB.human.cidx
152     Multiple sequence ranges can be extracted this way by providing a file having
153     each line following the format above (key followed by the two coordinates).
154     Then, as before, such file can be piped into cdbyank with -R option to pull
155     specific sequence ranges for each of the sequences specified in the input file.
158     cat seqlistranges | cdbyank -R GUDB.human.cidx
160     Note that this range option works by actually parsing and looping through the
161     retrieved record characters internally - so the performance is poor when some
162     terminal range is pulled from a very large record.
164     4.Data compression option
165     =========================
166     (This only applies if the programs were built with compression support enabled)
168     The indexing program cdbfasta has the -z <compressed_db> option which creates
169     a compressed file from the input file and at the same time creates an index
170     file for this compressed file. The original input file can then be discarded
171     (if it is only needed for random access through cdbyank). The entire input file
172     can be recovered from the resulting <compressed_db> by using the -z option of
173     cdbyank. Because each record is compressed separately, compression is poor if
174     the records are small. Compression is only advised when:
176     * data records are large enough for the compression algorithm to adapt (at
177     least 1KB, the more the better)
179     * only random access is needed to the data records (so the original file can
180     be discarded)
182     The compression can be quite slow for large files and there is also some
183     performance penalty for cdbyank as it has to decompress the retrieved records
184     on the fly.
186     The input data for cdbfasta compression can be collected from stdin if '-' is
187     used instead of a file name:
189     cat my_data_files* | cdbfasta - -z mydata.cdbz
191     This option is useful especially when the total size of input data files is
192     extremely large (over the file-system limits or over the 4GB internal limit of
193     cdbfasta) while the compressed output can be small enough to fall under such
194     limits.
196     With compressed databases cdbyank can be used normally without extra options as
197     it will auto-detect the compression (from the index file info) and activate
198     on-the-fly decompression of the retrieved records.
200     The -F and -R options are not yet accepted when working with compressed
201     records.
203     5.Development notes
204     ===================
206     These tools were developed in C++, based on the publicly available cdb
207     ("constant database") code written by D.J. Bernstein
208     ( "Constant databases" are those that we don't need
209     to add to or remove records from. The original C source was (rather crudely)
210     wrapped into C++ classes and adjusted to automatically index fasta records and
211     to create an external index instead of compacting the original data file like
212     the original cdb library code does. Also the "endianness" is now checked at
213     runtime and the bytes are swapped accordingly such that the file offsets and
214     record sizes are always read/written in the same way in the index file.
216     The compression option uses zlib's "deflate" method. The program uses deflate()
217     with Z_FULL_FLUSH after each record, such that random record decompression is
218     possible after the first dummy record is decompressed.
220     The index file contains an info chunk (actually stored at the end of the file)
221     which maintains a summary data and flags about the indexing process (the -s
222     option of cdbyank retrieves this information). Since the compression option
223     was added, cdbyank is always trying to read this information first (before
224     opening the data file) in order to determine if the data records are compressed
225     or not.
227     Please let me know if you notice problems running with these tools.
229     --
230     Geo Pertea
232     06/09/2003
235     7. Copyright
236     ============
238     Copyright (c) 2002-2003, The Institute for Genomic Research,
239     All Rights Reserved
240     This software is OSI Certified Open Source Software.
241     OSI Certified is a certification mark of the Open Source Initiative.