1 |
CDB (Constant DataBase) indexing and retrieval tools for FASTA files |
2 |
===================================================================== |
3 |
|
4 |
This is a brief introduction to a couple of platform independent file-based |
5 |
hashing tools (cdbfasta and cdbyank) that can be used for creating indices for |
6 |
quick retrieval of any particular sequences from large multi-FASTA files. The |
7 |
last version has the option to compress data records in order to save space. |
8 |
The index files are now architecture independent, the same index file can be |
9 |
created and used on many different Unix platform (be it 32bit/64bit, |
10 |
big-endian or little-endian architectures) and even Windows. |
11 |
|
12 |
1.Install instructions |
13 |
2.Typical usage |
14 |
3.Retrieving sequence ranges or only the defline |
15 |
4.Data compression option |
16 |
5.Development notes |
17 |
|
18 |
|
19 |
1.Install instructions |
20 |
=============================== |
21 |
Before running 'make' in the source directory, please take a look at |
22 |
the Makefile and note the following: |
23 |
|
24 |
* GCLDIR must point to the directory containing the gclib source |
25 |
files (should be included in this source package already as a subdirectory) |
26 |
* in order to support record compression, change the BASEFLAGS variable |
27 |
to have -DENABLE_COMPRESSION=1 instead of -DENABLE_COMPRESSION=0 |
28 |
(default is: no compression support) |
29 |
* if compression was enabled, ZDIR should point to the directory where the |
30 |
zlib library (libz.a and all the zlib header files like zlib.h) can be found. |
31 |
This is only needed if your system does not have the zlib library installed |
32 |
already (most systems do). In case you get zlib related errors when you try |
33 |
to compile cdbfasta you might have to download zlib and install/build it |
34 |
in a directory that should then be specified as ZDIR in the Makefile |
35 |
|
36 |
Running 'make' should produce the binaries 'cdbfasta' (the indexer program) |
37 |
and 'cdbyank' (the query program) in the current directory. |
38 |
|
39 |
2.Typical usage |
40 |
=============== |
41 |
|
42 |
Use cdbfasta to create the index file for a multi-FASTA file and cdbyank to |
43 |
pull records based on that index file. An usage message is displayed if the |
44 |
commands cdbyank or cdbyank are run without any parameters. |
45 |
|
46 |
In order to create an index file, only the name of the fasta file must be |
47 |
provided: |
48 |
|
49 |
cdbfasta <fasta_file> |
50 |
|
51 |
The fasta file can be specified with the whole path (if it's not in the current |
52 |
directory), e.g. |
53 |
|
54 |
cdbfasta /usr/local/db/GUDB.human |
55 |
|
56 |
By default cdbfasta creates an index file with the same name as the database |
57 |
file but with the .cidx suffix added to the original name. So in the example |
58 |
above, a file GUDB.human.cidx will be created in /usr/local/db/. The default |
59 |
usage considers the key for a FASTA record to be the first space-delimited |
60 |
token following the ">" starting character from the definition line. For |
61 |
example, if a FASTA record had a defline like this: |
62 |
|
63 |
>AA141526 |
64 |
|
65 |
Then we can use the string 'AA141526' with cdbyank to retrieve the full FASTA |
66 |
record associated to that sequence name: |
67 |
|
68 |
cdbyank -a 'AA141526' /usr/local/db/GUDB.human.cidx |
69 |
|
70 |
Sometimes all the space delimited tokens in the defline need to be declared as |
71 |
keys in the index file, pointing to the same fasta record. This can be |
72 |
accomplished by cdbfasta by using the "-m" switch. |
73 |
|
74 |
For long and complex fastA file accessions like this: |
75 |
EGAD|61|GP|186739|gb|AAA63210.1||M60828 there is an option to create the |
76 |
index file in such a way that there is no need to provide the full string to |
77 |
cdbyank in order to retrieve such a sequence, but only the first |
78 |
"<db>|<accession>" pair (i.e. a substring ending at the second '|' character) |
79 |
should be enough. (EGAD|61 in the example above). In order to enable this |
80 |
feature, there are two alternative options for cdbfasta: |
81 |
|
82 |
|
83 |
-c : the index file is built only by storing the "shortcut key" (the first |
84 |
"db|accession" pair found in the defline of each fasta record). In this |
85 |
case, cdbyank will only be able to accept these "shortcut" accessions for |
86 |
record retrieval. |
87 |
|
88 |
-C : the index file is built by storing both the "shortcut key" and the full |
89 |
keys (which are considered to end at the first space character in the |
90 |
defline). In this case, two strings are stored as keys for each fastA |
91 |
record so any of them can be used as an accession for retrieval of the |
92 |
same record with cdbyank. |
93 |
|
94 |
In order to retrieve records from the database file, cdbyank should be provided |
95 |
with the name of the index file created previously with cdbfasta, e.g.: |
96 |
|
97 |
cdbyank -a 'human|Z98492' /usr/local/db/GUDB.human.cidx |
98 |
|
99 |
A list of accessions is expected at stdin if -a option is not provided, e.g.: |
100 |
|
101 |
cat seq_list | cdbyank /usr/local/db/GUDB.human.cidx |
102 |
|
103 |
This way the output will be a series a fasta records at stdout. By redirecting |
104 |
this output to a file a multifasta file is obtained. cdbyank locates the |
105 |
database file by stripping the '.cidx' suffix off the index filename. But this |
106 |
is not enforced, because by using the -d option, cdbyank can make use of a |
107 |
user-provided database to be used by the given index file. In the example |
108 |
above, if the index file "GUDB.human.cidx" is moved into another directory, a |
109 |
cdbyank command (in that other directory) can be issued like that: |
110 |
|
111 |
cdbyank -a 'human|Z98492' -d /usr/local/db/GUDB.human GUDB.human.cidx |
112 |
|
113 |
The position of the index file in the list of arguments of cdbyank is not |
114 |
enforced. For the -a usage, the error status returned by cdbyank to the shell |
115 |
will be 1 if the given key was not found and 0 for success. |
116 |
|
117 |
The total number of fasta records indexed and the list of the keys stored in a |
118 |
specific cdb index file can be retrieved with cdbyank's -n and -l switches, |
119 |
respectively. This information is obtained from the index file directly (the |
120 |
database file is not needed for that). There is also a -s option that displays |
121 |
a summary of the indexing information stored in the index at index time. These |
122 |
are the initial name of the fastA file, its size, how the index was created |
123 |
(e.g. was -m (multiple keys) option given ? was -c or -C (shortcut keys) option |
124 |
given?), the number of keys stored in the file as well as the number of fasta |
125 |
records indexed - the latter being the same with what -n option returns. |
126 |
|
127 |
As an extra feature, cdbfasta and cdbyank can also be used for some special |
128 |
cases where databases may have different records but with the same key |
129 |
(non-unique keys). Although the performance will degrade a little, cdbfasta is |
130 |
able to index this kind of files, but by default cdbyank only outputs the first |
131 |
record found. If you want all the possible records sharing the same key |
132 |
(accession) to be retrieved and displayed, the -x option should be given to |
133 |
cdbyank. |
134 |
|
135 |
3.Retrieving sequence ranges or only the defline |
136 |
================================================ |
137 |
|
138 |
There are two cdbyank options added for convenience: -F option returns the |
139 |
definition line of each requested FASTA record (the first line for each |
140 |
record). The -R option of cdbyank is intended for FASTA files containing |
141 |
actual genetic sequences (nucleotide or protein) and expects each of the |
142 |
retrieval commands to have the following format (space delimited) |
143 |
|
144 |
<key> <right_coordinate> <left_coordinate> |
145 |
|
146 |
For example if we only want to retrieve the sequence range 24...178 (letter |
147 |
numbering starts at 1) from sequence with the name 'human|Z98492', then the |
148 |
cdbyank command would look like this: |
149 |
|
150 |
cdbyank -a 'human|Z98492 24 178' -R GUDB.human.cidx |
151 |
|
152 |
Multiple sequence ranges can be extracted this way by providing a file having |
153 |
each line following the format above (key followed by the two coordinates). |
154 |
Then, as before, such file can be piped into cdbyank with -R option to pull |
155 |
specific sequence ranges for each of the sequences specified in the input file. |
156 |
|
157 |
|
158 |
cat seqlistranges | cdbyank -R GUDB.human.cidx |
159 |
|
160 |
Note that this range option works by actually parsing and looping through the |
161 |
retrieved record characters internally - so the performance is poor when some |
162 |
terminal range is pulled from a very large record. |
163 |
|
164 |
4.Data compression option |
165 |
========================= |
166 |
(This only applies if the programs were built with compression support enabled) |
167 |
|
168 |
The indexing program cdbfasta has the -z <compressed_db> option which creates |
169 |
a compressed file from the input file and at the same time creates an index |
170 |
file for this compressed file. The original input file can then be discarded |
171 |
(if it is only needed for random access through cdbyank). The entire input file |
172 |
can be recovered from the resulting <compressed_db> by using the -z option of |
173 |
cdbyank. Because each record is compressed separately, compression is poor if |
174 |
the records are small. Compression is only advised when: |
175 |
|
176 |
* data records are large enough for the compression algorithm to adapt (at |
177 |
least 1KB, the more the better) |
178 |
|
179 |
* only random access is needed to the data records (so the original file can |
180 |
be discarded) |
181 |
|
182 |
The compression can be quite slow for large files and there is also some |
183 |
performance penalty for cdbyank as it has to decompress the retrieved records |
184 |
on the fly. |
185 |
|
186 |
The input data for cdbfasta compression can be collected from stdin if '-' is |
187 |
used instead of a file name: |
188 |
|
189 |
cat my_data_files* | cdbfasta - -z mydata.cdbz |
190 |
|
191 |
This option is useful especially when the total size of input data files is |
192 |
extremely large (over the file-system limits or over the 4GB internal limit of |
193 |
cdbfasta) while the compressed output can be small enough to fall under such |
194 |
limits. |
195 |
|
196 |
With compressed databases cdbyank can be used normally without extra options as |
197 |
it will auto-detect the compression (from the index file info) and activate |
198 |
on-the-fly decompression of the retrieved records. |
199 |
|
200 |
The -F and -R options are not yet accepted when working with compressed |
201 |
records. |
202 |
|
203 |
5.Development notes |
204 |
=================== |
205 |
|
206 |
These tools were developed in C++, based on the publicly available cdb |
207 |
("constant database") code written by D.J. Bernstein |
208 |
(http://cr.yp.to/djb.html). "Constant databases" are those that we don't need |
209 |
to add to or remove records from. The original C source was (rather crudely) |
210 |
wrapped into C++ classes and adjusted to automatically index fasta records and |
211 |
to create an external index instead of compacting the original data file like |
212 |
the original cdb library code does. Also the "endianness" is now checked at |
213 |
runtime and the bytes are swapped accordingly such that the file offsets and |
214 |
record sizes are always read/written in the same way in the index file. |
215 |
|
216 |
The compression option uses zlib's "deflate" method. The program uses deflate() |
217 |
with Z_FULL_FLUSH after each record, such that random record decompression is |
218 |
possible after the first dummy record is decompressed. |
219 |
|
220 |
The index file contains an info chunk (actually stored at the end of the file) |
221 |
which maintains a summary data and flags about the indexing process (the -s |
222 |
option of cdbyank retrieves this information). Since the compression option |
223 |
was added, cdbyank is always trying to read this information first (before |
224 |
opening the data file) in order to determine if the data records are compressed |
225 |
or not. |
226 |
|
227 |
Please let me know if you notice problems running with these tools. |
228 |
|
229 |
-- |
230 |
Geo Pertea |
231 |
gpertea@tigr.org |
232 |
06/09/2003 |
233 |
|
234 |
|
235 |
7. Copyright |
236 |
============ |
237 |
|
238 |
Copyright (c) 2002-2003, The Institute for Genomic Research, |
239 |
All Rights Reserved |
240 |
This software is OSI Certified Open Source Software. |
241 |
OSI Certified is a certification mark of the Open Source Initiative. |