ViewVC Help
View File | Revision Log | Show Annotations | View Changeset | Root Listing
root/gclib/cdbfasta/README
Revision: 8
Committed: Mon Mar 22 22:11:25 2010 UTC (12 years, 6 months ago) by gpertea
File size: 11295 byte(s)
Log Message:
added cdbfasta source files

Line File contents
1 CDB (Constant DataBase) indexing and retrieval tools for FASTA files
2 =====================================================================
3
4 This is a brief introduction to a couple of platform independent file-based
5 hashing tools (cdbfasta and cdbyank) that can be used for creating indices for
6 quick retrieval of any particular sequences from large multi-FASTA files. The
7 last version has the option to compress data records in order to save space.
8 The index files are now architecture independent, the same index file can be
9 created and used on many different Unix platform (be it 32bit/64bit,
10 big-endian or little-endian architectures) and even Windows.
11
12 1.Install instructions
13 2.Typical usage
14 3.Retrieving sequence ranges or only the defline
15 4.Data compression option
16 5.Development notes
17
18
19 1.Install instructions
20 ===============================
21 Before running 'make' in the source directory, please take a look at
22 the Makefile and note the following:
23
24 * GCLDIR must point to the directory containing the gclib source
25 files (should be included in this source package already as a subdirectory)
26 * in order to support record compression, change the BASEFLAGS variable
27 to have -DENABLE_COMPRESSION=1 instead of -DENABLE_COMPRESSION=0
28 (default is: no compression support)
29 * if compression was enabled, ZDIR should point to the directory where the
30 zlib library (libz.a and all the zlib header files like zlib.h) can be found.
31 This is only needed if your system does not have the zlib library installed
32 already (most systems do). In case you get zlib related errors when you try
33 to compile cdbfasta you might have to download zlib and install/build it
34 in a directory that should then be specified as ZDIR in the Makefile
35
36 Running 'make' should produce the binaries 'cdbfasta' (the indexer program)
37 and 'cdbyank' (the query program) in the current directory.
38
39 2.Typical usage
40 ===============
41
42 Use cdbfasta to create the index file for a multi-FASTA file and cdbyank to
43 pull records based on that index file. An usage message is displayed if the
44 commands cdbyank or cdbyank are run without any parameters.
45
46 In order to create an index file, only the name of the fasta file must be
47 provided:
48
49 cdbfasta <fasta_file>
50
51 The fasta file can be specified with the whole path (if it's not in the current
52 directory), e.g.
53
54 cdbfasta /usr/local/db/GUDB.human
55
56 By default cdbfasta creates an index file with the same name as the database
57 file but with the .cidx suffix added to the original name. So in the example
58 above, a file GUDB.human.cidx will be created in /usr/local/db/. The default
59 usage considers the key for a FASTA record to be the first space-delimited
60 token following the ">" starting character from the definition line. For
61 example, if a FASTA record had a defline like this:
62
63 >AA141526
64
65 Then we can use the string 'AA141526' with cdbyank to retrieve the full FASTA
66 record associated to that sequence name:
67
68 cdbyank -a 'AA141526' /usr/local/db/GUDB.human.cidx
69
70 Sometimes all the space delimited tokens in the defline need to be declared as
71 keys in the index file, pointing to the same fasta record. This can be
72 accomplished by cdbfasta by using the "-m" switch.
73
74 For long and complex fastA file accessions like this:
75 EGAD|61|GP|186739|gb|AAA63210.1||M60828 there is an option to create the
76 index file in such a way that there is no need to provide the full string to
77 cdbyank in order to retrieve such a sequence, but only the first
78 "<db>|<accession>" pair (i.e. a substring ending at the second '|' character)
79 should be enough. (EGAD|61 in the example above). In order to enable this
80 feature, there are two alternative options for cdbfasta:
81
82
83 -c : the index file is built only by storing the "shortcut key" (the first
84 "db|accession" pair found in the defline of each fasta record). In this
85 case, cdbyank will only be able to accept these "shortcut" accessions for
86 record retrieval.
87
88 -C : the index file is built by storing both the "shortcut key" and the full
89 keys (which are considered to end at the first space character in the
90 defline). In this case, two strings are stored as keys for each fastA
91 record so any of them can be used as an accession for retrieval of the
92 same record with cdbyank.
93
94 In order to retrieve records from the database file, cdbyank should be provided
95 with the name of the index file created previously with cdbfasta, e.g.:
96
97 cdbyank -a 'human|Z98492' /usr/local/db/GUDB.human.cidx
98
99 A list of accessions is expected at stdin if -a option is not provided, e.g.:
100
101 cat seq_list | cdbyank /usr/local/db/GUDB.human.cidx
102
103 This way the output will be a series a fasta records at stdout. By redirecting
104 this output to a file a multifasta file is obtained. cdbyank locates the
105 database file by stripping the '.cidx' suffix off the index filename. But this
106 is not enforced, because by using the -d option, cdbyank can make use of a
107 user-provided database to be used by the given index file. In the example
108 above, if the index file "GUDB.human.cidx" is moved into another directory, a
109 cdbyank command (in that other directory) can be issued like that:
110
111 cdbyank -a 'human|Z98492' -d /usr/local/db/GUDB.human GUDB.human.cidx
112
113 The position of the index file in the list of arguments of cdbyank is not
114 enforced. For the -a usage, the error status returned by cdbyank to the shell
115 will be 1 if the given key was not found and 0 for success.
116
117 The total number of fasta records indexed and the list of the keys stored in a
118 specific cdb index file can be retrieved with cdbyank's -n and -l switches,
119 respectively. This information is obtained from the index file directly (the
120 database file is not needed for that). There is also a -s option that displays
121 a summary of the indexing information stored in the index at index time. These
122 are the initial name of the fastA file, its size, how the index was created
123 (e.g. was -m (multiple keys) option given ? was -c or -C (shortcut keys) option
124 given?), the number of keys stored in the file as well as the number of fasta
125 records indexed - the latter being the same with what -n option returns.
126
127 As an extra feature, cdbfasta and cdbyank can also be used for some special
128 cases where databases may have different records but with the same key
129 (non-unique keys). Although the performance will degrade a little, cdbfasta is
130 able to index this kind of files, but by default cdbyank only outputs the first
131 record found. If you want all the possible records sharing the same key
132 (accession) to be retrieved and displayed, the -x option should be given to
133 cdbyank.
134
135 3.Retrieving sequence ranges or only the defline
136 ================================================
137
138 There are two cdbyank options added for convenience: -F option returns the
139 definition line of each requested FASTA record (the first line for each
140 record). The -R option of cdbyank is intended for FASTA files containing
141 actual genetic sequences (nucleotide or protein) and expects each of the
142 retrieval commands to have the following format (space delimited)
143
144 <key> <right_coordinate> <left_coordinate>
145
146 For example if we only want to retrieve the sequence range 24...178 (letter
147 numbering starts at 1) from sequence with the name 'human|Z98492', then the
148 cdbyank command would look like this:
149
150 cdbyank -a 'human|Z98492 24 178' -R GUDB.human.cidx
151
152 Multiple sequence ranges can be extracted this way by providing a file having
153 each line following the format above (key followed by the two coordinates).
154 Then, as before, such file can be piped into cdbyank with -R option to pull
155 specific sequence ranges for each of the sequences specified in the input file.
156
157
158 cat seqlistranges | cdbyank -R GUDB.human.cidx
159
160 Note that this range option works by actually parsing and looping through the
161 retrieved record characters internally - so the performance is poor when some
162 terminal range is pulled from a very large record.
163
164 4.Data compression option
165 =========================
166 (This only applies if the programs were built with compression support enabled)
167
168 The indexing program cdbfasta has the -z <compressed_db> option which creates
169 a compressed file from the input file and at the same time creates an index
170 file for this compressed file. The original input file can then be discarded
171 (if it is only needed for random access through cdbyank). The entire input file
172 can be recovered from the resulting <compressed_db> by using the -z option of
173 cdbyank. Because each record is compressed separately, compression is poor if
174 the records are small. Compression is only advised when:
175
176 * data records are large enough for the compression algorithm to adapt (at
177 least 1KB, the more the better)
178
179 * only random access is needed to the data records (so the original file can
180 be discarded)
181
182 The compression can be quite slow for large files and there is also some
183 performance penalty for cdbyank as it has to decompress the retrieved records
184 on the fly.
185
186 The input data for cdbfasta compression can be collected from stdin if '-' is
187 used instead of a file name:
188
189 cat my_data_files* | cdbfasta - -z mydata.cdbz
190
191 This option is useful especially when the total size of input data files is
192 extremely large (over the file-system limits or over the 4GB internal limit of
193 cdbfasta) while the compressed output can be small enough to fall under such
194 limits.
195
196 With compressed databases cdbyank can be used normally without extra options as
197 it will auto-detect the compression (from the index file info) and activate
198 on-the-fly decompression of the retrieved records.
199
200 The -F and -R options are not yet accepted when working with compressed
201 records.
202
203 5.Development notes
204 ===================
205
206 These tools were developed in C++, based on the publicly available cdb
207 ("constant database") code written by D.J. Bernstein
208 (http://cr.yp.to/djb.html). "Constant databases" are those that we don't need
209 to add to or remove records from. The original C source was (rather crudely)
210 wrapped into C++ classes and adjusted to automatically index fasta records and
211 to create an external index instead of compacting the original data file like
212 the original cdb library code does. Also the "endianness" is now checked at
213 runtime and the bytes are swapped accordingly such that the file offsets and
214 record sizes are always read/written in the same way in the index file.
215
216 The compression option uses zlib's "deflate" method. The program uses deflate()
217 with Z_FULL_FLUSH after each record, such that random record decompression is
218 possible after the first dummy record is decompressed.
219
220 The index file contains an info chunk (actually stored at the end of the file)
221 which maintains a summary data and flags about the indexing process (the -s
222 option of cdbyank retrieves this information). Since the compression option
223 was added, cdbyank is always trying to read this information first (before
224 opening the data file) in order to determine if the data records are compressed
225 or not.
226
227 Please let me know if you notice problems running with these tools.
228
229 --
230 Geo Pertea
231 gpertea@tigr.org
232 06/09/2003
233
234
235 7. Copyright
236 ============
237
238 Copyright (c) 2002-2003, The Institute for Genomic Research,
239 All Rights Reserved
240 This software is OSI Certified Open Source Software.
241 OSI Certified is a certification mark of the Open Source Initiative.