More formats.
Example output converters (eg, for HTML)
The patterns are useable for other langauges, so would it to
generate parsers for perl, C, etc. Probably write something in C
which is a SWIGged engine for use by everyone else.
Add more definitions for common patterns, like 'hostname' 'email',
'URL', etc.
I think SAX 2.0 uses an InputSource instead of parse/ parseFile/
parseString. Need to check, and possibly change Martel.
Have a working Locator. (How to deal with line numbers?)
Namespace support? Features? Other SAX 2.0-isms?
Need standard element/tag names. For example, NCBI has a DTD for
BLAST output, but that's just the semantic data. There is extra
organization of the BLAST output, so I need names for them.
Even better, the names and meanings should by consistent across
different formats (eg, "sequence") and consistent with XML usage (like
'*_list' for a list of items (?))
If I have access to a description of the datatype for each named
group, I can likely make a DTD for it. Hmm, but a DTD only allows
one tag lookaheads so perhaps not.
Still want to include "lenient" parsers which skip data they
assume is correct.
I want to make an example for how to index files. It might be
called like:
make_index --record <record tag> --id <id tag> [--aliases <alias tag>]*
[--format <format name>] --dbm <filename> files...
where:
<record tag> is the outermost name for the record (like 'swissprot38_record')
<id tag> is the unique id for the record (like 'entry_name')
<alias tag> is a set of tags listing aliases for the record (like 'ac_number')
<format name> is the (optional) format name for the data file
<filename> is the name for the DBM file
files... is the file or files to parse to generate the index
The index will contain mappings from aliases to primary id, and from primary
id to filename and character ranges. The data is stored in a dbm file
(probably Berkeley DBM from Sleepycat).
Using the index could be something like:
lookup --dbm <filename> [--primary | --alias | --either ] [--show] identifier
where
<filename> is the name of the DBM file
--primary means the identifier is a primary key name
--alias means the identifier is an alias
--either means it's either; returns all unique matches
By default, the output is the list of matching records as (filename,
start byte, end byte). When --show is used, the matching records are
displayed as a stream to stdout.
Add keyword searching by adding a list of fields which should be
indexed. This would allow searching for all records from Spam and
Eggs which didn't have Vikings for a gene name.
|