[BiO BB] GI numbers
pculpep at hotmail.com
Wed Mar 31 11:44:06 EST 2004
A package, Integrated Genomic Data System, has been submitted to the Baylor
College of Medicine Office of Technology . A description of the system is
Integrated Genomics Data System (IGDS)
The Integrated Genomics Data System (IGDS) integrates data from multiple
publicly available genomic databases into a relational database format.
The core of the IGDS system is a C/C++ program that data mines National
Center for Biological Information (NCBI) binary ASN1 files for sequence
data. This data is integrated by means of Perl scripts with data from Locus
Link, UniGene, Gene Ontology Association, Protein Data Bank, and other
sites. The resulting information is uploaded by a Java program into a
relational database defined by the Integrated Genomics Database System
The IGDS is a computational tool for data gathering and interpretation of
genomic data, which saves time and reduces repetition of rote processes.
A methodology using tested relationships among various pieces of data in
different files reduces the necessity for the accrual and processing of
massive amounts of data.
Data mining tactic based on the NCBI toolkit, thereby utilizing code that
has been approved for interpretation of NCBI ANS1. Native representation of
pertinent data elements is maintained as are the nesting levels inherent in
the ASN1 structure.
Data download and interpretation is performed on the most compact
representation of NCBI data - ASN1 binary files.
Less computer disk space is required to store data files when data mining
processes are invoked.
Processing of NCBI data in binary format provides optimal computer
performance with quick results.
Configurable interface affords various levels of processing granularity.
Processing may be allocated among many processes on one computer or across
Final relational representation of genomic data provides dynamic inference
not possible with flat file or ASN1 data representation
The system is fully configurable and will download and interpret the entire
NCBI ASN1 sequence library or a few select sequence sets.
A separate series of Perl scripts cross-references the NCBI Locus
Link/Unigene libraries providing Accession and GI Number, Gene names, Aliase
Gene Names, Preferred Gene Names, Clone, Lib, UniGene Id, Tissue, Vector,
Organ, Cyto_Genetic_Loc, and relevant Gene Ontology information such as GO
Id, catagories, etc. This data can be merged with the ASN1 data to create a
a fully integrated DB system of genomic information.
A Rational Rose UML Data Model is provided as well as relevant SQL tables.
C/C++, Perl, a compiled version of the NCBI toolkit, and a relational
database management system.
Contact information for the Baylor College of Medicine Office of Technology
is as follows --
Baylor College of Medicine
Office of Technology Administration (i.e. Baylor Licensing)
One Baylor Plaza
Mail Stop: BCM210 600D
Houston, TX 77030
P (713) 798-6821
F (713) 798-1252
lhope at bcm.tmc.edu
>From: "Stefanie Lager" <stefanielager at fastmail.ca>
>Reply-To: bio_bulletin_board at bioinformatics.org
>To: bio_bulletin_board at bioinformatics.org
>Subject: Re: [BiO BB] GI numbers
>Date: Wed, 31 Mar 2004 10:44:26 +0000 (UTC)
>Try linking them through LocusLink, either using one of the mapping
>tables found at: ftp://ftp.ncbi.nih.gov/refseq/LocusLink/ or (a bit more
>complicaated) using a system like OpenBNS: http://openbns.sourceforge.net/
> > Hi,
> > I'm analyzing a set of sequences with regard to their classifications
> > as homologs from both COG and Kegg databases of orthologs. Although
> > both COG and Kegg provide tables relating gene names to GI (PID)
> > numbers, I'm, up to this moment, unable to map GIs from one dataset to
> > the other, in order to check classifications for genes in both
> > catalogs.
> > GIs from COG appear to be from RefSeq and those from Kegg seem to be
> > from GenPept. How can I map GI numbers from Kegg to GI numbers from
> > COG database? Is there any query I can make to download such info for
> > 185904 proteins in COG and their equivalents on Kegg Orthologs
> > database?
> > Here is an example:
> > Sequence 14600509 is the protein coded by gene APE0180 from Aeropyrum
> > pernix complete genome, as described in COG's table myva=gb. The same
> > sequence is identified by GI 5103570 in Kegg. In this case, I was able
> > map COG's GI to Kegg's GI by using the gene identifier and annotation,
> > a procedure that is not easily automated.
> > How can I retrive equivalent IDs for the whole COG gene set?
> > Thanks in advance for any help.
> > Robson
> http://fastmail.ca/ - Fast Secure Web Email for Canadians
>BiO_Bulletin_Board maillist - BiO_Bulletin_Board at bioinformatics.org
All the action. All the drama. Get NCAA hoops coverage at MSN Sports by
More information about the BBB