[BiO BB] GI numbers

Pamela Culpepper pculpep at hotmail.com
Wed Mar 31 11:44:06 EST 2004

A package, Integrated Genomic Data System,  has been submitted to the Baylor 
College of Medicine Office of Technology .  A description of the system is 
as follows:

Integrated Genomics Data System (IGDS)


The Integrated Genomics Data System (IGDS) integrates data from multiple 
publicly available genomic databases into a relational database format.

The core of the IGDS system is a C/C++ program that data mines National 
Center for Biological Information (NCBI) binary ASN1 files for sequence 
data.  This data is integrated by means of Perl scripts with data from Locus 
Link, UniGene, Gene Ontology Association, Protein Data Bank, and other 
sites.   The resulting information is uploaded by a Java program into a 
relational database defined by the Integrated Genomics Database System 


The IGDS is a computational tool for data gathering and interpretation of 
genomic data, which saves time and reduces repetition of rote processes.

A methodology using tested relationships among various pieces of data in 
different files reduces the necessity for the accrual and processing of 
massive amounts of data.

Data mining tactic based on the NCBI toolkit, thereby utilizing code that 
has been approved for interpretation of NCBI ANS1.  Native representation of 
pertinent data elements is maintained as are the nesting levels inherent in 
the ASN1 structure.

Data download and interpretation is performed on the most compact 
representation of NCBI data - ASN1 binary files.

Less computer disk space is required to store data files when data mining 
processes are invoked.

Processing of NCBI data in binary format provides optimal computer 
performance with quick results.

Configurable interface affords various levels of processing granularity.  
Processing may be allocated among many processes on one computer or across 
several computers.

Final relational representation of genomic data provides dynamic inference 
not possible with flat file or ASN1 data representation

The system is fully configurable and will download and interpret the entire 
NCBI ASN1 sequence library or a few select sequence sets.

A separate series of Perl scripts cross-references the NCBI Locus 
Link/Unigene libraries providing Accession and GI Number, Gene names, Aliase 
Gene Names, Preferred Gene Names, Clone, Lib, UniGene Id, Tissue, Vector,  
Organ, Cyto_Genetic_Loc, and relevant Gene Ontology information such as GO 
Id, catagories, etc.  This data can be merged with the ASN1 data to create a 
a fully integrated DB system of genomic information.

A Rational Rose UML Data Model is provided as well as relevant SQL tables.

C/C++, Perl, a compiled version of the NCBI toolkit, and a relational 
database management system.

Contact information for the Baylor College of Medicine Office of Technology 
is as follows --

Baylor College of Medicine
Office of Technology Administration (i.e. Baylor Licensing)
One Baylor Plaza
Mail Stop:  BCM210 600D
Houston, TX 77030
P (713) 798-6821
F (713) 798-1252
lhope at bcm.tmc.edu


Pam Culpepper

>From: "Stefanie Lager" <stefanielager at fastmail.ca>
>Reply-To: bio_bulletin_board at bioinformatics.org
>To: bio_bulletin_board at bioinformatics.org
>Subject: Re: [BiO BB] GI numbers
>Date: Wed, 31 Mar 2004 10:44:26 +0000 (UTC)
>Try linking them through LocusLink, either using one of the mapping
>tables found at: ftp://ftp.ncbi.nih.gov/refseq/LocusLink/ or (a bit more
>complicaated) using a system like OpenBNS: http://openbns.sourceforge.net/
> > Hi,
> >
> > I'm analyzing a set of sequences with regard to their classifications
> > as homologs from both COG and Kegg databases of orthologs. Although
> > both COG and Kegg provide tables relating gene names to GI (PID)
> > numbers, I'm, up to this moment, unable to map GIs from one dataset to
> > the other, in order to check classifications for genes in both
> > catalogs.
> >
> > GIs from COG appear to be from RefSeq and those from Kegg seem to be
> > from GenPept. How can I map GI numbers from Kegg to GI numbers from
> > COG database? Is there any query I can make to download such info for
> > 185904 proteins in COG and their equivalents on Kegg Orthologs
> > database?
> >
> > Here is an example:
> >
> > Sequence 14600509 is the protein coded by gene APE0180 from Aeropyrum
> > pernix complete genome, as described in COG's table myva=gb. The same
> > sequence is identified by GI 5103570 in Kegg. In this case, I was able
> > map COG's GI to Kegg's GI by using the gene identifier and annotation,
> > a procedure that is not easily automated.
> >
> > How can I retrive equivalent IDs for the whole COG gene set?
> >
> > Thanks in advance for any help.
> > Robson
>     http://fastmail.ca/ - Fast Secure Web Email for Canadians
>BiO_Bulletin_Board maillist  -  BiO_Bulletin_Board at bioinformatics.org

All the action. All the drama. Get NCAA hoops coverage at MSN Sports by 
ESPN. http://msn.espn.go.com/index.html?partnersite=espn

More information about the BBB mailing list