[Biophp-dev] My humunguous posting - finally got it out of my system!
Serge Gregorio
biophp-dev@bioinformatics.org
Tue, 10 Jun 2003 04:17:54 +0800
Hello all! Ola amigos! Salut tout le monde! Gluchlich Geburstag! Err.. Guten Tag!
Yesterday, Greg suggested using the BioSQL database schema for the Seq object-to-SQL Database interface Nico and I were discussing.
I've downloaded the MySQL scripts and did an inventory of the database tables it creates:
1) taxon
2) taxon_name
3) ontology
4) term
5) term_synonym
6) term_dbxref
7) term_relationship
8) term_path
9) bioentry
10) bioentry_relationship
11) bioentry_path
12) biosequence
13) dbxref
14) dbxref_qualifier_value
15) bioentry_dbxref
16) reference
17) bioentry_reference
18) comment
19) bioentry_qualifier_value
20) seqfeature
21) seqfeature_relationship
22) seqfeature_path
23) seqfeature_qualifier_value
24) seqfeature_dbxref
25) location
26) location_qualifier_value
Initial reaction: Whoa, it's simply overwhelming! But if Greg or anyone here would volunteer to relate the pertinent table/columns to what I have at http://genephp.sf.net/mysql_dbscripts.html, it would certainly be a big help.
Nonetheless, I proceeded to write some PHP code that generates SQL INSERTs from data displayed on web forms. You can find them at:
http://genephp.sf.net/parse_na_gb_par.html [1]
http://genephp.sf.net/parse_na_gb_res.php [2]
http://genephp.sf.net/insert_na_gb_res.php [3]
You start out with [1] where you paste a GenBank DNA record, click the SUBMIT button, go to [2], make changes to the data if you like, click on the WRITE SQL button, and you arrive at [3]. It's still not finished yet (there are no SQL statements for the REFERENCES and FEATURES table, among others), but we'll get there.
I'm including the code that does this (see end of this email). The SQL statements are for the sample database which uses MySQL.
Nico has graciously offered to help out in the MySQL-to-ADODB conversion (so the SQL scripts could be used for other RDBMs).
If anyone would like to work on a MySQL-to-PearDB conversion of the CREATEs in http://genephp.sf.net/mysql_dbscripts.html,
or the INSERTs at insert_na_gb_res.php, you're mighty welcome!
If anyone would also like to work on a DATABASE SCHEMA for data extracted by the various parser scripts (there are 30 formats supported fully/partially at http://genephp.sf.net/io_set.html), well, that would be swell...
Referring to the BioSQL schema every now and then is encouraged.
Senor Andres Pinzon, could you help me out with some translation work? Drop me a note at serge525@hotmail.com. Thanks!
Herr Frankie: Next email, I'll attach a PDF file which is a survey/overview of computational biology algorithms. Hope it helps you vis-a-vis your earlier email re: traceback routines.
>Cayte: I'm a nature buff living in New England with a summer cottage on an
>island. I like birds, cats, raindrops, seasonal change and large windows
>that let in the sun. I've offered to help with genePHP and I am
>investigating CIF (unless you suggest something with more priority ).
>Although I haven't yet scripted a project in php, I've worked with similar
>languages like C, javascript, perl, python and lisp.
As I promised Cayte, here's my own short spiel: I'm a computer gaming buff living in the Philippines, more precisely in Marikina City, home of the world's largest pair of shoes (http://www.marikina.gov.ph).
I like dogs, temperate weather, and large glass ceiling through which I can view the stars at night... but I have none of these at the moment. =(
I started GenePHP well... on a cerebral level, because I think biotech has a lot to offer humanity, and on a motivational or emotional level, because I just got tired of writing business apps, and wanted to do something more challenging. I would also like to believe that I have recessive "scientist"
gene in me, that has been long dormant but is now expressing itself.
Lastly, I envy your proficiency in several computer languages. The only python I know is the one at the local zoo. Seriously now, I'm sure you have a lot to contribute to the project.
The priority now is to MAKE BIOPHP INTERFACE WITH AS MANY DATA FORMATS/FILES, RDBMs, and BIO APPS (web and non-web) AS POSSIBLE. (For example, Nico has put in a pDRAW parser, Sean is working on NCBI's eSearch & eFetch utility, etc.) CIF is one of them. You are right in investigating CIF first, looking out for opportunities for using BioPHP. I wouldn't plunge into coding it yet though.
Another priority is developing a class schema for the project which is further subdivided into two: data and methods. For data, the challenge is to integrate
data from several sources. Some databases have overlapping data but call them different names, or store them in different formats/data types. Some have data that others don't have. And so on.
== SIDEBAR =======================================
I'd like to know how people work with bioinformatics data and applications -- and draw/diagram it as some sort of "computational pathway" similar to "biological pathways". For instance:
GenBank -> DNA Sequences -> Clustalw -> ALN file
-> Custom ALN viewer/analyzer -> etc.
Anyone care to add more (usage diagrams)?
==================================================
There is also the issue of WHAT to model, i.e. choosing the classes, delineating the classes. For instance, do we create a separate class for an AMINO ACID? A CODON? AN ENZYME? A PROTEIN'S SECONDARY, TERTIARY, QUATERNARY STRUCTURE?
Current OO features of PHP seem to favor simpler structures. But with PHP 5 just around the corner, should this (still) be the case?
Well, these are just some of the issues confronting us. Isn't it exciting? =)
Cheers!
Serge
=======================================================
/* SOURCE CODE LISTING FOR:
1) insert_na_gb() - 'na' is Nucleic Acid,
'gb' is GenBank.
2) format_mysqlcol()
*/
// Accepts a Seq object and returns an array of strings. Each string is an SQL
// INSERT statement.
function insert_na_gb($oSeq)
{
$aSQL_Stmt = array();
// SQL for SEQUENCE table: PRIME_ACC ENTRY_NAME SEQ_LENGTH MOL_TYPE DATE SOURCE SEQUENCE
$accession = $oSeq->accession;
$entry_name = $oSeq->id;
$seqlen = format_mysqlcol($oSeq->seqlength);
$moltyp = format_mysqlcol($oSeq->moltype);
$date = format_mysqlcol($oSeq->date);
$source = format_mysqlcol($oSeq->source);
$seqdat = format_mysqlcol($oSeq->sequence);
$desc = format_mysqlcol($oSeq->definition);
$organi = format_mysqlcol($oSeq->organism);
$sql = "INSERT INTO sequence VALUES('$accession', '$entry_name', " .
"$seqlen, $moltyp, $date, $source, $seqdat, $desc, $organi)";
$aSQL_Stmt[] = $sql;
// SQL for GBSEQUENCE table: STRANDS, TOPOLOGY, DIVISION, etc.
$stra = format_mysqlcol($oSeq->strands);
$topo = format_mysqlcol($oSeq->topology);
$divi = format_mysqlcol($oSeq->division);
$segno = format_mysqlcol($oSeq->segment_no);
$segct = format_mysqlcol($oSeq->segment_count);
$versi = format_mysqlcol($oSeq->version);
$ncbid = format_mysqlcol($oSeq->ncbi_gi_id);
$sql = "INSERT INTO gbsequence VALUES('$accession', $stra, $topo, $divi, $segno, $segct, $versi, $ncbid)";
$aSQL_Stmt[] = $sql;
// ACCESSION table:
if (count($oSeq->sec_accession) > 0)
{
foreach($oSeq->sec_accession as $sec_acc)
{
$sql = "INSERT INTO accession VALUES('$accession', '$sec_acc')";
$aSQL_Stmt[] = $sql;
}
}
// KEYWORDS table
if ( (isset($oSeq->keywords)) and (count($oSeq->keywords) >= 1) and ($oSeq->keywords[0] != "") )
{
foreach($oSeq->keywords as $kword)
{
$sql = "INSERT INTO keywords VALUES('$accession', '$kword')";
$aSQL_Stmt[] = $sql;
}
}
return $aSQL_Stmt;
}
// format_mysqlcol() formats the Seq object attribute to conform with MySQL syntax
// esp. in the case of NULL values. Recall that a blank string or a zero (0) is
// NOT the same as NULL (which stands for unknown/unavailable data).
function format_mysqlcol($value)
{
if ($value == NULL) return 'NULL';
elseif (gettype($value) == "string")
{
if (strlen(trim($value)) > 0) return ("'" . addslashes($value) . "'");
else return 'NULL';
} }
elseif (is_numeric($value) == TRUE) return $value;
else return $value;
// later, add more ifs for other data types like boolean values
}
Need a new email address that people can remember
Check out the new EudoraMail at
http://www.eudoramail.com