Hello all! Ola amigos! Salut tout le monde! Gluchlich Geburstag! Err.. Guten Tag! Yesterday, Greg suggested using the BioSQL database schema for the Seq object-to-SQL Database interface Nico and I were discussing. I've downloaded the MySQL scripts and did an inventory of the database tables it creates: 1) taxon 2) taxon_name 3) ontology 4) term 5) term_synonym 6) term_dbxref 7) term_relationship 8) term_path 9) bioentry 10) bioentry_relationship 11) bioentry_path 12) biosequence 13) dbxref 14) dbxref_qualifier_value 15) bioentry_dbxref 16) reference 17) bioentry_reference 18) comment 19) bioentry_qualifier_value 20) seqfeature 21) seqfeature_relationship 22) seqfeature_path 23) seqfeature_qualifier_value 24) seqfeature_dbxref 25) location 26) location_qualifier_value Initial reaction: Whoa, it's simply overwhelming! But if Greg or anyone here would volunteer to relate the pertinent table/columns to what I have at http://genephp.sf.net/mysql_dbscripts.html, it would certainly be a big help. Nonetheless, I proceeded to write some PHP code that generates SQL INSERTs from data displayed on web forms. You can find them at: http://genephp.sf.net/parse_na_gb_par.html [1] http://genephp.sf.net/parse_na_gb_res.php [2] http://genephp.sf.net/insert_na_gb_res.php [3] You start out with [1] where you paste a GenBank DNA record, click the SUBMIT button, go to [2], make changes to the data if you like, click on the WRITE SQL button, and you arrive at [3]. It's still not finished yet (there are no SQL statements for the REFERENCES and FEATURES table, among others), but we'll get there. I'm including the code that does this (see end of this email). The SQL statements are for the sample database which uses MySQL. Nico has graciously offered to help out in the MySQL-to-ADODB conversion (so the SQL scripts could be used for other RDBMs). If anyone would like to work on a MySQL-to-PearDB conversion of the CREATEs in http://genephp.sf.net/mysql_dbscripts.html, or the INSERTs at insert_na_gb_res.php, you're mighty welcome! If anyone would also like to work on a DATABASE SCHEMA for data extracted by the various parser scripts (there are 30 formats supported fully/partially at http://genephp.sf.net/io_set.html), well, that would be swell... Referring to the BioSQL schema every now and then is encouraged. Senor Andres Pinzon, could you help me out with some translation work? Drop me a note at serge525@hotmail.com. Thanks! Herr Frankie: Next email, I'll attach a PDF file which is a survey/overview of computational biology algorithms. Hope it helps you vis-a-vis your earlier email re: traceback routines. >Cayte: I'm a nature buff living in New England with a summer cottage on an >island. I like birds, cats, raindrops, seasonal change and large windows >that let in the sun. I've offered to help with genePHP and I am >investigating CIF (unless you suggest something with more priority ). >Although I haven't yet scripted a project in php, I've worked with similar >languages like C, javascript, perl, python and lisp. As I promised Cayte, here's my own short spiel: I'm a computer gaming buff living in the Philippines, more precisely in Marikina City, home of the world's largest pair of shoes (http://www.marikina.gov.ph). I like dogs, temperate weather, and large glass ceiling through which I can view the stars at night... but I have none of these at the moment. =( I started GenePHP well... on a cerebral level, because I think biotech has a lot to offer humanity, and on a motivational or emotional level, because I just got tired of writing business apps, and wanted to do something more challenging. I would also like to believe that I have recessive "scientist" gene in me, that has been long dormant but is now expressing itself. Lastly, I envy your proficiency in several computer languages. The only python I know is the one at the local zoo. Seriously now, I'm sure you have a lot to contribute to the project. The priority now is to MAKE BIOPHP INTERFACE WITH AS MANY DATA FORMATS/FILES, RDBMs, and BIO APPS (web and non-web) AS POSSIBLE. (For example, Nico has put in a pDRAW parser, Sean is working on NCBI's eSearch & eFetch utility, etc.) CIF is one of them. You are right in investigating CIF first, looking out for opportunities for using BioPHP. I wouldn't plunge into coding it yet though. Another priority is developing a class schema for the project which is further subdivided into two: data and methods. For data, the challenge is to integrate data from several sources. Some databases have overlapping data but call them different names, or store them in different formats/data types. Some have data that others don't have. And so on. == SIDEBAR ======================================= I'd like to know how people work with bioinformatics data and applications -- and draw/diagram it as some sort of "computational pathway" similar to "biological pathways". For instance: GenBank -> DNA Sequences -> Clustalw -> ALN file -> Custom ALN viewer/analyzer -> etc. Anyone care to add more (usage diagrams)? ================================================== There is also the issue of WHAT to model, i.e. choosing the classes, delineating the classes. For instance, do we create a separate class for an AMINO ACID? A CODON? AN ENZYME? A PROTEIN'S SECONDARY, TERTIARY, QUATERNARY STRUCTURE? Current OO features of PHP seem to favor simpler structures. But with PHP 5 just around the corner, should this (still) be the case? Well, these are just some of the issues confronting us. Isn't it exciting? =) Cheers! Serge ======================================================= /* SOURCE CODE LISTING FOR: 1) insert_na_gb() - 'na' is Nucleic Acid, 'gb' is GenBank. 2) format_mysqlcol() */ // Accepts a Seq object and returns an array of strings. Each string is an SQL // INSERT statement. function insert_na_gb($oSeq) { $aSQL_Stmt = array(); // SQL for SEQUENCE table: PRIME_ACC ENTRY_NAME SEQ_LENGTH MOL_TYPE DATE SOURCE SEQUENCE $accession = $oSeq->accession; $entry_name = $oSeq->id; $seqlen = format_mysqlcol($oSeq->seqlength); $moltyp = format_mysqlcol($oSeq->moltype); $date = format_mysqlcol($oSeq->date); $source = format_mysqlcol($oSeq->source); $seqdat = format_mysqlcol($oSeq->sequence); $desc = format_mysqlcol($oSeq->definition); $organi = format_mysqlcol($oSeq->organism); $sql = "INSERT INTO sequence VALUES('$accession', '$entry_name', " . "$seqlen, $moltyp, $date, $source, $seqdat, $desc, $organi)"; $aSQL_Stmt[] = $sql; // SQL for GBSEQUENCE table: STRANDS, TOPOLOGY, DIVISION, etc. $stra = format_mysqlcol($oSeq->strands); $topo = format_mysqlcol($oSeq->topology); $divi = format_mysqlcol($oSeq->division); $segno = format_mysqlcol($oSeq->segment_no); $segct = format_mysqlcol($oSeq->segment_count); $versi = format_mysqlcol($oSeq->version); $ncbid = format_mysqlcol($oSeq->ncbi_gi_id); $sql = "INSERT INTO gbsequence VALUES('$accession', $stra, $topo, $divi, $segno, $segct, $versi, $ncbid)"; $aSQL_Stmt[] = $sql; // ACCESSION table: if (count($oSeq->sec_accession) > 0) { foreach($oSeq->sec_accession as $sec_acc) { $sql = "INSERT INTO accession VALUES('$accession', '$sec_acc')"; $aSQL_Stmt[] = $sql; } } // KEYWORDS table if ( (isset($oSeq->keywords)) and (count($oSeq->keywords) >= 1) and ($oSeq->keywords[0] != "") ) { foreach($oSeq->keywords as $kword) { $sql = "INSERT INTO keywords VALUES('$accession', '$kword')"; $aSQL_Stmt[] = $sql; } } return $aSQL_Stmt; } // format_mysqlcol() formats the Seq object attribute to conform with MySQL syntax // esp. in the case of NULL values. Recall that a blank string or a zero (0) is // NOT the same as NULL (which stands for unknown/unavailable data). function format_mysqlcol($value) { if ($value == NULL) return 'NULL'; elseif (gettype($value) == "string") { if (strlen(trim($value)) > 0) return ("'" . addslashes($value) . "'"); else return 'NULL'; } } elseif (is_numeric($value) == TRUE) return $value; else return $value; // later, add more ifs for other data types like boolean values } Need a new email address that people can remember Check out the new EudoraMail at http://www.eudoramail.com