[Biophp-dev] My humunguous posting - finally got it out of my system!

Serge Gregorio biophp-dev@bioinformatics.org
Tue, 10 Jun 2003 04:17:54 +0800


Hello all!  Ola amigos!  Salut tout le monde!  Gluchlich Geburstag! Err.. Guten Tag!

Yesterday, Greg suggested using the BioSQL database schema for the Seq object-to-SQL Database interface Nico and I were discussing.

I've downloaded the MySQL scripts and did an inventory of the database tables it creates:

  1) taxon
  2) taxon_name
  3) ontology
  4) term
  5) term_synonym
  6) term_dbxref
  7) term_relationship
  8) term_path
  9) bioentry
  10) bioentry_relationship
  11) bioentry_path
  12) biosequence
  13) dbxref
  14) dbxref_qualifier_value
  15) bioentry_dbxref
  16) reference
  17) bioentry_reference
  18) comment
  19) bioentry_qualifier_value
  20) seqfeature
  21) seqfeature_relationship
  22) seqfeature_path
  23) seqfeature_qualifier_value
  24) seqfeature_dbxref
  25) location
  26) location_qualifier_value

Initial reaction: Whoa, it's simply overwhelming!  But if Greg or anyone here would volunteer to relate the pertinent table/columns to what I have at http://genephp.sf.net/mysql_dbscripts.html, it would certainly be a big help.

Nonetheless, I proceeded to write some PHP code that generates SQL INSERTs from data displayed on web forms.  You can find them at:

   http://genephp.sf.net/parse_na_gb_par.html   [1]
   http://genephp.sf.net/parse_na_gb_res.php    [2]
   http://genephp.sf.net/insert_na_gb_res.php   [3]

You start out with [1] where you paste a GenBank DNA record, click the SUBMIT button, go to [2], make changes to the data if you like, click on the WRITE SQL button, and you arrive at [3].  It's still not finished yet (there are no SQL statements for the REFERENCES and FEATURES table, among others), but we'll get there.

I'm including the code that does this (see end of this email).  The SQL statements are for the sample database which uses MySQL.

Nico has graciously offered to help out in the MySQL-to-ADODB conversion (so the SQL scripts could be used for other RDBMs).

If anyone would like to work on a MySQL-to-PearDB conversion of the CREATEs in  http://genephp.sf.net/mysql_dbscripts.html,
or the INSERTs at insert_na_gb_res.php, you're mighty welcome!

If anyone would also like to work on a DATABASE SCHEMA for data extracted by the various parser scripts (there are 30 formats supported fully/partially at http://genephp.sf.net/io_set.html), well, that would be swell...

Referring to the BioSQL schema every now and then is encouraged.

Senor Andres Pinzon, could you help me out with some translation work?  Drop me a note at serge525@hotmail.com.  Thanks!

Herr Frankie: Next email, I'll attach a PDF file which is a survey/overview of computational biology algorithms.  Hope it helps you vis-a-vis your earlier email re: traceback routines.

>Cayte: I'm a nature buff living in New England with a summer cottage on an
>island.  I like birds, cats, raindrops, seasonal change and large windows
>that let in the sun.  I've offered to help with genePHP and I am
>investigating CIF (unless you suggest something with more priority ). 
>Although I haven't yet scripted a project in php, I've worked with similar
>languages like C, javascript, perl, python and lisp.

As I promised Cayte, here's my own short spiel: I'm a computer gaming buff living in the Philippines, more precisely in Marikina City, home of the world's largest pair of shoes (http://www.marikina.gov.ph).
I like dogs, temperate weather, and large glass ceiling through which I can view the stars at night... but I have none of these at the moment.  =(

I started GenePHP well... on a cerebral level, because I think biotech has a lot to offer humanity, and on a motivational or emotional level, because I just got tired of writing business apps, and wanted to do something more challenging. I would also like to believe that I have recessive "scientist"
gene in me, that has been long dormant but is now expressing itself.

Lastly, I envy your proficiency in several computer languages.  The only python I know is the one at the  local zoo.  Seriously now, I'm sure you have a lot to contribute to the project. 

The priority now is to MAKE BIOPHP INTERFACE WITH AS MANY DATA FORMATS/FILES, RDBMs, and BIO APPS (web and non-web) AS POSSIBLE.  (For example, Nico has put in a pDRAW parser, Sean is working on NCBI's eSearch & eFetch utility, etc.)  CIF is one of them.  You are right in investigating CIF first, looking out for opportunities for using BioPHP.  I wouldn't plunge into coding it yet though.

Another priority is developing a class schema for the project which is further subdivided into two: data and methods. For data, the challenge is to integrate
data from several sources.  Some databases have overlapping data but call them different names, or store them in different formats/data types. Some have data that others don't have.  And so on.

== SIDEBAR =======================================

I'd like to know how people work with bioinformatics data and applications -- and draw/diagram it as some sort of "computational pathway" similar to "biological pathways".  For instance:

   GenBank -> DNA Sequences -> Clustalw -> ALN file
   -> Custom ALN viewer/analyzer -> etc.

Anyone care to add more (usage diagrams)?

==================================================

There is also the issue of WHAT to model, i.e. choosing the classes, delineating the classes.  For instance, do we create a separate class for an AMINO ACID?  A CODON?  AN ENZYME?  A PROTEIN'S SECONDARY, TERTIARY, QUATERNARY STRUCTURE? 

Current OO features of PHP seem to favor simpler structures.  But with PHP 5 just around the corner, should this (still) be the case?

Well, these are just some of the issues confronting us.  Isn't it exciting?  =)

Cheers!

Serge

=======================================================

/* SOURCE CODE LISTING FOR:
   1) insert_na_gb() - 'na' is Nucleic Acid,
      'gb' is GenBank.
   2) format_mysqlcol()
*/

// Accepts a Seq object and returns an array of strings.  Each string is an SQL 
// INSERT statement.
function insert_na_gb($oSeq)
   {
   $aSQL_Stmt = array();

   // SQL for SEQUENCE table: PRIME_ACC ENTRY_NAME SEQ_LENGTH MOL_TYPE DATE SOURCE SEQUENCE
   $accession = $oSeq->accession;
   $entry_name = $oSeq->id;
   $seqlen = format_mysqlcol($oSeq->seqlength);
   $moltyp = format_mysqlcol($oSeq->moltype);
   $date   = format_mysqlcol($oSeq->date);
   $source = format_mysqlcol($oSeq->source);
   $seqdat = format_mysqlcol($oSeq->sequence);	
   $desc   = format_mysqlcol($oSeq->definition);
   $organi = format_mysqlcol($oSeq->organism);

   $sql = "INSERT INTO sequence VALUES('$accession', '$entry_name', " . 
          "$seqlen, $moltyp, $date, $source, $seqdat, $desc, $organi)";					 					 	  		 

   $aSQL_Stmt[] = $sql;
		 
   // SQL for GBSEQUENCE table: STRANDS, TOPOLOGY, DIVISION, etc.	
   $stra = format_mysqlcol($oSeq->strands);
   $topo = format_mysqlcol($oSeq->topology);
   $divi = format_mysqlcol($oSeq->division);
   $segno = format_mysqlcol($oSeq->segment_no);
   $segct = format_mysqlcol($oSeq->segment_count);	
   $versi = format_mysqlcol($oSeq->version);		
   $ncbid = format_mysqlcol($oSeq->ncbi_gi_id);
   $sql = "INSERT INTO gbsequence VALUES('$accession', $stra, $topo, $divi, $segno, $segct, $versi, $ncbid)";			 	
	  
   $aSQL_Stmt[] = $sql;  
  
   // ACCESSION table:	
   if (count($oSeq->sec_accession) > 0)
      {
      foreach($oSeq->sec_accession as $sec_acc)
         { 
      	$sql = "INSERT INTO accession VALUES('$accession', '$sec_acc')";	
         $aSQL_Stmt[] = $sql;  		
         }
      }

   // KEYWORDS table 	
   if ( (isset($oSeq->keywords)) and (count($oSeq->keywords) >= 1) and ($oSeq->keywords[0] != "") )
      {
      foreach($oSeq->keywords as $kword)
         {
         $sql = "INSERT INTO keywords VALUES('$accession', '$kword')";	
         $aSQL_Stmt[] = $sql;
         }
      }			

   return $aSQL_Stmt;
   }	

// format_mysqlcol() formats the Seq object attribute to conform with MySQL syntax
// esp. in the case of NULL values.  Recall that a blank string or a zero (0) is 
// NOT the same as NULL (which stands for unknown/unavailable data).

function format_mysqlcol($value)
   {
   if ($value == NULL) return 'NULL';
   elseif (gettype($value) == "string")
      {
      if (strlen(trim($value)) > 0) return ("'" . addslashes($value) . "'");
      else return 'NULL';
      }		}
   elseif (is_numeric($value) == TRUE) return $value;
   else return $value;
   // later, add more ifs for other data types like boolean values
   }


Need a new email address that people can remember
Check out the new EudoraMail at
http://www.eudoramail.com