[BiO BB] protein clustering threshold

Mon Feb 9 16:02:44 EST 2004

I heard about an enzyme where a single amino acid change alters
substrate specificity. Also, mutating an active site residue will distroy
function. When clustering I always keep mappings to the cluster members
(groupies), so you can assess things like functional annotations / species
distributions within clusters. The best reason for clustering I know is to
improve sequence search statistics and speed without sacrificing coverage,
if you are looking for genuine 'biological' non redundancy use refSeq or
similar (if possible).

It should be very easy for you to get lots of examples from the following
file...

ftp://ftp.ebi.ac.uk/pub/databases/uniprot/uniref/uniref90/uniref90.xml.gz

It contains XML format 'cluster' data for Swiss Prot + TrEMBL records
derived from CD-HIT. I have an example XML parser in perl if you need help
parsing this file.

Using this you could easily pull out cluster members (at 90%) with
different EC numbers as annotated in Swiss Prot.

Actually this information should be quite useful for the annotators.

You just made me curious...

Here is a (bad) example...

+--------+-----------------------------------------+-----------+--------+
| accn   | name                                    | ec        | rep    |
+--------+-----------------------------------------+-----------+--------+
| P70694 | Estradiol 17 beta-dehydrogenase 5       | 1.1.1.-   | P51857 |
| P52895 | Aldo-keto reductase family 1 member C2  | 1.1.1.-   | P51857 |
| P52895 | Aldo-keto reductase family 1 member C2  | 1.3.1.20  | P51857 |
| Q04828 | Aldo-keto reductase family 1 member C1  | 1.1.1.-   | P51857 |
| Q04828 | Aldo-keto reductase family 1 member C1  | 1.3.1.20  | P51857 |
| P42330 | Aldo-keto reductase family 1 member C3  | 1.1.1.-   | P51857 |
| P42330 | Aldo-keto reductase family 1 member C3  | 1.1.1.188 | P51857 |
| P42330 | Aldo-keto reductase family 1 member C3  | 1.3.1.20  | P51857 |
| P80508 | Prostaglandin-E2 9-reductase            | 1.1.1.149 | P51857 |
| P80508 | Prostaglandin-E2 9-reductase            | 1.1.1.189 | P51857 |
| P23457 | 3-alpha-hydroxysteroid dehydrogenase    | 1.1.1.50  | P51857 |
| P05980 | Prostaglandin-F synthase 1              | 1.1.1.188 | P51857 |
| P52897 | Prostaglandin-F synthase 2              | 1.1.1.188 | P51857 |
| P17516 | Aldo-keto reductase family 1 member C4  | 1.1.1.-   | P51857 |
| P17516 | Aldo-keto reductase family 1 member C4  | 1.1.1.225 | P51857 |
| P17516 | Aldo-keto reductase family 1 member C4  | 1.1.1.50  | P51857 |
| Q8VC28 | Aldo-keto reductase family 1 member C13 | 1.1.1.-   | P51857 |
| P51652 | 20-alpha-hydroxysteroid dehydrogenase   | 1.1.1.149 | P51857 |
| P31210 | 3-oxo-5-beta-steroid 4-dehydrogenase    | 1.3.99.6  | P51857 |
| P52898 | Dihydrodiol dehydrogenase 3             | 1.-.-.-   | P51857 |
+--------+-----------------------------------------+-----------+--------+

Is this showing up a complex cluster?

Ta,
Dan.

P.S. The above cluster file isn't enough to get this data, you need the
full Uniprot data set. I am using a rough sql schema of my own design
built using (copying) swissknife - anyone know how to change the XML
schema for the XML file into an SQL schema? Should I just use XML SQL?

On Mon, 9 Feb 2004, Hongyu Zhang wrote:

> 
> 
> Does anyone have examples of two proteins with >90%
> sequence identity have differenct functions? I need it as a
> proof to cluster a protein sequence library.
> 
> Thanks!
> 
> Hongyu
> _______________________________________________
> BiO_Bulletin_Board maillist  -  BiO_Bulletin_Board at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bio_bulletin_board
>