Bioinformatics.org
[University of Birmingham]
Not logged in
  • Log in
  • Bioinformatics.org
    Membership (45151+) Group hosting [?] Wiki
    Franklin Award
    Sponsorships

    Careers
    About bioinformatics
    Bioinformatics jobs

    Research
    All information groups
    Online databases Online analysis tools Online education tools More tools

    Development
    All software groups
    FTP repository
    SVN & CVS repositories [?]
    Mailing lists

    Forums
    News & Commentary
  • Submit
  • Archives
  • Subscribe

  • Jobs Forum
    (Career Center)
  • Submit
  • Archives
  • Subscribe
  • CD-HIT: Sequence clustering software - Support tickets

    Submit | Open tickets | Closed tickets

    [ Ticket #195 ] Longest sequence first???
    Date:
    08/19/04 07:28
    Submitted by:
    dmb
    Assigned to:
    liwz
    Category:
    Clustering
    Priority:
    9
    Ticket group:
    Critical
    Resolution:
    Resolved
    Summary:
    Longest sequence first???
    Original submission:


    CD-HIT should cluster from the longest sequence first, taking that sequence to be the representative.

    This dosn't appear to happen in a very simple test case...


    >15982
    kkekspkgkssispqarafleqvfrrkqslnskekeevakkcgitplqvrvwfinkrmrsk
    >79112
    aaaaaispqarafleqvfrrkqslnskekeevakkcgitplqvrvwfinkrmrsk
    >15981
    ispqarafleqvfrrkqslnskekeevakkcgitplqvrvwfinkrmrs


    (Which is easy to align...

    >15982
    kkekspkgkssispqarafleqvfrrkqslnskekeevakkcgitplqvrvwfinkrmrsk
    >79112
    ------aaaaaispqarafleqvfrrkqslnskekeevakkcgitplqvrvwfinkrmrsk
    >15981
    -----------ispqarafleqvfrrkqslnskekeevakkcgitplqvrvwfinkrmrs-

    )


    The longest sequence (SCOP SUNID 15982) should not clustr with 79112 at 100 percent identity threshold, but should then cluster with 15981 at 100%.

    However the result of the 100% clustering are...

    >Cluster 0
    0 61aa, >15982... *
    >Cluster 1
    0 55aa, >79112... *
    1 49aa, >15981... at 100%

    For some reason 79112 is clustering with 15981 at 100 percent *first*, so when 15982 comes along it dosn't see 100% identity to 15981 (behind its representative 79112), and forms a cluster on its own.

    The correct clustering should be

    >Cluster 0
    0 61aa, >15982... *
    1 49aa, >15981... at 100%
    >Cluster 1
    0 55aa, >79112... *

    What is the problem here?
    Please log in to add comments and receive followups via email.
    Followups
    Comment Date By

    I guess that that's the way that the code is written. (I was wondering about it too and it seems to be different from the descritption described in the paper. So I just checked their codes ).

    A sequence seems to be compared to the shortest representative sequence first. Therefore, 15981 is compared to 79112 first. And they are identical above the specified threshold, 15981 is clustered to 79112 without being compared to 79112.






    01/07/06 21:28 unset
    No results for "Dependent on ticket"
    No results for "Dependent on Task"
    No other tickets are dependent on this ticket
    Ticket change history
    Field Old value Date By
    status_id Open 05/16/11 00:37 liwz
    resolution_id Unset 05/16/11 00:37 liwz
    close_date 12/31/69 19:00 05/16/11 00:37 liwz
    priority 8 05/24/05 11:32 dmb
    status_id Unset 08/19/04 07:29 dmb
    priority 5 08/19/04 07:29 dmb
    assigned_to unset 08/19/04 07:29 dmb

     

    Copyright © 2025 Scilico, LLC · Privacy Policy