[Genquire-dev] Re: new tables (fwd)

David Block dblock@gnf.org
Fri, 16 Nov 2001 15:27:14 -0800 (PST)


This is what we should do with repeats, probably, since they're actually 
thinking about it.

Mark?

-- 
David Block
GNF http://www.gnf.org  La Jolla, California
dblock@gnf.org        Let's not talk about the weather...

---------- Forwarded message ----------
Date: Fri, 16 Nov 2001 14:36:54 +0000 (GMT)
From: James Gilbert <jgrg@sanger.ac.uk>
To: Arne Stabenau <stabenau@ebi.ac.uk>
Cc: Laura Clarke <lec@sanger.ac.uk>,
     Ensembl Development <ensembl-dev@ebi.ac.uk>
Subject: Re: new tables



Arne and Laura,

I need to write functions like:

  my @alu = $vc->get_all_Repeats_by_class('SINE');

I also want to keep all the repeats in one table,
so that get_repeatmasked_seq only has to visit one
table to get all the mask coordinates, and is
therefore nice and fast.

I propose the following:


CREATE TABLE repeat_feature (
  contig_id     int(10) unsigned NOT NULL,
  contig_start  int(10) unsigned NOT NULL,
  contig_end    int(10) unsigned NOT NULL,
  contig_strand tinyint(1) DEFAULT '1' NOT NULL, # 1 positive oriented 0 not oriented -1 negative oriented
  repeat_id     int unsigned NOT NULL,
  repeat_start  int(10) NOT NULL,
  repeat_end    int(10) NOT NULL.
  analysis_id   int(10) unsigned NOT NULL,
  score float,
  
  KEY contig_idx( contig_id, contig_start, analysis_id ),
  KEY repeat_type( contig_id, repeat_id, contig_start ),
) max_rows=300000000 avg_row_length=80;

CREATE TABLE repeat (
    repeat_id           int unsigned NOT NULL autoincrement,
    repeat_name         varchar(255) NOT NULL,
    repeat_class        varchar(40) NOT NULL,   # eg:  SINE, LINE, DNA Transposon,
                                                # Retroviral LTR, Satellite, Tandem
    repeat_consensus    text,   # Or dna_id with entry in DNA table?
    repeat_length       int NOT NULL,  # not needed with repeat_consensus?
    
    PRIMARY KEY( repeat_id )
);

  
So an Alu would be represented:

  repeat_feature:
    contig_id         233
    contig_start      27673
    contig_end        27953
    contig_strand     1
    repeat_id         15
    repeat_start      40
    repeat_end        311
    score             2199
    analysis_id       17

  repeat:
    repeat_id         15
    repeat_name       AluY
    repeat_class      SINE
    repeat_length     311
    repeat_consensus  RGCCGGGCGCGGTGGCTCACGCCTGTAATC...

and a Tandem repeat:

  repeat_feature:
    contig_id         233
    contig_start      6386
    contig_end        6415
    contig_strand     1
    repeat_id         245
    repeat_start      1
    repeat_end        3
    score             90
    analysis_id       18

  repeat:
    repeat_id         245
    repeat_name       3-mer atc
    repeat_class      Tandem
    repeat_length     3
    repeat_consensus  ATC

	James

James G.R. Gilbert
The Sanger Centre
Wellcome Trust Genome Campus
Hinxton
Cambridge                        Tel: 01223 494906
CB10 1SA                         Fax: 01223 494919