[Genquire-dev] Re: new tables (fwd)
David Block
dblock@gnf.org
Fri, 16 Nov 2001 15:27:14 -0800 (PST)
This is what we should do with repeats, probably, since they're actually
thinking about it.
Mark?
--
David Block
GNF http://www.gnf.org La Jolla, California
dblock@gnf.org Let's not talk about the weather...
---------- Forwarded message ----------
Date: Fri, 16 Nov 2001 14:36:54 +0000 (GMT)
From: James Gilbert <jgrg@sanger.ac.uk>
To: Arne Stabenau <stabenau@ebi.ac.uk>
Cc: Laura Clarke <lec@sanger.ac.uk>,
Ensembl Development <ensembl-dev@ebi.ac.uk>
Subject: Re: new tables
Arne and Laura,
I need to write functions like:
my @alu = $vc->get_all_Repeats_by_class('SINE');
I also want to keep all the repeats in one table,
so that get_repeatmasked_seq only has to visit one
table to get all the mask coordinates, and is
therefore nice and fast.
I propose the following:
CREATE TABLE repeat_feature (
contig_id int(10) unsigned NOT NULL,
contig_start int(10) unsigned NOT NULL,
contig_end int(10) unsigned NOT NULL,
contig_strand tinyint(1) DEFAULT '1' NOT NULL, # 1 positive oriented 0 not oriented -1 negative oriented
repeat_id int unsigned NOT NULL,
repeat_start int(10) NOT NULL,
repeat_end int(10) NOT NULL.
analysis_id int(10) unsigned NOT NULL,
score float,
KEY contig_idx( contig_id, contig_start, analysis_id ),
KEY repeat_type( contig_id, repeat_id, contig_start ),
) max_rows=300000000 avg_row_length=80;
CREATE TABLE repeat (
repeat_id int unsigned NOT NULL autoincrement,
repeat_name varchar(255) NOT NULL,
repeat_class varchar(40) NOT NULL, # eg: SINE, LINE, DNA Transposon,
# Retroviral LTR, Satellite, Tandem
repeat_consensus text, # Or dna_id with entry in DNA table?
repeat_length int NOT NULL, # not needed with repeat_consensus?
PRIMARY KEY( repeat_id )
);
So an Alu would be represented:
repeat_feature:
contig_id 233
contig_start 27673
contig_end 27953
contig_strand 1
repeat_id 15
repeat_start 40
repeat_end 311
score 2199
analysis_id 17
repeat:
repeat_id 15
repeat_name AluY
repeat_class SINE
repeat_length 311
repeat_consensus RGCCGGGCGCGGTGGCTCACGCCTGTAATC...
and a Tandem repeat:
repeat_feature:
contig_id 233
contig_start 6386
contig_end 6415
contig_strand 1
repeat_id 245
repeat_start 1
repeat_end 3
score 90
analysis_id 18
repeat:
repeat_id 245
repeat_name 3-mer atc
repeat_class Tandem
repeat_length 3
repeat_consensus ATC
James
James G.R. Gilbert
The Sanger Centre
Wellcome Trust Genome Campus
Hinxton
Cambridge Tel: 01223 494906
CB10 1SA Fax: 01223 494919