This is what we should do with repeats, probably, since they're actually thinking about it. Mark? -- David Block GNF http://www.gnf.org La Jolla, California dblock@gnf.org Let's not talk about the weather... ---------- Forwarded message ---------- Date: Fri, 16 Nov 2001 14:36:54 +0000 (GMT) From: James Gilbert <jgrg@sanger.ac.uk> To: Arne Stabenau <stabenau@ebi.ac.uk> Cc: Laura Clarke <lec@sanger.ac.uk>, Ensembl Development <ensembl-dev@ebi.ac.uk> Subject: Re: new tables Arne and Laura, I need to write functions like: my @alu = $vc->get_all_Repeats_by_class('SINE'); I also want to keep all the repeats in one table, so that get_repeatmasked_seq only has to visit one table to get all the mask coordinates, and is therefore nice and fast. I propose the following: CREATE TABLE repeat_feature ( contig_id int(10) unsigned NOT NULL, contig_start int(10) unsigned NOT NULL, contig_end int(10) unsigned NOT NULL, contig_strand tinyint(1) DEFAULT '1' NOT NULL, # 1 positive oriented 0 not oriented -1 negative oriented repeat_id int unsigned NOT NULL, repeat_start int(10) NOT NULL, repeat_end int(10) NOT NULL. analysis_id int(10) unsigned NOT NULL, score float, KEY contig_idx( contig_id, contig_start, analysis_id ), KEY repeat_type( contig_id, repeat_id, contig_start ), ) max_rows=300000000 avg_row_length=80; CREATE TABLE repeat ( repeat_id int unsigned NOT NULL autoincrement, repeat_name varchar(255) NOT NULL, repeat_class varchar(40) NOT NULL, # eg: SINE, LINE, DNA Transposon, # Retroviral LTR, Satellite, Tandem repeat_consensus text, # Or dna_id with entry in DNA table? repeat_length int NOT NULL, # not needed with repeat_consensus? PRIMARY KEY( repeat_id ) ); So an Alu would be represented: repeat_feature: contig_id 233 contig_start 27673 contig_end 27953 contig_strand 1 repeat_id 15 repeat_start 40 repeat_end 311 score 2199 analysis_id 17 repeat: repeat_id 15 repeat_name AluY repeat_class SINE repeat_length 311 repeat_consensus RGCCGGGCGCGGTGGCTCACGCCTGTAATC... and a Tandem repeat: repeat_feature: contig_id 233 contig_start 6386 contig_end 6415 contig_strand 1 repeat_id 245 repeat_start 1 repeat_end 3 score 90 analysis_id 18 repeat: repeat_id 245 repeat_name 3-mer atc repeat_class Tandem repeat_length 3 repeat_consensus ATC James James G.R. Gilbert The Sanger Centre Wellcome Trust Genome Campus Hinxton Cambridge Tel: 01223 494906 CB10 1SA Fax: 01223 494919