[BiO BB] Restriction sites frequencies in mouse genome
Benoit VARVENNE
varvenne at genoway.com
Thu Sep 7 03:39:42 EDT 2006
Hello,
Harry,
Thanks for your answer. I'd be very interested in having this code.
First i only had to calculate frequencies in mouse genome but now things
have changed... I'm interested in having positions of hits and in
calculating distribution, fragment length ...
The next step will be to make the link between hits found and corresponding
features available in Ensembl databases (site in an existing gene,
centromere, repeat regions, ...).
I think i'm going to use Ensembl Perl API to do so.
If anyone has got other ideas, i'd be very interested in them.
If anyone's interested, i've got an optimized (program memory and
performance) general perl script for finding number of hits of a sequence
(or a pattern version) in very big sequences (like chromosomes or genome).
Let me know if you want it.
There is no management of a list of program entries for the moment and no
management of storing positions, ....
Regards,
Benoit Varvenne,
Bioinformatics pearson in charge,
Genoway Lyon - France.
Le 6/09/06 21:43, « Harry Mangalam » <harry.mangalam at uci.edu> a écrit :
> If by calculating frequencies, you want to find all the sites in a
> genome, tacg will do this. It will find all the sites you give it
> (I've tested it on all human chromosome assemblies) as well as the
> predicted frequency based on the base pair distribution.
>
> It can theoretically do the entire genome in one shot if you have
> enough RAM, but I've never tried it and the output would be pretty
> ferocious.
> for example, for chromosome 21 (a paltry 33.6MB), the summary output
> is:
>
> ## Sequence: #1; from file: UNAVAILABLE
> Format: FASTA; ID: gi:89161201; Description: Homo sapiens
> chromosome 21, alternate assembly (based on Celera assembly), whole
> genome shotgun sequence.
>
> == Sequence info:
>
> NB: sequence length > A+C+G+T due to -> 224404 <- IUPAC
> degeneracies.
> # of: N:224404 Y:0 R:0 W:0 S:0 K:0 M:0 B:0 D:0 H:0 V:0
>
> #s below are for top strand; 'sites exp' values calculated on the
> basis of both strands.
> 33216610 bases; 9772353 A(29.42 %) 6752472 C(20.33 %) 6753971
> G(20.33 %) 9713410 T(29.24 %)
>
> == Enzymes that DO NOT MAP to this sequence:
>
> There were NO NON-matches - ALL patterns matched at least
> ONCE.
>
>
> == Total Number of Hits per Enzyme:
> AatII 1068 BsiEI 1803 EcoRV 4841 PsiI
> 20384
> AccI 12230 BsiHKAI 23981 FauI 18509
> PspGI112279
> AccII 9733 BsiWI 174 Fnu4HI 74994 PspOMI
> 6067
> Acc65I 3021 BslI 91011 FokI 59656 PstI
> 15561
> AciI 52859 BsmI 13955 FseI 235 PvuI
> 181
> AclI 2047 BsmAI 73662 FspI 1211 PvuII
> 12841
> AfeI 1406 BsmBI 7619 HaeII 7030 RsaI
> 56361
> AflII 7226 BsmFI 45828 HaeIII 99508 RsrII
> 126
> AflIII 18426 Bsp1286I 57995 HgaI 8115 SacI
> 6829
> AgeI 676 BspEI 1246 HhaI 21013 SacII
> 893
> AhdI 3149 BspHI 11844 HinP1I 21013 SalI
> 392
> AluI143869 BspMI 16591 HincII 13046 SanDI
> 3409
> AlwI 37296 BsrI 63802 HindIII 9457 SapI
> 4316
> AlwNI 16140 BsrBI 2994 HinfI 96900 Sau96I
> 77627
> ApaI 6067 BsrDI 16179 HpaI 4478 Sau3AI
> 79640
> ApaLI 6042 BsrFI 4609 HpaII 29934 SbfI
> 1068
> ApoI 74171 BsrGI 9408 HphI 67904 ScaI
> 5880
> AscI 47 BssHII 890 KasI 2793
> ScrFI137189
> AseI 17631 BssKI137189 KpnI 3021 SexAI
> 3472
> AvaI 12916 BssSI 5101 MaeII 28783 SfaNI
> 42093
> AvaII 31938 BstAPI 9253 MaeIII 83257 SfcI
> 39408
> AvrII 6112 BstBI 1256 MboII100007 SfiI
> 599
> BaeI 2868 Bst4CI 87767 MfeI 6359 SfoI
> 2793
> BaeI 2868 BstDSI 14918 MluI 334 SgfI
> 13
> BamHI 4165 BstEII 4065 MlyI 44962 SgrAI
> 214
> BanI 18704 BstF5I 59661 MnlI308118 SmaI
> 4948
> BanII 27893 BstNI112279 MscI 14579 SmlI
> 29332
> BbeI 2793 BstUI 9733 MseI226716 SnaBI
> 1598
> BbsI 16623 BstXI 19685 MslI 38862 SpeI
> 4362
> BbvI 63057 BstYI 24349 MspA1I 17762 SphI
> 6477
> BbvCI 14806 BstZ17I 4605 MwoI 73785 SrfI
> 302
> BcgI 3733 Bsu36I 10646 NaeI 1898 SspI
> 28450
> BcgI 3733 BtgI 14918 NarI 2793 StuI
> 8988
> BciVI 7495 BtrI 3836 NciI 24927 StyI
> 34781
> BclI 8350 Cac8I 66066 NcoI 8941 SwaI
> 2801
> BfaI 83296 ClaI 1121 NdeI 10096 TaiI
> 28783
> BglI 6550 Csp6I 56361 NgoMIV 1898 TaqI
> 17908
> BglII 8895 CviJI507227 NheI 2770 TatI
> 30303
> BlpI 6131 CviRI168208 NlaIII161486 TfiI
> 51945
> BmrI 19063 DdeI155096 NlaIV 87348 TliI
> 1496
> BplI 11478 DpnI 79640 NotI 127 TseI
> 63101
> BpmI 32957 DraI 41466 NruI 209 Tsp45I
> 47283
> Bpu10I 25858 DraIII 6989 NsiI 11383
> Tsp509I254887
> BsaI 18254 DrdI 3165 NspI 36783 TspRI
> 98632
> BsaAI 9382 EaeI 20232 PacI 1946 Tth111I
> 7783
> BsaBI 4988 EagI 1139 PciI 12666 XbaI
> 9158
> BsaHI 6162 EarI 25525 PflMI 11275 XcmI
> 9507
> BsaJI121468 EciI 6774 PleI 44962 XhoI
> 1496
> BsaWI 3529 Ecl136II 6829 PmeI 539 XmaI
> 4948
> BseMII104754 Eco57I 24123 PmlI 4081 XmnI
> 11146
> BseRI 23673 EcoNI 8774 Ppu10I 11383
> BseSI 25059 EcoO109I 28937 PpuMI 12989
> BsgI 24191 EcoRI 8938 PshAI 3251
>
> To get the actual prdicted number of sites, you have to generate the
> Sites info which would be enormous but easily sed-able to extract
> what you needed.
>
> This took 9.5s on a 2GHz Opteron running 64bit Linux
>
> If you want, I'll send you the source tarball in a separate email.
>
> hjm
>
>
> On Tuesday 29 August 2006 05:35, Benoit VARVENNE wrote:
>> Hello everybody,
>>
>> Thanks to all for your ideas and suggestions. I think i'm going to
>> consider perl programming to calculate restriction sites frequency
>> as softwares mentionned in your mails (+softwares i found) don't
>> seem to be useful for a whole genome scale. Programming was to be
>> avoid for this study but it seems to be the only solution. I'm
>> really surprised not being able to find such an already done study.
>>
>> Thanks again,
>> Regards,
>>
>> Beno?t Varvenne,
>> Bioinformatics pearson in charge,
>> Genoway Lyon - France.
>>
>> Le 28/08/06 11:34, ??Benoit VARVENNE?? <varvenne at genoway.com>
> a ?crit?:
>>> Dear Members,
>>>
>>> I am a new member of this mailing-list and i don't know if such a
>>> post will draw the attention of anyone here. So excuse me in
>>> advance if my subject is not appropriate.
>>> I am searching for a way to calculate restriction sites frequency
>>> in mouse genome (so sequences from 6 to 13bp). I have already
>>> tried to do so using blast (or blast-like) tools and configuring
>>> them as needed but it gave no results, because of too numerous
>>> hits i think.
>>>
>>> I would be very greatful if someone could help me on this topic.
>>>
>>> Thanks a lot for your help,
>>> Best regards,
>>>
>>> Beno?t Varvenne,
>>> Bioinformatics pearson in charge,
>>> Genoway Lyon - France
>>>
