[ssml] error from BLASTCLUST

Fri Sep 3 05:30:36 EDT 2004

On Fri, 3 Sep 2004, Manoj Tyagi wrote:

>Hello
>      I have subscribed to the list as asked by you.
>Well I looked at your reply but if you read the documentation of BLASTCLUS=
T it=20
>says low complexity filtering is "off" by default. Now if this feature is =
off=20
>then why we get error.=20

I see what you mean..

Quoting
http://bioinformatics.ubc.ca/resources/tools/index.php?name=3Dblastclust

"BLASTCLUST uses the default values for the BLAST and Mega BLAST
parameters. For protein sequences these are: matrix BLOSUM62; gap opening
cost 11; gap extension cost 1; no low-complexity filtering."

Some how I thought that the default was to have  "-F T" because this is
the default for the blast program, and the above says it uses the
default values.

Quoting
http://bioweb.pasteur.fr/docs/man/man/blast.1.html

-F str (bl2seq, blast, blastall, blastpgp, blastcl3, impala, megablast,
rpsblast) Filter options for DUST or SEG; defaults to T for bl2seq, blast,
blastall, blastcl3, and megablast, and to F for blastpgp, impala, and
rpsblast.

I don't know where the definitive docs are.

>Plus if some one wants to active above feature how you could do that
>because in program options there is no option for low complexity. there
>was -F but i didn't understand what it is.

For some reason blastclust dose not directly support the -F option, rather
you have to make a file with something like=20

=2E.. I can't find any examples ...

Found it...
http://www.sacs.ucsf.edu/Documentation/standblast.html

---- begin

-F  Filter query sequence (DUST with blastn, SEG with others) [T/F]
    default =3D T

BLAST 2.0 uses the dust low-complexity filter for blastn and seg for the
other programs. Both 'dust' and 'seg' are integral parts of the NCBI
toolkit and are accessed automatically.

If one uses "-F T" then normal filtering by seg or dust (for blastn)
occurs (likewise "-F F" means no filtering whatsoever).  The seg options
can be changed by using:

  -F "S 10 1.0 1.5"

which specifies a window of 10, locut of 1.0 and hicut of 1.5.  A
coiled-coiled filter, based on the work of Lupas et al. (Science, vol 252,
pp. 1162-4 (1991)) and written by John Kuzio (Wilson et al., J Gen Virol,
vol. 76, pp. 2923-32 (1995)), may be invoked by specifying:
        =20
 -F "C"
        =20
There are three parameters for this: window, cutoff (prob of a coil-coil),
and linker (distance between two coiled-coiled regions that should be
linked together).  These are now set to
        =20
 window: 22
 cutoff: 40.0
 linker: 32
        =20
One may also change the coiled-coiled parameters in a manner analogous to
that of seg:
        =20
 -F "C 28 40.0 32" will change the window to 28.
        =20
One may also run both seg and coiled-coiled together by using a ";":
    =20
 -F "C;S"
        =20
Filtering by dust may also be specified by:
        =20
 -F "D"
        =20
It is possible to specify that the masking should only be done during the
process of building the initial words by starting the filtering command
with 'm':
        =20
 -F "m S"
        =20
which specifies that seg (with default arguments) should be used for
masking, but that the masking should only be done when the words are being
built. This masking option is available with all filters.

---- end

Then use the -c option of blastclust to point at that file.

Basically the -F option takes a string giving the parameters to use in SEG

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Seg.html

I once compiled seg locally, which lets you pre-filter your database so
you know what you are doing

http://blast.wustl.edu/pub/seg/

but I found it didn't accept any command line options (when passed the
behavior was unpredictable). So you had to hard code any values you
wanted...

I am trying to find the recommended values for sequence searching and
database-database comparison, but I can't.

This looks like a good seg page...

http://www.biology.wustl.edu/gcg/seg.html

Additionally it looks like SEALS defines a different (but related) seg...

http://www.ncbi.nlm.nih.gov/CBBresearch/Walker/SEALS/info/readmes/famask.ht=
ml

Actually it looks like what they call 'domain' isn't what I think of when
I use the term, they are talking about the same seg though.

Found it at last...

http://cbr-rbc.nrc-cnrc.gc.ca/documentation/seals/seg.html

EXAMPLES OF PARAMETER SETS =20
--------------------------

Default parameters are given by 'seg sequence' (equivalent to 'seg
sequence 12 2.2 2.5').  These  parameters are appropriate for low-
complexity masking of many amino acid sequences [with -x option].

Database-database comparisons:
-----------------------------
More stringent (lower) complexity parameters are suitable when =20
masked sequences are compared with masked sequences.  For example,=20
for BLAST or FASTA searches that compare two amino acid sequence =20
databases, the following masking may be applied to both databases:

  seg database 12 1.8 2.0 -x

Homopolymer analysis:
--------------------
To examine all homopolymeric subsequences of length (for example)=20
7 or greater:

  seg sequence 7 0 0=20

Non-globular regions of protein sequences:
-----------------------------------------
Many long non-globular domains may be diagnosed at longer window =20
lengths, typically:

  seg sequence 45 3.4 3.75

For some shorter non-globular domains, the following set is =20
appropriate:

  seg sequence 25 3.0 3.3

Nucleotide sequences:
--------------------
The maximum value of the complexity parameters is 2 (log[base 2]4).=20
For masking, the following is approximately equivalent in effect=20
to the default parameters for amino acid sequences:

  seg sequence.na 21 1.4 1.6

Well, none of that description of seg helps answer your question. My next
guess is that the sequences are very short, and that this messes with some
initialization somewhere in the blast query setup.=20

Try setting up a file with the option

-F F

and use blastclust -c to point at that file and see if the problem
persists. If it goes away you should report the minor bug.

If you are using blastclust, you could just use blastall and write your
own single linkage clustering on top (very quick to do), or you could look
at tribe-mcl for using the 'homology network' to make clusters. A simple
program is cd-hit, which currently should not be used with low complexity
filters at all.

Can anybody recommend a good ncbi-tools tutorial?

>
>regards,
>Manoj=20
>Quoting Dan Bolser <dmb at mrc-dunn.cam.ac.uk>:
>
>>=20
>>=20
>>=20
>http://www.dur.ac.uk/biological.sciences/Bioinformatics/blast_FAQs.html#BL=
ASTSet
>UpSearch
>>=20
>> This is a simple problem that is often encountered when beginning to use
>> blastclust.
>>=20
>> The problem comes when blastclust uses low complexity sequence filtering
>> by default, and the sequence it creates on the fly (which you may never
>> actually see) is totally 'screened' by the filter. i.e. the whole sequen=
ce
>> is replaced by X's.
>>=20
>> This prevents anything useful being done with the sequence and leads to
>> the error seen (I am not sure of the exact process that leads to the
>> error in these cases).
>>=20
>> The short answer is you can safely ignore these problems, or you can
>> switch off low complexity filtering at the risk of a few seemingly
>> significant matches (matches over low complexity regions are probably no=
t
>> as unlikely as the random sequence approximation makes them seem).=20
>>=20
>> Sorry that isn't a very clear description...=20
>>=20
>> The best thing to do is understand low complexity sequences (very simple
>> sequence repeats) and why / how those are filtered.=20
>>=20
>> The standard program is repetitive (for low complexity), and DUST remove=
s
>> coiled-coil sequences (which can be highly repetitive).
>>=20
>> A random sequence is maximally complex. A continious repeat of one
>> character is minimally complex.
>>=20
>> Cheers,
>> Dan.
>>=20
>>=20
>> On Thu, 2 Sep 2004, Manoj Tyagi wrote:
>>=20
>> >Hello,
>> >
>> >Thanks for the reply to my BLAST query.=20
>> >
>> >This time I have another query about BLASTCLUST which I am trying to us=
e to
>>=20
>> >clsuter my dataset. In the documentation it says by default it uses BLO=
SUM62
>>=20
>> >with gap penalities etc.=20
>> >Now I want to use default options so I just simply give my dataset as i=
nput
>>=20
>> >file & give output file names.=20
>> >
>> >Problem is it throws warning & error saying=20
>> >"[NULL_Caption] WARNING: SetUpBlastSearch failed.
>> >[NULL_Caption] ERROR: BLASTSetUpSearch: Unable to calculate Karlin-Alts=
chul
>> para
>> >ms, check query sequence"
>> >
>> >it means it didn't find lamda & K values in precomputed tables so givin=
g=20
>> >warning & errors. normally it should be there anyway I can provide that
>> values,=20
>> >the question is HOW? in BLASTCLUST there is no option for providing the=
se=20
>> >values.=20
>> >
>> >Could you help me out, what to do in this case. & why it is giving erro=
r?=20
>> >
>> >regards,
>> >Manoj
>> >Quoting Kevin Karplus <karplus at soe.ucsc.edu>:
>> >
>> >> The matrix is not the whole set of parameterization for BLAST.
>> >> There are also the gap costs and the lambda and K values used for
>> >> computing E-values.
>> >>=20
>> >> Changing the matrix without correcting the other parameters leads to
>> >> uninterpretable results.
>> >>=20
>> >> Kevin Karplus =09karplus at soe.ucsc.edu=09http://www.soe.ucsc.edu/~karp=
lus
>> >> Senior member, IEEE=09Board of Directors, ISCB (starting Jan 2005)
>> >> Professor of Biomolecular Engineering, University of California, Sant=
a
>> Cruz
>> >> Undergraduate and Graduate Director, Bioinformatics
>> >> Affiliations for identification only.
>> >>=20
>> >
>> >
>> >**********************************************************************
>> > Manoj TYAGI=20
>> > Laboratoire de Biochimie et G=E9n=E9tique Mol=E9culaire
>> > Universit=E9 de La R=E9union
>> > BP 7151, 15 avenue Ren=E9 Cassin
>> > 97715 Saint Denis Messag Cedex 09
>> > La R=E9union
>> > FRANCE
>> > Tel : +262 262 938641
>> > Fax : +262 262 938237
>> >**********************************************************************
>> >
>> >
>> >-------------------------------------------------
>> >This mail sent through IMP: http://horde.org/imp/
>> >_______________________________________________
>> >ssml-general mailing list
>> >ssml-general at bioinformatics.org
>> >https://bioinformatics.org/mailman/listinfo/ssml-general
>> >
>>=20
>>=20
>>=20
>
>
>**********************************************************************
> Manoj TYAGI=20
> Laboratoire de Biochimie et G=E9n=E9tique Mol=E9culaire
> Universit=E9 de La R=E9union
> BP 7151, 15 avenue Ren=E9 Cassin
> 97715 Saint Denis Messag Cedex 09
> La R=E9union
> FRANCE
> Tel : +262 262 938641
> Fax : +262 262 938237
>**********************************************************************
>
>
>-------------------------------------------------
>This mail sent through IMP: http://horde.org/imp/
>_______________________________________________
>ssml-general mailing list
>ssml-general at bioinformatics.org
>https://bioinformatics.org/mailman/listinfo/ssml-general
>