Q: I have just submitted a "batch" queries to the BLAST server and it's taking a very long time. What's wrong with the server?.
The NCBI WWW BLAST server is a shared resource and it would be unfair for a few users to monoplize it. To prevent this, the server keeps track of how many queries are in the queue for each user and penalzies those users with many queries in the queue. This is done by calculating a 'Time of Execution' (TOE). If a user has only one query in the queue, then the TOE is set to the current time. As a user adds more queries to the queue, then the TOE is set to the current time, plus 60 seconds for every query in the queue. An example would be if a user sent in five requests one after the other without waiting for any to be worked on, then the TOE's for the requests would be:
1st request: current time
2nd request: current time + 60 seconds
3rd request: current time + 120 seconds
4th request: current time + 180 seconds
5th request: current time + 240 seconds
The BLAST server works through requests in the order of earliest to latest TOE. A query will be executed before it's TOE, if there are no other queries with an earlier TOE. Users with large numbers of queries are encouraged to use the BLAST servers at off-peaks hours, which are from 8 p.m. to 8 a.m. (EST).
Q: Causes for "No significant similarity found".
Below are several reasons that a BLAST search can result in the "No significant similarity found" message. Note: You may need to use more than one of these options at the same time (example: increase the Expect value AND turn off filtering).
Short Sequences: Depending on sequence composition, a short sequence is a sequence under 20 residues
1) Try increasing the Expect value using the pull down tab. You can raise the E value even further than 1000 by using the -e option in the Other Advanced Options Box (Advanced BLAST only). Example: -e 100000
2) You may also need to decrease the Word Size from the default (11 for nucleotides or 3 for proteins). You can decrease the word size using the -W option in the Other Advanced Options Box (Advanced BLAST only). Example: -W 2
You should also consult the "How do I perform a similarity search with a short peptide/nucleotide sequence?" section below.
Filtering: BLAST filters regions of low-complexity (for a description of low-complexity see "What is low-complexity sequence?" below). If you sequence contains large regions of "low complexity" it may not significant hits to the database. You can turn off filtering by setting the "Filter" option to "None" using the pull down tab.
Query Format: Another reason you may see the "No Significant Similarity found" message is using the wrong type of sequence in your search.
1) Accession/GI Number or FASTA. Check that you have the Input Data set to the correct format for your Query. Set the pull down menu to "Accession number or Gi" to search with GenBank accession numbers or Gi numbers. Set to FASTA for raw amino acid or nucleotide sequences. For more information on FASTA format, click here.
2) Sequence type and Program combination. You can search with an amino acid query sequence using the blastp and tblastn programs. With nucleotide query sequences you can use blastn, blastx, and tblastx. Please note that tblastx program cannot be used with the nr database on the BLAST Web page. For more information on the BLAST programs, click here.
Q: Why does my search timeout on the BLAST servers?
Certain combinations of BLAST searches with large sequences against
large databases can cause the BLAST servers to timeout.
This has to do with a limit on the server CPU's which prevents
sequences which generate many HSPs from hoarding server
resources.
However there are some things you can do to prevent timeout and generate results from large sequences.
- Some sequences contain large regions of ALU repeats. In this
case you can select the "Human Repeat" filtering option on the
main BLAST search page. This will mask repeat regions which generate
a large number of biologically uninteresting hits to
the databases.
- Increase the Word Size to 20 - 25. With a default Word Size
of 7, the BLAST algorithm finds initial HSPs of 7 bases in
length and begins extension of these from either end. In a large
sequence this can generate 100's of initial HSPs between
the query sequence and even a single large genomic sequence in
the databases. Increasing the Word Size to 25 makes the initial
HSP smaller, limiting the number small initial fragments to be
extended.
- Decrease the Expect value to 1.0 or lower. Many hits from large
sequences are to many small fragments in the database.
The expect value for these searches is such that decreasing the
expect value will eliminate these results, and concentrate on
results which are more likely to contain large coding regions
and genomic fragments.
If you are still seeing a "timeout" error message after making the above
changes, please contact blast-help@ncbi.nlm.nih.gov
with the RID of your search.
Q: Why do I get the message "ERROR: BLASTSetUpSearch: Unable to calculate Karlin-Altschul params, check query sequence" ?
This will happen if your entire query sequence has been masked by low
complexity filtering. You will need to turn filtering off to get
hits. For further information on filtering, please read the sections
of the BLAST FAQs on Q: What is low-complexity sequence?
and also Q: After running a search why do I see a string
of "X"s (or "N"s) in my query sequence that I did not put there?
Q: Why do I get the message "ERROR: Blast: No valid letters to be indexed"?
You may have accidentally entered an accession number in the search
box without changing the input selection from "Sequence in
FASTA format" to "Accession or gi". You will also see this error message
if too many ambiguity codes (R,Y,K,W,N, etc. for
nucleotides) are present in your query sequence. Although BLAST allows
ambiguity codes, be aware that these will always
contribute a negative score in nucleic acid searches. Thus, sequences
such as degenerate PCR primers with ambiguity codes may
not find any significant hits even though they may be designed from
sequences that are present in the database.
Q: After running a search why do I see a string of "X"s (or "N"s) in my query sequence that I did not put there?
You are seeing the result of automatic filtering of your query for low-complexity sequence that is performed to prevent artifactual hits. The filter substitutes any low-complexity sequence that it finds with the letter "N" in nucleotide sequence (e.g., "NNNNNNNNNNNNN") or the letter "X" in protein sequences (e.g., "XXXXXXXXX"). Low-complexity regions can result in high scores that reflect compositional bias rather than significant position-by-position alignment (Wootton & Federhen, 1996). Filter programs can eliminate these potentially confounding matches from the blast reports, leaving regions whose BLAST statistics reflect the specificity of their parities alignment. Queries searched with the blastn program are filtered with DUST. The other BLAST programs use SEG.
You can change the default and remove these filters if you like. On the Basic BLAST Web interface you will see a button to click that will remove the filter. On the Advanced Page you can set the filter to "none" in the menu. For email BLAST you can use the following command (filter NONE).
Q: Is it possible to search for a motif or pattern with BLAST?
Although NCBI does not have a database of protein motifs, you can use the PHI-BLAST to search for a motif in a protein query sequence and then find other proteins that contain this motif from the NCBI databases. PHI-BLAST can only be used for protein sequences at this time.
In addition, you can search with short query sequences using BLAST after changing a few parameters (see "Q: How do I perform a similarity search with a short peptide/nucleotide sequence?" above). You may also be interested in checking out other molecular biology web sites, such as those mentioned in the Other Resources section at the end of this FAQ, for motif searching software.
Q: How do I perform a similarity search with a short peptide/nucleotide sequence?
First, you will probably need to increase the Expect (E) value in your search. A short query is more likely to occur by chance in the database. Therefore, even a perfect match can have low statistical significance and may not be reported. Increasing the E value allows you to look farther down in the hit list and see matches that would normally be discarded because of low statistical significance.
For most searches, an Expect value up to 1000 is enough to see results. However, you can raise the E value farther on the Advanced BLAST Web page by typing -e 10000, for example, in the Other Advanced Options Box.
If you still do not get results after increasing the E value, you may want to try decreasing the Word size (W), another parameter that becomes important with a short query. The BLAST algorithm uses "words" to nucleate regions of similarity. The default Word size for a protein sequence is 3 residues and for nucleotide sequences it is 11 bp. A blastn search will not work with a Word size of less than 7. A good rule of thumb is that the query length must be at least twice the Word size. For example, if your query is a protein sequence of 4 residues, than the Word size should be reduced to 2. Please note that the smaller the Word size, the slower your search will be.
You can lower the default word size on the Advanced BLAST Web page. In the Other Advanced Options, type -W some_number (for example, -W 9).
Sometimes a short query does not produce results because it contains low-complexity sequence. Often this type of sequence can be recognized by the human eye because it looks very redundant, for example the protein sequence PADPPPDPPPP or the nucleotide sequence AAATTTAAAAAT. A filter for low complexity sequence is applied by default to BLAST nucleotide and protein searches. If your query has regions of low-complexity sequence, then large portions of your query may be filtered out, essentially making your query shorter than you might have expected. Removing the filter will help in these cases.
Finally, you can change the matrix to optimize for searching with short
protein sequences. For information on query length and the matrix see the
document http://www.ncbi.nlm.nih.gov/BLAST/matrix_info.html#matrix
To compare one sequence against a specific sequence or set of sequences, you can also use a separate multiple sequence alignment program. There are many such software tools available to do this. NCBI has developed a tool, MACAW, which will do multiple sequence alignments on PC or Mac platforms. The latest version of MACAW is available on the NCBI anonymous FTP site (ftp://ncbi.nlm.nih.gov) under /pub/macaw/. The instructions are included with the program. You may also be interested in checking out other molecular biology web sites, such as those mentioned in the Other Resources section at the end of this FAQ.
Q: What is low-complexity sequence?
Regions with low-complexity sequence have an unusual composition and
this can create problems in sequence similarity searching (Wootton
& Federhen, 1996). Low-complexity sequence can often be recognized
by visual inspection. For example, the protein sequence PPCDPPPPPKDKKKKDDGPP
has low complexity and so does the nucleotide sequence AAATAAAAAAAATAAAAAAT.
Filters are used to remove low-complexity sequence because it can cause
artifactual hits (please also see Q: After running a
search why do I see a string of "X"s (or "N"s) in my query sequence that
I did not put there?)
In BLAST searches performed without a filter, often certain hits will be reported with high scores only because of the presence of a low-complexity region. Most often, this type of match cannot be thought of as the result of homology shared by the sequences. Rather, it is as if the low-complexity region is "sticky" and is pulling out many sequences that are not truly related.
Q: What is the Expect (E) value?
The Expect value (E) is a parameter that describes the number of hits one can "expect" to see just by chance when searching a database of a particular size. It decreases exponentially with the Score (S) that is assigned to a match between two sequences. Essentially, the E value describes the random background noise that exists for matches between sequences.
The Expect value is used as a convenient way to create a significance threshold for reporting results. When the Expect value is increased from the default value of 10, a larger list with more low-scoring hits can be reported.
In BLAST 2.0, the Expect value is also used instead of the P value (probability) to report the significance of matches. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance.
Q: Why was the Ungapped BLAST 1.4 discontinued?
The NCBI strongly encourages use of the newest gapped BLAST (version 2.0) services. This includes Web pages, an e-mail server, network client, and stand-alone binaries. BLAST version 2.0 was released in September, 1997 and most users have already made the transition from version 1.4 (which is no longer supported). Those who have not done so already are strongly encouraged to switch as the ungapped service will not be maintained indefinitely.
The software behind BLAST version 2.0 was written from scratch to allow BLAST to handle the new challenges posed by the sequence databases in the coming years.
Improvements of BLAST version 2.0 over 1.4 of interest to all users are:
- Gapped alignments.
- Master-slave display options for the alignments.
- Organism specific BLAST.
- Position-specific-iterated searches with PSI-BLAST.
You can still get Ungapped BLAST alignments using BLAST 2.0. See the FAQ "How can I emulate UnGapped BLAST searches using the Gapped BLAST 2.0?" below.
Q: How can I emulate UnGapped BLAST searches using the Gapped BLAST 2.0?
You can emulate an Ungapped BLAST 1.4 search using the Advanced BLAST 2.0 web page. To do this:
1) Click the option "Perform ungapped alignment" on the main BLAST 2.0 search page.
2) Then enter the following parameters in the "Other Advanced Options" field:
-r 5 -q -4
This changes the reward for matches to 5 and the penalty for mismatch at -4. which was the default in Ungapped BLAST.
Q: How can I see low-similarity matches when
there are many strong hits to my query sequence?
Often, when the query is a member of a large sequence family, the summary
hit list and the alignments returned only contain very high scoring hits.
To look at low-similarity matches, you must increase the maximum number
of results returned. On the BLAST Web pages, often it is sufficient to
increase the size of the summary hit list and the number of alignments
shown using the menus on the Advanced pages. However, it is possible to
increase the lists even further using the Other
Advanced Options box on the Advanced BLAST pages. For BLAST 2.0, "-v
2000", for example, will increase the number of descriptions returned in
the summary hit list to 2000. The option "-b 2000" will similarly increase
the number of alignments returned.
Q: How do I perform a similarity search with a short peptide/nucleotide sequence?
First, you will probably need to increase the Expect (E) value in your search. A short query is more likely to occur by chance in the database. Therefore, even a perfect match can have low statistical significance and may not be reported. Increasing the E value allows you to look farther down in the hit list and see matches that would normally be discarded because of low statistical significance.
For most searches, an Expect value up to 1000 is enough to see results. However, you can raise the E value farther on the Advanced BLAST Web page by typing -e 10000, for example, in the Other Advanced Options Box.
If you still do not get results after increasing the E value, you may want to try decreasing the Word size (W), another parameter that becomes important with a short query. The BLAST algorithm uses "words" to nucleate regions of similarity. The default Word size for a protein sequence is 3 residues and for nucleotide sequences it is 11 bp. A blastn search will not work with a Word size of less than 7. A good rule of thumb is that the query length must be at least twice the Word size. For example, if your query is a protein sequence of 4 residues, than the Word size should be reduced to 2. Please note that the smaller the Word size, the slower your search will be.
You can lower the default word size on the Advanced BLAST Web page. In the Other Advanced Options, type -W some_number (for example, -W 9).
Sometimes a short query does not produce results because it contains low-complexity sequence. Often this type of sequence can be recognized by the human eye because it looks very redundant, for example the protein sequence PADPPPDPPPP or the nucleotide sequence AAATTTAAAAAT. A filter for low complexity sequence is applied by default to BLAST nucleotide and protein searches. If your query has regions of low-complexity sequence, then large portions of your query may be filtered out, essentially making your query shorter than you might have expected. Removing the filter will help in these cases.
Finally, you can change the matrix to optimize for searching with short protein sequences. For information on query length and the matrix see the document http://www.ncbi.nlm.nih.gov/BLAST/matrix_info.html#matrix
(1) Install the BLAST 2.0 Server Locally:
There is information about Standalone BLAST in the "Overview" available
from the sidebar of the main BLAST page
(http://www.ncbi.nlm.nih.gov/BLAST/)
and at http://www.ncbi.nlm.nih.gov/BLAST/newblast.html#standalone
There is also some information on setting up the programs at the NGHRI site at http://genome.nhgri.nih.gov/blastall/blast_install/
http://genome.nhgri.nih.gov/blastall/blast_install/
The Standalone executables are available at the anonymous FTP location: ftp://ncbi.nlm.nih.gov/blast/executables/
(2) Install the BLAST 2.0 Network client software locally:
There are executables for Mac, PC, and various UNIX platforms.
Chapter 7 of Cold Spring Harbor Genome Analysis Laboratory Manual also
provides helpful introductory information for users of molecular biology
databases and software. This chapter is available over the WWW at:
from the Cold Spring Harbor Laboratory WWW home page (http://www.cshl.org/
)
under CSHL Press.
There are many sites which offer software tools for molecular biologists
and for manipulating sequence data. Some of the larger of these are listed
below: