|
A region of the query sequences can be used to be used for BLAST searching.
You can enter the range in nucleotides or protein residues in the "Form" and "T
o" boxes provided under "Set Subsequence". For example to limit matches to the
region from nucleotide 24 to nucleotide 200 of a query sequence, you would ente
r From= 24 To= 200. If one of the limits you enter is out of range, the interse
ction of the [From,To] and [1,length] intervals will be searched, where length
is the length of the whole query sequence.
Learn more
The BLAST pages offer several different databases for searching. Some of the
se, like SwissProt, PDB and Kabat are complied outside of NCBI. Others like eco
li, dbEST and month, are subsets of the NCBI databases. Other "virtual Database
s" can be created using the Limit by Entrez Query
option.
Peptide Sequence Databases
nr
All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF
month
All new or revised GenBank CDS translation+PDB+SwissProt+PIR+PRF released
in the last 30 days.
swissprot
Last major release of the SWISS-PROT protein sequence database (no updates)
Drosophila genome
Drosophila genome proteins provided by Celera and Berkeley
DrosophilaGenome Project (BDGP).
yeast
Yeast (Saccharomyces cerevisiae) genomic CDS translations
ecoli
Escherichia coli genomic CDS translations
pdb
Sequences derived from the 3-dimensional structure from
Brookhaven Protein Data Bank
Patent
Protein sequences derived from the Patent division of GenBank
Nucleotide Sequence
Databases
nr
All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or
2 HTGS sequences). No longer "non-redundant".
month
All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last 30
days.
Drosophila genome
Drosophila genome provided by Celera and Berkeley
Drosophila Genome Project (BDGP).
dbest
Database of GenBank+EMBL+DDBJ sequences from EST
Divisions
dbsts
Database of GenBank+EMBL+DDBJ sequences from STS
Divisions
htgs
Unfinished High Throughput Genomic Sequ
ences: phases 0, 1 and 2 (finished, phase 3 HTG sequences are in nr)
gss
Genome Survey Sequence, includes single
-pass genomic data, exon-trapped sequences,and Alu PCR sequences.
yeast
Yeast (Saccharomyces cerevisiae) genomic nucleotide sequences
E. coli
Escherichia coli genomic nucleotide sequences
pdb
Sequences derived from the 3-dimensional structure from Brookhaven Protein Data Bank
Patent
Nucleotide sequences derived from the Patent division of GenBank
vector
Vector subset of GenBank(R), NCBI, in ftp://ncbi.n
lm.nih.gov/blast/db/
mito
Database of mitochondrial sequences
alu
Select Alu repeats from REPBASE, suitable for masking Alu repeats from quer
y sequences. It is available by anonymous FTP from ncbi.nlm.nih.gov (under the
/pub/jmc/alu directory). See "Alu alert" by
Claverie and Makalowski, Nature vol. 371, page 752 (1994).
CDD Search
Compares protein sequences to the Conserved Domain Database. The CDD is a datab
ase containing a collection of functional and/or structural domains derived fro
m two popular collections, Smart and Pfam, plus contributions from colleagues at NCBI.
For more information please see the CDD homepage
.
Limit by Entrez Query
BLAST searches can be limited to the results of an Entrez query against the dat
abase chosen. This can be used to limit searches to subsets of the BLAST databa
ses. Any terms can be entered that would normally be allowed in an Entrez
search session. For example:
protease AND NOT hiv1[Organsim]
This will limit a BLAST search to all proteases, except those in HIV 1. This ca
n also be used to limit searches to a particular molecule type:
biomol_mrna[PROP] AND brain
To limit to a specific organism you can either select using the pulldown menu,
form a list of the most common organism in the databases. Or enter the name of
the organism in the Entrez Query field with the [Organism] qualifier. For examp
le:
Mus musculus[Organism]
Or For help in constructing Entrez queries please see the "Writing Advanced Search Statements" section of the Entrez Help
document.
Filter (Low-complexity)
Mask off segments of the query sequence that have low compositional complexity,
as determined by the SEG program of Wootton
& Federhen (Computers and Chemistry, 1993) or, for BLASTN, by the DUST program of Tatusov and Lipman (in preparation). Fil
tering can eliminate statistically significant but biologically uninteresting r
eports from the blast output (e.g., hits against common acidic-, basic- or prol
ine-rich regions), leaving the more biologically interesting regions of the que
ry sequence available for specific matching against database sequences.
Filtering is only applied to the query sequence (or its translation products),
not to database sequences. Default filtering is DUST for BLASTN, SEG for other
programs.
It is not unusual for nothing at all to be masked by SEG, when applied to seque
nces in SWISS-PROT, so filtering should not be expected to always yield an effe
ct. Furthermore, in some cases, sequences are masked in their entirety, indicat
ing that the statistical significance of any matches reported against the unfil
tered query sequence should be suspect.
Filter (Human repeats)
This option masks Human repeats (LINE's and SINE's) and is especially useful fo
r human sequences that may contain these repeats. Filtering for repeats can inc
rease the speed of a search especially with very long sequences (>100 kb) and a
gainst databases which contain large number of repeats (htgs). For more informa
tion please see "Why does my search timeout on the BLA
ST servers?" in the BLAST Frequently Asked Questions. Human Repeat F
iltering is still experimental and under development, so it may change in the n
ear future.
Filter (Mask for lookup table only)
This option masks only for purposes of constructing the lookup table used by BL
AST. BLAST searches consist of two phases, finding hits based upon a lookup tab
le and then extending them. The option to "Mask for lookup table only" masks on
ly for the lookup table so that no hits are found based upon low-complexity seq
uence. The BLAST extensions are performed without masking and so they can be ex
tended through low-complexity sequence. This option is still experimental and m
ay change in the near future.
Expect
The statistical significance threshold for reporting matches against database s
equences; the default value is 10, meaning that 10 matches are expected to be f
ound merely by chance, according to the stochastic model of Karlin and Altschul
(1990). If the statistical significance ascribed to a match is greater than the
EXPECT threshold, the match will not be reported. Lower EXPECT thresholds are m
ore stringent, leading to fewer chance matches being reported. Increasing the t
hreshold shows less stringent matches. Fractional values are acceptable.
Learn more
Inclusion Threshold
The statistical significance threshold for including a sequence in the model us
ed by PSI-BLAST to create the PSSM on the next iteration.
Query Genetic Code
Genetic code to be used in blastx translation of the query. (See List of Genetic Codes)
Number of hits
It is possible to speed up search by specifying maximum number of hits to be co
mputed.
AutoFormat
If AutoFormat is disabled (unchecked) the "Status = Ready" and a change of back
ground color to blue indicates the search is complete. However, it will not per
form actual formatting. Formatting can be performed by pressing the 'Format' bu
tton on a previous page.
When the AutoFormat option is enabled (checked) clicking the Format button
will show the status and time stamps and then automatically format BLAST result
s when they are ready.
Graphical Overview
An overview of the database sequences aligned to the query sequence is shown. T
he score of each alignment is indicated by one of five different colors, which
divides the range of scores into five groups. Multiple alignments on the same d
atabase sequence are connected by a striped line. Mousing over a hit sequence c
auses the definition and score to be shown in the window at the top, clicking o
n a hit sequence takes the user to the associated alignments.
NCBI-gi
Causes NCBI gi identifiers to be shown in the output, in addition to the access
ion and/or locus name.
Descriptions
Restricts the number of short descriptions of matching sequences reported to th
e number specified; default limit is 100 descriptions. See also EXPECT.
Alignments
Restricts database sequences to the number specified for which high-scoring seg
ment pairs (HSPs) are reported; the default limit is 100. If more database sequ
ences than this happen to satisfy the statistical significance threshold for re
porting (see EXPECT below), only the matches ascribed the greatest statistical
significance are reported.
Alignments Views
pairwise
Standard BLAST alignment in pairs of query sequence and database match.
Query-anchored with identities
The databases alignments are anchored (shown in relation to) to the query seque
nce. Identities are displayed as dashes, with mismatches displayed as single le
tter nucleotide abbreviations.
Query-anchored without identities
Identities are shown as single letter nucleotide abbreviations.
Flat Query-anchored with identities
The 'flat' display shows inserts as deletions on the query.
Identities are displayed as dashes, with mismatches displayed as single let
ter nucleotide abbreviations.
Flat Query-anchored without identities
The 'flat' display shows inserts as deletions on the query. Identities are show
n as single letter nucleotide abbreviations.
Get ASN.1 for SeqAnnot
SeqAnnot format for importation into NCBI Toolkit programs.
Get ASN.1 for the BLAST Object
Object format for NCBI toolkit programs.
The translations include:
blastx
compares a nucleotide query sequence translated in all reading frames against a
protein sequence database
tblastn
compares a protein query sequence against a nucleotide sequence database dynami
cally translated in all reading frames.
tblastx
compares the six-frame translations of a nucleotide query sequence against the
six-frame translations of a nucleotide sequence database. Please note that tbla
stx program cannot be used with the nr database on the BLAST Web page.
Learn more
Matrix
A key element in evaluating the quality of a pairwise sequence alignment is the
"substitution matrix", which assigns a score for aligning any possible pair of
residues. The matrix used in a BLAST search can be changed depending on t
he type of sequences you are searching with (see the
Frequently Asked Questions).
More information on BLAST substitution matrices
Gap Cost and Lambda Ratio
The pull down menu shows the Gap Costs (Penalty to open Gap and penalty to exte
nd Gap) and the Lambda ratio settings for the matrix chosen. There can only be
a limited number of options for these parameters. Increasing the Gap Costs and
Lambda ratio will result in alignments which decrease the number of Gaps introd
uced.
Learn More
PSSM
PSI-BLAST can save the Position Specific Score Matrix to be used in other prote
in searches. The PSSM can be stored in a text file and cut and pasted into the
PSSM field.
To save a PSSM file:
-
Run a protein BLAST search.
-
Check the PSI-BLAST box on formatting page.
-
Click the "Format" Button.
-
On the PSI-BLAST results page, click the "Run PSI-BLAST Iteration 2" button.
-
Now, on the Format page, "PSSM" from the "Show" pull down menu.
-
Click "Format".
-
This will display text output with the ASCII-encoded PSSM. The "Save as..
." option of the browser can be used to save this to a plain text file on your
hard drive.
From the protein BLAST page, chose any database, and paste the contents of the
PSSM text file into the "PSSM" field. If the database is the same as when the P
SSM was stored, you'll reproduce the iteration on which you've saved the PSSM;
A different database will yield a different hit list.
Inclusion Threshold
The statistical significance threshold for including a sequence in the mode
l used by PSI-BLAST on the next iteration.
Composition-based statistics
BLAST and PSI-BLAST now permit calculated E-values to take into account the ami
no acid composition of the individual database sequences involved in reported a
lignments. This improves E-value accuracy, thereby reducing the number of false
positive results.
The improved statistics are achieved with a scaling procedure [1,2] which i
n effect employs a slightly different scoring system for each database sequence
. As a result, raw BLAST alignment scores in general will not correspond precis
ely to those implied by any standard substitution matrix. Furthermore, identica
l alignments can receive different scores, based upon the compositions of the s
equences they involve. The improved statistics are now used by default for all
rounds of searching on the PSI-BLAST page, but not on the BLAST page. Therefore
, if one uses default settings, the results of the first round of searching wil
l be different on the BLAST and PSI-BLAST pages. In addition adjustments have b
een made to two PSI-BLAST parameters: the pseudocount constant default has been
changed from 10 to 7, and the E-value threshold for including matches in the PS
I-BLAST model has been changed from 0.001 to 0.002.
[1] Altschul, S.F. et al. (1997) Nucl.
Acids Res. 25:3389-3402.
[2] Schäffer, A.A. et al. (1999)
Bioinformatics 15:1000-1011.
Program Advanced Options
-G Cost to open gap [Integer]
default = 5 for nucleotides 11 proteins
-E Cost to extend gap [Integer]
default = 2 nucleotides 1 proteins
-q Penalty for nucleotide mismatch [Integer]
default = -3
-r reward for nucleotide match [Integer]
default = 1
-e expect value [Real]
default = 10
-W wordsize [Integer]
default = 11 nucleotides 3 proteins
-y Dropoff (X) for blast extensions in bits (default if zero)
default = 20 for blastn 7 for other programs
-X X dropoff value for gapped alignment (in bits)
30 for blastn, 15 for other programs
-Z final X dropoff value for gapped alignment (in bits)
50 for blastn, 25 for other programs
-I Number of database sequences to save hits for (see:
"Number of Hits Computed")
default = 500
-v Number of descriptions to show
default = 100
-b Number of database sequences to show alignments for
default = 50
-Y Effective search space
default = 0, real search space
-z Effective database length
default = 0, real database length
-c Pseudocount constant for PSI-BLAST
default = 7
-F String with filtering directives.
This option allows a user to change seg parameters. To use seg with a windo
w of 10, locut of 1.0 and hicut of 1.5
one should specify in the Advanced Options box: -F'S 10 1.0 1.5'. Note that
it is necessary to use single-quotes here, double quotes are removed by the
browser.
Limited values for gap existence and extension are supported for these three
programs. Some supported and suggested values are:
Existence Extension
10
1
10
2
11
1
8
2
9
2
Learn more
PHI-BLAST (Pattern-Hit Initiated BLAST) is a search program that combines ma
tching of regular expressions with local alignments surrounding the match. Give
n a protein sequence S and a regular expression pattern P occurring in S, PHI-B
LAST helps answer the question: What other protein sequences both contain an oc
currence of P and are homologous to S in the vicinity of the pattern occurrence
s? PHI-BLAST may be preferable to just searching for pattern occurrences becaus
e it filters out those cases where the pattern occurrence is probably random an
d not indicative of homology. Please see the Rules
for Pattern Syntax.
Learn more
The Position-Specific Iterated BLAST, or PSI-BLAST program performs an itera
tive search in which sequences found in one round of searching are used to buil
d a score model for the next round of searching. In PSI-BLAST the algorithm is
not tied to a specific score matrix. Traditionally, it has been implemented usi
ng an AxA substitution matrix where A is the alphabet size. PSI-BLAST instead u
ses a QxA matrix, where Q is the length of the query sequence; at each position
the cost of a letter depends on the position w.r.t. the query and the letter in
the subject sequence.
|
Questions?
|
| If you have additional questions please contact
blast-help@ncbi.nlm.nih.gov
|
Disclaimer
Privacy statement
Revised January 21, 2000 |