The BLAST family of programs allows all combinations of DNA or protein query sequences with searches against DNA or protein databases: blastp compares an amino acid query sequence against a protein sequence database. blastn compares a nucleotide query sequence against a nucleotide sequence database. blastx compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database. tblastn compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands). tblastx compares the six-frame translations of a nucleo- tide query sequence against the six-frame transla- tions of a nucleotide sequence database. The default matrix for all protein-protein comparisons is BLOSUM62. Gaps in Blast Version 2.0 of BLAST allows the introduction of gaps (deletions and insertions) into alignments. With a gapped alignment tool, homologous domains do not have to be broken into several segments. Also, the scoring of gapped results tends to be more biologically meaningful than ungapped results. The programs, blastn and blastp, offer fully gapped alignments. blastx and tblastn have 'in-frame' gapped alignments and use sum statistics to link alignments from different frames. tblastx provides only ungapped alignments. Blast Report The BLAST report consists of a number of sections. The descriptions below are for a blastp comparison, but the format for the other programs is analogous. The BLAST report is not intended to be a parseable document. It is subject to change with little or no notice. The BLAST report starts with some header information that lists the type of program (here blastp), the version (here 2.0.1), and a release date. Also listed are a reference to the BLAST program, the query definition line, and summary of the database used. BLASTP 2.0.1 [Aug-20-1997] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Query= gi|129295|sp|P01013|OVAX_CHICK gene X protein - chicken (fragment) (232 letters) Database: Non-redundant SwissProt sequences 59,576 sequences; 21,219,450 total letters One-line descriptions of the database matches found are presented next. These include a database sequence identifier, the corresponding definition line, as well as the score (in bits) and the statistical significance ('E value') for this match (please see the section on statistics for an explanation of bits and significance). Consider the output below, from a gapped blastp comparison of SwissProt accession P01013 against the SwissProt database. High E Sequences producing significant alignments: Score Value sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED) 442 e-124 sp|P01014|OVAY_CHICK GENE Y PROTEIN (OVALBUMIN-RELATED) 353 9e-98 sp|P01012|OVAL_CHICK OVALBUMIN (PLAKALBUMIN) (ALLERGEN GAL D II) 278 5e-75 sp|P19104|OVAL_COTJA OVALBUMIN 268 5e-72 sp|P48595|BOMA_HUMAN BOMAPIN (PROTEASE INHIBITOR 10) 199 2e-51 sp|P29508|SCC1_HUMAN SQUAMOUS CELL CARCINOMA ANTIGEN 1 (SCCA-1) ... 198 5e-51 sp|P80229|ILEU_PIG LEUKOCYTE ELASTASE INHIBITOR (LEI) (LEUCOCYTE... 197 1e-50 sp|P48594|SCC2_HUMAN SQUAMOUS CELL CARCINOMA ANTIGEN 2 (SCCA-2) ... 196 2e-50 sp|P50453|PTI9_HUMAN CYTOPLASMIC ANTIPROTEINASE 3 (CAP3) (PROTEA... 195 6e-50 sp|P05619|ILEU_HORSE LEUKOCYTE ELASTASE INHIBITOR (LEI) 193 2e-49 The first match, in this case, is the actual query sequence. The identifiers shown here are all from SwissProt, so they all have 'sp' in the first field, followed by the accession, and then a Locus name. The syntax of these identifiers is discussed in more detail in the appendices of ftp://ncbi.nlm.nih.gov/blast/db/README The definition lines are taken from the definition line in the database, with the ellipsis (e.g., P29508) indicating that the definition line was too long to for the space available. Ungapped alignments and results from blastx and tblastn will have an additional column ('N'), displaying the number of different segment pairs used to produce the alignment, according to the Karlin-Altschul statistics. Each alignment is preceded by the sequence identifier, the full definition line and the length of the database sequence. Next come the score (in bits as well as the raw score) as well as the statistical significance of the match, followed by the number of identities and positive matches according to the scoring system (e.g., BLOSUM62) and, if applicable, the number of gaps in the alignment. Finally the actual alignment is shown, with the query on top and the database match labeled as 'Sbjct'. Between the two sequences the residue is shown if it is conserved, a '+' is shown if there is a positive match. One or more dashes, '-', indicates insertions or deletions. The example below is the third sequence listed in the one-line descriptions above. >sp|P01012|OVAL_CHICK OVALBUMIN (PLAKALBUMIN) (ALLERGEN GAL D II) Length = 386 Score = 278 bits (744), Expect = 5e-75 Identities = 149/231 (64%), Positives = 182/231 (78%), Gaps = 2/231 (0%) Query 2 IKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNS 61 I+++L SS D T +VLVNAI FKG+W+ AF EDT+ MPF VT+QESKPVQMM Sbjct 158 IRNVLQPSSVDSQTAMVLVNAIVFKGLWEKAFKDEDTQAMPFRVTEQESKPVQMMYQIGL 217 Query 62 FNVATLPAEKMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKR 121 F VA++ +EKMKILELPFASG +SMLVLLPDEVS LE++E INFEKLTEWT+ N ME+R Sbjct 218 FRVASMASEKMKILELPFASGTMSMLVLLPDEVSGLEQLESIINFEKLTEWTSSNVMEER 277 Query 122 RVKVYLPQMKIEEKYNLTSVLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSE 181 ++KVYLP+MK+EEKYNLTSVLMA+G+TD+F SANL+GISSAESLKISQAVH A E++E Sbjct 278 KIKVYLPRMKMEEKYNLTSVLMAMGITDVFSSSANLSGISSAESLKISQAVHAAHAEINE 337 Query 182 DGIEMAGSTGVIEDIKHSPESEQFRADHPFLFLIKHNPTNTIVYFGRYWSP 232 G E+ GS + + SE+FRADHPFLF IKH TN +++FGR SP Sbjct 338 AGREVVGSAEA--GVDAASVSEEFRADHPFLFCIKHIATNAVLFFGRCVSP 386 The last section lists specifics about the database searched as well as statistical and search parameters used: Database: Non-redundant SwissProt sequences Posted date: Aug 14, 1997 9:52 AM Number of letters in database: 21,219,450 Number of sequences in database: 59,576 Lambda K H 0.317 0.132 0.377 Gapped Lambda K H 0.255 0.0350 0.190 Matrix: BLOSUM62 Gap Penalties: Existence: 10, Extension: 1 Number of Hits to DB: 8938654 Number of Sequences: 59576 Number of extensions: 335248 Number of successful extensions: 1188 Number of sequences better than 10: 116 Number of HSP's better than 10.0 without gapping: 106 Number of HSP's successfully gapped in prelim test: 10 Number of HSP's that attempted gapping in prelim test: 868 Number of HSP's gapped (non-prelim): 120 length of query: 232 length of database: 21219450 effective HSP length: 52 effective length of query: 180 effective length of database: 18121498 effective search space: -1033097656 T: 11 A: 40 X1: 16 ( 7.3 bits) X2: 40 (14.7 bits) X3: 67 (24.6 bits) S1: 41 (21.7 bits) S2: 64 (28.4 bits) Blast Statistics and Scores One may judge the results of a blast search by two numbers. One is the 'bit' score, which is defined as: S' (bits) = [lambda * S (raw) - ln K] / ln 2 where lambda and K are Karlin-Altschul parameters. The expression of the score in terms of bits makes it independent of the scoring system used (i.e., which matrix). The Expect value estimates the statistical significance of the match, specifying the number of matches, with a given score, that are expected in a search of a database of this size absolutely by chance. An Expect value of two, with a given score, would indicate that two matches with this score, are expected purely by chance. The expect value changes with the size of the database (in a larger database more chance matches with a given score are expected) and is the most intuitive way to rank results or compare the results of one query run against two different databases. Blastall Blastall may be used to perform all five flavors of blast comparison. One may obtain the blastall options by executing 'blastall -' (note the dash). A typical blastall to perform a blastn search (nucl. vs. nucl.) of a file called QUERY would be: blastall -p blastn -d nr -i QUERY -o out.QUERY The output is placed into the output file out.QUERY and the search is performed against the 'nr' database. If a protein vs. protein search is desired, then 'blastn' should be replaced with 'blastp' etc. Some of the most commonly used blastall options are: blastall arguments: -p Program Name [String] Input should be one of "blastp", "blastn", "blastx", "tblastn", or "tblastx". -d Database [String] default = nr Version 2.0.4 and higher will accept multiple database names (bracketed by quotations). An example would be -d "nr est" which will search both the nr and est databases, presenting the results as if one 'virtual' database consisting of all the entries from both were searched. The statistics are based on the 'virtual' database. -i Query File [File In] default = stdin The query should be in FASTA format. If multiple FASTA entries are in the input file, all queries will be searched. -e Expectation value (E) [Real] default = 10.0 -o BLAST report Output File [File Out] Optional default = stdout -F Filter query sequence (DUST with blastn, SEG with others) [T/F] default = T See the "Low-complexity Filters" section below for details. Blastpgp Blastpgp performs gapped blastp searches and can be used to perform iterative searches in psi-blast and phi-blast mode. See the PSI-Blast and PHI-BLAST sections for a description of this binary. The options may be obtained by executing 'blastpgp -'. PSI-Blast The blastpgp program can do an iterative search in which sequences found in one round of searching are used to build a score model for the next round of searching. In this usage, the program is called Position-Specific Iterated BLAST, or PSI-BLAST. As explained in the accompanying paper, the BLAST algorithm is not tied to a specific score matrix. Traditionally, it has been implemented using an AxA substitution matrix where A is the alphabet size. PSI-BLAST instead uses a QxA matrix, where Q is the length of the query sequence; at each position the cost of a letter depends on the position w.r.t. the query and the letter in the subject sequence. The position-specific matrix for round i+1 is built from a constrained multiple alignment among the query and the sequences found with sufficiently low e-value in round i. The top part of the output for each round distinguishes the sequences into: sequences found previously and used in the score model, and sequences not used in the score model. The output currently includes lots of diagnostics requested by users at NCBI. To skip quickly from the output of one round to the next, search for the string "producing", which is part of the header for each round and likely does not appear elsewhere in the output. PSI-BLAST "converges" and stops if all sequences found at round i+1 below the e-value threshold were already in the model at the beginning of the round. There are several blastpgp parameters specifically for PSI-BLAST: -j is the maximum number of rounds (default 1; i.e., regular BLAST) -e is the e-value threshold for including sequences in the score matrix model (default 0.01) -c is the "constant" used in the pseudocount formula specified in the paper (default 10) The -C and -R flags provide a "checkpointing" facility whereby a score model can be stored and later reused. -C stores the query and frequency count ratio matrix in a file -R restarts from a file stored previously. When using -R, it is required that the query specified on the command line match exactly the query in the restart file. The checkpoint files are stored in a byte-encoded (not human readable) format, so as to prevent roundoff error between writing and reading the checkpoint. Users who also develop their own sequence analysis software may wish to develop their own scoring systems. For this purpose the code in posit.c that writes out the checkpoint can be easily adapated to write out scoring systems derived by other algorithms in such a way that PSI-BLAST can read the files in later. The checkpoint structure is general in the sense that it can handle any position-specific matrix that fits in the Karlin-Altschul statistical framework for BLAST scoring. PHI-Blast PHI-BLAST (Pattern-Hit Initiated BLAST) is a search program that combines matching of regular expressions with local alignments surrounding the match. The most important features of the program have been incorporated into the BLAST software framework partly for user convenience and partly so that PHI-BLAST may be combined seamlessly with PSI-BLAST. Other features that do not fit into the BLAST framework will be released later as a separate program and/or separate Web page query options. One very restrictive way to identify protein motifs is by regular expressions that must contain each instance of the motif. The PROSITE database is a compilation of restricted regular expressions that describe protein motifs. Given a protein sequence S and a regular expression pattern P occurring in S, PHI-BLAST helps answer the question: What other protein sequences both contain an occurrence of P and are homologous to S in the vicinity of the pattern occurrences? PHI-BLAST may be preferable to just searching for pattern occurrences because it filters out those cases where the pattern occurrence is probably random and not indicative of homology. PHI-BLAST may be preferable to other flavors of BLAST because it is faster and because it allows the user to express a rigid pattern occurrence requirement. The pattern search methods in PHI-BLAST are based on the algorithms in: R. Baeza-Yates and G. Gonnet, Communications of the ACM 35(1992), pp. 74-82. S. Wu and U. Manber, Communications of the ACM 35(1992), pp. 83-91. The calculation of local alignments is done using a method very similar to (and much of the same code as) gapped BLAST. However, the method of evaluating statistical significance is different, and is described below. In the stand-alone mode the typical PHI-BLAST usage looks like: blastpgp -i -k -p patseedp where -i is followed by the file containing the query in FASTA format where -k is followed by the file containing the pattern in a syntax given below and "patseedp" indicates the mode of usage, not representing any file. The syntax for the query sequence is FASTA format as for all other BLAST queries. The syntax for patterns follows the rules of PROSITE and is documented in detail below. The specified pattern is not required to be in the PROSITE list. Most of the other BLAST flags can be used with PHI-BLAST. One important exception is that PHI-BLAST requires gapped alignments (i.e. forbids -g F in the flags) because ungapped alignments do not make sense for almost all patterns in PROSITE. There is a second mode of PHI-BLAST usage that is important when the specified pattern occurs more than 1 time in the query. In this case, the user may be interested in restricting the search for local alignments to a subset of the pattern occurrences. This can be done with a search that looks like: blastpgp -i -k -p seedp in which case the use of the "seedp" option requires the user to specify the location(s) of the interesting pattern occurrence(s) in the pattern file. The syntax for how to specify pattern occurrences is below. When there are multiple pattern occurrences in the query it may be important to decide how many are of interest because the E-value for matches is effectively multiplied by the number of interesting pattern occurrences. The PHI-BLAST Web page supports only the "patseedp" option. PHI-BLAST is integrated with PSI-BLAST. In the command-line mode, PSI-BLAST can be invoked by using the -j option, as usual. When this is done as: blastpgp -i -k -p patseedp -j then the first round of searching uses PHI-BLAST and all subsequent rounds use PSI-BLAST. In the Web page setting, the user must explicitly invoke one round at a time, and the PHI-BLAST Web page provides the option to initiate a PSI-BLAST round with the PHI-BLAST results. To describe a combined usage, use the term "PHI-PSI-BLAST" (Pattern-Hit Initiated, Position-Specific Iterated BLAST). Determining statistical significance. When a query sequence Q matches a database sequence D in PHI-BLAST, it is useful to subdivide Q and D into 3 disjoint pieces Qleft Qpattern Qright Dleft Dpattern Dright The substrings Qpattern and Dpattern contain the pattern specified in the pattern file. The pieces Qpattern and Dpattern are aligned and that alignment is displayed as part of the PHI-BLAST output, but the score for that alignment is mostly ignored. The "reduced" score r of an alignment is the sum of the scores obtained by aligning Qleft with Dleft and by aligning Qright with Dright. The expected number of alignments with a reduced score >= x is given by: CN(Lambda*x + 1)e^(-Lambda *x) where: C and Lambda are "constants" depending on the score matrix and the gap costs. N is (number of occurrences of pattern in database) * (number of occurrences of pattern in Q) e is the base of the natural logarithm. It is important to understand that this method of computing the statistical significance of a PHI-BLAST alignment is mathematically different from the method used for BLAST and PSI-BLAST alignments. However, both methods provide E-values, so they the E_values are displayed with a similar output syntax. Rules for pattern syntax for PHI-BLAST. The syntax for patterns in PHI-BLAST follows the conventions of PROSITE. When using the stand-alone program, it is permissible to have multiple patterns in a file separated by a blank line between patterns. When using the Web-page only one pattern is allowed per query. Valid protein characters for PHI-BLAST patterns: ABCDEFGHIKLMNPQRSTVWXYZU Valid DNA characters for PHI-BLAST patterns: ACGT Other useful delimiters: [ ] means any one of the characters enclosed in the brackets e.g., [LFYT] means one occurrence of L or F or Y or T - means nothing (this is a spacer character used by PROSITE) x with nothing following means any residue x(5) means 5 positions in which any residue is allowed (and similarly for any other single number in parentheses after x) x(2,4) means 2 to 4 positions where any residue is allowed, and similarly for any other two numbers separated by a comma; the first number should be < the second number. > can occur only at the end of a pattern and means nothing it may occur before a period (another spacer used by PROSITE) . may be used at the end of the pattern and means nothing When using the stand-alone program, the pattern should be in a file, with the first line starting: ID followed by 2 spaces and a text string giving the pattern a name. There should also be a line starting PA followed by 2 spaces followed by the pattern description. All other PROSITE codes in the first two columns are allowed, but only the HI code, described below is relevant to PHI-BLAST. Here is an example from PROSITE. ID CNMP_BINDING_2; PATTERN. AC PS00889; DT OCT-1993 (CREATED); OCT-1993 (DATA UPDATE); NOV-1995 (INFO UPDATE). DE Cyclic nucleotide-binding domain signature 2. PA [LIVMF]-G-E-x-[GAS]-[LIVM]-x(5,11)-R-[STAQ]-A-x-[LIVMA]-x-[STACV]. NR /RELEASE=32,49340; NR /TOTAL=57(36); /POSITIVE=57(36); /UNKNOWN=0(0); /FALSE_POS=0(0); NR /FALSE_NEG=1; /PARTIAL=1; CC /TAXO-RANGE=??EP?; /MAX-REPEAT=2; The line starting ID gives the pattern a name. The lines starting AC, DT, DE, NR, NR, CC are relevant to PROSITE users, but irrelevant to PHI-BLAST. These lines are tolerated, but ignored by PHI-BLAST. The line starting PA describes the pattern as: one of LIVMF followed by G followed by E followed by any single character followed by one of GAS followed by one of LIVM followed by any 5 to 11 characters followed by R followed by one of STAQ followed by A followed by any single character followed by one of LIVMA followed by any single character followed by one of STACV In this case the pattern ends with a period. It can end with nothing after the last specifying symbol or any number of > signs or periods or combination thereof. Here is another example, illustrating the use of an HI line. ID ER_TARGET; PATTERN. PA [KRHQSA]-[DENQ]-E-L>. HI (19 22) HI (201 204) In this example, the HI lines specify that the pattern occurs twice, once from positions 19 through 22 in the sequence and once from positions 201 through 204 in the sequence. These specifications are relevant when stand-alone PHI-BLAST is used with the seedp option, in which the interesting occurrences of the pattern in the sequence are specified. In this case the HI lines specify which occurrence(s) of the pattern should be used to find good alignments. In general, the seedp option is more useful than the standard patternp option ONLY when the pattern occurs K > 1 times in the sequence AND the user is interested in matching to J < K of those occurrences. Then using the HI lines enables the user to specify which occurrences are of interest. Additional functionality related to PHI-BLAST. PHI-BLAST takes as input both a sequence and a query containing that sequence and searches a sequence database for other sequences containing the same pattern and having a good alignment. One may be interested in asking two related, simpler questions: 1. Given a sequence and a database of patterns, which patterns occur in the sequence and where? 2. Given a pattern and a sequence database, which sequences contain the pattern and where? These queries can be answered wih software closely related to PHI-BLAST, but they do not fit into the output framework of BLAST because the answers are simple lists without alignments and with no notion of statistical significance. The NCBI toolbox includes another program, currently called seedtop to answer the two queries above. Query 1 can be asked with: seedtop -i -k -p patmatchp Query 2 can be asked with: seedtop -d -k -p patternp The -k argument is used similarly in all queries and the file format is always the same. The standard pattern database is PROSITE, but others (or a subset) can be used. There are plans afoot to offer the patmatchp query (number 1) on the PHI-BLAST web page or in its vicinity, but this would be restricted to having PROSITE as the pattern database. References Zhang, Zheng, Alejandro A. Schäffer, Webb Miller, Thomas L. Madden, David J. Lipman, Eugene V. Koonin, and Stephen F. Altschul (1998), "Protein sequence similarity searches using patterns as seeds", Nucleic Acids Res. 26:3986-3990. Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Karlin, Samuel and Stephen F. Altschul (1990). Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA 87:2264-68. Karlin, Samuel and Stephen F. Altschul (1993). Applications and statistics for multiple high-scoring segments in molecu- lar sequences. Proc. Natl. Acad. Sci. USA 90:5873-7.