Warren Gish

Warren Richard Gish
Warren Richard Gish
Alma mater	University of California, Berkeley
Known for	BLAST
	Scientific career
Fields	Bioinformatics
Institutions	National Center for Biotechnology Information; Washington University in St. Louis; Advanced Biocomputing LLC; University of California, Berkeley
Thesis	I. SV40 mutants isolated from transformed human cells. II. Methods for sequence analysis (1988)
Doctoral advisor	Michael Botchan

Warren Richard Gish is the owner of Advanced Biocomputing LLC. He joined Washington University School of Medicine as a junior faculty member in 1994, and was a Research Associate Professor of Genetics from 2002 to 2007.^[2]

Education

After initially studying physics, Gish obtained an A.B. degree in Biochemistry from University of California, Berkeley, and completed work for his Ph.D. degree in Molecular Biology at the same institution in 1988.^[1]

Research

Gish is primarily known for his contributions to NCBI BLAST,^[3]^[4] including creation and support of the BLAST Network Service, development and support of the nr (quasi-nonredundant) databases in February 1991, introduction of gapped BLAST (WU-BLAST 2.0) in May 1996, and his continued work on AB-BLAST. At Washington University in St. Louis, Gish led the genome analysis group which annotated all finished human, mouse and rat genome data produced by the university's Genome Sequencing Center from 1995 through 2002. First available to internal users in December 1989, the NCBI BLAST Network Service was opened to the public in March 1990, shortly after the BLAST manuscript was accepted for publication. This was several months before the paper would appear in print, so the NCBI director requested that availability of the network service not be published. Since the service ran the latest BLAST software on fast SMP hardware against comprehensive, daily-updated sequence databases, word-of-mouth publicity soon established the NCBI as a convenient, one-stop shop for sequence similarity searching. Like the BLAST Network Service, Gish's subsequent independently undertaken projects would rely on word-of-mouth publicity.

In 1985, to address a frequent need for rapid identification of restriction enzyme recognition sites in DNA, Gish developed a deterministic finite automaton function library in the C language. The idea of applying a finite-state machine to this problem was suggested by fellow graduate student and Berkeley Software Distribution developer Michael J. Karels, who noted that the finite-state automaton techniques used by grep might be applicable. Gish's DFA implementation used a Mealy machine architecture, which is more compact than an equivalent Moore machine and therefore more efficient. The resulting automata could scan subject sequences for recognition sites in a single pass without backtracking. Although developed independently, the DFA construction method was later observed to effectively consolidate Algorithms 3 and 4 described by Alfred Aho and Margaret J. Corasick.^[5]

While working at the University of California, Berkeley, in December 1986, Gish sped up the FASTP program ^[6] (later known as FASTA^[7]) of William R. Pearson and David J. Lipman by 2- to 3-fold without altering the results. When the performance modifications were communicated to Pearson and Lipman, Gish further suggested replacing FASTP's k-tuple lookup table with a deterministic finite automaton, estimating that this could improve overall performance by as much as 10%. The authors concluded that such a modest gain did not justify the added code complexity. At about the same time, Gish envisioned a centralized sequence-search service in which all GenBank nucleotide sequences would be maintained in memory, stored in compressed form to conserve space, and searched remotely over the Internet, thereby eliminating database I/O bottlenecks and making it practical for a single high-performance system to serve a large community of users.

Early prototypes of BLAST executed substantially faster than contemporary versions of FASTA. To further improve performance, Gish adapted his DFA code to BLAST word-hit identification as a replacement for lookup tables. Other contributions to BLAST included the use of compressed nucleotide sequences both as a compact storage format and as a faster internal representation for sequence searching; parallel processing; the use of locked shared-memory segments to keep large sequence databases resident in memory; memory-mapped I/O; and the use of sentinel bytes at the start and end of sequences to improve the speed of word-hit extension.

Gish also developed BLASTX,^[8] TBLASTN,^[3] and TBLASTX (the latter unpublished), as well as transparent support for external programs such as seg, xnu, and dust to mask low-complexity regions in query sequences. He created the NCBI BLAST E-mail Service with optional public-key-encrypted communications, and the NCBI BLAST Network Service.

Additional contributions included development of and ongoing support for the NCBI quasi-nonredundant (nr) protein and nucleotide sequence databases, typically updated daily and available both through the BLAST Network Service and for public download. These databases integrated sequences from GenBank (including GenPept), Swiss-Prot, and the Protein Information Resource. Gish also developed the first BLAST API, which was used in EST^[9] annotation and Entrez data production, as well as in the NCBI BLAST version 1.4 application suite (Gish, unpublished), and designed the initial NCBI Dispatcher for distributed services (inspired by CORBA's Object Request Broker).

At Washington University in St. Louis, Gish developed WU-BLAST 2.0, introducing a new X-drop method for gapped sequence alignment combined with new statistics for evaluating gapped alignment scores. The resulting programs were substantially more sensitive but only marginally slower than ungapped BLAST. Sensitivity of gapped alignments was enhanced by the novel extension of Karlin-Altschul Sum statistics^[10] to the evaluation of gapped alignment scores. Sum statistics had been developed analytically for the evaluation of multiple, ungapped alignment scores, and their empirical use to evaluate multiple, gapped alignment scores was validated in collaboration with Stephen Altschul. In May 1996, WU-BLAST version 2.0, with gapped alignments and Sum statistics in all search modes (BLASTP, BLASTN, BLASTX, TBLASTN and TBLASTX), was publicly released in the form of a drop-in upgrade for existing users of ungapped NCBI BLAST and WU-BLAST (both at version 1.4, after they were forked in 1994). NCBI BLAST supports neither gapped alignments nor Sum statistics in all search modes. In 1997, Gish implemented a faster, more memory-efficient and slightly more sensitive two-hit BLAST algorithm than used by NCBI software.

In 1999, Gish added support to WU-BLAST for the Extended Database Format (XDF), the first BLAST database format capable of accurately representing the entire draft sequence of the human genome in full-length chromosome sequence objects. WU-BLAST introduced XDF transparently, preserving compatibility with the original BLAST database format through an abstracted database I/O layer. WU-BLAST with XDF was the first BLAST suite to support indexed retrieval with NCBI standard FASTA-format sequence identifiers as keys (including the entire range of NCBI identifier tags); the first to allow retrieval of individual sequences, either in part or in whole, and in native form, translated or reverse-complemented; and the first capable of dumping the entire contents of a BLAST database back into human-readable FASTA format.

In 2000, unique support for reporting links (consistent sets of HSPs; also called chains by some later software packages) was added to WU-BLAST, along with the ability for users to limit the distance between HSPs allowed in the same set to a biologically relevant length (e.g., the length of the expected longest intron in the species of interest), with the distance limitation incorporated into the calculation of E-values. Gish proposed multiplexing query sequences to speed up BLAST searches by an order of magnitude or more (MPBLAST) and implemented segmented sequences with internal sentinel bytes, in part to aid multiplexing with MPBLAST and in part to prevent alignments from extending across segment boundaries when analyzing segmented query sequences from shotgun sequencing assemblies. He also directed use of WU-BLAST as a fast, flexible search engine for accurately identifying and masking genome sequences for repetitive elements and low-complexity sequences (the MaskerAid^[11] package for RepeatMasker). With doctoral student Miao Zhang, Gish directed development of EXALIN,^[12] which improved the accuracy of spliced alignment predictions by a novel approach that combined information from donor and acceptor splice-site models with information from sequence conservation. Although EXALIN performed full dynamic programming by default, it could optionally utilize the output from WU-BLAST to seed the dynamic programming and speed up the process by about 100-fold with little loss of sensitivity or accuracy.

In 2008, Gish founded Advanced Biocomputing, LLC, where he continues development and support of the AB-BLAST package.^{[citation needed]}

In 2024, after the NCBI discontinued distribution of sequence databases in the simple, widely supported FASTA format, Gish provided scripts that accelerated conversion of NCBI BLAST databases to FASTA format by several fold.^[13]

References

1 2 Gish, Warren Richard (1988). I. SV40 mutants isolated from transformed human cells. II. Methods for sequence analysis (PhD thesis). University of California, Berkeley. ProQuest 303669506.
↑ Warren Gish at DBLP Bibliography Server
1 2 Altschul, S.; Gish, W.; Miller, W.; Myers, E.; Lipman, D. (1990). "Basic Local Alignment Search Tool". Journal of Molecular Biology. 215 (3): 403–410. doi:10.1016/S0022-2836(05)80360-2. PMID 2231712. S2CID 14441902.
↑ Sense from Sequences: Stephen F. Altschul on Bettering BLAST
↑ Aho, Alfred V.; Corasick, Margaret J. (June 1975). "Efficient string matching: An aid to bibliographic search". Communications of the ACM. 18 (6): 333–340. doi:10.1145/360825.360855. S2CID 207735784.
↑ Lipman, DJ; Pearson, WR (1985). "Rapid and sensitive protein similarity searches". Science. 227 (4693): 1435–41. Bibcode:1985Sci...227.1435L. doi:10.1126/science.2983426. PMID 2983426.
↑ Pearson, W. R.; Lipman, D. J. (1988). "Improved tools for biological sequence comparison". Proceedings of the National Academy of Sciences of the United States of America. 85 (8): 2444–2448. Bibcode:1988PNAS...85.2444P. doi:10.1073/pnas.85.8.2444. PMC 280013. PMID 3162770.
↑ Gish, W.; States, D.J. (1993). "Identification of protein coding regions by database similarity search". Nature Genetics. 3 (3): 266–272. doi:10.1038/ng0393-266. PMID 8485583. S2CID 15295142.
↑ Boguski, M.S.; Lowe, T.M.; Tolstoshev, C.M. (1993). "dbEST--database for "expressed sequence tags"". Nature Genetics. 4 (4): 332–333. doi:10.1038/ng0893-332. PMID 8401577. S2CID 40138950.
↑ Karlin, S.; Altschul, S. F. (1993). "Applications and statistics for multiple high-scoring segments in molecular sequences". Proceedings of the National Academy of Sciences of the United States of America. 90 (12): 5873–5877. Bibcode:1993PNAS...90.5873K. doi:10.1073/pnas.90.12.5873. PMC 46825. PMID 8390686.
↑ Bedell, J. A.; Korf, I.; Gish, W. (2000). "MaskerAid : A performance enhancement to RepeatMasker". Bioinformatics. 16 (11): 1040–1041. doi:10.1093/bioinformatics/16.11.1040. PMID 11159316.
↑ Zhang, M.; Gish, W. (2005). "Improved spliced alignment from an information theoretic approach". Bioinformatics. 22 (1): 13–20. doi:10.1093/bioinformatics/bti748. PMID 16267086.
↑ "NCBI Helper". blast.advbiocomp.com. Retrieved 2026-05-30.

[gishphd-1] 1 2 Gish, Warren Richard (1988). I. SV40 mutants isolated from transformed human cells. II. Methods for sequence analysis (PhD thesis). University of California, Berkeley. ProQuest 303669506.

[dblp-2] Warren Gish at DBLP Bibliography Server

[Altschul1990-3] 1 2 Altschul, S.; Gish, W.; Miller, W.; Myers, E.; Lipman, D. (1990). "Basic Local Alignment Search Tool". Journal of Molecular Biology. 215 (3): 403–410. doi:10.1016/S0022-2836(05)80360-2. PMID 2231712. S2CID 14441902.

[4] Sense from Sequences: Stephen F. Altschul on Bettering BLAST

[5] Aho, Alfred V.; Corasick, Margaret J. (June 1975). "Efficient string matching: An aid to bibliographic search". Communications of the ACM. 18 (6): 333–340. doi:10.1145/360825.360855. S2CID 207735784.

[6] Lipman, DJ; Pearson, WR (1985). "Rapid and sensitive protein similarity searches". Science. 227 (4693): 1435–41. Bibcode:1985Sci...227.1435L. doi:10.1126/science.2983426. PMID 2983426.

[7] Pearson, W. R.; Lipman, D. J. (1988). "Improved tools for biological sequence comparison". Proceedings of the National Academy of Sciences of the United States of America. 85 (8): 2444–2448. Bibcode:1988PNAS...85.2444P. doi:10.1073/pnas.85.8.2444. PMC 280013. PMID 3162770.

[8] Gish, W.; States, D.J. (1993). "Identification of protein coding regions by database similarity search". Nature Genetics. 3 (3): 266–272. doi:10.1038/ng0393-266. PMID 8485583. S2CID 15295142.

[9] Boguski, M.S.; Lowe, T.M.; Tolstoshev, C.M. (1993). "dbEST--database for "expressed sequence tags"". Nature Genetics. 4 (4): 332–333. doi:10.1038/ng0893-332. PMID 8401577. S2CID 40138950.

[10] Karlin, S.; Altschul, S. F. (1993). "Applications and statistics for multiple high-scoring segments in molecular sequences". Proceedings of the National Academy of Sciences of the United States of America. 90 (12): 5873–5877. Bibcode:1993PNAS...90.5873K. doi:10.1073/pnas.90.12.5873. PMC 46825. PMID 8390686.

[11] Bedell, J. A.; Korf, I.; Gish, W. (2000). "MaskerAid : A performance enhancement to RepeatMasker". Bioinformatics. 16 (11): 1040–1041. doi:10.1093/bioinformatics/16.11.1040. PMID 11159316.

[12] Zhang, M.; Gish, W. (2005). "Improved spliced alignment from an information theoretic approach". Bioinformatics. 22 (1): 13–20. doi:10.1093/bioinformatics/bti748. PMID 16267086.

[13] "NCBI Helper". blast.advbiocomp.com. Retrieved 2026-05-30.

[2]

[1]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]