Bringing large-scale multiple genome analysis one step closer: ScalaBLAST and beyond

PDF Version Also Available for Download.

Description

Genome sequence comparisons of exponentially growing data sets form the foundation for the comparative analysis tools provided by community biological data resources such as the Integrated Microbial Genome (IMG) system at the Joint Genome Institute (JGI). We present an example of how ScalaBLAST, a high-throughput sequence analysis program harnesses increasingly critical high-performance computing to perform sequence analysis which is a critical component of maintaining a state-of-the-art sequence data repository. The Integrated Microbial Genomes (IMG) system1 is a data management and analysis platform for microbial genomes hosted at the JGI. IMG contains both draft and complete JGI genomes integrated with other ... continued below

Creation Information

Oehmen, Christopher S.; Sofia, Heidi J.; Baxter, Douglas; Szeto, Ernest; Hugenholtz, Philip; Kyrpides, Nikos et al. June 1, 2007.

Context

This report is part of the collection entitled: Office of Scientific & Technical Information Technical Reports and was provided by UNT Libraries Government Documents Department to Digital Library, a digital repository hosted by the UNT Libraries. More information about this report can be viewed below.

Who

People and organizations associated with either the creation of this report or its content.

Publisher

Provided By

UNT Libraries Government Documents Department

Serving as both a federal and a state depository library, the UNT Libraries Government Documents Department maintains millions of items in a variety of formats. The department is a member of the FDLP Content Partnerships Program and an Affiliated Archive of the National Archives.

Contact Us

What

Descriptive information to help identify this report. Follow the links below to find similar items on the Digital Library.

Description

Genome sequence comparisons of exponentially growing data sets form the foundation for the comparative analysis tools provided by community biological data resources such as the Integrated Microbial Genome (IMG) system at the Joint Genome Institute (JGI). We present an example of how ScalaBLAST, a high-throughput sequence analysis program harnesses increasingly critical high-performance computing to perform sequence analysis which is a critical component of maintaining a state-of-the-art sequence data repository. The Integrated Microbial Genomes (IMG) system1 is a data management and analysis platform for microbial genomes hosted at the JGI. IMG contains both draft and complete JGI genomes integrated with other publicly available microbial genomes of all three domains of life. IMG provides tools and viewers for interactive analysis of genomes, genes and functions, individually or in a comparative context. Most of these tools are based on pre-computed pairwise sequence similarities involving millions of genes. These computations are becoming prohibitively time consuming with the rapid increase in the number of newly sequenced genomes incorporated into IMG and the need to refresh regularly the content of IMG in order to reflect changes in the annotations of existing genomes. Thus, building IMG 2.0 (released on December 1st 2006) entailed reloading from NCBI's RefSeq all the genomes in the previous version of IMG (IMG 1.6, as of September 1st, 2006) together with 1,541 new public microbial,viral and eukaryal genomes, bringing the total of IMG genomes to 2,301. A critical part of building IMG 2.0 involved using PNNL ScalaBLAST software for computing pairwise similarities for over 2.2 million genes in under 26 hours on 1,000 processors, thus illustrating the impact that new generation bioinformatics tools are poised to make in biology. The BLAST algorithm2, 3 is a familiar bioinformatics application for computing sequence similarity, and has become a workhorse in large-scale genomics projects. The rapid growth of genome resources such as IMG cannot be sustained without more powerful tools such as ScalaBLAST that use more effectively large scale computing resources to perform the core BLAST calculations. ScalaBLAST is a high performance computing algorithm designed to give high throughput BLAST results on high-end supercomputers. Other parallel sequence comparison applications have been developed4-6. However problems with scaling generally prevent these applications from being used for very large searches. ScalaBLAST7 is the first BLAST application to be both highly scaleable against the size of the database as well as the number of computer processors on high-end hardware and on commodity clusters. ScalaBLAST achieves high throughput by parsing a large collection of query sequences into independent subgroups. These smaller tasks are assigned to independent process groups. Efficient scaling is achieved by (transparently to the user) sharing only one copy of the target database across all processors using the Global Array toolkit 8, 9, which provides software implementation of shared memory interface. ScalaBLAST was initially deployed on the 1,960 processor MPP2 cluster in the Wiliam R. Wiley Environmental Molecular Sciences Laboratory at Pacific Northwest National Laboratory, and has since been ported to a variety of linux-based clusters and shared memory architectures, including SGI Altix, AMD opteron, and Intel Xeon-based clusters. Future targets include IBM BlueGene, Cray, and SGI Altix XE architectures. The importance of performing high-throughput calculations rapidly lies in the rate of growth of sequence data. For a genome sequencing center to provide multiple-genome comparison capabilities, it must keep pace with exponentially growing collection of protein data, both from its own genomes, and from the public genome information as well. As sequence data continues to grow exponentially, this challenge will only increase with time. Solving the BLAST throughput challenge for centralized data resources like IMG has the potential to unlock the power of emerging analysis methods which, until recently, were limited by the availability of multiple genome comparison data. Fig. 1 illustrates how the run-time achieved by efficient scaling in ScalaBLAST enabled the IMG all vs. all BLAST calculations to complete in roughly 1 day. Note that to keep pace with growing IMG database, we will have to double the number of processors used in these calculations during the upcoming year. Grid-based solutions for improving throughput for BLAST searches has become a popular and attractive option for some centers. The Institute for Genome Research (http://www.tigr.org/), for instance, has implemented a grid-based BLAST tool allowing users to submit requests to be farmed out to available computers on an on-demand basis.

Language

Item Type

Identifier

Unique identifying numbers for this report in the Digital Library or other systems.

  • Report No.: LBNL-62882
  • Grant Number: DE-AC02-05CH11231
  • DOI: 10.2172/960403 | External Link
  • Office of Scientific & Technical Information Report Number: 960403
  • Archival Resource Key: ark:/67531/metadc934830

Collections

This report is part of the following collection of related materials.

Office of Scientific & Technical Information Technical Reports

Reports, articles and other documents harvested from the Office of Scientific and Technical Information.

Office of Scientific and Technical Information (OSTI) is the Department of Energy (DOE) office that collects, preserves, and disseminates DOE-sponsored research and development (R&D) results that are the outcomes of R&D projects or other funded activities at DOE labs and facilities nationwide and grantees at universities and other institutions.

What responsibilities do I have when using this report?

When

Dates and time periods associated with this report.

Creation Date

  • June 1, 2007

Added to The UNT Digital Library

  • Nov. 13, 2016, 7:26 p.m.

Description Last Updated

  • Nov. 18, 2016, 2:26 p.m.

Usage Statistics

When was this report last used?

Yesterday: 0
Past 30 days: 0
Total Uses: 1

Interact With This Report

Here are some suggestions for what to do next.

Start Reading

PDF Version Also Available for Download.

International Image Interoperability Framework

IIF Logo

We support the IIIF Presentation API

Oehmen, Christopher S.; Sofia, Heidi J.; Baxter, Douglas; Szeto, Ernest; Hugenholtz, Philip; Kyrpides, Nikos et al. Bringing large-scale multiple genome analysis one step closer: ScalaBLAST and beyond, report, June 1, 2007; Berkeley, California. (digital.library.unt.edu/ark:/67531/metadc934830/: accessed November 20, 2018), University of North Texas Libraries, Digital Library, digital.library.unt.edu; crediting UNT Libraries Government Documents Department.