The Sorcerer II Global Ocean Sampling Expedition: Expanding theUniverse of Protein Families

PDF Version Also Available for Download.

Description

Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, ... continued below

Creation Information

Yooseph, Shibu; Sutton, Granger; Rusch, Douglas B.; Halpern,Aaron L.; Williamson, Shannon J.; Remington, Karin et al. March 23, 2006.

Context

This article is part of the collection entitled: Office of Scientific & Technical Information Technical Reports and was provided by UNT Libraries Government Documents Department to Digital Library, a digital repository hosted by the UNT Libraries. More information about this article can be viewed below.

Who

People and organizations associated with either the creation of this article or its content.

Provided By

UNT Libraries Government Documents Department

Serving as both a federal and a state depository library, the UNT Libraries Government Documents Department maintains millions of items in a variety of formats. The department is a member of the FDLP Content Partnerships Program and an Affiliated Archive of the National Archives.

Contact Us

What

Descriptive information to help identify this article. Follow the links below to find similar items on the Digital Library.

Description

Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature.

Source

  • Journal Name: PLoS Biology; Journal Volume: 5; Journal Issue: 3; Related Information: Journal Publication Date: 03/2007

Language

Item Type

Identifier

Unique identifying numbers for this article in the Digital Library or other systems.

  • Report No.: LBNL--62557
  • Grant Number: DE-AC02-05CH11231
  • Office of Scientific & Technical Information Report Number: 910224
  • Archival Resource Key: ark:/67531/metadc887299

Collections

This article is part of the following collection of related materials.

Office of Scientific & Technical Information Technical Reports

What responsibilities do I have when using this article?

When

Dates and time periods associated with this article.

Creation Date

  • March 23, 2006

Added to The UNT Digital Library

  • Sept. 22, 2016, 2:13 a.m.

Usage Statistics

When was this article last used?

Congratulations! It looks like you are the first person to view this item online.

Interact With This Article

Here are some suggestions for what to do next.

Start Reading

PDF Version Also Available for Download.

Citations, Rights, Re-Use

Yooseph, Shibu; Sutton, Granger; Rusch, Douglas B.; Halpern,Aaron L.; Williamson, Shannon J.; Remington, Karin et al. The Sorcerer II Global Ocean Sampling Expedition: Expanding theUniverse of Protein Families, article, March 23, 2006; United States. (digital.library.unt.edu/ark:/67531/metadc887299/: accessed August 18, 2017), University of North Texas Libraries, Digital Library, digital.library.unt.edu; crediting UNT Libraries Government Documents Department.