DOE Joint Genome Institute 2008 Progress Report

Description: While initially a virtual institute, the driving force behind the creation of the DOE Joint Genome Institute in Walnut Creek, California in the Fall of 1999 was the Department of Energy's commitment to sequencing the human genome. With the publication in 2004 of a trio of manuscripts describing the finished 'DOE Human Chromosomes', the Institute successfully completed its human genome mission. In the time between the creation of the Department of Energy Joint Genome Institute (DOE JGI) and completion of the Human Genome Project, sequencing and its role in biology spread to fields extending far beyond what could be imagined when the Human Genome Project first began. Accordingly, the targets of the DOE JGI's sequencing activities changed, moving from a single human genome to the genomes of large numbers of microbes, plants, and other organisms, and the community of users of DOE JGI data similarly expanded and diversified. Transitioning into operating as a user facility, the DOE JGI modeled itself after other DOE user facilities, such as synchrotron light sources and supercomputer facilities, empowering the science of large numbers of investigators working in areas of relevance to energy and the environment. The JGI's approach to being a user facility is based on the concept that by focusing state-of-the-art sequencing and analysis capabilities on the best peer-reviewed ideas drawn from a broad community of scientists, the DOE JGI will effectively encourage creative approaches to DOE mission areas and produce important science. This clearly has occurred, only partially reflected in the fact that the DOE JGI has played a major role in more than 45 papers published in just the past three years alone in Nature and Science. The involvement of a large and engaged community of users working on important problems has helped maximize the impact of JGI science. A seismic ...
Date: March 12, 2009
Creator: Gilbert, David
JGI Fungal Genomics Program

Description: Genomes of energy and environment fungi are in focus of the Fungal Genomic Program at the US Department of Energy Joint Genome Institute (JGI). Its key project, the Genomics Encyclopedia of Fungi, targets fungi related to plant health (symbionts, pathogens, and biocontrol agents) and biorefinery processes (cellulose degradation, sugar fermentation, industrial hosts), and explores fungal diversity by means of genome sequencing and analysis. Over 50 fungal genomes have been sequenced by JGI to date and released through MycoCosm (www.jgi.doe.gov/fungi), a fungal web-portal, which integrates sequence and functional data with genome analysis tools for user community. Sequence analysis supported by functional genomics leads to developing parts list for complex systems ranging from ecosystems of biofuel crops to biorefineries. Recent examples of such 'parts' suggested by comparative genomics and functional analysis in these areas are presented here
Date: March 14, 2011
Creator: Grigoriev, Igor V.
Genome Sequence Databases (Overview): Sequencing and Assembly

Description: From the date its role in heredity was discovered, DNA has been generating interest among scientists from different fields of knowledge: physicists have studied the three dimensional structure of the DNA molecule, biologists tried to decode the secrets of life hidden within these long molecules, and technologists invent and improve methods of DNA analysis. The analysis of the nucleotide sequence of DNA occupies a special place among the methods developed. Thanks to the variety of sequencing technologies available, the process of decoding the sequence of genomic DNA (or whole genome sequencing) has become robust and inexpensive. Meanwhile the assembly of whole genome sequences remains a challenging task. In addition to the need to assemble millions of DNA fragments of different length (from 35 bp (Solexa) to 800 bp (Sanger)), great interest in analysis of microbial communities (metagenomes) of different complexities raises new problems and pushes some new requirements for sequence assembly tools to the forefront. The genome assembly process can be divided into two steps: draft assembly and assembly improvement (finishing). Despite the fact that automatically performed assembly (or draft assembly) is capable of covering up to 98% of the genome, in most cases, it still contains incorrectly assembled reads. The error rate of the consensus sequence produced at this stage is about 1/2000 bp. A finished genome represents the genome assembly of much higher accuracy (with no gaps or incorrectly assembled areas) and quality ({approx}1 error/10,000 bp), validated through a number of computer and laboratory experiments.
Date: January 1, 2009
Creator: Lapidus, Alla L.
Next-generation transcriptome assembly

Description: Transcriptomics studies often rely on partial reference transcriptomes that fail to capture the full catalog of transcripts and their variations. Recent advances in sequencing technologies and assembly algorithms have facilitated the reconstruction of the entire transcriptome by deep RNA sequencing (RNA-seq), even without a reference genome. However, transcriptome assembly from billions of RNA-seq reads, which are often very short, poses a significant informatics challenge. This Review summarizes the recent developments in transcriptome assembly approaches - reference-based, de novo and combined strategies-along with some perspectives on transcriptome assembly in the near future.
Date: September 1, 2011
Creator: Martin, Jeffrey A. & Wang, Zhong
ChIP-seq Mapping of Distant-Acting Enhancers and Their In Vivo Activities

Description: The genomic location and function of most distant-acting transcriptional enhancers in the human genome remains unknown We performed ChIP-seq for various transcriptional coactivator proteins (such as p300) directly from different embryonic mouse tissues, identifying thousands of binding sitesTransgenic mouse experiments show that p300 and other co-activator peaks are highly predictive of genomic location AND tissue-specific activity patterns of distant-acting enhancersMost enhancers are active only in one or very few tissues Genomic location of tissue-specific p300 peaks correlates with tissue-specific expression of nearby genes Most binding sites are conserved, but the global degree of conservation varies between tissues
Date: June 1, 2011
Creator: Visel, Axel & Pennacchio, Len A.
BioPig: Developing Cloud Computing Applications for Next-Generation Sequence Analysis

Description: Next Generation sequencing is producing ever larger data sizes with a growth rate outpacing Moore's Law. The data deluge has made many of the current sequenceanalysis tools obsolete because they do not scale with data. Here we present BioPig, a collection of cloud computing tools to scale data analysis and management. Pig is aflexible data scripting language that uses Apache's Hadoop data structure and map reduce framework to process very large data files in parallel and combine the results.BioPig extends Pig with capability with sequence analysis. We will show the performance of BioPig on a variety of bioinformatics tasks, including screeningsequence contaminants, Illumina QA/QC, and gene discovery from metagenome data sets using the Rumen metagenome as an example.
Date: March 22, 2011
Creator: Bhatia, Karan & Wang, Zhong
A renaissance for the pioneering 16S rRNA gene

Description: Culture-independent molecular surveys using the 16S rRNA gene have become a mainstay for characterizing microbial community structure over the last quarter century. More recently this approach has been overshadowed by metagenomics, which provides a global overview of a community's functional potential rather than just an inventory of its inhabitants. However, the pioneering 16S rRNA gene is making a comeback in its own right thanks to a number of methodological advancements including higher resolution (more sequences), analysis of multiple related samples (e.g. spatial and temporal series) and improved metadata and use of metadata. The standard conclusion that microbial ecosystems are remarkably complex and diverse is now being replaced by detailed insights into microbial ecology and evolution based only on this one historically important marker gene.
Date: September 7, 2008
Creator: Tringe, Susannah & Hugenholtz, Philip
Challenges in Whole-Genome Annotation of Pyrosequenced Eukaryotic Genomes

Description: Pyrosequencing technologies such as 454/Roche and Solexa/Illumina vastly lower the cost of nucleotide sequencing compared to the traditional Sanger method, and thus promise to greatly expand the number of sequenced eukaryotic genomes. However, the new technologies also bring new challenges such as shorter reads and new kinds and higher rates of sequencing errors, which complicate genome assembly and gene prediction. At JGI we are deploying 454 technology for the sequencing and assembly of ever-larger eukaryotic genomes. Here we describe our first whole-genome annotation of a purely 454-sequenced fungal genome that is larger than a yeast (>30 Mbp). The pezizomycotine (filamentous ascomycote) Aspergillus carbonarius belongs to the Aspergillus section Nigri species complex, members of which are significant as platforms for bioenergy and bioindustrial technology, as members of soil microbial communities and players in the global carbon cycle, and as agricultural toxigens. Application of a modified version of the standard JGI Annotation Pipeline has so far predicted ~;;10k genes. ~;;12percent of these preliminary annotations suffer a potential frameshift error, which is somewhat higher than the ~;;9percent rate in the Sanger-sequenced and conventionally assembled and annotated genome of fellow Aspergillus section Nigri member A. niger. Also,>90percent of A. niger genes have potential homologs in the A. carbonarius preliminary annotation. Weconclude, and with further annotation and comparative analysis expect to confirm, that 454 sequencing strategies provide a promising substrate for annotation of modestly sized eukaryotic genomes. We will also present results of annotation of a number of other pyrosequenced fungal genomes of bioenergy interest.
Date: April 17, 2009
Creator: Kuo, Alan & Grigoriev, Igor
Genomic Prospecting for Microbial Biodiesel Production

Description: Biodiesel is defined as fatty acid mono-alkylesters and is produced from triacylglycerols. In the current article we provide an overview of the structure, diversity and regulation of the metabolic pathways leading to intracellular fatty acid and triacylglycerol accumulation in three types of organisms (bacteria, algae and fungi) of potential biotechnological interest and discuss possible intervention points to increase the cellular lipid content. The key steps that regulate carbon allocation and distribution in lipids include the formation of malonyl-CoA, the synthesis of fatty acids and their attachment onto the glycerol backbone, and the formation of triacylglycerols. The lipid biosynthetic genes and pathways are largely known for select model organisms. Comparative genomics allows the examination of these pathways in organisms of biotechnological interest and reveals the evolution of divergent and yet uncharacterized regulatory mechanisms. Utilization of microbial systems for triacylglycerol and fatty acid production is in its infancy; however, genomic information and technologies combined with synthetic biology concepts provide the opportunity to further exploit microbes for the competitive production of biodiesel.
Date: March 20, 2008
Creator: Lykidis, Athanasios; Lykidis, Athanasios & Ivanova, Natalia
Illumina Unamplified Indexed Library Construction: An Automated Approach

Description: Manual library construction is a limiting factor in Illumina sequencing. Constructing libraries by hand is costly, time-consuming, low-throughput, and ergonomically hazardous, and constructing multiple libraries introduces risk of library failure due to pipetting errors. The ability to construct multiple libraries simultaneously in automated fashion represents significant cost and time savings. Here we present a strategy to construct up to 96 unamplified indexed libraries using Illumina TruSeq reagents and a Biomek FX robotic platform. We also present data to indicate that this library construction method has little or no risk of cross-contamination between samples.
Date: March 21, 2011
Creator: Hack, Christopher A.; Sczyrba, Alexander & Cheng, Jan-Fang
The RNA-Seq Analysis pipeline on Galaxy

Description: Q: How do I know my RNA-Seq experiments worked well A: RNA-Seq QC PipelineQ: How do I detect transcripts which are over expressed or under expressed in my samples A: Counting and Statistic AnalysisQ: What do I do if I don't have a reference genome A: Rnnotator de novo Assembly.
Date: May 31, 2011
Creator: Meng, Xiandong; Martin, Jeffrey & Wang, Zhong
DUK - A Fast and Efficient Kmer Based Sequence Matching Tool

Description: A new tool, DUK, is developed to perform matching task. Matching is to find whether a query sequence partially or totally matches given reference sequences or not. Matching is similar to alignment. Indeed many traditional analysis tasks like contaminant removal use alignment tools. But for matching, there is no need to know which bases of a query sequence matches which position of a reference sequence, it only need know whether there exists a match or not. This subtle difference can make matching task much faster than alignment. DUK is accurate, versatile, fast, and has efficient memory usage. It uses Kmer hashing method to index reference sequences and Poisson model to calculate p-value. DUK is carefully implemented in C++ in object oriented design. The resulted classes can also be used to develop other tools quickly. DUK have been widely used in JGI for a wide range of applications such as contaminant removal, organelle genome separation, and assembly refinement. Many real applications and simulated dataset demonstrate its power.
Date: March 21, 2011
Creator: Li, Mingkun; Copeland, Alex & Han, James
Carboxysomal carbonic anhydrases: Structure and role in microbial CO2 fixation

Description: Cyanobacteria and some chemoautotrophic bacteria are able to grow in environments with limiting CO2 concentrations by employing a CO2-concentrating mechanism (CCM) that allows them to accumulate inorganic carbon in their cytoplasm to concentrations several orders of magnitude higher than that on the outside. The final step of this process takes place in polyhedral protein microcompartments known as carboxysomes, which contain the majority of the CO2-fixing enzyme, RubisCO. The efficiency of CO2 fixation by the sequestered RubisCO is enhanced by co-localization with a specialized carbonic anhydrase that catalyzes dehydration of the cytoplasmic bicarbonate and ensures saturation of RubisCO with its substrate, CO2. There are two genetically distinct carboxysome types that differ in their protein composition and in the carbonic anhydrase(s) they employ. Here we review the existing information concerning the genomics, structure and enzymology of these uniquely adapted carbonic anhydrases, which are of fundamental importance in the global carbon cycle.
Date: June 23, 2010
Creator: Cannon, Gordon C.; Heinhorst, Sabine & Kerfeld, Cheryl A.
Ready, set, go . . . well maybe

Description: The agenda for this presentation is: (1) understand organizational readiness for changes; (2) review benefits and challenges of change; (3) share case studies of ergonomic programs that were 'not ready' and some that were 'ready'; and (4) provide some ideas for facilitating change.
Date: February 28, 2011
Creator: Alexandre, Melanie M & Bartolome, Terri-Lynn C
Evolutionary Genomics of Life in (and from) the Sea

Description: High throughput genome sequencing centers that were originally built for the Human Genome Project (Lander et al., 2001; Venter et al., 2001) have now become an engine for comparative genomics. The six largest centers alone are now producing over 150 billion nucleotides per year, more than 50 times the amount of DNA in the human genome, and nearly all of this is directed at projects that promise great insights into the pattern and processes of evolution. Unfortunately, this data is being produced at a pace far exceeding the capacity of the scientific community to provide insightful analysis, and few scientists with training and experience in evolutionary biology have played prominent roles to date. One of the consequences is that poor quality analyses are typical; for example, orthology among genes is generally determined by simple measures of sequence similarity, when this has been discredited by molecular evolutionary biologists decades ago. Here we discuss the how genomes are chosen for sequencing and how the scientific community can have input. We describe the PhIGs database and web tools (Dehal and Boore 2005a; http://PhIGs.org), which provide phylogenetic analysis of all gene families for all completely sequenced genomes and the associated 'Synteny Viewer', which allows comparisons of the relative positions of orthologous genes. This is the best tool available for inferring gene function across multiple genomes. We also describe how we have used the PhIGs methods with the whole genome sequences of a tunicate, fish, mouse, and human to conclusively demonstrate that two rounds of whole genome duplication occurred at the base of vertebrates (Dehal and Boore 2005b). This evidence is found in the large scale structure of the positions of paralogous genes that arose from duplications inferred by evolutionary analysis to have occurred at the base of vertebrates.
Date: January 9, 2006
Creator: Boore, Jeffrey L.; Dehal, Paramvir & Fuerstenberg, Susan I.
Wrinkles in the rare biosphere: Pyrosequencing errors can lead to artificial inflation of diversity estimates

Description: Massively parallel pyrosequencing of the small subunit (16S) ribosomal RNA gene has revealed that the extent of rare microbial populations in several environments, the 'rare biosphere', is orders of magnitude higher than previously thought. One important caveat with this method is that sequencing error could artificially inflate diversity estimates. Although the per-base error of 16S rDNA amplicon pyrosequencing has been shown to be as good as or lower than Sanger sequencing, no direct assessments of pyrosequencing errors on diversity estimates have been reported. Using only Escherichia coli MG1655 as a reference template, we find that 16S rDNA diversity is grossly overestimated unless relatively stringent read quality filtering and low clustering thresholds are applied. In particular, the common practice of removing reads with unresolved bases and anomalous read lengths is insufficient to ensure accurate estimates of microbial diversity. Furthermore, common and reproducible homopolymer length errors can result in relatively abundant spurious phylotypes further confounding data interpretation. We suggest that stringent quality-based trimming of 16S pyrotags and clustering thresholds no greater than 97% identity should be used to avoid overestimates of the rare biosphere.
Date: August 1, 2009
Creator: Kunin, Victor; Engelbrektson, Anna; Ochman, Howard & Hugenholtz, Philip
Microbial co-habitation and lateral gene transfer: what transposases can tell us

Description: Determining the habitat range for various microbes is not a simple, straightforward matter, as habitats interlace, microbes move between habitats, and microbial communities change over time. In this study, we explore an approach using the history of lateral gene transfer recorded in microbial genomes to begin to answer two key questions: where have you been and who have you been with? All currently sequenced microbial genomes were surveyed to identify pairs of taxa that share a transposase that is likely to have been acquired through lateral gene transfer. A microbial interaction network including almost 800 organisms was then derived from these connections. Although the majority of the connections are between closely related organisms with the same or overlapping habitat assignments, numerous examples were found of cross-habitat and cross-phylum connections. We present a large-scale study of the distributions of transposases across phylogeny and habitat, and find a significant correlation between habitat and transposase connections. We observed cases where phylogenetic boundaries are traversed, especially when organisms share habitats; this suggests that the potential exists for genetic material to move laterally between diverse groups via bridging connections. The results presented here also suggest that the complex dynamics of microbial ecology may be traceable in the microbial genomes.
Date: March 1, 2009
Creator: Hooper, Sean D.; Mavromatis, Konstantinos & Kyrpides, Nikos C.
