Interpolative multidimensional scaling techniques for the identification of clusters in very large sequence sets Page: 1
The following text was automatically extracted from the image on this page using optical character recognition software:
Hughes et al. BMC Bioinformatics 2012, 13(Suppl 2):S9
Interpolative multidimensional scaling techniques
for the identification of clusters in very large
Adam Hughes', Yang Ruan'2, Saliya Ekanayake'2, Seung-Hee Bae',2, Qunfeng Dong3, Mina Rho2, Judy Qiu1'2
From Great Lakes Bioinformatics Conference 2011
Athens, OH, USA. 2-4 May 2011
Background: Modern pyrosequencing techniques make it possible to study complex bacterial populations, such as
16S rRNA, directly from environmental or clinical samples without the need for laboratory purification. Alignment of
sequences across the resultant large data sets (100,000+ sequences) is of particular interest for the purpose of
identifying potential gene clusters and families, but such analysis represents a daunting computational task. The
aim of this work is the development of an efficient pipeline for the clustering of large sequence read sets.
Methods: Pairwise alignment techniques are used here to calculate genetic distances between sequence pairs.
These methods are pleasingly parallel and have been shown to more accurately reflect accurate genetic distances
in highly variable regions of rRNA genes than do traditional multiple sequence alignment (MSA) approaches. By
utilizing Needleman-Wunsch (NW) pairwise alignment in conjunction with novel implementations of interpolative
multidimensional scaling (MDS), we have developed an effective method for visualizing massive biosequence data
sets and quickly identifying potential gene clusters.
Results: This study demonstrates the use of interpolative MDS to obtain clustering results that are qualitatively
similar to those obtained through full MDS, but with substantial cost savings. In particular, the wall clock time
required to cluster a set of 100,000 sequences has been reduced from seven hours to less than one hour through
the use of interpolative MDS.
Conclusions: Although work remains to be done in selecting the optimal training set size for interpolative MDS,
substantial computational cost savings will allow us to cluster much larger sequence sets in the future.
The continued advancement of pyrosequencing techni-
ques has made it possible for scientists to study complex
bacterial populations, such as 16S rRNA, directly from
environmental or clinical samples without the need for
involved and time-consuming laboratory purification .
As a result, there has been a rapid accumulation of raw
sequence reads awaiting analysis in recent years, placing
an extreme burden on existing software systems.
* Correspondence: email@example.com
Pervasive Technology Institute, Indiana University, Bloomington, IN 47408,
Full list of author information is available at the end of the article
Alignment of sequences across these large data sets
(100,000+ sequences) is of particular interest for the
purposes of sequence classification and identification of
potential gene clusters and families, but such analysis
cannot be completed manually and represents a daunt-
ing computational task. The aim of this work is the
development of an efficient and effective pipeline for
clustering large quantities of raw biosequence reads.
One technique often used in sequence clustering is mul-
tiple sequence alignment (MSA), which employs heuristic
methods in an attempt to determine optimal alignments
2012 Hughes et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons
SBiolMed Central Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.
Here’s what’s next.
This article can be searched. Note: Results may vary based on the legibility of text within the document.
Tools / Downloads
Get a copy of this page or view the extracted text.
Citing and Sharing
Basic information for referencing this web page. We also provide extended guidance on usage rights, references, copying or embedding.
Reference the current page of this Article.
Hughes, Adam; Ruan, Yang; Ekanayake, Saliya; Bae, Seung-Hee; Dong, Qunfeng; Rho, Mina et al. Interpolative multidimensional scaling techniques for the identification of clusters in very large sequence sets, article, March 13, 2012; [London, United Kingdom]. (digital.library.unt.edu/ark:/67531/metadc78283/m1/1/: accessed August 16, 2018), University of North Texas Libraries, Digital Library, digital.library.unt.edu; crediting UNT College of Arts and Sciences.