Interpolative multidimensional scaling techniques for the identification of clusters in very large sequence sets Page: 2
The following text was automatically extracted from the image on this page using optical character recognition software:
Hughes et al. BMC Bioinformatics 2012, 13(Suppl 2):S9
across an entire sample. However, global pairwise
sequence alignment algorithms have previously been
reported to better identify microbial richness in genomes
with hypervariable regions, like 16S rRNA, than do MSA
techniques, while also offering superior computational
scaling . For these studies, genetic distances produced
by the Needleman-Wunsch pairwise aligner algorithm 
were converted to Cartesian coordinates through Multi-
dimensional Scaling (MDS) for the purpose of clustering
and visualization .
This basic clustering pipeline is shown in Figure 1,
and it has been used to good effect for sample sizes of
100,000 or fewer sequences with less than 200 bases in
each. However, the computational complexity of both
the distance calculation and multidimensional scaling is
O(N2), where N is the number of sequences, rendering
the overall process untenable as the sample size grows
To overcome these performance bottlenecks, we have
employed an interpolative MDS algorithm , wherein a
small, in-sample subset of sequences is subjected to full
NW and MDS calculations and then the results are
used to interpolate Cartesian coordinates for the
remaining, out-of-sample sequences from the larger data
set. This reduces the computational complexity to O
(M2) + O(M*(N-M)) for both distance and scaling
operations, where N is the number of sequences in the
initial data set, M is the number of in-sample sequences,
and N-M is the number of out-of-sample sequences.
The basic interpolative MDS scheme is illustrated in
To further enhance computational throughput and ease
job management, we implemented the updated pipeline
utilizing the Twister Iterative MapReduce runtime  to
take advantage of the map-reduce pattern inherent in
these calculations. Twister, developed in our lab, also
512 2861164 67
788 62 40 343
34 608 486 389
887661 539 22
Figure 1 Basic computational pipeline for sequence clustering. Sequence clustering begins with a sampling of raw sequence reads, stripped
of duplicates. Pairwise sequence alignments and genetic distances are calculated over the entire sample. For this study, the Needleman-Wunsch
global alignment algorithm was employed. Next, the calculated distances are passed to multidimensional scaling and pairwise clustering
algorithms, producing Cartesian coordinates and clustering information which can be used to visualize the sequence space. Both the distance
calculation and multidimensional scaling are order O(N2), where N is the number of sequences, making the pipeline computationally expensive
as the sample grows very large.
Page 2 of 6
Figure 2 Interpolative Multidimensional Scaling (MDS). Interpolative MDS begins with a raw sequence file, which is then divided into in-
sample and out-of-sample sets. The in-sample data is then subjected to full NW distance and MDS calculations, resulting in a subset of genetic
distances. This trained data is then used to interpolate the distances for the remaining, out-of-sample sequences. The computational complexity
of the interpolation step is O((N-M)*M), where N is the size of the original sequence set and M is the size of the in-sample data.
Here’s what’s next.
This article can be searched. Note: Results may vary based on the legibility of text within the document.
Tools / Downloads
Get a copy of this page or view the extracted text.
Citing and Sharing
Basic information for referencing this web page. We also provide extended guidance on usage rights, references, copying or embedding.
Reference the current page of this Article.
Hughes, Adam; Ruan, Yang; Ekanayake, Saliya; Bae, Seung-Hee; Dong, Qunfeng; Rho, Mina et al. Interpolative multidimensional scaling techniques for the identification of clusters in very large sequence sets, article, March 13, 2012; [London, United Kingdom]. (digital.library.unt.edu/ark:/67531/metadc78283/m1/2/: accessed November 18, 2018), University of North Texas Libraries, Digital Library, digital.library.unt.edu; crediting UNT College of Arts and Sciences.