Interpolative multidimensional scaling techniques for the identification of clusters in very large sequence sets Page: 3
The following text was automatically extracted from the image on this page using optical character recognition software:
Hughes et al. BMC Bioinformatics 2012, 13(Suppl 2):S9
enables us to target large, Linux-based compute clusters
. This scaled-up pipeline is shown in Figure 3.
Results and discussion
Full calculation on entire data set
Figure 4 shows the results of running full Needleman-
Wunsch (NW) and Multidimensional Scaling (MDS)
calculations on a set of 100,000 raw 16S rRNA sequence
reads. The results of this calculation fit well with the
expected groupings for this genome [7,8]. The initial
clustering calculation colors the predicted sequences in
a given grouping, while the MDS calculation produces
Cartesian coordinates for each sequence. As Figure 4
shows, the spatial and colored results correspond to the
same sequences, indicating that the combination of NW
and MDS produce reasonable sequence clusters.
Interpolation: 50000 in-sample sequences, 50000 out-of-
Figure 5 shows the results of running interpolative MDS
and NW on the same 100,000 sequences, with 50,000 in-
sample and 50,000 out-of-sample data points. The basic
structure observed in this case is similar to that seen in
the full calculation discussed above. Some slight differ-
ences within individual clusters are noted, but the major
sequence groupings are intact.
Interpolation: 10000 in-sample sequences, 90000 out-of-
Figure 6 shows the results of running interpolative MDS
and NW on the same 100,000 sequences, with 10,000
in-sample and 90,000 out-of-sample data points. Once
again, the same basic clustering structure is observed,
512 2861164 67
/88 62 40 343
34 608 486 389
887 661 539 22
x, y, z
x, y, z
Figure 3 Scaled-up computational pipeline for sequence clustering. As with the basic pipeline, the scaled-up workflow begins with a raw
sequence file. Before calculating genetic distances, the file is divided into in-sample and out-of-sample sets for use in Interpolative MDS. Full
MDS and NW distance calculations on the in-sample data yield trained distances, which are used to interpolate the remaining distances. The
interpolation step includes on-the-fly pairwise NW distance calculation. The overall complexity of the pipeline is reduced from O(N2) for the basic
pipeline to O(M2 + (N-M)*M) for the pipeline with interpolation, where N is the size of the original sequence set and M is the size of the in-
sample data. To enhance computational job management and resource availability, all computational portions of the depicted pipeline were
implemented using the Twister Iterative Map Reduce runtime.
Page 3 of 6
Here’s what’s next.
This article can be searched. Note: Results may vary based on the legibility of text within the document.
Tools / Downloads
Get a copy of this page or view the extracted text.
Citing and Sharing
Basic information for referencing this web page. We also provide extended guidance on usage rights, references, copying or embedding.
Reference the current page of this Article.
Hughes, Adam; Ruan, Yang; Ekanayake, Saliya; Bae, Seung-Hee; Dong, Qunfeng; Rho, Mina et al. Interpolative multidimensional scaling techniques for the identification of clusters in very large sequence sets, article, March 13, 2012; [London, United Kingdom]. (digital.library.unt.edu/ark:/67531/metadc78283/m1/3/: accessed March 18, 2018), University of North Texas Libraries, Digital Library, digital.library.unt.edu; crediting UNT College of Arts and Sciences.