Chunking of Large Multidimensional Arrays Page: 4 of 19
This report is part of the collection entitled: Office of Scientific & Technical Information Technical Reports and was provided to UNT Digital Library by the UNT Libraries Government Documents Department.
Extracted Text
The following text was automatically extracted from the image on this page using optical character recognition software:
An important metric of the storage efficiency is the expected number of chunks retrieved by
queries under the access workload. The problem of optimal chunking was first introduced by
Sarawagi and Stonebraker [14], who gave an approximate solution to this problem. We show
that the optimal shape derivation given by Sarawagi and Stonebraker is only approximate and
under certain circumstances can deviate significantly from the true answer. We propose two
different models of the problem and show how the chunking parameters should be determined
based on the probabilistic access patterns of sub-array queries. The main contributions of this
paper are:
" The development of two accurate mathematical models of the chunking problem;
" Derivation of exact solutions, one using steepest descent and another using geometrical
programming method;
" Experimental comparison of the estimation errors induced by the models using synthetic
workloads on real life datasets.
In the rest of the paper, Section 2 presents some related work on array chunking where we also
discuss how chunks are organized and accessed for both dense and sparse multi-dimensional
arrays. Section 3 presents the two mathematical models for defining an optimal chunking shape.
The derivations of the optimal chunk shapes and sizes, under both models given some probabilis-
tic access patterns of sub-array queries are presented in Section 4. In section 5, we present the
results of our experimental comparisons for a synthetic workload. We conclude with Section 6,
giving some direction for future work.
2 Related Work
In nearly all applications that use disk resident large scale multi-dimensional arrays, the physical
organization of the array is by chunking. The global array is tessellated into sub-arrays or tiles
of size C and shape (c1, c2, ..., cA). Rather than mapping the elements of the array directly
onto consecutive linear storage, the chunks are mapped onto storage and, within each chunk,
the array elements are laid out using a conventional row-major or column-major ordering.
The rational for chunking large arrays, whether dense or sparse, is justified in general when
efficient I/O performance is desired in applications that access data with a high degree of local-
ity [17]. In [17], Vitter elaborated on the fact that an algorithm that does not exploit locality
can be reasonably efficient when the data sets fit entirely in internal memory, but performs
miserably when deployed naively on an External Memory (EM), setting and virtual memory is
used to handle page management. The linear mapping function for allocating chunks onto disk
storage can be done by the row-major or column-major ordering, any one of the mapping func-
tions for space filling curves [8] or done with the use of B+ -tree indexing as in HDF5 [7]. We
discuss some methods for chunk addressing in subsection 2.1. The problem of chunk addressing
is orthogonal to optimizing the chunk shape that requires taking into account the information
on sub-array access patterns. This is the problem first raised by Sarawagi and Stonebraker [14].
In [15], the problem of the storage of multidimensional arrays on disks for subsequent efficient
access in computational fluid dynamic applications that run on parallel processors is addressed.
The approach taken is to distribute all the array chunks among processors during program
executions. A similar approach is described in the design and implementation of disk resident
array storage (DRA) [11], used in Computational Chemistry applications. DRA extends the use
of a memory resident distributed array library, called Global Array (GA), to external memory3
Upcoming Pages
Here’s what’s next.
Search Inside
This report can be searched. Note: Results may vary based on the legibility of text within the document.
Tools / Downloads
Get a copy of this page or view the extracted text.
Citing and Sharing
Basic information for referencing this web page. We also provide extended guidance on usage rights, references, copying or embedding.
Reference the current page of this Report.
Rotem, Doron; Otoo, Ekow J. & Seshadri, Sridhar. Chunking of Large Multidimensional Arrays, report, February 28, 2007; Berkeley, California. (https://digital.library.unt.edu/ark:/67531/metadc896932/m1/4/: accessed April 24, 2024), University of North Texas Libraries, UNT Digital Library, https://digital.library.unt.edu; crediting UNT Libraries Government Documents Department.