36 Matching Results

Search Results

Advanced search parameters have been applied.

The Scientific Data Management Center

Description: With the increasing volume and complexity of data produced by ultra-scale simulations and high-throughput experiments, understanding the science is largely hampered by the lack of comprehensive, end-to-end data management solutions ranging from initial data acquisition to final analysis and visualization. The Scientific Data Management (SDM) Center is bringing a set of advanced data management technologies to DOE scientists in various application domains including astrophysics, climate, fusion, and biology. Equally important, it has established collaborations with these scientists to better understand their science as well as their forthcoming data management and data analytics challenges. The SDM center has provided advanced data management technologies to DOE domain scientists in the areas of storage efficient access, data mining and analysis, and scientific process automation.
Date: June 30, 2006
Creator: Shoshani, Arie
Partner: UNT Libraries Government Documents Department

Breaking the Curse of Cardinality on Bitmap Indexes

Description: Bitmap indexes are known to be efficient for ad-hoc range queries that are common in data warehousing and scientific applications. However, they suffer from the curse of cardinality, that is, their efficiency deteriorates as attribute cardinalities increase. A number of strategies have been proposed, but none of them addresses the problem adequately. In this paper, we propose a novel binned bitmap index that greatly reduces the cost to answer queries, and therefore breaks the curse of cardinality. The key idea is to augment the binned index with an Order-preserving Bin-based Clustering (OrBiC) structure. This data structure significantly reduces the I/O operations needed to resolve records that cannot be resolved with the bitmaps. To further improve the proposed index structure, we also present a strategy to create single-valued bins for frequent values. This strategy reduces index sizes and improves query processing speed. Overall, the binned indexes with OrBiC great improves the query processing speed, and are 3 - 25 times faster than the best available indexes for high-cardinality data.
Date: April 4, 2008
Creator: Wu, Kesheng; Wu, Kesheng; Stockinger, Kurt & Shoshani, Arie
Partner: UNT Libraries Government Documents Department

A New Approach in Advance Network Reservation and Provisioning for High-Performance Scientific Data Transfers

Description: Scientific applications already generate many terabytes and even petabytes of data from supercomputer runs and large-scale experiments. The need for transferring data chunks of ever-increasing sizes through the network shows no sign of abating. Hence, we need high-bandwidth high speed networks such as ESnet (Energy Sciences Network). Network reservation systems, i.e. ESnet's OSCARS (On-demand Secure Circuits and Advance Reservation System) establish guaranteed bandwidth of secure virtual circuits at a certain time, for a certain bandwidth and length of time. OSCARS checks network availability and capacity for the specified period of time, and allocates requested bandwidth for that user if it is available. If the requested reservation cannot be granted, no further suggestion is returned back to the user. Further, there is no possibility from the users view-point to make an optimal choice. We report a new algorithm, where the user specifies the total volume that needs to be transferred, a maximum bandwidth that he/she can use, and a desired time period within which the transfer should be done. The algorithm can find alternate allocation possibilities, including earliest time for completion, or shortest transfer duration - leaving the choice to the user. We present a novel approach for path finding in time-dependent networks, and a new polynomial algorithm to find possible reservation options according to given constraints. We have implemented our algorithm for testing and incorporation into a future version of ESnet?s OSCARS. Our approach provides a basis for provisioning end-to-end high performance data transfers over storage and network resources.
Date: January 28, 2010
Creator: Balman, Mehmet; Chaniotakis, Evangelos; Shoshani, Arie & Sim, Alex
Partner: UNT Libraries Government Documents Department

Scientific Data Services -- A High-Performance I/O System with Array Semantics

Description: As high-performance computing approaches exascale, the existing I/O system design is having trouble keeping pace in both performance and scalability. We propose to address this challenge by adopting database principles and techniques in parallel I/O systems. First, we propose to adopt an array data model because many scientific applications represent their data in arrays. This strategy follows a cardinal principle from database research, which separates the logical view from the physical layout of data. This high-level data model gives the underlying implementation more freedom to optimize the physical layout and to choose the most effective way of accessing the data. For example, knowing that a set of write operations is working on a single multi-dimensional array makes it possible to keep the subarrays in a log structure during the write operations and reassemble them later into another physical layout as resources permit. While maintaining the high-level view, the storage system could compress the user data to reduce the physical storage requirement, collocate data records that are frequently used together, or replicate data to increase availability and fault-tolerance. Additionally, the system could generate secondary data structures such as database indexes and summary statistics. We expect the proposed Scientific Data Services approach to create a “live” storage system that dynamically adjusts to user demands and evolves with the massively parallel storage hardware.
Date: September 21, 2011
Creator: Wu, Kesheng; Byna, Surendra; Rotem, Doron & Shoshani, Arie
Partner: UNT Libraries Government Documents Department

Improving Estimation Accuracy of Aggregate Queries on Data Cubes

Description: In this paper, we investigate the problem of estimation of a target database from summary databases derived from a base data cube. We show that such estimates can be derived by choosing a primary database which uses a proxy database to estimate the results. This technique is common in statistics, but an important issue we are addressing is the accuracy of these estimates. Specifically, given multiple primary and multiple proxy databases, that share the same summary measure, the problem is how to select the primary and proxy databases that will generate the most accurate target database estimation possible. We propose an algorithmic approach for determining the steps to select or compute the source databases from multiple summary databases, which makes use of the principles of information entropy. We show that the source databases with the largest number of cells in common provide the more accurate estimates. We prove that this is consistent with maximizing the entropy. We provide some experimental results on the accuracy of the target database estimation in order to verify our results.
Date: August 15, 2008
Creator: Pourabbas, Elaheh & Shoshani, Arie
Partner: UNT Libraries Government Documents Department

A Flexible Reservation Algorithm for Advance Network Provisioning

Description: Many scientific applications need support from a communication infrastructure that provides predictable performance, which requires effective algorithms for bandwidth reservations. Network reservation systems such as ESnet's OSCARS, establish guaranteed bandwidth of secure virtual circuits for a certain bandwidth and length of time. However, users currently cannot inquire about bandwidth availability, nor have alternative suggestions when reservation requests fail. In general, the number of reservation options is exponential with the number of nodes n, and current reservation commitments. We present a novel approach for path finding in time-dependent networks taking advantage of user-provided parameters of total volume and time constraints, which produces options for earliest completion and shortest duration. The theoretical complexity is only O(n2r2) in the worst-case, where r is the number of reservations in the desired time interval. We have implemented our algorithm and developed efficient methodologies for incorporation into network reservation frameworks. Performance measurements confirm the theoretical predictions.
Date: April 12, 2010
Creator: Balman, Mehmet; Chaniotakis, Evangelos; Shoshani, Arie & Sim, Alex
Partner: UNT Libraries Government Documents Department

Bulk Data Movement for Climate Dataset: Efficient Data Transfer Management with Dynamic Transfer Adjustment

Description: Many scientific applications and experiments, such as high energy and nuclear physics, astrophysics, climate observation and modeling, combustion, nano-scale material sciences, and computational biology, generate extreme volumes of data with a large number of files. These data sources are distributed among national and international data repositories, and are shared by large numbers of geographically distributed scientists. A large portion of data is frequently accessed, and a large volume of data is moved from one place to another for analysis and storage. One challenging issue in such efforts is the limited network capacity for moving large datasets to explore and manage. The Bulk Data Mover (BDM), a data transfer management tool in the Earth System Grid (ESG) community, has been managing the massive dataset transfers efficiently with the pre-configured transfer properties in the environment where the network bandwidth is limited. Dynamic transfer adjustment was studied to enhance the BDM to handle significant end-to-end performance changes in the dynamic network environment as well as to control the data transfers for the desired transfer performance. We describe the results from the BDM transfer management for the climate datasets. We also describe the transfer estimation model and results from the dynamic transfer adjustment.
Date: July 16, 2010
Creator: Sim, Alexander; Balman, Mehmet; Williams, Dean; Shoshani, Arie & Natarajan, Vijaya
Partner: UNT Libraries Government Documents Department

Advance Network Reservation and Provisioning for Science

Description: We are witnessing a new era that offers new opportunities to conduct scientific research with the help of recent advancements in computational and storage technologies. Computational intensive science spans multiple scientific domains, such as particle physics, climate modeling, and bio-informatics simulations. These large-scale applications necessitate collaborators to access very large data sets resulting from simulations performed in geographically distributed institutions. Furthermore, often scientific experimental facilities generate massive data sets that need to be transferred to validate the simulation data in remote collaborating sites. A major component needed to support these needs is the communication infrastructure which enables high performance visualization, large volume data analysis, and also provides access to computational resources. In order to provide high-speed on-demand data access between collaborating institutions, national governments support next generation research networks such as Internet 2 and ESnet (Energy Sciences Network). Delivering network-as-a-service that provides predictable performance, efficient resource utilization and better coordination between compute and storage resources is highly desirable. In this paper, we study network provisioning and advanced bandwidth reservation in ESnet for on-demand high performance data transfers. We present a novel approach for path finding in time-dependent transport networks with bandwidth guarantees. We plan to improve the current ESnet advance network reservation system, OSCARS [3], by presenting to the clients, the possible reservation options and alternatives for earliest completion time and shortest transfer duration. The Energy Sciences Network (ESnet) provides high bandwidth connections between research laboratories and academic institutions for data sharing and video/voice communication. The ESnet On-Demand Secure Circuits and Advance Reservation System (OSCARS) establishes guaranteed bandwidth of secure virtual circuits at a certain time, for a certain bandwidth and length of time. Though OSCARS operates within the ESnet, it also supplies end-to-end provisioning between multiple autonomous network domains. OSCARS gets reservation requests through a standard web service interface, ...
Date: July 10, 2009
Creator: Balman, Mehmet; Chaniotakis, Evangelos; Shoshani, Arie & Sim, Alex
Partner: UNT Libraries Government Documents Department

RRS: Replica Registration Service for Data Grids

Description: Over the last few years various scientific experiments and Grid projects have developed different catalogs for keeping track of their data files. Some projects use specialized file catalogs, others use distributed replica catalogs to reference files at different locations. Due to this diversity of catalogs, it is very hard to manage files across Grid projects, or to replace one catalog with another. In this paper we introduce a new Grid service called the Replica Registration Service (RRS). It can be thought of as an abstraction of the concepts for registering files and their replicas. In addition to traditional single file registration operations, the RRS supports collective file registration requests and keeps persistent registration queues. This approach is of particular importance for large-scale usage where thousands of files are copied and registered. Moreover, the RRS supports a set of error directives that are triggered in case of registration failures. Our goal is to provide a single uniform interface for various file catalogs to support the registration of files across multiple Grid projects, and to make Grid clients oblivious to the specific catalog used.
Date: July 15, 2005
Creator: Shoshani, Arie; Sim, Alex & Stockinger, Kurt
Partner: UNT Libraries Government Documents Department

Optimizing connected component labeling algorithms

Description: This paper presents two new strategies that can be used to greatly improve the speed of connected component labeling algorithms. To assign a label to a new object, most connected component labeling algorithms use a scanning step that examines some of its neighbors. The first strategy exploits the dependencies among them to reduce the number of neighbors examined. When considering 8-connected components in a 2D image, this can reduce the number of neighbors examined from four to one in many cases. The second strategy uses an array to store the equivalence information among the labels. This replaces the pointer based rooted trees used to store the same equivalence information. It reduces the memory required and also produces consecutive final labels. Using an array instead of the pointer based rooted trees speeds up the connected component labeling algorithms by a factor of 5 {approx} 100 in our tests on random binary images.
Date: January 16, 2005
Creator: Wu, Kesheng; Otoo, Ekow & Shoshani, Arie
Partner: UNT Libraries Government Documents Department

The Composite OLAP-Object Data Model

Description: In this paper, we define an OLAP-Object model that combines the main characteristics of OLAP and Object data models in order to achieve their functionalities in a common framework. We classify three different object classes: primitive, regular and composite. Then, we define a query language which uses the path concept in order to facilitate data navigation and data manipulation. The main feature of the proposed language is an anchor. It allows us to fix dynamically an object class (primitive, regular or composite) along the paths over the OLAP-Object data model for expressing queries. The queries can be formulated on objects, composite objects and combination of both. The power of the proposed query language is investigated through multiple query examples. The semantic of different clauses and syntax of the proposed language are investigated.
Date: December 7, 2005
Creator: Pourabbas, Elaheh & Shoshani, Arie
Partner: UNT Libraries Government Documents Department

Storage resource managers: Middleware components for gridstorage

Description: The amount of scientific data generated by simulations orcollected from large scale experiments have reached levels that cannot bestored in the researcher's workstation or even in his/her local computercenter. Such data are vital to large scientific collaborations dispersedover wide-area networks. In the past, the concept of a Gridinfrastructure [1]mainly emphasized the computational aspect ofsupporting large distributed computational tasks, and optimizing the useof the network by using bandwidth reservation techniques. In this paperwe discuss the concept of Storage Resource Managers (SRMs) as componentsthat complement this with the support for the storage management of largedistributed datasets. The access to data is becoming the main bottleneckin such "data intensive" applications because the data cannot bereplicated in all sites. SRMs can be used to dynamically optimize the useof storage resource to help unclog this bottleneck.
Date: August 18, 2005
Creator: Shoshani, Arie; Sim, Alex & Gu, Junmin
Partner: UNT Libraries Government Documents Department

Performances of Multi-Level and Multi-Component Compressed BitmapIndices

Description: This paper presents a systematic study of two large subsetsof bitmap indexing methods that use multi-component and multi-levelencodings. Earlier studies on bitmap indexes are either empirical or foruncompressed versions only. Since most of bitmap indexes in use arecompressed, we set out to study the performance characteristics of thesecompressed indexes. To make the analyses manageable, we choose to use aparticularly simple, but efficient, compression method called theWord-Aligned Hybrid (WAH) code. Using this compression method, a numberof bitmap indexes are shown to be optimal because their worst-case timecomplexities for answering a query is a linear function of the number ofhits. Since compressed bitmap indexes behave drastically different fromuncompressed ones, our analyses also lead to a number of new methods thatare much more efficient than commonly used ones. As a validation for theanalyses, we implement a number of the best methods and measure theirperformance against well-known indexes. The fastest new methods arepredicted and observed to be 5 to 10 times faster than well-knownindexing methods.
Date: April 30, 2007
Creator: Wu, Kesheng; Stockinger, Kurt & Shoshani, Arie
Partner: UNT Libraries Government Documents Department

Replica Registration Service - Functional Interface Specification1.0

Description: The goal of the Replica Registration Service (RRS) is toprovide a uniform interface to various file catalogs, replica catalogs,and metadata catalogs. It can be thought of as an abstraction of theconcepts used in such systems to register files and their replicas. Someexperiments may prefer to support their own file catalogs (which may havetheir own specialized structures, semantics, and implementations) ratherthan use a standard replica catalog. Providing an RRS that can interactwith such a catalog (for example by invoking a script) can permit thatcatalog to be invoked as a service in the same way that other replicacatalogs do. If at a later time the experiment wishes to change toanother file catalog or an RLS, it is only a matter of developing an RRSfor that new catalog and replacing the existing catalog. In addition,some systems use metadata catalogs or other catalogs to manage the filename spaces. Our goal is to provide a single interface that supports theregistration of files into such name spaces as well as retrieving thisinformation.
Date: February 28, 2005
Creator: Shoshani, Arie; Sim, Alex & Stockinger, Kurt
Partner: UNT Libraries Government Documents Department

Accurate modeling of cache replacement policies in a Data-Grid.

Description: Caching techniques have been used to improve the performance gap of storage hierarchies in computing systems. In data intensive applications that access large data files over wide area network environment, such as a data grid,caching mechanism can significantly improve the data access performance under appropriate workloads. In a data grid, it is envisioned that local disk storage resources retain or cache the data files being used by local application. Under a workload of shared access and high locality of reference, the performance of the caching techniques depends heavily on the replacement policies being used. A replacement policy effectively determines which set of objects must be evicted when space is needed. Unlike cache replacement policies in virtual memory paging or database buffering, developing an optimal replacement policy for data grids is complicated by the fact that the file objects being cached have varying sizes and varying transfer and processing costs that vary with time. We present an accurate model for evaluating various replacement policies and propose a new replacement algorithm referred to as ''Least Cost Beneficial based on K backward references (LCB-K).'' Using this modeling technique, we compare LCB-K with various replacement policies such as Least Frequently Used (LFU), Least Recently Used (LRU), Greedy DualSize (GDS), etc., using synthetic and actual workload of accesses to and from tertiary storage systems. The results obtained show that (LCB-K) and (GDS) are the most cost effective cache replacement policies for storage resource management in data grids.
Date: January 23, 2003
Creator: Otoo, Ekow J. & Shoshani, Arie
Partner: UNT Libraries Government Documents Department

Compressing bitmap indexes for faster search operations

Description: In this paper, we study the effects of compression on bitmap indexes. The main operations on the bitmaps during query processing are bitwise logical operations such as AND, OR, NOT, etc. Using the general purpose compression schemes, such as gzip, the logical operations on the compressed bitmaps are much slower than on the uncompressed bitmaps. Specialized compression schemes, like the byte-aligned bitmap code(BBC), are usually faster in performing logical operations than the general purpose schemes, but in many cases they are still orders of magnitude slower than the uncompressed scheme. To make the compressed bitmap indexes operate more efficiently, we designed a CPU-friendly scheme which we refer to as the word-aligned hybrid code (WAH). Tests on both synthetic and real application data show that the new scheme significantly outperforms well-known compression schemes at a modest increase in storage space. Compared to BBC, a scheme well-known for its operational efficiency, WAH performs logical operations about 12 times faster and uses only 60 percent more space. Compared to the uncompressed scheme, in most test cases WAH is faster while still using less space. We further verified with additional tests that the improvement in logical operation speed translates to similar improvement in query processing speed.
Date: April 25, 2002
Creator: Wu, Kesheng; Otoo, Ekow J. & Shoshani, Arie
Partner: UNT Libraries Government Documents Department

Evaluation Strategies for Bitmap Indices with Binning

Description: Bitmap indices are efficient data structures for querying read-only data with low attribute cardinalities. To improve the efficiency of the bitmap indices on attributes with high cardinalities, we present a new strategy to evaluate queries using bitmap indices. This work is motivated by a number of scientific data analysis applications where most attributes have cardinalities in the millions. On these attributes, binning is a common strategy to reduce the size of the bitmap index. In this article we analyze how binning affects the number of pages accessed during query processing, and propose an optimal way of using the bitmap indices to reduce the number of pages accessed. Compared with two basic strategies the new algorithm reduces the query response time by up to a factor of two. On a set of 5-dimensional queries on real application data, the bitmap indices are on average 10 times faster than the projection index.
Date: June 3, 2004
Creator: Stockinger, Kurt; Wu, Kesheng & Shoshani, Arie
Partner: UNT Libraries Government Documents Department

On the performance of bitmap indices for high cardinality attributes

Description: It is well established that bitmap indices are efficient for read-only attributes with a small number of distinct values. For an attribute with a large number of distinct values, the size of the bitmap index can be very large. To over come this size problem, specialized compression schemes are used. Even though there is empirical evidence that some of these compression schemes work well, there has not been any systematic analysis of their effectiveness. In this paper, we analyze the time and space complexities of the two most efficient bitmap compression techniques known, the Byte-aligned Bitmap Code (BBC) and the Word-Aligned Hybrid (WAH) code, and study their performance on high cardinality attributes. Our analyses indicate that both compression schemes are optimal in time. The time and space required to operate on two compressed bitmaps are proportional to the total size of the two bitmaps. We demonstrate further that an in-place OR algorithm can operate on a large number of sparse bitmaps in time linear in their total size. Our analyses also show that the compressed indices are relatively small compared with commonly used indices such as B-trees. Given these facts, we conclude that bitmap index is efficient on attributes of low cardinalities as well as on those of high cardinalities. We also verify the analytical results with extensive tests, and identify an optimal way to combine different options to achieve the best performance. The test results confirm the linearity in the total size of the compressed bitmaps, and that WAH out performs BBC by about a factor of two.
Date: March 5, 2004
Creator: Wu, Kesheng; Otoo, Ekow & Shoshani, Arie
Partner: UNT Libraries Government Documents Department

Compressed bitmap indices for efficient query processing

Description: Many database applications make extensive use of bitmap indexing schemes. In this paper, we study how to improve the efficiencies of these indexing schemes by proposing new compression schemes for the bitmaps. Most compression schemes are designed primarily to achieve good compression. During query processing they can be orders of magnitude slower than their uncompressed counterparts. The new schemes are designed to bridge this performance gap by reducing compression effectiveness and improving operation speed. In a number of tests on both synthetic data and real application data, we found that the new schemes significantly outperform the well-known compression schemes while using only modestly more space. For example, compared to the Byte-aligned Bitmap Code, the new schemes are 12 times faster and it uses only 50 percent more space. The new schemes use much less space(<30 percent) than the uncompressed scheme and are faster in a majority of the test cases.
Date: September 30, 2001
Creator: Wu, Kesheng; Otoo, Ekow & Shoshani, Arie
Partner: UNT Libraries Government Documents Department

FastBit -- Helps Finding the Proverbial Needle in a Haystack

Description: FastBit is a software package designed to meet the searching and filtering needs of data intensive sciences. In these applications, scientists are trying to find nuggets of information from petabytes of raw data. FastBit has been demonstrated to be an order of magnitude faster than comparable technologies. In this brief report, we highlight how we work with a visualization team, a network security team and a DNA sequencing center to find the nuggets in their data.
Date: July 5, 2006
Creator: Wu, Kesheng "John"; Stockinger, Kurt; Shoshani, Arie & Wes, Bethel
Partner: UNT Libraries Government Documents Department

Grid Collector: Using an event catalog to speed up user analysisin distributed environment

Description: Nuclear and High Energy Physics experiments such as STAR at BNL are generating millions of files with Peta Bytes of data each year. In most cases, analysis programs have to read all events in a file in order to find the interesting ones. Since the interesting events may be a small fraction of events in the file, a significant portion of the computer time is wasted on reading the unwanted events. To address this issue, we developed a software system called Grid Collector. The core of Grid Collector is an Event Catalog. This catalog can be efficiently searched with compressed bitmap indices. Tests show that Grid Collector can index and search STAR event data much faster than database systems. It is fully integrated with an existing analysis framework so that a minimal effort is required to use Grid Collector. In addition, by taking advantage of existing file catalogs, Storage Resource Managers (SRMs) and GridFTP, Grid Collector automatically downloads the needed files anywhere on the Grid without user intervention. Grid Collector can significantly improve user productivity. For a user that typically performs computation on 50 percent of the events, using Grid Collector could reduce the turn around time by 30 percent. The improvement is more significant when searching for rare events, because only a small number of events with appropriate properties are read into memory and the necessary files are automatically located and down loaded through the best available route.
Date: November 1, 2004
Creator: Wu, Kesheng; Shoshani, Arie; Zhang, Wei-Ming; Lauret, Jerome & Perevoztchikov, Victor
Partner: UNT Libraries Government Documents Department

Efficient Analysis of Live and Historical Streaming Data and itsApplication to Cybersecurity

Description: Applications that query data streams in order to identifytrends, patterns, or anomalies can often benefit from comparing the livestream data with archived historical stream data. However, searching thishistorical data in real time has been considered so far to beprohibitively expensive. One of the main bottlenecks is the update costsof the indices over the archived data. In this paper, we address thisproblem by using our highly-efficient bitmap indexing technology (calledFastBit) and demonstrate that the index update operations aresufficiently efficient for this bottleneck to be removed. We describe ourprototype system based on the TelegraphCQ streaming query processor andthe FastBit bitmap index. We present a detailed performance evaluation ofour system using a complex query workload for analyzing real networktraffic data. The combined system uses TelegraphCQ to analyze streams oftraffic information and FastBit to correlate current behaviors withhistorical trends. We demonstrate that our system can simultaneouslyanalyze (1) live streams with high data rates and (2) a large repositoryof historical stream data.
Date: April 6, 2007
Creator: Reiss, Frederick; Stockinger, Kurt; Wu, Kesheng; Shoshani, Arie & Hellerstein, Joseph M.
Partner: UNT Libraries Government Documents Department

Analyzing Enron Data: Bitmap Indexing Outperforms MySQL Queries bySeveral Orders of Magnitude

Description: FastBit is an efficient, compressed bitmap indexing technology that was developed in our group. In this report we evaluate the performance of MySQL and FastBit for analyzing the email traffic of the Enron dataset. The first finding shows that materializing the join results of several tables significantly improves the query performance. The second finding shows that FastBit outperforms MySQL by several orders of magnitude.
Date: January 28, 2006
Creator: Stockinger, Kurt; Rotem, Doron; Shoshani, Arie & Wu, Kesheng
Partner: UNT Libraries Government Documents Department

DataMover: robust terabyte-scale multi-file replication overwide-area networks

Description: Typically, large scientific datasets (order of terabytes) are generated at large computational centers, and stored on mass storage systems. However, large subsets of the data need to be moved to facilities available to application scientists for analysis. File replication of thousands of files is a tedious, error prone, but extremely important task in scientific applications. The automation of the file replication task requires automatic space acquisition and reuse, and monitoring the progress of staging thousands of files from the source mass storage system, transferring them over the network, archiving them at the target mass storage system or disk systems, and recovering from transient system failures. We have developed a robust replication system, called DataMover, which is now in regular use in High-Energy-Physics and Climate modeling experiments. Only a single command is necessary to request multi-file replication or the replication of an entire directory. A web-based tool was developed to dynamically monitor the progress of the multi-file replication process.
Date: April 5, 2004
Creator: Sim, Alex; Gu, Junmin; Shoshani, Arie & Natarajan, Vijaya
Partner: UNT Libraries Government Documents Department