1,247 Matching Results

Search Results

Advanced search parameters have been applied.

Designing a Micro-Mechanical Transistor

Description: This is the final report of a three-year, Laboratory-Directed Research and Development (LDRD) project at the Los Alamos National Laboratory (LANL). Micro-mechanical electronic systems are chips with moving parts. They are fabricated with the same techniques that are used to manufacture electronic chips, sharing their low cost. Micro-mechanical chips can also contain electronic components. By combining mechanical parts with electronic parts it becomes possible to process signal mechanically. To achieve designs comparable to those obtained with electronic components it is necessary to have a mechanical device that can change its behavior in response to a small input - a mechanical transistor. The work proposed will develop the design tools for these complex-shaped resonant structures using the geometrical ray technique. To overcome the limitations of geometrical ray chaos, the dynamics of the rays will be studied using the methods developed for the study of nonlinear dynamical systems. T his leads to numerical methods that execute well in parallel computer architectures, using a limited amount of memory and no inter-process communication.
Date: June 3, 1999
Creator: Mainieri, R.
Partner: UNT Libraries Government Documents Department

Case study of isosurface extraction algorithm performance

Description: Isosurface extraction is an important and useful visualization method. Over the past ten years, the field has seen numerous isosurface techniques published leaving the user in a quandary about which one should be used. Some papers have published complexity analysis of the techniques yet empirical evidence comparing different methods is lacking. This case study presents a comparative study of several representative isosurface extraction algorithms. It reports and analyzes empirical measurements of execution times and memory behavior for each algorithm. The results show that asymptotically optimal techniques may not be the best choice when implemented on modern computer architectures.
Date: December 14, 1999
Creator: Sutton, P M; Hansen, C D; Shen, H & Schikore, D
Partner: UNT Libraries Government Documents Department

Creating science-driven computer architecture: A new path to scientific leadership

Description: This document proposes a multi-site strategy for creating a new class of computing capability for the U.S. by undertaking the research and development necessary to build supercomputers optimized for science in partnership with the American computer industry.
Date: October 14, 2002
Creator: McCurdy, C. William; Stevens, Rick; Simon, Horst; Kramer, William; Bailey, David; Johnston, William et al.
Partner: UNT Libraries Government Documents Department

Design Space Exploration of Domain Specific CGRAs Using Crowd-sourcing

Description: CGRAs (coarse grained reconfigurable array architectures) try to fill the gap between FPGAs and ASICs. Over three decades, the research towards CGRA design has produced number of architectures. Each of these designs lie at different points on a line drawn between FPGAs and ASICs, depending on the tradeoffs and design choices made during the design of architectures. Thus, design space exploration (DSE) takes a very important role in the circuit design process. In this work I propose the design space exploration of CGRAs can be done quickly and efficiently through crowd-sourcing and a game driven approach based on an interactive mapping game UNTANGLED and a design environment called SmartBricks. Both UNTANGLED and SmartBricks have been developed by our research team at Reconfigurable Computing Lab, UNT. I present the results of design space exploration of domain-specific reconfigurable architectures and compare the results comparing stripe vs mesh style, heterogeneous vs homogeneous. I also compare the results obtained from different interconnection topologies in mesh. These results show that this approach offers quick DSE for designers and also provides low power architectures for a suite of benchmarks. All results were obtained using standard cell ASICs with 90 nm process.
Date: August 2014
Creator: Sistla, Anil Kumar
Partner: UNT Libraries

Transitive closure on the imagine stream processor

Description: The increasing gap between processor and memory speeds is a well-known problem in modern computer architecture. The Imagine system is designed to address the processor-memory gap through streaming technology. Stream processors are best-suited for computationally intensive applications characterized by high data parallelism and producer-consumer locality with minimal data dependencies. This work examines an efficient streaming implementation of the computationally intensive Transitive Closure (TC) algorithm on the Imagine platform. We develop a tiled TC algorithm specifically for the Imagine environment, which efficiently reuses streams to minimize expensive off-chip data transfers. The implementation requires complex stream programming since the memory hierarchy and cluster organization of the underlying architecture are exposed to the Imagine programmer. Results demonstrate that limited performance of TC is achieved primarily due to the complicated data-dependencies of the blocked algorithm. This work is an ongoing effort to identify classes of scientific problems well-suited for streaming processors.
Date: November 11, 2003
Creator: Griem, Gorden & Oliker, Leonid
Partner: UNT Libraries Government Documents Department

Using a Transfer Function to Describe the Load-Balancing Problem

Description: The dynamic load-balancing problem for mesh-connected parallel computers can be clearly described by introducing a function that identifies how much work is to be transmitted between neighboring processors. This function is a solution to an elliptic problem for which a wealth of knowledge exists. The non-uniqueness of the solution to the load-balancing problem is made explicit.
Date: November 1993
Creator: Conley, Andrew J.
Partner: UNT Libraries Government Documents Department

Decoherence and a simple quantum computer

Description: The authors analyze the effect of decoherence on the operation of part of a simple quantum computer. The results indicate that quantum bit coding techniques may be used to mitigate the effects of two sources of decoherence - amplitude damping and phase randomization.
Date: October 1, 1995
Creator: Chuang, I.L.; Yamamoto, Y. & Laflamme, R.
Partner: UNT Libraries Government Documents Department

GridRun: A lightweight packaging and execution environment forcompact, multi-architecture binaries

Description: GridRun offers a very simple set of tools for creating and executing multi-platform binary executables. These ''fat-binaries'' archive native machine code into compact packages that are typically a fraction the size of the original binary images they store, enabling efficient staging of executables for heterogeneous parallel jobs. GridRun interoperates with existing distributed job launchers/managers like Condor and the Globus GRAM to greatly simplify the logic required launching native binary applications in distributed heterogeneous environments.
Date: February 1, 2004
Creator: Shalf, John & Goodale, Tom
Partner: UNT Libraries Government Documents Department

A brief comparison between grid based real space algorithms andspectrum algorithms for electronic structure calculations

Description: Quantum mechanical ab initio calculation constitutes the biggest portion of the computer time in material science and chemical science simulations. As a computer center like NERSC, to better serve these communities, it will be very useful to have a prediction for the future trends of ab initio calculations in these areas. Such prediction can help us to decide what future computer architecture can be most useful for these communities, and what should be emphasized on in future supercomputer procurement. As the size of the computer and the size of the simulated physical systems increase, there is a renewed interest in using the real space grid method in electronic structure calculations. This is fueled by two factors. First, it is generally assumed that the real space grid method is more suitable for parallel computation for its limited communication requirement, compared with spectrum method where a global FFT is required. Second, as the size N of the calculated system increases together with the computer power, O(N) scaling approaches become more favorable than the traditional direct O(N{sup 3}) scaling methods. These O(N) methods are usually based on localized orbital in real space, which can be described more naturally by the real space basis. In this report, the author compares the real space methods versus the traditional plane wave (PW) spectrum methods, for their technical pros and cons, and the possible of future trends. For the real space method, the author focuses on the regular grid finite different (FD) method and the finite element (FE) method. These are the methods used mostly in material science simulation. As for chemical science, the predominant methods are still Gaussian basis method, and sometime the atomic orbital basis method. These two basis sets are localized in real space, and there is no indication that their roles in quantum ...
Date: December 1, 2006
Creator: Wang, Lin-Wang
Partner: UNT Libraries Government Documents Department

Performance and Accuracy of LAPACK's Symmetric TridiagonalEigensolvers

Description: We compare four algorithms from the latest LAPACK 3.1 release for computing eigenpairs of a symmetric tridiagonal matrix. These include QR iteration, bisection and inverse iteration (BI), the Divide-and-Conquer method (DC), and the method of Multiple Relatively Robust Representations (MR). Our evaluation considers speed and accuracy when computing all eigenpairs, and additionally subset computations. Using a variety of carefully selected test problems, our study includes a variety of today's computer architectures. Our conclusions can be summarized as follows. (1) DC and MR are generally much faster than QR and BI on large matrices. (2) MR almost always does the fewest floating point operations, but at a lower MFlop rate than all the other algorithms. (3) The exact performance of MR and DC strongly depends on the matrix at hand. (4) DC and QR are the most accurate algorithms with observed accuracy O({radical}ne). The accuracy of BI and MR is generally O(ne). (5) MR is preferable to BI for subset computations.
Date: April 19, 2007
Creator: Demmel, Jim W.; Marques, Osni A.; Parlett, Beresford N. & Vomel,Christof
Partner: UNT Libraries Government Documents Department

Computational Biology, Advanced Scientific Computing, and Emerging Computational Architectures

Description: This CRADA was established at the start of FY02 with $200 K from IBM and matching funds from DOE to support post-doctoral fellows in collaborative research between International Business Machines and Oak Ridge National Laboratory to explore effective use of emerging petascale computational architectures for the solution of computational biology problems. 'No cost' extensions of the CRADA were negotiated with IBM for FY03 and FY04.
Date: June 27, 2007
Partner: UNT Libraries Government Documents Department

Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms

Description: We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore specific-optimization methodologies for important scientific computations. In this work, we examine sparse matrix-vector multiply (SpMV) - one of the most heavily used kernels in scientific computing - across a broad spectrum of multicore designs. Our experimental platform includes the homogeneous AMD quad-core, AMD dual-core, and Intel quad-core designs, the heterogeneous STI Cell, as well as one of the first scientific studies of the highly multithreaded Sun Victoria Falls (a Niagara2 SMP). We present several optimization strategies especially effective for the multicore environment, and demonstrate significant performance improvements compared to existing state-of-the-art serial and parallel SpMV implementations. Additionally, we present key insights into the architectural trade-offs of leading multicore design strategies, in the context of demanding memory-bound numerical algorithms.
Date: October 16, 2008
Creator: Williams, Samuel; Oliker, Leonid; Vuduc, Richard; Shalf, John; Yelick, Katherine & Demmel, James
Partner: UNT Libraries Government Documents Department

Fabric-based systems: model, tools, applications.

Description: A Fabric Based System is a parameterized cellular architecture in which an array of computing cells communicates with an embedded processor through a global memory . This architecture is customizable to different classes of applications by funtional unit, interconnect, and memory parameters, and can be instantiated efficiently on platform FPGAs . In previous work, we have demonstrated the advantage of reconfigurable fabrics for image and signal processing applications . Recently, we have build a Fabric Generator, a Java-based toolset that greatly accelerates construction of the fabrics presented in. A module-generation library is used to define, instantiate, and interconnect cells' datapaths . FG generates customized sequencers for individual cells or collections of cells . We describe the Fabric-Based System model, the FG toolset, and concrete realizations offabric architectures generated by FG on the Altera Excalibur ARM that can deliver 4.5 GigaMACs/s (8/16 bit data, Multiply-Accumulate) .
Date: January 1, 2003
Creator: Wolinski, C. (Christophe); Gokhale, M. (Maya) & McCabe, K. P. (Kevin P.)
Partner: UNT Libraries Government Documents Department

Hydra: a service oriented architecture for scientific simulation integration

Description: One of the current major challenges in scientific modeling and simulation, in particular in the infrastructure-analysis community, is the development of techniques for efficiently and automatically coupling disparate tools that exist in separate locations on different platforms, implemented in a variety of languages and designed to be standalone. Recent advances in web-based platforms for integrating systems such as SOA provide an opportunity to address these challenges in a systematic fashion. This paper describes Hydra, an integrating architecture for infrastructure modeling and simulation that defines geography-based schemas that, when used to wrap existing tools as web services, allow for seamless plug-and-play composability. Existing users of these tools can enhance the value of their analysis by assessing how the simulations of one tool impact the behavior of another tool and can automate existing ad hoc processes and work flows for integrating tools together.
Date: January 1, 2008
Creator: Bent, Russell; Djidjev, Tatiana; Hayes, Birch P; Holland, Joe V; Khalsa, Hari S; Linger, Steve P et al.
Partner: UNT Libraries Government Documents Department

Non-preconditioned conjugate gradient on cell and FPGA based hybrid supercomputer nodes

Description: This work presents a detailed implementation of a double precision, non-preconditioned, Conjugate Gradient algorithm on a Roadrunner heterogeneous supercomputer node. These nodes utilize the Cell Broadband Engine Architecture{sup TM} in conjunction with x86 Opteron{sup TM} processors from AMD. We implement a common Conjugate Gradient algorithm, on a variety of systems, to compare and contrast performance. Implementation results are presented for the Roadrunner hybrid supercomputer, SRC Computers, Inc. MAPStation SRC-6 FPGA enhanced hybrid supercomputer, and AMD Opteron only. In all hybrid implementations wall clock time is measured, including all transfer overhead and compute timings.
Date: January 1, 2009
Creator: Dubois, David H; Dubois, Andrew J; Boorman, Thomas M & Connor, Carolyn M
Partner: UNT Libraries Government Documents Department

Efficient Graph Based Assembly of Short-Read Sequences on Hybrid Core Architecture

Description: Advanced architectures can deliver dramatically increased throughput for genomics and proteomics applications, reducing time-to-completion in some cases from days to minutes. One such architecture, hybrid-core computing, marries a traditional x86 environment with a reconfigurable coprocessor, based on field programmable gate array (FPGA) technology. In addition to higher throughput, increased performance can fundamentally improve research quality by allowing more accurate, previously impractical approaches. We will discuss the approach used by Convey?s de Bruijn graph constructor for short-read, de-novo assembly. Bioinformatics applications that have random access patterns to large memory spaces, such as graph-based algorithms, experience memory performance limitations on cache-based x86 servers. Convey?s highly parallel memory subsystem allows application-specific logic to simultaneously access 8192 individual words in memory, significantly increasing effective memory bandwidth over cache-based memory systems. Many algorithms, such as Velvet and other de Bruijn graph based, short-read, de-novo assemblers, can greatly benefit from this type of memory architecture. Furthermore, small data type operations (four nucleotides can be represented in two bits) make more efficient use of logic gates than the data types dictated by conventional programming models.JGI is comparing the performance of Convey?s graph constructor and Velvet on both synthetic and real data. We will present preliminary results on memory usage and run time metrics for various data sets with different sizes, from small microbial and fungal genomes to very large cow rumen metagenome. For genomes with references we will also present assembly quality comparisons between the two assemblers.
Date: March 22, 2011
Creator: Sczyrba, Alex; Pratap, Abhishek; Canon, Shane; Han, James; Copeland, Alex; Wang, Zhong et al.
Partner: UNT Libraries Government Documents Department

Radiation transport algorithms on trans-petaflops supercomputers of different architectures.

Description: We seek to understand which supercomputer architecture will be best for supercomputers at the Petaflops scale and beyond. The process we use is to predict the cost and performance of several leading architectures at various years in the future. The basis for predicting the future is an expanded version of Moore's Law called the International Technology Roadmap for Semiconductors (ITRS). We abstract leading supercomputer architectures into chips connected by wires, where the chips and wires have electrical parameters predicted by the ITRS. We then compute the cost of a supercomputer system and the run time on a key problem of interest to the DOE (radiation transport). These calculations are parameterized by the time into the future and the technology expected to be available at that point. We find the new advanced architectures have substantial performance advantages but conventional designs are likely to be less expensive (due to economies of scale). We do not find a universal ''winner'', but instead the right architectural choice is likely to involve non-technical factors such as the availability of capital and how long people are willing to wait for results.
Date: August 1, 2003
Creator: Christopher, Thomas Woods
Partner: UNT Libraries Government Documents Department

High Performance Architecture using Speculative Threads and Dynamic Memory Management Hardware

Description: With the advances in very large scale integration (VLSI) technology, hundreds of billions of transistors can be packed into a single chip. With the increased hardware budget, how to take advantage of available hardware resources becomes an important research area. Some researchers have shifted from control flow Von-Neumann architecture back to dataflow architecture again in order to explore scalable architectures leading to multi-core systems with several hundreds of processing elements. In this dissertation, I address how the performance of modern processing systems can be improved, while attempting to reduce hardware complexity and energy consumptions. My research described here tackles both central processing unit (CPU) performance and memory subsystem performance. More specifically I will describe my research related to the design of an innovative decoupled multithreaded architecture that can be used in multi-core processor implementations. I also address how memory management functions can be off-loaded from processing pipelines to further improve system performance and eliminate cache pollution caused by runtime management functions.
Date: December 2007
Creator: Li, Wentong
Partner: UNT Libraries

Evaluating the Scalability of SDF Single-chip Multiprocessor Architecture Using Automatically Parallelizing Code

Description: Advances in integrated circuit technology continue to provide more and more transistors on a chip. Computer architects are faced with the challenge of finding the best way to translate these resources into high performance. The challenge in the design of next generation CPU (central processing unit) lies not on trying to use up the silicon area, but on finding smart ways to make use of the wealth of transistors now available. In addition, the next generation architecture should offer high throughout performance, scalability, modularity, and low energy consumption, instead of an architecture that is suitable for only one class of applications or users, or only emphasize faster clock rate. A program exhibits different types of parallelism: instruction level parallelism (ILP), thread level parallelism (TLP), or data level parallelism (DLP). Likewise, architectures can be designed to exploit one or more of these types of parallelism. It is generally not possible to design architectures that can take advantage of all three types of parallelism without using very complex hardware structures and complex compiler optimizations. We present the state-of-art architecture SDF (scheduled data flowed) which explores the TLP parallelism as much as that is supplied by that application. We implement a SDF single-chip multiprocessor constructed from simpler processors and execute the automatically parallelizing application on the single-chip multiprocessor. SDF has many desirable features such as high throughput, scalability, and low power consumption, which meet the requirements of the next generation of CPU design. Compared with superscalar, VLIW (very long instruction word), and SMT (simultaneous multithreading), the experiment results show that for application with very little parallelism SDF is comparable to other architectures, for applications with large amounts of parallelism SDF outperforms other architectures.
Date: December 2004
Creator: Zhang, Yuhua
Partner: UNT Libraries

PRODEEDINGS OF RIKEN BNL RESEARCH CENTER WORKSHOP : HIGH PERFORMANCE COMPUTING WITH QCDOC AND BLUEGENE.

Description: Staff of Brookhaven National Laboratory, Columbia University, IBM and the RIKEN BNL Research Center organized a one-day workshop held on February 28, 2003 at Brookhaven to promote the following goals: (1) To explore areas other than QCD applications where the QCDOC and BlueGene/L machines can be applied to good advantage, (2) To identify areas where collaboration among the sponsoring institutions can be fruitful, and (3) To expose scientists to the emerging software architecture. This workshop grew out of an informal visit last fall by BNL staff to the IBM Thomas J. Watson Research Center that resulted in a continuing dialog among participants on issues common to these two related supercomputers. The workshop was divided into three sessions, addressing the hardware and software status of each system, prospective applications, and future directions.
Date: March 11, 2003
Creator: CHRIST,N.; DAVENPORT,J.; DENG,Y.; GARA,A.; GLIMM,J.; MAWHINNEY,R. et al.
Partner: UNT Libraries Government Documents Department

An experimental evaluation of the parallel I/O systems of the IBM SP and Intel Paragon using a production application

Description: This paper presents the results of an experimental evaluation of the parallel I/O systems of the IBM SP and Intel Paragon. For the evaluation, we used a full, three-dimensional application code that is in production use for studying the nonlinear evolution of Jeans instability in self-gravitating gaseous clouds. The application performs I/O by using library routines that we developed and optimized separately for parallel I/O on the SP and Paragon. The I/O routines perform two-phase I/O and use the PIOFS file system on the SP and PFS on the Paragon. We studied the I/O performance for two different sizes of the application. We found that for the small case, I/O was faster on the SP, whereas for the large case, I/O took almost the same time on both systems. Communication required for I/O was faster on the Paragon in both cases. The highest read bandwidth obtained was 48 Mbytes/sec. and the highest write bandwidth obtained was 31.6 Mbytes/sec., both on the SP.
Date: September 1, 1996
Creator: Thakur, R.; Gropp, W. & Lusk, E.
Partner: UNT Libraries Government Documents Department

Achieving high performance in numerical computations on RISC workstations and parallel systems

Description: The nominal peak speeds of both serial and parallel computers is raising rapidly. At the same time however it is becoming increasingly difficult to get out a significant fraction of this high peak speed from modern computer architectures. In this tutorial the authors give the scientists and engineers involved in numerically demanding calculations and simulations the necessary basic knowledge to write reasonably efficient programs. The basic principles are rather simple and the possible rewards large. Writing a program by taking into account optimization techniques related to the computer architecture can significantly speedup your program, often by factors of 10--100. As such, optimizing a program can for instance be a much better solution than buying a faster computer. If a few basic optimization principles are applied during program development, the additional time needed for obtaining an efficient program is practically negligible. In-depth optimization is usually only needed for a few subroutines or kernels and the effort involved is therefore also acceptable.
Date: August 20, 1997
Creator: Goedecker, S. & Hoisie, A.
Partner: UNT Libraries Government Documents Department