Clusters as large-scale development facilities. Page: 5 of 10
This article is part of the collection entitled: Office of Scientific & Technical Information Technical Reports and was provided to UNT Digital Library by the UNT Libraries Government Documents Department.
Extracted Text
The following text was automatically extracted from the image on this page using optical character recognition software:
counters to network performance data, in order
to gain insight into execution and performance.
" Permission to Stress the System. Occasionally,
large-scale development and testing will tax the
system greatly, sometimes beyond the limits of
system stability. When such system instabilities
occur, the developer is usually interested in
determining the root cause of the problem and
may not consider the instability to be
detrimental. In order to support this type of
activity, expectations across the user community
must be set; users must be aware that the system
will occasionally have instabilities as a result of
user code and that this is acceptable (although
not actually desirable).
Typically, the intent of the basic development user is
to obtain information about performance properties of
applications. We note that, in many cases, computational
users also carry out basic development when they are
developing their computational code and generally have
these same requirements.
3.3. System Development
A more demanding type of user is one who carries out
"system" development. Projects that require system
development capabilities make some kind of modification
to the user nodes or to the system itself that requires some
type of cleanup before the system can be used by other
users.
System developers typically have the same
requirements as basic developers. In addition, they have
one or more of these:
" Root Access. Some developers, such as those
developing device drivers or testing daemons,
require privileged access on the user nodes. On
a system where root privileges can be assigned to
users, the software state on a node can become
untrustworthy. Even when a user is trusted,
honest mistakes can happen, causing a
configuration management nightmare. In effect,
when root access is granted to a user, the node
must be considered untrustworthy and must be
rebuilt when the user is finished. One
implication of this is that the rebuild process
must be robust and efficient. Giving root access
to a user also has security implications, which
are discussed below.
" Specialized Kernels. In some cases, a developer
needs a specific kernel that may be different
from the one installed by default on the node.
As with root access, the node will need to be
rebuilt when the user is done.
" Hardware Management. A user who is
working in this mode is often doing work thatcan crash the node. If such a crash takes place,
the user will want access to system hardware in
order to debug or restart the node. At present on
our facility, the user must have access to the
management infrastructure of the cluster, a level
of access that we are not comfortable making
generally available. In practice, this situation
comes up rarely, and in such cases it has been
possible to have a system administrator
participate in the debugging activities. If this
becomes a bigger issue in the future, we will
need to develop this in a more general solution.
Again, the intent of this kind of user is to carry out
development and to test an application's properties, such
as stability and performance, rather than to generate
numerical result.
3.4. Extreme Development
The "extreme" developer is one that is developing or
packaging up a complete operating system or is working
on clusterwide systems services. Most extreme
developers have the same requirements as the previous
types of developers but have one or both of the following
objectives as well:
" User-Defined Node Software. The user
provides an operating system in some form that
can be installed on the nodes allocated to them.
In order for this to be successful, these images
usually need to meet certain requirements: they
must be able to use the facility's network, which
opens up a number of issues related to node
identification. Nodes typically need to set up
trust relationships with other nodes in the same
project. These issues create a number of
technical challenges that are discussed in section
4. Once the user is done with the nodes, they
will need to be rebuilt into a standard
configuration.
" Dynamic System Services. Some projects
eventually mature to the point that they can be
installed as a part of the cluster fabric for serious
testing. Examples of such projects include
naming services, mapping services, grid
software, and file systems. In all such cases so
far, we have had the system managers of the
facility get directly involved in the project in
order to determine specific goals, testing
procedures, and fallback plans. Perhaps the
trickiest issue here is that these types of activity
tend to destabilize the system infrastructure, once
again requiring that the user community have the
correct expectations for system reliability.
Upcoming Pages
Here’s what’s next.
Search Inside
This article can be searched. Note: Results may vary based on the legibility of text within the document.
Tools / Downloads
Get a copy of this page or view the extracted text.
Citing and Sharing
Basic information for referencing this web page. We also provide extended guidance on usage rights, references, copying or embedding.
Reference the current page of this Article.
Evard, R.; Desai, N.; Navarro, J. P. & Nurmi, D. Clusters as large-scale development facilities., article, July 1, 2002; Illinois. (https://digital.library.unt.edu/ark:/67531/metadc741653/m1/5/: accessed May 5, 2024), University of North Texas Libraries, UNT Digital Library, https://digital.library.unt.edu; crediting UNT Libraries Government Documents Department.