Scheduling in Heterogeneous Grid Environments: The Effects of DataMigration Page: 3 of 8
This article is part of the collection entitled: Office of Scientific & Technical Information Technical Reports and was provided to Digital Library by the UNT Libraries Government Documents Department.
The following text was automatically extracted from the image on this page using optical character recognition software:
for migrating jobs between grid schedulers. The first
reference algorithm is centralized that uses a single
grid scheduler interacting with all local schedulers. The
second reference algorithm is local that has no grid
schedulers and executes all jobs on the compute server
where they are submitted. Details of all five methods are
given in the following subsections.
A. Distributed Algorithms
Our three distributed algorithms are based on these
common primary steps:
. A job j is submitted to a GS on compute server s,
and placed in the associated GQ.
. The GS queries the LS on s, for the approximate
wait time (AWT) of j. AWT is the amount of time
LS estimates j, if submitted to it, will wait in
LQ before beginning to execute. AWT is computed
by simulating the local scheduling policy using
the local jobs that are either running or waiting
in LQ, and j. If LS cannot satisfy the resource
requirements of j, an AWT of infinity is returned.
. The GS compares AWT for j against a threshold 0.
If the AWT is less than 0, j is moved from GQ to
LQ for execution on s2. Otherwise, j is retained in
GQ and one of the following three distributed job
migration algorithms is invoked.
1) Sender-Initiated: In the sender-initiated (S-I) strat-
egy, the GS sends the resource requirements of j to
its peers. In this study, we only consider the CPU and
run time requirements of each job; however, this can
be easily extended to an arbitrary number of resource
constraints. In response to the query, each peer GS
returns the approximate turnaround time (ATT) for j and
the resource utilization (RU) of the associated compute
server. If a peer GS does not respond within a specified
time limit due to traffic congestion or machine failure,
it is simply ignored for that request.
ATT is an estimate of the amount of time it will take
to complete a job. The ATT for j on compute server sf
initially submitted to s, is derived as follows:
ATT(j, sf) max(AWT(j, sf),ADT(jn, s,, sf)) +
ERT(j, sf) +ADT(jot, s, s,)).
Before j begins to execute, it must wait in a LQ and
transfer input data to sf. AWT(j, sf) is the approximate
wait time of j on sf while ADT(j,,, s, sf) is the
approximate data transfer time of j's input data jin, from
s, to sf. We assume these activities can be performed
simultaneously, so the maximum of the two constraints
determines when j can begin executing. The job then
runs on sf with an expected run time of ERTj, sf),
and the output data jot is transferred back to s, in time
ADT(jo,,t, s f, s,). Note that ERT can vary from one com-
pute server to another depending on their architectural
designs and program characterizations. We simplify the
calculation of ERT by assuming that run time is only
related to the clock frequency of the compute server.
RU is the fraction of the computer server that is
currently being utilized. We assume our compute servers
have multiple CPUs that are space shared so we calculate
RU as the fraction of CPUs assigned to jobs.
Based on all collected information, the GS at server
s, where j is initially submitted calculates its local ATT
and compares it against the values from each peer that
responded. If the local ATT is within a factor r of
the minimum ATT, j is scheduled for execution on s,;
otherwise, j is migrated (the migration threshold r acts
as a gate to discourage excessive job movement). In
case multiple machines respond with ATT values that are
within a small tolerance E, the server with the lowest RU
is chosen to accept j. This heuristic process attempts to
minimize the user's time-to-solution, while using system
utilization as a tiebreaker. We found this approach to be
more effective than simply relying on ATT. The job is
then sent to the LQ (by way of its partner GS and LS) on
the winning compute server. Note that once a job enters
a LQ, it is scheduled and run based exclusively on the
policy of the LS, and cannot migrate to another site.
2) Receiver-Initiated: The receiver-initiated (R-I) al-
gorithm takes a more passive approach to job migration.
Here, each compute server periodically checks its own
RU at time interval -. If the RU is below a threshold
6, the system volunteers itself for receiving jobs by
informing other machines of its low utilization. Once a
peer GS at server p receives this information, it checks its
GQ to see if any jobs are waiting to be scheduled. If so,
the resource requirements of the first job are sent to the
volunteer server. The underutilized system then responds
with the job's ATT, as well as its own RU. If the ATT
of the volunteer system is lower than that of p (or if the
local and remote ATT's are within the tolerance E but the
RU of the volunteer is smaller), the job is transferred to
the LQ of that system. Otherwise, the job continues to
wait in the GQ until either its local AWT falls below 0
(examined at time interval a), or an available machine
again volunteers its services.
3) Symmetrically-Initiated: Unlike S-I and R-I, the
symmetrically-initiated (SY-I) algorithm works in both
active and passive modes. As with the R-I strategy, each
machine periodically checks its own RU and broadcasts
a message if it is underutilized. The difference occurs
when the local AWT of a job exceeds 0 but no un-
derutilized machine volunteers its services. In the R-I
approach, the job passively sits in the GQ while waiting
for a volunteer, periodically checking its local AWT
at each a time interval. However, the SY-I algorithm
immediately switches to active mode and sends out a
request using the S-I strategy. The main differences in the
Here’s what’s next.
This article can be searched. Note: Results may vary based on the legibility of text within the document.
Tools / Downloads
Get a copy of this page or view the extracted text.
Citing and Sharing
Basic information for referencing this web page. We also provide extended guidance on usage rights, references, copying or embedding.
Reference the current page of this Article.
Oliker, Leonid; Biswas, Rupak; Shan, Hongzhang & Smith, Warren. Scheduling in Heterogeneous Grid Environments: The Effects of DataMigration, article, January 1, 2004; Berkeley, California. (https://digital.library.unt.edu/ark:/67531/metadc901613/m1/3/: accessed April 20, 2019), University of North Texas Libraries, Digital Library, https://digital.library.unt.edu; crediting UNT Libraries Government Documents Department.