On Batch-Processing Based Coded Computing for Heterogeneous Distributed Computing Systems

In recent years, coded distributed computing (CDC) has attracted significant attention, because it can efficiently facilitate many delay-sensitive computation tasks against unexpected latencies in distributed computing systems. Despite such a salient feature, many design challenges and opportunities remain. In this paper, we focus on practical computing systems with heterogeneous computing resources, and design a novel CDC approach, called batch-processing based coded computing (BPCC), which exploits the fact that every computing node can obtain some coded results before it completes the whole task. To this end, we first describe the main idea of the BPCC framework, and then formulate an optimization problem for BPCC to minimize the task completion time by configuring the computation load. Through formal theoretical analyses, extensive simulation studies, and comprehensive real experiments on the Amazon EC2 computing clusters, we demonstrate promising performance of the proposed BPCC scheme, in terms of high computational efficiency and robustness to uncertain disturbances.


INTRODUCTION
In recent years, distributed computing has been widely adopted to perform various computation tasks in different computing systems [1]- [3]. For instance, to perform big data analytics in cloud computing systems, MapReduce [4] and Apache Spark [5] are the two prevalent modern distributed computing frameworks that process data in the order of petabytes.
Despite the importance of distributed computing, many design challenges remain. One major challenge is that many computing frameworks are vulnerable to uncertain disturbances, such as node/link failures, communication congestion, and slow-downs [6]. Such disturbances, which can be modeled as stragglers that are slow or even fail in returning results, have been observed in many largescale computing systems such as cloud computing [7], mobile edge computing [8], and fog computing [9]. A variety of solutions have been developed in the literature to address stragglers. For example, the authors of [10] proposed to identify and blacklist nodes that are in bad health and to run tasks only on well-performed nodes. However, empirical studies show that stragglers can occur in non-blacklisted nodes as well [11], [12]. As another type of solution, delayed computation tasks can be re-executed in a speculative manner [4], [10], [13], [14]. Nevertheless, such speculative execution techniques have to wait to collect the performance statistics of the tasks before generating speculative copies and thus have limitations in dealing with small jobs [12]. To avoid waiting and predicting stragglers, the authors of [12], [15] suggested to execute multiple clones of each task and use results generated by the fastest clones. Although their results show the promising performance of this approach in reducing the average completion time of small jobs, the extra resources required for launching clones can be considerably large, because multiple clones are executed for each task.
Instead of directly replicating the whole task, the coding techniques can be adopted to introduce arbitrary redundancy into the computation in a systematic way. However, until a few years ago, the coding techniques have been mostly known for their capability in improving the resilience of communication, storage and cache systems to uncertain disturbances [16]- [18]. Lee et al. [19], [20] presented the first coded distributed computing (CDC) scheme to speed up matrix multiplication and data shuffling. Since then, CDC has attracted significant attention in the distributed computing community. Although a variety of CDC schemes have been developed to solve different computation problems, most of these schemes assume homogeneous computing nodes, which is not a common case in realistic scenarios. Moreover, they require each worker node to first complete the computation task and then send back the whole result to the master node, which introduces significant delays [19]- [22].
In this paper, we focus on a classical CDC task: matrixvector multiplication and propose a novel coding scheme, called batch-processing based coded computing (BPCC), to speed up the computational efficiency of general distributed computing systems with heterogeneous computing nodes and improve their robustness to uncertain disturbances. Unlike most existing CDC schemes, our BPCC allows each node to return partial computing results to the master node in batches before the whole computation task is completed. Therefore, BPCC achieves lower latency. Also worthy of note is that the partial results can be used to generate approximated solutions, e.g., by applying the singular value decomposition (SVD) approach in [23], which is very useful for applications that require timely but unnecessarily optimized decisions such as emergency response. To the best of our knowledge, such a BPCC framework has not been fully investigated in the literature.
This paper extends our earlier work presented in [24], and further makes the following new contributions.
1) An optimal load allocation strategy. For systems with heterogeneous computing nodes, equally distributing the computation load may lead to bad performance. To optimize the computational efficiency, we formulate an optimization problem for general BPCC with the assumption that the processing time of each computing node follows a shifted exponential distribution. To solve the optimization problem, we formulate alternative optimization problems, based on which we design an optimal load allocation scheme that assigns proper amount of load to each node to achieve the minimal expected task completion time. 2) Comprehensive theoretical analyses. We conduct formal theoretical analyses to prove the asymptotic optimality of BPCC and the impact of its important parameter. We also prove that it outperforms a state-of-the-art CDC scheme for heterogeneous systems, called Heterogeneous Coded Matrix Multiplication (HCMM) [21], [22].

3) Extensive simulation and real experimental studies.
To further demonstrate the performance of BPCC, we compare it with three benchmark schemes, including the Uniform Uncoded, Load-Balanced Uncoded, and HCMM. The simulation results show the impact of BPCC parameters including number of batches and number of worker nodes. Specifically, the efficiency of BPCC improves with the increase of the number of batches and the solution of BPCC is optimal when the number of worker nodes approaches infinity. A sensitivity study shows the performance of BPCC when parameters in the computing model take erroneous values. Moreover, the simulation results also demonstrate that BPCC can improve computing performance by reducing the latency up to 73%, 56%, and 34% over the aforementioned three benchmark schemes, respectively. In the real experiments, we test all distributed computing schemes in the Amazon EC2 computing clusters. In particular, we deploy a heterogeneous computing cluster that consists of different machine instances in Amazon EC2. The results show that our BPCC scheme is more efficient and robust to uncertain disturbances than the benchmark schemes.
The rest of this paper is organized as follows. Section 2 presents the related work. In Section 3, we introduce the system model and the BPCC framework, and then formulate an optimization problem for BPCC. To solve the optimization problem, in Section 4, we provide a twostep alternative formulation for which we design the BPCC scheme and conduct solid theoretical analysis to prove its optimality and understand the impact of its parameter. We then present extensive simulation and experimental results in Section 5 and Section 6, respectively, before concluding the paper in Section 7. For better readability, we move the proofs of all lemmas, theorems and corollaries to the Appendix.

RELATED WORK
Following the seminal work in [17], [19], [20], many different computation problems have been explored using codes, such as the gradients [25], large matrix-matrix multiplication [26], linear inverse problems [27], and nonlinear operations [28]. Other relevant coded computation solutions include the "Short-Dot" coding scheme [29] that offers computation speed-up by introducing additional sparsity to the coded matrices and the unified coded framework [30], [31] that achieves the trade-off between communication load and computation latency.
While most CDC schemes consider homogeneous computing nodes, there have been a few recent studies that investigated CDC over heterogeneous computing clusters. In particular, Kim et al. [32], [33] considered the matrixvector multiplication problem and presented an optimal load allocation method that achieves a lower bound of the expected latency. Reisizadeh et al. [21] introduced a different approach, namely Heterogeneous Coded Matrix Multiplication (HCMM), that can maximize the expected computing results aggregated at the master node. In [21], [22], the authors proved that the HCMM is asymptotically optimal under the assumption that the processing time of each computing node follows a shifted exponential or Weibull distribution. Also of interest, Keshtkarjahromi et al. [34] considered the scenario when computing nodes have time-varying computing powers, and introduced a coded cooperative computation protocol that allocates tasks in a dynamic and adaptive manner. Narra et al. [35] also developed an adaptive load allocation scheme and utilized a LSTM-based model to predict the computation capability of the worker nodes.
To reduce the output delay, there have been some attempts to enable early return of partial results [23], [36], [37]. In particular, an anytime coding technique was introduced in [23], which adopts the SVD to allow early output of approximated result. Also of interest is the study presented in [36], which introduced a hierarchical approach to address the limitations of above coding techniques in terms of wastefully ignoring the work completed by slow worker nodes. In particular, to better utilize the work completed by each worker node, it partitions the total computation at each worker node into layers of subcomputations, with each layer encoding part of the job. It then processes each layer sequentially. The final result can be obtained after the master node recovers all layers. The simulation results demonstrate the effectiveness of this approach in reducing the computation latency. However, as the worker nodes have to process the layers in the same order, the results obtained by slow worker nodes for layers that have already been recovered are useless. Furthermore, this approach, as well as aforementioned approaches, assumes homogeneous computing nodes. Another relevant study is presented in [37], which introduced a rateless fountain coding scheme that can utilize partial results returned by worker nodes.

SYSTEM MODELS
In this section, we first introduce the computing system for distributed matrix-vector multiplication. We then illustrate three computing schemes, including the proposed batch processing-based coded computing (BPCC). Finally, we formulate an optimization problem for BPCC.

Computing System
We consider a distributed computing system that consists of one master node and N (N ∈ Z + ) computing nodes, a.k.a., worker nodes. Using this system, we investigate how to quickly solve a matrix-vector multiplication problem, which is one of the most basic building blocks of many computation tasks. Specifically, we consider a matrix-vector multiplication problem y = Ax, where y ∈ R r is the output vector to be calculated, x ∈ R m is the input vector to be distributed from a master node to multiple workers, and A ∈ R r×m is an r×m dimensional matrix pre-stored in the system. Both r and m can be very large, which implies that calculating Ax at a single computing node is not feasible. Finally, we define [n] = {1, 2, . . . , n}, where n is an arbitrary positive integer, i.e., n ∈ Z + .

Uncoded Distributed Computing
To solve the above problem, a traditional distributed computing scheme divides matrix A into a set of sub-matrices A 1 , A 2 , · · · , A N , and pre-stores each sub-matrix A i ∈ R i×m in computing node i, where ∀i ∈ [N ], i ∈ Z + and N i=1 i = r. Upon receiving the input vector x, the master node sends vector x to all worker nodes. Each worker node i then computes y i = A i x and returns the result to the master node. After all results are received, the master node aggregates the results and outputs y = [y T 1 , y T 2 , · · · , y T N ] T , where T stands for transpose. Due to the existence of uncertain system disturbances, the uncoded computing scheme may defer or even fail the computation, because the delay or loss of any y i , i ∈ [N ], will affect the calculation of the final result y = Ax. To address this issue, more computing nodes can be used to perform distributed computing. For instance, the master node can have two or more computing nodes to compute y i . This approach, however, is not efficient because the cost can be unnecessarily large.

Coded Distributed Computing (CDC)
In recent years, a more efficient computing paradigm, CDC, has been introduced to tackle the issue of uncertain disturbances. There are many CDC schemes in the literature, and we consider a generic CDC scheme as follows.
In this CDC scheme, A will first be used to calculate a larger matrixÂ ∈ R q×m with more rows, i.e., q > r, by usingÂ = HA, where H ∈ R q×r is the encoding matrix with the property that any r row vectors are linearly independent from each other [28]. In other words, we can use any r rows of H to create an r × r full-rank matrix. Note that this encoding procedure is performed offline and A can be considered to be pre-stored in the system. Similar to the uncoded computing scheme, matrixÂ can then be divided into N sub-matricesÂ 1 ,Â 2 , · · · ,Â N , whereÂ i ∈ R i×m , ∀i ∈ [N ], N i=1 i = q, and each worker node i calculatesŷ i =Â i x.
Different from the uncoded computing scheme, the master node does not need to wait for all worker nodes to complete their calculations, because it can recover Ax once the total number of rows of the received results is equal to or larger than r. In particular, suppose the master node receivesŷ b ∈ R r at a certain time t, it can first infer whereĤ b ∈ R r×r is a sub-matrix of the encoding matrix H corresponding toŷ b . The master node can then calculate

BPCC
In the literature, most existing CDC schemes assume that each worker node i sends the completeŷ i to the master node when it is ready, which may incur large delays. To further speed up the computation, we propose a novel BPCC scheme and the main idea is to allow each worker node to return partial results to the master node.
Specifically, we consider that each worker node i equally divides the pre-stored encoded matrixÂ i rowwise into p i sub-matrices, named as batches, where p i ∈ Z + is the number of batches and p i ≤ i . Except the last batch, each batch has i pi = b i rows. After receiving the input vector x from the master node, the worker node multiplies each batch with x and will send back the partial results once available. Suppose that the master node receives s i (t) batches from the worker node i by time t, where 0 ≤ s i (t) ≤ p i , it can then recover the final result when N i=1 min( i , s i (t)b i ) ≥ r, by using Eq. (1).

Problem Formulation
In the previous sub-section, we introduced the key idea of the BPCC scheme. In the following study, we focus on optimizing the performance of BPCC. Specifically, we consider minimizing the task completion time. This is achieved by allocating proper computation load (i.e., i ) to each worker node.
We now define T as the amount of time to complete a computation task. Given the number of batches for each worker node p = (p 1 , p 2 , . . . , p N ), where p i ∈ Z + , ∀i ∈ [N ], the optimization can be formulated as follows: where = ( 1 , 2 , . . . , N ).
To facilitate further analysis, we assume that the computation task scales with N , i.e., r = Θ(N ). Next, we assume that the computing nodes are fixed with timeinvariant computation capabilities, and the network maintains a stable communication delay during the computing process.
We now consider the behavior of waiting time, which is defined as the duration from the time that the master node distributes x to the time that it receives a certain result. For BPCC, we let T k,i be the waiting time for the master node to receive k batches from worker node i, k ∈ Z + . Clearly, T k,i can be modeled as a random variable following a certain probability distribution. Following the modeling technique used in recent studies [19], [20], [22], [36], we consider that T k,i follows a shifted exponential distribution defined below: where µ i and α i are straggling and shift parameters, respectively, and µ i and α i are positive constants for all i ∈ [N ]. Furthermore, we assume that T k,i is independent from T k ,j , ∀j ∈ [N ], j = i, k ∈ Z + . Based on the above definitions and assumptions, we see that T must satisfy N i=1 s i (T )b i ≥ r. In the following sections, we will first discuss how to solve the optimization problem, in which we will conduct theoretical analysis to show the optimality and advantages of BPCC. We will then conduct extensive simulation and real experimental studies to validate the assumptions and to evaluate performance of the optimization algorithm.

MAIN RESULTS
In this section, we aim to solve the optimization problem P main . In particular, we will first provide a simplified formulation, for which we then apply a two-step alternative formulation. Next, we show how to solve the alternative problems and prove the optimality of the solution. We then analyze the impact of parameter p i , ∀i ∈ [N ] on the solution, and finally prove that this solution outperforms a recent CDC scheme without batch processing.

Notations for Asymptotic Analysis
For any two given functions f (n) and g(n), f (n) = Θ(g(n)) if and only if there exist positive constants c 1 ,

A Simplified Formulation
We relax the constraint from i ∈ Z + to i ≥ 0, ∀i ∈ [N ] to simplify the analysis. We also remove the constraint , by assuming that p i ∈ Z + is properly selected such that the optimal solution satisfies this constraint. Consequently, the problem in Eq. (2) can be formulated as follows: subject to i ≥ 0, ∀i ∈ [N ], Once the above problem is solved, we can round each optimal load number i up to its nearest integer using the ceiling function (denoted as ). Note that the effect of this rounding step is negligible in practical applications with large load numbers, such as those considered in our simulation and experimental studies [22]. In cases when the derived load number i is smaller than p i , we reduce the value of p i until this assumption holds. Note that we can always find such p i that satisfies the constraint, as the derived load number i is always larger than or equal to 1.

A Two-Step Alternative Formulation
To solve the above problem, which is NP-Hard, we provide a two-step alternative formulation, inspired by [22]. We will show later that this alternative formulation provides an asymptotically optimal solution to problem P main .
The key idea of the two-step alternative formulation is to first maximize the amount of results accumulated at the master node by a feasible time t, i.e., t ≥ max i {α i i }, and then minimize time t such that sufficient amount of results are available to recover the final result. In particular, we let i be the amount of results received by the master node by time t, where b i = i pi is the batch size. For a feasible time t, we first maximize the expected amount of results received by the master node, through solving the following problem: After obtaining the solution to P alt , denoted as * (t) = ( * 1 (t), · · · , * N (t)), we then minimize the time t such that there is a high probability that the results received by the master node by time t are sufficient to recover the final result, by solving is the amount of results received by the master node by time t for load allocation * (t).

Solution to the Two-Step Alternative Problem
To solve the two-step alternative problem, we first consider P alt . Note that, the expected amount of results received by the master node by time t is: where s i (t) is an integer in range 0 ≤ s i (t) ≤ p i , and Pr[s i (t) = k] is the probability that the master node receives exactly k batches from worker node i, in Eq. (4) can then be computed by: The solution to P (1) alt can then be obtained by solving the following equation for each i ∈ [N ]: λ i is the positive solution to the following equation: which is a constant independent of t. To show that Eq. (7) has a single positive solution, we can define an auxiliary function f i for each i: We can see that f i (x) decreases monotonically with the increase of x when x > 0. We can also find that f i (0) = e µiαi > 1 and f i (∞) = 0. Based on these statements, we know that a unique λ i exists and can be efficiently solved using a numerical approach. Next, we show in Lemma 1 that λ i has closed-form infimum and supremum.
, be the positive solution to Eq. (7). Its infimum is given by In addition, its supremum is given by which is attained when p i = 1 and W (·) is the Lambert W function [38].
From Lemma 1, we can derive that the condition t ≥ Next, we solve P alt . Since this problem is also NPhard, we here provide an approximated solution. In particular, we approximate its optimal solution, denoted as t * , with value τ * , such that the expected amount of results accumulated at the master node by time τ * equals to the amount of results required for recovering the final result, i.e., E[S * (τ * )] = r. To find the value of τ * , we let Then, using the load allocation * i (t) in Eq. (6), the expected amount of results received by the master node is: Algorithm 1: BPCC Calculate λ i by solving Eq. (7) 3 Calculate β by using Eq. (13) 4 for i = 1 : N do 5 Calculate * i by using Eq. (14) We can then find the solution to Eq. (10) as follows: which is also a constant.
Combining the solutions to P alt and P alt , we can then derive the load allocation: * The procedures of BPCC are summarized in Algorithm 1.

Optimality Analysis
In this sub-section, we conduct theoretical analysis to investigate the performance of BPCC. Specifically, we first show in Lemma 2 the optimality of the approximated solution τ * to P (2) alt . We then show in Theorem 3 that the solution provided by BPCC is asymptotically optimal. Finally, we show in Theorem 4 the accuracy of τ * in approximating the expected execution time of BPCC. Lemma 2. Let t * be the optimal solution to P (2) alt , and τ * be the approximated solution given by Eq. (12). If the batch processing time follows the shifted exponential distribution in Eq. (3) and r = Θ(N ), then Based on Lemma 2, we next show the asymptotic optimality of BPCC in Theorem 3. Theorem 3. Consider problem P main with the batch processing time following the shifted exponential distribution in Eq. (3) and r = Θ(N ). Let E[T BPCC ] and E[T OPT ] be the expected execution time of BPCC and the optimal value of P main , respectively. The BPCC is asymptotically optimal, i.e., Theorem 3 and Lemma 2 further lead to the following theorem.

Analysis of the Impact of Parameter p
In the BPCC scheme shown in Algorithm 1, we note that p is the only parameter that can be tuned, while the other parameters, including r, N , u and α, are determined by the specific computation task and properties of the distributed computing system. In this sub-section, we analyze the impact of this important parameter p on the performance of BPCC in Theorem 5. We then show in Theorem 6 that the approximated execution time of BPCC, i.e., τ * given by Eq. (12), has closed-form infimum and supremum.
Theorem 5. Consider problem P main with the batch processing time following the shifted exponential distribution in Eq. (3) and r = Θ(N ). Let τ * be the approximated execution time of BPCC given by Eq. (12). Then the increase of any p i , i ∈ [N ], will cause τ * to decrease.
Theorem 6. Consider problem P main with the batch processing time following the shifted exponential distribution in Eq. (3) and r = Θ(N ). Let τ * be the approximated execution time of BPCC given by Eq. (12). Then and which is attained when Here sup λ i is given by Eq. (9).
From Theorem 6 and Eq. (14), we can derive the following corollary. Corollary 6.1. Consider problem P main with the batch processing time following the shifted exponential distribution in Eq. (3) and r = Θ(N ). Let * i be the solution of BPCC given by Eq. (14). Then when the approximated execution time τ * of BPCC given by Eq. (12) converges to its infimum, * i converges toˆ i , wherê

Comparison with HCMM
In this sub-section, we compare the performance of BPCC with HCMM [22], a state-of-the-art CDC scheme for heterogeneous worker nodes, and show that BPCC outperforms HCMM in computational efficiency.
HCMM can be considered as a special case of BPCC with p i = 1, ∀i ∈ [N ]. It assigns each worker node i with load H,i = r β H λ H,i , where λ H,i is the positive solution to e µiλ H,i = e αiµi (µ i λ H,i + 1) and β H = N i=1 µi 1+µiλ H,i . Theorem 7 shows that BPCC is more efficient than HCMM. Theorem 7. Consider problem P main , with the batch processing time following a shifted exponential distribution in Eq. (3) and r = Θ(N ). Let T BPCC and T HCMM be the execution times of BPCC and HCMM, respectively. Then,

SIMULATION STUDIES
In this section, we conduct simulation studies to evaluate the performance of the proposed BPCC scheme. Specifically, we first explain the simulation settings, including the distributed computing schemes and scenarios. We then elaborate on the impact of important parameters, including p i , N , µ i and α i , on the performance of the BPCC scheme. Finally, we compare the proposed BPCC scheme with benchmark schemes, including the state-ofthe-art HCMM scheme [22].

Distributed Computing Schemes
In this study, we consider four distributed computing schemes: • Load-Balanced Uncoded [22]: This method divides the computation loads according to the computing capabilities of the worker nodes. In particular, the computation load assigned to each worker node i is inversely proportional to the expected time for this node to compute an inner product, i.e., i ∝ ( µi µiαi+1 ) and N i=1 i = r.
• HCMM [22]: In this method, the load assignment method in [22] is used. Note that this is a special case of Algorithm 1, in which p i = 1, ∀i ∈ [N ].
The HCMM and BPCC have the exactly same load allocation for each worker.
• BPCC: In this scheme, Algorithm 1 is used, where p are the parameters to configure.

Computation Scenarios
To evaluate the performance of different distributed computing schemes, we consider the following four computation scenarios:

Simulation Method
In our simulation, we implement all the aforementioned distributed computing schemes in MATLAB. We assume that the processing time of each node follows the shifted exponential distribution in Eq. (3). Specifically, for each experiment of a scenario, we choose the straggling parameters µ i , ∀i ∈ [N ] randomly in [1,50], and calculate each shift parameter α i = 1 µi . In each experiment, we simulate every distributed computing scheme for 100 times, in each of which the computing time of a node is simulated by using its straggling and shift parameters.

Parameter Impact Analysis
In this sub-section, we investigate the impacts of parameters in BPCC, including number of batches, number of worker nodes, and the straggling and shift parameters in the computing model.

Number of batches
The number of batches p i is an important parameter to configure. In Section 4.6, we have theoretically analyzed its impact on the performance of BPCC. Here we conduct simulation studies to demonstrate its impact described in Theorem 5, Theorem 6 and Corollary 6.1. In particular, two experiments are designed.
In the first experiment, we show that the approximated execution time τ * of BPCC given by Eq. (12) decreases with the increase of any p i , i ∈ [N ], as presented in Theorem 5. In particular, we vary the number of batches for one of the worker nodes and fix the number of batches for the others. Specially, we vary p 1 and let p j = 1, ∀j ∈ [N ] \ {1}. As shown in Fig. 1(a), τ * indeed decreases as p 1 increases. In the second experiment, we show that the approximated execution time τ * and the load * i tend to converge as p i increases for all i ∈ [N ], as presented in Theorem 6 and Corollary 6.1. In this experiment, we vary p i simultaneously for all i ∈ [N ]. In other words, we let p i = p ∈ Z + , ∀i ∈ [N ] and vary the value of p. As shown in Fig. 1(b), τ * decreases with the increase of p and finally converges. Note that when p = 100, τ * equals to 66. 32 Fig. 2(a) shows the trajectory of the load allocated to one of the worker nodes, i.e., * 1 , which decreases and finally converges as p increases. Note that when p = 100, * 1 equals to 1657. 13 .68 for the four scenarios, respectively, which are almost the same as the associated inf τ * andˆ 1 , respectively. Fig. 2(a) shows the impact of parameter p i on the load * i for one of the work nodes. In Fig. 2(b), we also show its impact on the total load q = N i=1 * i . As we can see, the total load q also increases with the increase of p, where p i = p, ∀i ∈ [N ]. This indicates that a larger p i will require more storage space at the worker nodes. Note that the worker nodes will stop execution once the master node receives sufficient amount of results for recovering the final result. Therefore, a larger total load q does not increase the computation load for the worker nodes. This study tells us that the configuration of parameter p i should trade off between computational efficiency and storage consumption.  As τ * is an approximation of BPCC's execution time, we also show in Fig. 3 the impact of p i on the expected execution time E[T BPCC ] of BPCC, which is estimated using the Monte Carlo simulation method, specifically, by repeating each experiment for 100 times and averaging the times to execute the BPCC scheme. Comparing Fig. 1 and Fig. 3, we can see that τ * approximates E[T BPCC ] generally well. The fluctuations are caused by the uncertainty of the computation times and the relatively weak estimation capability of the Monte Carlo method, which requires large number of simulations to obtain an accurate mean estimate. As we will show in the next study, the approximation accuracy of τ * is impacted by the number of worker nodes N .

Number of worker nodes
As we have theoretically proved in Theorem 4, the approximated execution time τ * converges to the true expected execution time E[T BPCC ] of BPCC, when the number of worker nodes N approaches infinity. To demonstrate this theorem, we vary N and set r = 100N +10000, and record the approximation error of τ * , given by |τ * −E[T BPCC ]|, for each value of N . The results are shown in Fig. 4. Note that, instead of the four scenarios described in Section 5.1.2, we design four new scenarios for this study, where the configuration of each scenario is specified in the figure. As we can see, the approximation error of τ * decreases with the increase of the number of worker nodes, and finally converges to zero. From the above studies, we can see that, as the number of batches p i for any worker node i ∈ [N ] increases, the efficiency of BPCC improves, but the demand for storage also increases. Because storage consumption is not our main concern in this study, in the following experiments, we set p i to its maximum value possible, i.e., p i = ˆ i , ∀i ∈ [N ], considering that a valid p i should be a positive integer smaller than or equal to * i and * i converges toˆ i as p i , ∀i ∈ [N ], increases.

Straggling and shift parameters
In BPCC, to determine the load numbers i , we need to know the values of the straggling and shift parameters, µ i and α i , which are estimated by measuring the actual execution behaviors in real experiments. To understand the impact of parameter estimation errors to the performance of BPCC, we conduct a sensitivity study. In particular, to study how sensitive BPCC is to the estimation errors associated with the straggling parameters µ i , we fix the shift parameters α i and deviate each µ i from its true value by randomly picking a value from the interval (µ min , µ * i is the true value and ∆ > 0 represents the degree of deviation. As µ i should be positive, we let µ min i = 0, if ∆ > 1. Fig. 5(a) shows the relative change of the mean execution time, measured byÊ , at different values of ∆ in different scenarios, whereÊ[T ] andÊ [T ] are the mean execution time obtained by using the true and erroneous parameter values, respectively. Similarly, we plot in Fig. 5(b) the relative change of the mean execution time when the shift parameters α i suffer from estimation errors. As we can see, the deviation of straggling parameters µ i has less impact on the performance of BPCC than that of shift parameters α i , and BPCC is robust to small errors in general.

Comparative Performance Studies
In this sub-section, we compare the performance of the proposed BPCC scheme with three benchmark schemes, including Uniform Uncoded, Load-Balanced Uncoded and HCMM. The parameter p i in BPCC is set to  Fig. 6(a) shows the mean execution times for all schemes, grouped by the computation scenario. We can clearly observe that the proposed BPCC scheme outperforms other benchmark schemes in all scenarios. For instance, BPCC achieves performance improvement of up to 73% over the Uniform Uncoded scheme, up to 56% over Load-Balanced Uncoded scheme, and up to 34% over HCMM. Note that the execution times are directly derived by using the computing model in Eq. (3) and the decoding times are not considered here.
In Fig. 6(a), the performance is expressed in terms of the mean execution time, which corresponds to E[T BPCC ] for different schemes. In Fig. 6(b), we show the average amount of received results over time for Scenario 2, which corresponds to E[S(t)] in the theoretical analysis. Remarkably, we can observe from the figure that the master node can quickly receive results from the worker nodes from the very beginning. On the other hand, under the three benchmark schemes, there is a certain duration at the beginning when the master node does not receive any result. This phenomenon occurs because our BPCC scheme allows partial results to be returned, which is very useful for certain applications that can utilize partial results. In Fig. 6(b), we also indicate the time when the master node receives the required amount of results, i.e., r. Such a time corresponds to τ * (i.e., E[S(τ * )] = r).

EXPERIMENTS ON THE AMAZON EC2 COMPUT-ING CLUSTER
In this section, we evaluate the performance of the proposed BPCC scheme in the real distributed computing system. Specifically, we implement three benchmark schemes and the proposed BPCC scheme in the Amazon EC2 computing platform [39], which is a classical cloud computing system.

Experiment Settings
To implement the proposed BPCC and the three benchmark schemes over Amazon EC2 clusters 1 , we apply a standard distributed computing interface, Message Passing Interface (MPI) [40], by using an open-source package: mpi4py [41], which provides interfaces in Python. Moreover, to encode and decode matrices in BPCC, we use the Luby Transform (LT) codes with peeling decoder [37] that are adopted by HCMM [22]. The utilization of LT code relaxes the constraint of recovering the final computation result from any r rows to any r(1 + ) rows, where > 0 is desired to be as small as possible. In this study, we adopt the configuration in [22] and set = 0.13. The parameter To evaluate the performance of the four computation schemes, we consider the following four scenarios: • Scenario 1: r = 0.5 × 10 4 and N = 5, where one r4.2xlarge instance, two r4.xlarge instances and two t2.large instances are used as the worker nodes.
• Scenario 2: r = 1 × 10 4 and N = 10, where two r4.2xlarge instances instances, four r4.xlarge instances, and four t2.large instances are used as the worker nodes. 1. The source code can be found at https://github.com/ BaoqianWang/Batch-Processing-Coded-Computation. In all above scenarios, the master node runs in a m4.xlarge instance, and the size of the input vector x ∈ R m is set to m = 5 × 10 5 .

Parameter Estimation
In our previous design and analysis, we have assumed that the task completion time T on each node follows a shifted exponential distribution in a general form: when t ≥ αr. Therefore, E[T ] = r µ + αr. Based on the assumption above, we conduct extensive experiments and measure the actual execution behaviors to estimate the values of the straggling and shift parameters, µ and α, for different types of instances. Particularly, let t c (r) = r µ and t 0 (r) = αr, we run tasks of different sizes. For each task size r, we execute the task repeatedly for M = 1000 times and obtain the execution times {T 1 , T 2 , . . . , T M }. The maximum likelihood estimates of t 0 (r) and t c (r) are then given byt 0 (r) = min l∈[M ] T l andt c (r) = 1 M M l=1 T l −t 0 (r), respectively [42], [43]. Witht 0 (r) andt c (r) for different task sizes r, we can then estimate the values of µ and α by using the least squares estimation. Fig. 7 shows the estimated cumulative distribution function (CDF) of the processing time of a t2.xlarge instance when task size r = 500. The estimated α and µ for different types of Amazon EC2 instances are summarized in Table 1. These estimated parameters will be used to allocate computation loads for all computing schemes, except the uniform uncoded scheme.  Figure 7. The CDF of the processing time of an Amazon EC2 t2.xlarge instance for computing a task with r = 500.

Experimental Results
To evaluate the performance of the proposed BPCC scheme running on the heterogeneous Amazon clusters, we design three experiments. In this experiment, we compare the performance of BPCC with the three benchmark schemes in different scenarios.
For each scenario, we run each scheme 100 times and record the mean execution time (E[T ]). To evaluate the robustness of these schemes to uncertain disturbances, we introduce unexpected stragglers that are randomly chosen in each run. In particular, we randomly select 20% of the worker nodes to be stragglers in each run. As stragglers can be slow in computing or returning results (e.g, when communication congestion happens), such stragglers are emulated by delaying the return of computing results such that the computing time observed by the master node is three times of the actual computing time. Fig. 8(a) illustrates the mean execution time of different distributed computing schemes in different scenarios, which also highlights the decoding time required by the coded schemes including BPCC and HCMM. We can see from the figure that the proposed BPCC scheme outperforms all benchmark schemes in all scenarios. Specifically, the performance improves up to 79% compared with Uncoded scheme, up to 78% compared with Load-Balanced scheme, and up to 62% compared with HCMM. As stragglers can also fail in returning any results (e.g., when nodes/links fail), we also consider such stragglers and emulate them by setting the delay time to infinity. Since no results will be returned by such stragglers, the computation task can fail. Fig. 8(b) shows the success rate (measured by the ratio of successful runs) of each scheme in different scenarios. The mean execution time of successful runs is shown in Fig. 8(c). As we can see, the Uncoded and Load-Balanced schemes fail to complete the task in all runs, as no redundancy is introduced in these schemes. Both HCMM and BPCC can successfully complete the task in most runs, but HCMM has a lower success rate and is less efficient than BPCC.
In Fig. 9, we selectively show the average amount of received results over time (E[S(t)]) in Scenario 4. As expected, in our BPCC scheme, the master node continuously receives results from the very beginning. However, in other schemes, the master node needs to wait a long time before receiving any result.

Experiment 2
In the second experiment, we study the impact of the number of unexpected stragglers on the performance of the four computation schemes, by varying the percentage of stragglers from 0% to 60%. Similar as Experiment 1,  stragglers can delay in returning results for finite or infinite amount of time.The mean execution time of each scheme at the presence of different numbers of stragglers with finite delay for Scenario 4 is shown in Fig. 10(a). As we can see, when there is no straggler, the Uniform Uncoded scheme and the Load-Balanced Uncoded scheme achieve the best performance, as they do not involve any computation redundancy, compared with the coded schemes. However, when stragglers exist, our BPCC scheme achieves the best performance, indicating its high robustness to uncertain stragglers. We can also observe from Fig. 10(a) that the performances of all schemes degrade with the increase of the number of stragglers. Of interest, the performance degradation of the three benchmark schemes slows down when the number of stragglers reaches to a certain value. This is because worker nodes in these schemes won't return any result to the master node until the whole assigned task is completed and all stragglers would delay returning the result for a period that is three times of the task computation time. We also note that the performance of HCMM is even worse than the two uncoded schemes when the percentage of stragglers exceeds 20%. This is because each worker node in HCMM is assigned with more computation load, compared with the uncoded schemes, which causes the stragglers in HCMM to wait for a longer time before returning any result.
In case when stragglers delay in returning results for infinite amount of time, the success rate and the mean execution time of each scheme are shown in Fig. 10(b) and Fig. 10(c), respectively. As expected, both the Uncoded and Load-Balanced schemes fail to complete the task. Additionally, the performances of BPCC and HCMM degrade with the increase of the number of stragglers, and BPCC outperforms HCMM.

Experiment 3
In the third experiment, we evaluate the impact of the number of batches p i on the performance of BPCC running on the Amazon clusters. Similar as the simulation study, we let p i = p, ∀i ∈ [N ], and vary the value of p from 5 to 100. Fig. 10(d) shows the mean execution time of BPCC at different values of p under the settings described in Scenario 4 and Experiment 1 when unexpected stragglers with finite delay are present. As expected, the efficiency of BPCC improves with the increase of p.

CONCLUSION
In this paper, we systematically investigated the design and evaluation of a novel coded distributed computing (CDC) framework, namely, batch-processing based coded computing (BPCC), for heterogeneous computing systems. The key idea of BPCC is to optimally exploit partial coded results calculated by all distributed computing nodes. Under this BPCC framework, we then investigated a classical CDC problem, matrix-vector multiplication, and formulated an optimization problem for BPCC to minimize the expected task completion time, by configuring the computation load. The BPCC was proved to provide an asymptotically optimal solution and outperform a state-of-the-art CDC scheme for heterogeneous clusters, namely, heterogeneous coded matrix multiplication (HCMM). Theoretical analysis reveals the impact of BPCC's key parameter, i.e., number of batches, on its performance, the results of which infer the worst and best performance that BPCC can achieve. To evaluate the performance of the proposed BPCC scheme and better understand the impacts of its parameters, we conducted extensive simulation studies and real experiments on the Amazon EC2 computing clusters. The simulation and experimental results verify theoretical results and also demonstrate that the proposed BPCC scheme outperforms all benchmark schemes in computing systems with uncertain stragglers, in terms of the task completion time and robustness to stragglers. In the future, we will further enhance BPCC by jointly optimizing load allocation and the number of batches to achieve a tradeoff between computational efficiency and storage consumption, and explore its other properties, such as the convergence rate. We will also consider distributed computing systems with mobile computing nodes and other optimization objectives, such as minimizing the energy consumption.

ACKNOWLEDGEMENT
We would like to thank the National Science Foundation (NSF) under Grants CI-1953048/1730589/1730675/1730570/1730325, CAREER-2048266 and CAREER-1714519 for the support of this work.

Proof of Lemma 1
To derive the infimum and supremum of λ i , we first prove that they are attained at p i → ∞ and p i = 1, respectively. Define the following auxiliary function where x > 0 and y ∈ [0, 1]. Note that h i (x, y) is a monotonically increasing function with respect to y, as ∂hi(x,y) ∂y = µix y 3 e −µi( x y −αi) > 0. Also define P (z) = {y 0 , y 1 , y 2 , . . . , y z } as a partition of the range of y, i.e., [0, 1], for each k ∈ [z]. We then let ∆y k = y k − y k−1 = 1 z , k ∈ [z], and define According to Theorem 6.4 in [44], if there exists another partition P (z ) that satisfies P (z ) ⊃ P (z), then We can then derive that as P (z) ⊇ P (1) and P (z) ⊂ P (∞), ∀z ∈ Z + . The right equality holds when z = 1. Furthermore, as ∂U (z,x) is a monotonically decreasing function with respect to x.
Now let x = λ i and z = p i . We then have according to Eq. (7). Using proof of contradiction, we can then derive that the infimum and supremum of λ i are attained when p i → ∞ and p i = 1, respectively. Specifically, suppose sup λ i =λ is attained at p i =p > 1 and λ * is attained at p i = 1, we then haveλ ≥ λ * and U (1,λ) ≤ U (1, λ * ) as U (z, x) is a monotonically decreasing function with respect to x. Since U (p,λ) = U (1, λ * ) according to Eq. (27), we have U (1,λ) ≤ U (p,λ), which contradicts with U (1,λ) > U (p,λ) according to Eq. (26). Therefore, sup λ i must be attained at p i = 1. Similarly, we can prove that the infimum of λ i is attained when p i → ∞. Next, we find the specific formulas for the infimum and supremum of λ i . In particular, the infimum of λ i is attained when p i → ∞, which can be found by solving Eq. (27). Specifically, which is equivalent to solving Define variable v = µiλi x , the term in the left side of the above equation can be simplified as Combining Eq. (29) and Eq. (30), we can get Similarly, we can obtain the supremum of λ i by solving Eq. (27) and setting p i = 1. Specifically, we aim to solve the following equation According to [38], the solution to a , where W (·) is the Lambert W function. We can then get the following solution to the above equation
Applying the first McDiarmid's inequality in Eq. (32), we can then derive Using the asymptotic scales of parameters in the right hand side of Ineq. (37), we have Ineq. (37) shows that t = τ * + δ can satisfy the constraint in P (2) alt . Since t * is the minimal time that satisfies the constraint, t * ≤ t = τ * + δ. Therefore, t * ≤ τ * + o(1).

Proof of Theorem 3
The asymptotic optimality of BPCC shown in Eq. (16) can be proved by showing that is the optimal value of P main , we use two steps, inspired by [22], to prove the other two inequalities.
Step 1: Let OPT = ( OPT,1 , · · · , OPT,N ) be the optimal load allocation obtained by solving P main and let S OPT (t) be the amount of results received by the master node by time t under load allocation OPT . The inequality above can be proved by showing the following inequalities: where τ * OPT is the solution to E [S OPT (t)] = r, and δ 1 and . To prove Ineq. (a), we first define an auxiliary function g i (t) for each node i as According to Eq. (5), we have and r According to our previous discussions, r = Θ(N ), so OP T,i = Θ(1), ∀i ∈ [N ]. Therefore, g i (t) does not change with N , i.e., g i (t) = Θ(1). We then have By using the McDiarmid's inequality in Eq. (33), we have which implies that E[T OPT ] ≥ τ * OPT − δ 1 . Next, we proceed to prove Ineq. (b). Since E[S * (t)] is the optimal value of P We have now proved t * − o(1) ≤ E [T OPT ].
Let T max be a random variable that denotes the time required for all worker nodes to complete their tasks assigned using BPCC. Let E 1 = {T max > Θ(N )} and E 2 = {T BPCC > t * } be two events. E[T BPCC ] can then be computed by The first term in the right hand side of Eq. (38) can be written as where f max (t) is the probability density function (PDF) of T max . A stochastic upper bound of T max can be found by using N worker nodes that all take the smallest straggling parameter min{µ i } and the largest shift parameter max{α i }. Using the PDF of the maximum of N i.i.d. exponential random variables, we then have where k 1 is a constant, i.e., k 1 = Θ(1).
The second term in the right hand side of Eq. (38) can be written as where E [T max |T max ≤ Θ(N ), T BPCC > t * ] can be computed by Since the master node receives at least r rows of inner product results by time T BPCC , we have S * (T BPCC ) ≥ r. Next, since S * (t) is a monotonically increasing function with respect to time t, we can derive that, if S * (t * ) < r, then T BPCC > t * , which leads to The third term in the right hand side of Eq. (38) can be written as: Therefore, Similarly, from Theorem 5, we can derive that the supremum of τ * is attained when p i = 1, ∀i ∈ [N ], i.e., where sup λ i is given by Eq. (9).

Proof of Theorem 7
According to Theorem