A Shannon-Theoretic Approach to the Storage-Retrieval Tradeoff in PIR Systems

We consider the storage-retrieval rate tradeoff in private information retrieval systems using a Shannon-theoretic approach. Our focus is on the canonical two-message two-database case, for which a coding scheme based on random codebook generation, joint typicality encoding, and the binning technique is proposed. It is first shown that when the retrieval rate is kept optimal, the proposed non-linear scheme uses less storage than the optimal linear scheme. Since the other extreme point corresponding to using the minimum storage requires both messages to be retrieved, the performance through space-sharing of the two points can also be achieved. However, using the proposed scheme, further improvement can be achieved over this simple strategy. Although the random-coding based scheme has a diminishing but nonzero probability of error, the coding error can be eliminated if variable-length codes are allowed. Novel outer bounds are finally provided and used to establish the superiority of the non-linear codes over linear codes.


Introduction
Private information retrieval (PIR) addresses the situation of storing K messages of L-bits each in N databases, with the requirement that the identity of any requested message must be kept private from any one (or any small subset) of the databases.The early works were largely computer science theoretic [1], where L = 1, and the main question is the scaling law of the retrieval rate in terms of (K, N ).
The storage overhead in PIR systems has been studied in the coding and information theory community, from several perspectives using mainly two problem formulations.Shah et al. [2] considered the problem when N is allowed to vary with L and K, and obtained some conclusive results.In a similar vein, for L = 1, Fazeli et al. [3] proposed a technique to convert any linear PIR code to a new one with low storage overhead by increasing N .Other notable results along this line can be found in [4][5][6][7][8][9].
An information theoretic formulation of the PIR problem was considered in [10], where L is allowed to increase, while (N, K) are kept fixed.Important properties on the tradeoff between the storage rate and retrieval rate were identified in [10], and a linear code construction was proposed.In this formulation, even without any storage overhead constraint, characterizing the minimum retrieval rate in the PIR systems is nontrivial, and this capacity problem was settled in [11].Tajeddine et al. [12] considered the capacity problem when the message is coded across the databases with a maximum-distance separable (MDS) code, which was later solved by Banawan and Ulukus [13].Capacity-achieving code designs with optimal message sizes were given in [14,15].Systems where servers can collude were considered in [16].There have been various extensions and generalizations, and the recent survey article [17] provides a comprehensive overview on efforts following this information theoretic formulation.
In many existing works, the storage component and the PIR component are largely designed separately, usually by placing certain structural constraints on one of them; e.g., the MDS coding requirement for the storage component [13], or the storage is uncoded [18]; moreover, the code constructions are almost all linear.The few exceptions we are aware of are [19][20][21].In this work, we consider the information theoretic formulation of the PIR problem, without placing any additional structural constraints on the two components, and explicitly investigate the storage-retrieval tradeoff region.We mostly focus on the case N = K = 2 here since it provides the most important intuition; we refer to this as the (2, 2) PIR system.Our approach naturally allows the joint design of the two components using either linear or non-linear schemes.
The work in [19] is of significant relevance to our work, where the storage overhead was considered in both single-round and multi-round PIR systems, when the retrieval rate must be optimal.Although multi-round PIR has the same capacity as single-round PIR, it was shown that at the minimum retrieval rate, a multi-round, -error, non-linear code can indeed break the storage performance barrier of an optimal single-round, zero error, linear code.The question whether all the three differences are essential to overcome this barrier was left as an open question.
In this work, we show that a non-linear code is able to achieve better performance than the optimal linear code in the single-round zero-error (2, 2) PIR system, over a range of the storage rates.This is accomplished by providing a Shannon-theoretic coding scheme based on random codebook generation and the binning technique.The proposed scheme at the minimum retrieval rate is conceptually simpler, and we present it as an explicit example.The general inner bound is then provided, and we show an improved tradeoff can be achieved beyond space-sharing between the minimum retrieval rate code and the other optimal extreme point.By leveraging a method akin to the expurgation technique, we further show that one can extract a zero-error deterministic PIR code from the random -error PIR code.Outer bounds are also studied for both general codes and linear codes, which allow us to establish conclusively the superiority of non-linear codes over linear codes.Our work essentially answers the open question in [19], and shows that in fact only non-linearity is essential in breaking the aforementioned barrier.
A preliminary version of this work was presented first in part in [22].In this updated article, we provide a more general random coding scheme, which reveals a hidden connection to the multiple description source coding problem [23].Intuitively, we can view the retrieved message as certain partial reconstruction of the full set of messages, instead of a complete reconstruction of a single message.Therefore, the answers from the servers can be viewed as descriptions of the full set of messages, which are either stored directly at the servers or formed at the time of request, and the techniques seen in multiple description coding become natural in the PIR setting.Since the publication of the preliminary version [22], several subsequent efforts have been made in studying the storage-retrieval tradeoff in the PIR setting, which provided stronger and more general information theoretic outer bounds and several new linear code constructions [20,21,24].However, the Shannon-theoretic random coding scheme given in [22] remains the best performance for the (2, 2) case, which motivate us to provide the general coding scheme in this work and to make the connection to multiple description source coding more explicit.It is our hope that this connection may bring existing coding techniques for the multiple description problem to the study of the PIR problem.

Preliminaries
The problem we consider is essentially the same as that in [11], with the additional consideration on the storage overhead constraint at the databases.We provide a formal problem definition in the more traditional Shannon-theoretic language, to facilitate subsequent treatment.Some relevant results on this problem are also reviewed briefly in this section.

Problem Definition
There are two independent messages, denoted as W 1 and W 2 , in this system, each of which is generated uniformly at random in the finite field F L 2 , i.e., each message is an L-bit sequence.There are two databases to store the messages, which are produced by two encoding functions operating on (W 1 , W 2 ) where α n is the number of storage symbols at database-n, n = 1, 2, which is a deterministic function of L, i.e., we are using fixed length codes for storage.We write When a user requests message-k, it generates two queries (Q 2 ) to be sent to the two databases, randomly in the alphabet Q × Q.Note the joint distribution satisfies the condition i.e., the messages and the queries are independent.The marginal distributions P W 1 ,W 2 and P Q , k = 1, 2, thus fully specify the randomness in the system.
After receiving the queries, the databases produce the answers to the query via a set of deterministic functions We also write the answers The user, with the retrieved information, wishes to reproduce the desired message through a set of decoding functions The outputs of the functions Ŵk = 2 ) are essentially the retrieved messages.We require the system to retrieve the message correctly (zero-error), i.e., Ŵk = W k for k = 1, 2.
Alternatively, we can require the system to have a small error probability.Denote the average probability of coding error of a PIR code as An (L, α 1 , α 2 , β 1 , β 2 ) -error PIR code is defined similar as a (zero-error) PIR code, except that the correctness condition is replaced by the condition that the probability of error P e ≤ .Finally, the privacy constraint stipulates that the identical distribution condition must be satisfied: n ,Sn , n = 1, 2. ( Note that one obvious consequence is that n P Qn , for n = 1, 2. We refer to the code, which is specified by two probability distributions , k = 1, 2, and a valid set of coding functions {φ n , ϕ (q) n , ψ k,q 1 ,q 2 } that satisfy both the correctness and privacy constraints, as an (L, α 1 , α 2 , β 1 , β 2 ) PIR code, where Definition 1.A normalized storage-retrieval rate pair (ᾱ, β) is achievable, if for any > 0 and sufficiently large L, there exists an (L, α 1 , α 2 , β 1 , β 2 ) PIR code, such that The collection of the achievable normalized storage-retrieval rate pair (ᾱ, β) is the achievable storageretrieval rate region, denoted as R.
Unless explicitly stated, the rate region R is used for the zero-error PIR setting.In the definition above, we have used the average rates (ᾱ, β) across the databases instead of the individual rate vectors ]).This can be justified using the following lemma.
Lemma 1.If an (L, α 1 , α 2 , β 1 , β 2 ) PIR code exists, then a (2L, α, α, β, β) PIR code exists, where This lemma can essentially be proved by a space-sharing argument, the details of which can be found in [19].The following lemma is also immediate using a conventional space-sharing argument.
Lemma 2. The region R is convex.

Some Relevant Known Results
The capacity of a general PIR system with K messages and N databases is identified in [11] as which in our definition corresponds to the case when β is minimized, and the proposed linear code achieves (ᾱ, β) = (K, (1 − 1/N K )/(N − 1)).The capacity of MDS-code PIR systems was established in [13].In the context of storage-retrieval tradeoff, this result can be viewed as providing the achievable tradeoff pairs However when specialized to the (2, 2) PIR problem, this does not provide any improvement over the space-sharing strategy between the trivial code of retrieval-everything and the code in [11].By specializing the code in [11], it was shown in [19] that for the (2, 2) PIR problem, at the minimal retrieval value β = 0.75, the storage rate ᾱl = 1.5 is achievable using a single-round, zero-error linear code, and in fact, it is the optimal storage rate that any single-round, zero-error linear code can achieve.One of the key observations in [19] is that a special coding structure appears to be the main difficulty in the (2, 2) PIR setting, which is illustrated in Fig. 1.Here message W 1 can be recovered from either (X 1 , Y 1 ) or (X 2 , Y 2 ), and message W 2 can be recovered from either (X 1 , Y 2 ) or (X 2 , Y 1 ); (X 1 , X 2 ) is essentially S 1 and is stored at database-1, and (Y 1 , Y 2 ) is essentially S 2 and is stored at database-2.It is clear that we can use the following strategy to satisfy the privacy constraint: when message W 1 is requested, with probability 1/2, the user queries for either (X 1 , Y 1 ) or (X 2 , Y 2 ); for message 2, with probability 1/2, the user queries for either (X 1 , Y 2 ) or (X 2 , Y 1 ).More precisely, the following probability distribution can be used: and 2 ) = (12) 0.5 (Q 2 ) = ( 21)

Multiple Description Source Coding
The multiple description source coding problem [23] considers compressing a memoryless source S into a total of M descriptions, i.e., M compressed bit sequences, such that the combinations of any subset of these descriptions can be used to reconstruct the source S to guarantee certain quality requirements.The motivation of this problem is mainly to address the case when packets can be dropped randomly on a communication network.Denote the coding rate for each description as R i , i = 1, 2, . . ., M .A coding scheme was proposed in [25], which leads to the following rate region.Let U 1 , U 2 , . . ., U M be M random variables jointly distributed with S, then the following rates (R 1 , R 2 , . . ., R M ) and distortions (D A , A ⊆ {1, 2, . . ., M }) are achievable: Here f A is a reconstruction mapping from the random variables {U i , i ∈ A} to the reconstruction domain, d(•, •) is a distortion metric that is used to measure the distortion, and D A is the distortion achievable using the descriptions in the set A. Roughly speaking, the coding scheme requires generating approximately 2 nR i length-n codewords in an i.i.d.manner using the marginal distribution U i for each i = 1, 2, . . ., M , and the rate constraints insure that when n is sufficiently large, with overwhelming probability there is a tuple of M codewords (u n 1 , u n 2 , . . ., u n M ), one in each codebook constructed earlier, that are jointly typical with the source vector S n .In this coding scheme, the descriptions are simply the codeword indices of these codewords in these codebooks.For a given joint distribution (S, U 1 , U 2 , . . ., U M ), we refer to the rate region in (12) as the MD rate region R M D (S, U 1 , U 2 , . . ., U M ), and the corresponding random code construction the MD codebooks associated with (S, U 1 , U 2 , . . ., U M ).
The binning technique [26] can be applied in the multiple description problem to provide further performance improvements, particularly when not all the combinations of the descriptions are required to satisfy certain performance constraints, but only a subset of them are; this technique has previously been used in [27] and [28] for this purpose.Assume that only the subsets of descriptions A 1 , A 2 , . . ., A T ⊆ {1, 2, . . ., M } have distortion requirements associated with the reconstructions using these descriptions, which are denoted as D A i , i = 1, 2, . . ., T .Consider the MD codebooks associated with (S, U 1 , U 2 , . . ., U M ) at rates (R 1 , R 2 , . . ., R M ) ∈ R M D (S, U 1 , U 2 , . . ., U M ), then assign the codewords in the i-th codebook uniformly at random into 2 nR i bins with 0 ≤ R i ≤ R i .The coding rates and distortions that satisfy the following constraints simultaneously for all A i , i = 1, 2, . . ., T are achievable: We denote the collection of such rate vectors (R )), and refer to the corresponding codebooks as the MD * codebooks associated with the random variables (S, U 1 , U 2 , . . ., U M ) and the reconstruction sets (A 1 , A 2 , . . ., A T ).

A Special Case: Slepian-Wolf Coding for Minimum Retrieval Rate
In this section, we consider the minimum-retrieval-rate case, and show that non-linear and Shannontheoretic codes are beneficial.We will be rather cavalier here and ignore some details, in the hope of better conveyance of the intuition.In particular, we ignore the asymptotic-zero probability of error that is usually associated with a random coding argument, but this will be addressed more carefully in Section 4.
Let us rewrite the L-bit messages as The messages can be viewed as being produced from a discrete memoryless source where V 1 and V 2 are independent uniform-distributed Bernoulli random variables.Consider the following auxiliary random variables where ¬ is the binary negation, and ∧ is the binary "and" operation.This particular distribution satisfies the coding structure depicted in Fig. 1, with (V 1 , V 2 ) taking the role of (W 1 , W 2 ), and the relation is non-linear.The same distribution was used in [19] to construct a multiround PIR code.This non-linear mapping appears to allow the resultant code to be more efficient than linear codes.We wish to store (X L 1 , X L 2 ) at the first database in a lossless manner, however, store only certain necessary information regarding Y L 1 and Y L 2 to facilitate the recovery of W 1 or W 2 .For this purpose, we will encode the message as follows: • At database-1, compress and store (X L 1 , X L 2 ) losslessly; • At database-2, encode Y L 1 using a Slepian-Wolf code (or more precisely Sgarro's code with uncertainty side information [29]), with either X L 1 or X L 2 at the decoder, whose resulting code index is denoted as It is clear that for database-1, we need roughly ᾱ1 = H(X 1 , X 2 ).At database-2, in order to guarantee successful decoding of the Slepian-Wolf code, we can chose roughly where the second equality is due to the symmetry in the probability distribution.Thus we find that this code achieves The retrieval strategy is immediate from the coding structure in Fig. 1, with ( , and thus indeed the privacy constraint is satisfied.The retrieval rates are roughly as follows implying Thus at the optimal retrieval rate β = 0.75, we have ᾱl = 1.5 vs. ᾱnl ≈ 1.4387, (22) and clearly the proposed non-linear Shannon-theoretic code is able to perform better than the optimal linear code.We note that it was shown in [19] by using a multround approach, the storage rate ᾱ can be further reduced, however this issue is beyond the scope of this work.In the rest of the paper, we build on the intuition in this special case to generalize and strengthen the coding scheme.
The key observation to establish this theorem is that there are five descriptions in this setting, however, the retrieval and storage place different constraints on different combination of descriptions, and some descriptions can in fact be stored, recompressed, and then retrieved.Such compression and recompression may lead to storage savings.The description based on X 0 can be viewed as some common information to X 1 and X 2 , which allows us to tradeoff the storage and retrieval rates.
Proof of Theorem 2. Codebook generation: Codebooks are built using the MD codebooks based on the distribution Storage codes: The bin indices of the codebooks are stored in the two servers: those of X 0 , X 1 , and X 2 are stored at server-1 at rates α 1 , and α 2 .Note that at such rates, the codewords for X 0 , X 1 , and X 2 can be recovered jointly with overwhelming probability, while those for Y 1 and Y 2 can also be recovered jointly with overwhelming probability.
Retrieval codes: A different set of bin indices of the codebooks are retrieved during the retrieval process, again based on the MD * codebooks: those of X 0 , X 1 , and X 2 are retrieved at server-1 at rates β 2 .Note that at such rates, the codewords of X 0 , X 1 , and Y 1 can be jointly recovered such that using the three corresponding codewords, the required V 1 source vector can be recovered with overwhelming probability.Similarly, the three retrieval patterns of (X 0 , X 1 , Y 2 ) → V 2 , (X 0 , X 2 , Y 1 ) → V 2 , and (X 0 , X 2 , Y 2 ) → V 2 will succeed with overwhelming probabilities.

Storage and retrieval rates:
The rates can be computed straightforwardly, after normalization by the parameter t.
Next we use it to prove Theorem 1.
We can use any explicit distribution (X 0 , X 1 , X Proof.These tradeoff pairs are obtained by applying Corollary 1, and choosing t = 1 and setting (X 1 , X 2 , Y 1 , Y 2 ) as given in (17), and letting X 0 be defined as in Table 1.Note that the joint distribution indeed satisfies the required Markov structure, and in this case α (1) 2 and α (2) 2 .

Conclusion
We consider the problem of private information retrieval using a Shannon-theoretic approach.A new coding scheme based on random coding and binning is proposed, which reveals a hidden connection to the multiple description problem.It is shown that for the (2, 2) PIR setting, this non-linear coding scheme is able to provide the best known tradeoff between retrieval rate and storage rate, which is strictly better than that achievable using linear codes.We further investigate the relation between zero-error PIR codes and -error PIR codes in this setting, and shows that they do not causes any essential difference in this problem setting.We hope that the hidden connection to multiple description coding can provide a new revenue to design more efficient PIR codes.
; those of Y 1 and Y 2 are stored at server-2 at rates α ; those of Y 1 and Y 2 are retrieved at server-2 at rates β
2 , Y 1 , Y 2 ) to obtain an explicit inner bound to R(t)in , and the next corollary provides one such non-trivial bound.For convenience, we write the entropy function of a probability mass (p 1 , . . ., p t ) as H(p 1 , . . ., p t ).