0% found this document useful (0 votes)
42 views26 pages

Papr 7

Uploaded by

ahsanbser67
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views26 pages

Papr 7

Uploaded by

ahsanbser67
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Capacity-Achieving Private Information Retrieval Codes from

MDS-Coded Databases with Minimum Message Size


Ruida Zhou, Chao Tian, Hua Sun, and Tie Liu

July 20, 2024


arXiv:1903.08229v2 [cs.IT] 22 Jan 2020

Abstract
We consider constructing capacity-achieving linear codes with minimum message size for
private information retrieval (PIR) from N non-colluding databases, where each message is coded
using maximum distance separable (MDS) codes, such that it can be recovered from accessing
the contents of any T databases. It is shown that the minimum message size (sometimes also
referred to as the sub-packetization factor) is significantly, in fact exponentially, lower than
previously believed. More precisely, when K > T / gcd(N, T ) where K is the total number
of messages in the system and gcd(·, ·) means the greatest common divisor, we establish, by
providing both novel code constructions and a matching converse, the minimum message size
as lcm(N − T, T ), where lcm(·, ·) means the least common multiple. On the other hand, when
K is small, we show that it is in fact possible to design codes with a message size even smaller
than lcm(N − T, T ).

1 Introduction
The problem of private information retrieval (PIR), since its introduction [1], has attracted signifi-
cant attention from researchers in the fields of theoretical computer science, cryptography, informa-
tion theory, and coding theory. In the classical PIR model, a user wishes to retrieve one of the K
available messages, from N non-colluding databases, each of which has a copy of these K messages.
User privacy needs to be preserved during message retrieval, which requires that the identity of the
desired message not be revealed to any single database. To accomplish the task efficiently, good
codes should be designed to download the least amount of data per-bit of desired message, the
inverse of which is referred to as the capacity of the PIR system. This capacity problem in the
classical setting was settled recently [2].
In practical systems, the databases may suffer from failures, and are also constrained on the
storage space. Erasure codes can be used to improve both storage efficiency and failure resistance.
This consideration motivated the investigation of PIR from MDS-coded databases [3–6], with coding
parameter (N, T ), i.e., the messages can be recovered by accessing any T databases. The capacity
of PIR from MDS-coded databases (MDS-PIR) was characterized [4] as
 K−1 !−1
T T
C = 1+ + ··· + . (1)
N N

In a given code, the smallest required number of symbols in each message is called the message
size L (sometimes also referred to as the sub-packetization factor), which is an important factor
impacting the practicality and efficiency of the code. A large message size implies that the message
(or the data file in practice systems) needs to be large for such code to be applicable, which

1
significantly restricts the possible usage scenarios. Moreover, a large message size also usually
implies that the encoding and the decoding functions are more complex, which not only requires
more engineering efforts to implement but also hinders the efficiency of the system operation. From
a theoretical point of view, a code with a smaller message size usually implies a more transparent
coding structure, which can be valuable for related problems; see, e.g., [7] for such an example.
Thus codes with a smaller message size are highly desirable in both theory and practice.
The capacity-achieving code given in [4] requires L = T N K , which can be extremely large
for a system with even a moderate number of messages. The problem of reducing the message
size of capacity-achieving codes was recently considered by Xu and Zhang [6], and it was shown
that under the assumption that all answers are of the same length, the message size must satisfy
L ≥ T (N/ gcd(N, T ))K−1 . These existing results may have left the impression that capacity-
achieving codes would necessitate a message size exponential in the number of messages.
In this work, we show that the minimum message size for capacity-achieving PIR codes can in
fact be significantly smaller than previously believed, by providing capacity-achieving linear codes
with message size L = lcm(N −T, T ). Two linear code constructions, referred to as Construction-A
and Construction-B, respectively, are given. The two constructions have the same download cost
and message size, however Construction-B has a better upload cost (i.e., a lower communication
cost for the user to send the queries), at the expense of being slightly more sophisticated than
Construction-A. The key difference between the two proposed constructions and existing codes in
the literature is that the proposed codes reduce the reliance on the so-called variety symmetry
[8], which should be distinguished from the asymmetry discussed in [9], and the answers may
be of different lengths1 . We further show that this is in fact the minimum message size when
K > T / gcd(N, T ), the proof of which requires a careful analysis of the converse proof of the
information-theoretic MDS-PIR capacity. Finally, we show that, when K is small, it is in fact
possible to design codes with a message size even smaller than lcm(N − T, T ).
The code constructions and converse proof reflect a reverse engineering approach which further
extends [8, 11]. Particularly, in [8], a similar approach was used to tackle the canonical PIR setting
with replicated databases and a capacity-achieving PIR code with the minimum message size and
upload cost was discovered, and in the current work the databases are instead MDS-coded. The
analysis technique and the code construction in the current work, however, are considerably more
involved due to the additional coding requirements and the several integer constraints.
The rest of the paper is organized as follows. In Section 2, a formal problem definition is given.
Construction-A and Construction-B are then given in Section 3 and Section 4, respectively, where
the correctness and performance are also proved and analyzed. The optimality of message size is
established by first identifying several critical properties of capacity-achieving codes in Section 5.1,
then lower-bounding the minimum message size when K > T / gcd(N, T ) in Section 5.2. A special
code is given in Section 5.3 to show that when K ≤ T / gcd(N, T ), the message size can be even
lower than lcm(N − T, T ). Finally, Section 6 concludes the paper. Several technical proofs are
relegated to the Appendices.
1
The download cost is measured in this work as the expected number of downloaded symbols (over all random
queries), which is in line with the prevailing approach in the literature when PIR capacity is concerned [2, 4], where
the download cost is viewed as being equivalent to certain entropy term. However, if we instead measure the download
cost by the maximum number of downloaded symbols (among all possible queries), which was the alternative and
more stringent approach used in [10] and [6], then the optimal minimum message sizes will need to be much larger.
In a sense, having the more stringent requirement that the maximum download cost needs to match the PIR capacity
forces certain symmetrization to be built in the code, which necessitates a significant increase in the message size.

2
2 System Model
There are a total of K mutually independent messages W 0 , W 1 , . . . , W K−1 in the system. Each
message is uniformly distributed over X L , i.e., the set of length-L sequences in the finite alphabet
X . All the messages can be collected and written as a single length-LK row vector W 0:K−1 . Each
message is MDS-coded and then distributed to N databases, such that from any T databases, the
messages can be fully recovered. Since the messages are (N, T ) MDS-coded, it is without loss of
generality to assume that L = M · T for some integer M .
∗ [k∗ ] [k∗ ] [k∗ ]
When a user wishes to retrieve a particular message W k , N queries Q0:N −1 = (Q0 , . . . , QN −1 )
[k∗ ]
are sent to the databases, where Qn is the query for database-n. The retrieval needs to be
information theoretically private, i.e., any database is not able to infer any knowledge as to which
message is being requested. For this purpose, a random key F in the set F is used together with
[k∗ ] [k∗ ]
the desired message index k ∗ to generate the set of queries Q0:N −1 . Each query Qn belongs to
[k∗ ]
the set of allowed queries for database-n, denoted as Qn . After receiving query Qn , database-n
[k∗ ]
responds with an answer An . Each symbol in the answers also belongs to the finite field X , and
[k∗ ]
the answers may have multiple (and different numbers of) symbols. Using the answers A0:N −1 from

all N databases, together with F and k ∗ , the user then reconstructs Ŵ k .
A more rigorous definition of the linear information retrieval process we consider in this work
can be specified by a set of coding matrices and functions as follows. For notational simplicity, we
denote the cardinality of a set A as |A|.
Definition 1. A linear private information retrieval code from linearly MDS-coded databases (a
linear MDS-PIR code) consists of the following coding components:
1. A set of MDS encoding matrices:
G̃n := diag(G̃0n , G̃1n , . . . , G̃K−1
n ),
n ∈ {0, 1, . . . , N − 1}, (2)

where G̃kn , k ∈ {0, 1, . . . , K − 1} is an L × M matrix in X for encoding message W k , i.e., each


message is not mixed with other messages during storage, and each G̃n encodes the messages
into the information to be stored at database-n, denoted as Vn = W 0:K−1 · G̃n ;
2. A set of MDS decoding recovery functions:
ΨT : X LK → X LK , (3)
for each T ⊆ {0, 1, . . . , N − 1} such that |T | = T , whose outputs are denoted as W̃T0:K−1 =
ΨT ({Vn : n ∈ T });
3. A query function
φn : {0, 1, . . . , K − 1} × F → Qn ,
n ∈ {0, 1, . . . , N − 1},
∗ [k∗ ]
i.e., for retrieving message W k , the user sends the query Qn = φn (k ∗ , F) to database-n;
4. An answer length function
`n : Qn → {0, 1, . . .}, n ∈ {0, 1, . . . , N − 1}, (4)
i.e., the length of the answer from each database, a non-negative integer, is a deterministic
function of the query, but not the particular realization of the messages;

3
5. An answer generating matrix

Ĝ(q
n
n)
∈ X M K×`n , qn ∈ Qn , n ∈ {0, 1, . . . , N − 1}, (5)
[k∗ ] (q ) (q ) [k∗ ]
i.e., the answer An = An n := Vn · Ĝn n , when qn = Qn is the query received by database-n;

6. A reconstruction function
N
Y −1
ψ: X `n × {0, 1, . . . , K − 1} × F → X L , (6)
n=0

∗ [k∗ ]
i.e., after receiving the answers, the user reconstructs the message as Ŵ k = ψ(A0:N −1 , k ∗ , F).

These functions satisfy the following three requirements:

1. MDS recoverable: For any T ⊆ {0, 1, . . . , N − 1} such that |T | = T , we have W̃T0:K−1 =


W 0:K−1 .
∗ ∗
2. Retrieval correctness: For any k ∗ ∈ {0, 1, . . . , K − 1}, we have Ŵ k = W k .

3. Privacy: For every k, k 0 ∈ {0, 1, . . . , K − 1}, n ∈ {0, 1, . . . , N − 1} and q ∈ Qn ,


0
Pr(Q[k] [k ]
n = q) = Pr(Qn = q). (7)

[k∗ ]
Note that Qn is in fact a random variable, since F is the random key. It follows that even
[k∗ ]
when the messages are viewed as deterministic, An is still not deterministic. In contrast, for

[k ] (q )
any specific query realization Qn = qn , the corresponding answer An n is deterministic when the

[k ] (q )
messages are viewed as deterministic. The distinction between An and An n is indicated by the
bracket [·] and the parenthesis (·).
In order to measure the performance of an MDS-PIR code, we consider the following two metrics,
with the focus on minimizing the latter while keeping the former optimal:

1. The retrieval rate, which is defined as


L
R := PN −1 . (8)
n=0 E(`n )

This is the number of bits of desired message information that can be privately retrieved per
bit of downloaded data. It was shown [4] that the maximum retrieval rate, i.e., the capacity
of such MDS-PIR systems, is as given in (1).

2. The message size L, which is the number of symbols to represent each individual message.
This quantity should be minimized, because in practical applications, a smaller message size
implies a more versatile code.

A third metric, the upload cost, is also of interest in practical systems (also particularly in
computer science literature, e.g., [1]), although it is not our main focus in this work. The upload
cost can be defined as
N
X −1
log2 |Qn |, (9)
n=0

4
which is roughly the total number of bits that the user needs to send to the servers during the
query phase.
We will need several more parameters before proceeding. Define p := gcd(N, T ), then

N − T = p · r, T = p · s, (10)

for some positive integers r and s, which are co-prime.

3 New MDS-PIR Code: Construction-A


In this section, we provide the first MDS-PIR code construction with message length L = lcm(N −
T, T ), which we refer to as Construction-A.

3.1 The Coding Components of Construction-A


Each message W k can be divided into M sub-messages, denoted as W k = (W k,0 , W k,1 , . . . , W k,M −1 ),
and each sub-message contains T symbols in the alphabet X . The construction relies on two novel
ingredients: a new indexing on the key (query) and the introduction of pseudo code symbols. The
two elements were not present in other constructions in the literature such as [2, 4, 6]. A simpler
version of these two ingredients were first used in [8] for replicated databases. The generalized
version used in the current work requires a more complex translation between indexing and the
answer, as well as the introduction of more than one pseudo code symbol.
The first novel ingredient in the construction, which is different from previous ones in the liter-
ature, is the random key F = (F0 , F1 , . . . , FK−1 ), which is a length-K vector uniformly distributed
in the set

F := (f0 , . . . , fK−1 ) ∈ {0, . . . , r + s − 1}K
K−1
! 
X
fk =0 , (11)
k=0 r+s

where (·)r+s indicates modulo (r + s). In this code construction, we need to first choose a (in
fact, any) linear (N, T )-MDS code C, in the alphabet X as our base code. There are many known
techniques to construct such codes, such as Reed-Solomon codes and Cauchy matrix based con-
structions; see [12]. The coding functions can then be given as follows:

1. Each sub-message W k,m , m = 0, 1, . . . , r − 1 and k = 0, 1, . . . , K − 1, is encoded by C into


k,m k,m
N coded symbols V0:N −1 = (V0 , V1k,m , . . . , VNk,m k,m
−1 ), with Vn = W k,m · G̃∗n ∈ X placed at
database-n, where G̃∗n is the n-th column of the T × N generator matrix of code C operated
on each sub-message, which produces the stored information at database-n.

2. The MDS decoding function is obvious which is naturally induced by that of C.

3. For any n ∈ {0, 1, . . . , N − 1}, the query generating function produces a length-K column
vector as

φn (k ∗ , F) = Q[k
n
]
=
(F0 , . . . , Fk∗ −1 , (Fk∗ + n)r+s , Fk∗ +1 , . . . , FK−1 )T . (12)

5
1,"
.D"1," .D#1," .D*&#

K
... 7∗

s s s
DB 0 DB 1 DB N-1

Figure 1: The queries to different databases are illustrated. The parts of the queries related to the
interference signals are of the same pattern. As a consequence, the induced interference signals in
the answers will have the same pattern, and T of them can be isolated to remove the interference
signals in all the answers.

4. Database-n first produces a K × s query matrix Q̃n


 ∗ 
Q̃n = Q[kn
]
· 1T
s + 1 K · [0, 1, . . . , s − 1] , (13)
r+s

where 1t is the all-one column vector of length t, and T indicates matrix transpose; the
element of Q̃n on the k-th row and i-th column is denoted as Q̃k,i
n . The query length function
is then defined as:
s−1 
X 
`n = 1 min Q̃k,i
n <r , (14)
k=0,1,...,K−1
i=0

where 1(·) is the indicator function, i.e., `n is the number of columns in Q̃n which has an
element less than r.

5. The second novel ingredient, which is different from previous ones in the literature, is the
introduction of pseudo code symbols and pseudo message symbols in the sub-messages: Vnk,i =
[k∗ ]
W k,i = 0 for i ≥ r. For n ∈ {0, 1, . . . , N − 1}, an intermediate answer vector Ãn of length-s
is formed as
K−1 K−1

M k,0 M k,1
Ã[k
n
]
:= Vnk,Q̃n , Vnk,Q̃n ,
k=0 k=0
K−1
!
k,Q̃k,s−1
M
..., Vn n
, (15)
k=0

each component of which is the finite field addition of some components of the vector Vn that
[k∗ ]
are indicated by the corresponding column of Q̃n . The eventual answer An of length `n is
[k∗ ]
formed by concatenating the components of Ãn which are not constantly zero, i.e., those
corresponding to the positions indicated in (14).
∗ ,i
6. For any i ∈ {0, 1, . . . , s − 1}, define the interference database set Ti := {n| Q̃nk ≥ r}. The

6
[k∗ ]
i-th component of Ãn , n ∈ Ti , can be written as
K−1 K−1
k,i M k,i
M 
Vnk,Q̃n = W k,Q̃n · G̃∗n
k=0 k=0
K−1
!
k,Q̃k,i ∗ ],i
M
= W n
· G̃∗n = W̄ [k · G̃∗n , n ∈ Ti ,
k=0

∗ ],i
where the length-T row vector W̄ [k is defined as
∗ −1
kM K−1
!
[k∗ ],i k,Q̃k,i k,Q̃k,i
M
W̄ := W n
⊕ W n
.
k=0 k=k∗ +1


Note that W̄ [k ],i is not a function of n, since Q̃k,i k,i ∗
n = Q̃n0 unless k = k . Thus as long as
[k ∗ ],i
|Ti | ≥ T , the vector W̄ can be fully recovered by the MDS property of the code C; see Fig.
[k∗ ]
1 for an illustration. Further note that the i-th component of Ãn for n ∈ {0, 1, . . . , N −1}\Ti
can be written as
  ∗

k∗ ,Q̃kn ,i

[k∗ ],i ∗ ∗
W̄ · G̃n ⊕ W · G̃n , (16)

∗ ],i
from which, since W̄ [k is known, we can recover
 ∗

k∗ ,Q̃kn ,i ∗
W · G̃n , n ∈ {0, 1, . . . , N − 1} \ Ti . (17)


n o ∗
Denote Nm := n Q̃kn ,i = m . As long as |Nm | ≥ T , we can recover the vector W k ,m by
again invoking the property of the MDS code C.

3.2 An Example for Construction-A


Let us first consider an example (N, T, K) = (3, 2, 3), which induces (p, r, s, L) = (1, 1, 2, 2) in the
code. We omit the index i since here r = 1. The possible queries Q0 , Q1 , and Q2 are listed in the
corresponding columns in Table 1. With a given query Qn , the expanded query Q̃n is given to its
right, the second column of which is by adding 1 to each component and then taking modulo-3, as
specified in Step-4 of the protocol. The answer An is then simply constructed by taking each column
of Q̃n , and forming the addition of the corresponding V symbols, by however, taking advantage of
the fact that Vnk = 0 whenever k ≥ 1.
Consider the case to retrieve message k ∗ = 1, and the key is F = (0, 1, 2)T . Then the queries
are

Q0 = (0, 1, 2)T , Q1 = (0, 2, 2)T , Q2 = (0, 0, 2)T . (18)

The corresponding queries (and query matrices such induced) and answers are marked bold in
Table 1. In these Q̃ matrices, each column has at least one element being 0, and thus the total
number of transmission symbols is 6. It is seen that from (V00 , V10 ), the symbol V20 can be recovered
by the MDS property, and thus V21 . Similarly, we can recover V11 . Using both (V11 , V21 ), we can
then recover the original message W1 by decoding the MDS code C.

7
Table 1: Queries and answers for (N, T, K) = (3, 2, 3).
database-0 database-1 database-2
Q Q̃ A0 Q Q̃ A1 Q Q̃ A2
 0  0  1  1  2  2
0 01 0 01 0 01
            
0 01 V 0 ⊕ V 1 ⊕ V 2 , ∅ 0 01 0 1
V1 ⊕ V1 , ∅ 0 01 V0 ⊕ V1 , V2
   0 0 0       2 2 2
0 01 1 12 2 20
        
0 01   0 01 0 01
            
1 12 V 0 , V2 1 12 V 0 ⊕ V 2, ∅ 1 12 V 0, ∅
   0 0    1 1    2
2 20 0 01 1 12
        
0 01   0 01 0 01
           
2 20 V 0, V 1 2 20 V0 , V1 ⊕ V2 2 20 V 0 ⊕ V 2 , V 1
   0 0    1 1 1    2 2 2
1 12 2 20 0 01
.. .. .. .. .. .. .. .. ..
. . . . . . . . .
        
2 20 2 20 2 20
             
2 20 ∅, V 0 ⊕ V 1 ⊕ V 2 2 20 V 2 , V 0 ⊕ V 1 2 20 0
∅, V2 ⊕ V2 1
   0 0 0    1 1 1   
2 20 0 01 1 12

3.3 Correctness, Privacy, and Download Cost


According to the last coding component function (the reconstruction function) in Construction-A,
the correctness of the proposed code hinges on two conditions: |Ti | ≥ T for all i = 0, 1, . . . , s − 1
and |Nm | ≥ T for all m = 0, 1, . . . , r − 1. We establish these two conditions in the following lemma,
whose proof can be found in Appendix A.

Lemma 1. In Construction-A, for any request of message-k ∗ and any random key F,

1. |Ti | = T for any i ∈ {0, 1, . . . , s − 1};

2. |Nm | = T for any m ∈ {0, 1, . . . , r − 1}.

We have the following main theorem for Construction-A.

Theorem 1. Codes obtained by Construction-A are both private and capacity-achieving.


[k∗ ]
Proof. The fact that the code is private is immediate, by observing that Qn is uniformly dis-
tributed on the set

Qn = (f0 , . . . , fK−1 )T ∈ {0, . . . , r + s − 1}K
K−1
! 
X
fk − n =0 , (19)
k=0 r+s

regardless of the value of k ∗ .

8
The expected lengths of the answers is
N
X −1 N
X −1 X
s−1  
E(`n ) = Pr min Q̃k,i
n <r
k=0,1,...,K−1
n=0 n=0 i=0
X NX
s−1 −1  
= Pr min Q̃k,i
n < r , (20)
k=0,1,...,K−1
i=0 n=0

assuming an arbitrary message k ∗ is being requested. The probabilities involved in the summand
i = i∗ depend on
∗ 0:K−1,i∗ 0:K−1,i∗
 
Q̃0:K−1,i
0 , Q̃1 , . . . , Q̃N −1 . (21)

By the definition of Q̃k,i


n , it is clear that if any item in

(F0 + i∗ , ..., Fk∗ −1 + i∗ , Fk∗ +1 + i∗ , . . . , FK−1 + i∗ )r+s



is less than r, then mink=0,1,...,K−1 Q̃k,i
n < r for all n = 0, 1, . . . , N − 1, which will induce N trans-
mitted symbols in the retrieval from all databases for i = i∗ ; this event E occurs with probability
1 − (s/(r + s))K−1 . On the other hand, when the event E does not occur, in the vector

(Fk∗ + i∗ + 0, Fk∗ + i∗ + 1, . . . , Fk∗ + i∗ + N − 1)r+s

the number of elements that are less than r is N − T , which induces (N − T ) symbols being
transmitted. Therefore
N
X −1
E(`n ) = s [Pr(E)N + (1 − Pr(E)) (N − T )]
n=0
 K−1 "  K #
T T
= sN − sT = sN 1 − , (22)
N N

from which it follows that the code is indeed capacity achieving, by taking into account (10).

The following lemma is also immediate, and we state it as a lemma below.

Lemma 2. The upload cost of Construction-A is N (K − 1) log [N/ gcd(N, T )].

Proof. Consider any k ∗ . By (12), we see that |Qn | = |F| = (N/ gcd(N, T ))K−1 , and it follows that
PN −1
the upload cost is n=0 log(|Qn |) = N (K − 1) log [N/ gcd(N, T )].

4 New MDS-PIR Code: Construction-B


In this section, we provide an alternative code construction, namely Construction-B. This construc-
tion requires a lower upload cost than Construction-A, however, it relies on two different coding
strategies for the two cases of high rate codes T ≥ N − T and low rate codes T ≤ N − T . The high
rate code construction is essentially built on a product code, while the low rate codes bear more
similarity to Construction-A.

9
4.1 Construction-B for T ≥ N − T
In this construction, the same random key F = (F0 , F1 , . . . , FK−1 ) as in Construction-A is used, and
the MDS encoding matrices and decoding functions are also exactly the same as in Construction-A.
We need a second generic (s, r)-MDS code Cc in the alphabet X in this construction. Construction-
B essentially utilizes a product code with row code C and column code Cc [12]. In this context, it
is more convenient to view the message W k as being represented as an r × T matrix, denoted as
W̆ k
W k,0
 
 W k,1 
W̆ k =  .  . (23)
 
 .. 
W k,r−1

Next we provide the coding components (3 − 6) in Construction-B.

3. The query generating function at server-n produces the following K × 1 query vector
∗ [k∗ ] [k∗ ] [k∗ ]
φn (k ∗ , F) = Q[k
n
]
= (Q0,n , Q1,n , . . . , QK−1,n )T
= d(F0 , F1 , . . . , Fk∗ −1 , (Fk∗ + n)s+r ,
Fk∗ +1 , . . . , FK−1 )T es , (24)

where dxes := min(x, s), and it operates on a vector by operating on each component indi-
vidually.

4. Define an s × (s + 1) query pattern matrix P as


 
1 1 ··· 1 0 0 ··· 0 0
1
 0 1 ··· 1 1 0 ··· 0 0 
1
P :=  . . , (25)
 
..
.. .. .. ..
 .. .. .. . . . 0 
1 1 ··· 1 0 0 ··· 0 1 0

where the first row has the first r elements being 1’s and the rest (s − r + 1) being 0’s, and
the remaining rows are obtained by cyclically shifting the first s elements in the first row but
keeping the last 0 in place. The query length function is then defined as
s−1 K−1
!
X X
`n = 1 Pi,Q[k∗ ] > 0 , (26)
k,n
i=0 k=0

[k∗ ]
i.e., it is the number of columns in the matrix P selected by the vector Qn that have non-zero
elements.

5. Recall that coded message W k at database-n is a length-r vector Vnk

Vnk,0 W k,0
   
 Vnk,1   W k,1
Vnk =  =  · G̃∗n . (27)
   
.. ..
 .    .
k,r−1 W k,r−1
Vn

10
In order to generate the answer, each Vnk vector is encoded by Cc into a length-s intermediate
code vector
Ṽnk = (Ṽn,0
k k
, Ṽn,1 k
, . . . , Ṽn,s−1 ) = (Ĝ∗ )T · Vnk , (28)

where Ĝ∗ is the generator matrix of the code Cc . An intermediate answer vector is then
produced
K−1 K−1

M M
Ã[k
n
]
:= Ṽnk,0 · P0,Q[k∗ ] , Ṽnk,1 · P1,Q[k∗ ] ,
k,n k.n
k=0 k=0
K−1
!T
M
··· , Ṽnk,s−1 · Ps−1,Q[k∗ ] . (29)
k,n
k=0
[k∗ ] [k∗ ]
The eventual answer An
of length `n is formed by concatenating the components of Ãn
which are not constantly zero, i.e., those indicated by (26).
6. For any i ∈ {0, 1, . . . , s − 1}, define the interference database set T̃i := {n| Pi,Q[k∗ ] = 0}. For
k∗ ,n

n ∈ T̃i , the i-th symbol in the intermediate answer is


K−1
[k∗ ]
M
Ãn,i = Ṽnk,i · Pi,Q[k∗ ]
k,n
k=0
K−1
M
= (Ĝ∗i )T · Vnk · Pi,Q[k∗ ]
k,n
k=0
K−1
M
= (Ĝ∗i )T · W̆ k · G̃∗n · Pi,Q[k∗ ]
k,n
k=0
K−1
M  
= (Ĝ∗i )T k
· W̆ · Pi,Q[k∗ ] · G̃∗n
k,n
k=0
K−1
!
M
= (Ĝ∗i )T · k
W̆ · Pi,Q[k∗ ] · G̃∗n
k,n
k=0
[k∗ ]
= (Ĝ∗i )T · W̄i · G̃∗n ,
[k∗ ]
where the r × T matrix W̄i is defined as
∗ −1
kM
!
[k∗ ]
W̄i := W̆ k · Pi,Q[k∗ ]
k,n
k=0
K−1
!
M
k
⊕ W̆ · Pi,Q[k∗ ] .
k,n
k=k∗ +1
[k∗ ] [k∗ ] [k∗ ]
Note that W̄i is not a function of n, since Qk,n = Qk,n0 unless k = k ∗ . Thus as long as
[k∗ ]
|T̃i | ≥ T , the vector (Ĝ∗i )T · W̄i can be fully recovered by the MDS property of the code C.
[k∗ ]
Further note that Ãn,i for n ∈ {0, 1, . . . , N − 1} \ T̃i can be written as
[k∗ ]
   ∗

(Ĝ∗i )T · W̄i · G̃∗n ⊕ (Ĝ∗i )T · W̆ k · G̃∗n , (30)

11
[k∗ ]
from which, since (Ĝ∗i )T · W̄i is known, we can recover

(Ĝ∗i )T · W̆ k · G̃∗n , n ∈ {0, 1, . . . , N − 1} \ T̃i . (31)
 
Denote Sn := i Pi,Q[k∗ ] = 1 and N := {n | |Sn | ≥ r}, the latter of which is the set of
k∗ ,n
databases that provide at least r symbols of the requested messages in the form (31). For

any n ∈ N , we can recover W̆ k · G̃∗n by invoking the property of MDS code Cc . Then as long

as |N | ≥ T , W̆ k can be fully recovered by invoking the MDS property of code C.

4.1.1 An Uncompressed Description of Construction-B


The description of the coding components above is in a compressed form, and offers little intuition.
The following equivalent description, on the other hand, can provide better intuition at the expense
of more redundant items. Let the extended pattern matrix P̄ of size s × (s + r) be defined as
P̄ = [P |0s×(r−1) ], (32)
i.e., expanding the original pattern matrix P by appending an all-0 matrix of size s × (r − 1). The
same query answer can now be equivalently produced at each server by using the following auxiliary
query

Q̄[k
n
]
= (F0 , F1 , . . . , Fk∗ −1 ,
(Fk∗ + n)s+r , Fk∗ +1 , . . . , FK−1 )T , (33)
i.e., without using the d·es function mapping, and then following the same manner in answer
generating, using the extended patter matrix P̃ . The stored contents of message W k across all the
databases can be visualized as follows
 k,0
Ṽ1k,0 · · · ṼNk,0

Ṽ0 −1
 Ṽ k,1 Ṽ1k,1 · · · ṼNk,1 
 0 −1 
 . .. .. .. 
 .. . . . 
 
 k,r−1 k,r−1 k,r−1 
Ṽ0 Ṽ1 · · · ṼN −1  , (34)
 k,r k,r k,r 
 Ṽ0 Ṽ1 · · · ṼN −1  
 .. .. .. .. 

 . . . . 
Ṽ0k,s−1 Ṽ1k,s−1 · · · ṼNk,s−1
−1

where each column corresponds to a database, and the contents below the horizontal line are not
stored physically, but can be generated as part of the answer computation.
[k∗ ]
It is straightforward to verify that with the auxiliary query Q̄n as the queries, the extended
pattern matrix P̄ in the answer generation, and the uncompressed stored contents Ṽ0k,0 as the
stored content, the answer is precisely the same as the uncompressed version described above.

4.1.2 An Example of Construction-B


Consider an example N = 5, T = 3, K = 4, which induces the parameters (p, r, s, L) = (1, 2, 3, 6).
The pattern matrix P and the extended pattern matrix P̄ are
   
1 1 0 0 1 1 0 0 0
P = 0 1 1 0 , P̄ = 0 1 1 0 0 . (35)
1 0 1 0 1 0 1 0 0

12
Let the messages be W 0:3 = (A, B, C, D). Consider the case when message W 0 = A is being
requested, and the key is F = (4, 1, 2). Then the auxiliary queries are
 
3 4 0 1 2
4 4 4 4 4
[Q̄0 , Q̄1 , Q̄2 , Q̄3 , Q̄4 ] = 
1 1 1 1 1 ,
 (36)
2 2 2 2 2

and the compressed queries are


 
3 3 0 1 2
3 3 3 3 3
[Q0 , Q1 , Q2 , Q3 , Q4 ] = 
1
. (37)
1 1 1 1
2 2 2 2 2

The answers from the five databases are then


C00 C10 A02 + C20 A03 + C30 C40
 
 C01 + D01 C11 + D11 C21 + D21 A13 + C31 + D31 A14 + C41 + D41  , (38)
D02 D12 A22 + D22 D32 A24 + D41
0,m
where we have used Am 0
n to denote Ṽn , as the corresponding coded message W = A, and similarly
for B, C, D. Observer that in the first row (C00 , C10 , C40 ) can be used to recover Cn0 , n = 0, 1, . . . , 4,
and thus to obtain (A02 , A03 ); similarly, in the second row and third row, we can recover information
on the A message. The information on A we can recover is thus as given in the following message
matrix where each column corresponds to a database

∗ ∗ A02 A03 ∗
 
 ∗ ∗ ∗ A13 A14  . (39)
2
∗ ∗ A2 ∗ A4 2

It is now straightforward to see that through the product code based on the (5, 3) MDS code C
and the (3, 2) MDS code Cc , the message A can be fully recovered.

4.1.3 Correctness, Privacy, and Communication Costs


Similar to Construction-A, the correctness of Construction-B relies on the following two facts
established as Lemma 3, whose proof can be found in Appendix A.

Lemma 3. In Construction-B, for any request of message-k ∗ and any random key F,

1. |T̃i | = T for any i ∈ {0, 1, . . . , s − 1};

2. |N | = T .

We also have the follow main theorem for Construction-B.

Theorem 2. The codes obtained by Construction-B for T ≥ N − T are both private and capacity-
achieving.

13
Proof. To see that the code is private, observe that for database-n, the auxiliary query vector
[k∗ ]
Q̄n follows a uniform distribution in the set defined in (19) for any requested message k ∗ ∈
[k∗ ]
{0, 1, . . . , K − 1}. The query Qn is obtained through an additional mapping d·es regardless of k ∗ ,
and thus the query at database-n follows the same distribution for all k ∗ .
The expected lengths of the answers can be written as
N −1 N −1 X
s−1 K−1
!
X X X
E(`n ) = Pr Pi,Q[k∗ ] > 0
k,n
n=0 n=0 i=0 k=0
X NX
s−1 −1 K−1
!
X
= Pr Pi,Q[k∗ ] > 0
k,n
i=0 n=0 k=0
X NX
s−1 −1 K−1
!
X
= Pr P̄i,Q̄[k∗ ] > 0 , (40)
k,n
i=0 n=0 k=0

assuming an arbitrary message k ∗ is being requested. The probabilities involved in the summand
for i = i∗ depend on the vector
 ∗
[k∗ ] [k∗ ]

[k ]
Q̄0:K−1,0 , Q̄0:K−1,1 , . . . , Q̄0:K−1,N −1 . (41)

[k∗ ]
By the definition of Q̄0:K−1,n , it is clear that if P̄i∗ ,Fk = 1 for any k ∈ {0, 1, . . . , k ∗ − 1, k ∗ +
1, . . . , K − 1}, then K−1
P
k=0 Pi,Q[k ] > 0 for all n = 0, 1, . . . , N − 1. This will induce N transmissions

k,n
in the retrieval from all databases for i = i∗ . This event, denoted as E, occurs with probability
[k∗ ]
1 − (s/(r + s))K−1 , since the i∗ -th row of the matrix P̄ has r entries of value 1, and Q̄k,n = Fk ,
k = 0, 1, . . . , k ∗ − 1, k ∗ + 1, . . . , K − 1, are mutually independent, identically distributed, and each
follows a uniform distribution on {0, 1, . . . , r + s − 1}.
On the other hand, when the event E does not occur, the vector

(Fk∗ + 0, Fk∗ + 1, . . . , Fk∗ + N − 1)r+s

is a permutation of the p-replicated vector of (0, 1, . . . , s + r − 1), and thus the number of elements
that satisfying P̄i∗ ,(Fk∗ +n)r+s = 1, n = 0, 1, . . . , N − 1, is exactly N − T , implying that (N − T )
symbols will be transmitted for i = i∗ . Therefore, we have
N
X −1
E(`n ) = s [Pr(E)N + (1 − Pr(E)) (N − T )]
n=0
 K−1 "  K #
T T
= sN − sT = sN 1 − , (42)
N N

from which it follows that the code is indeed capacity achieving.

Lemma 4. The upload cost of Construction-B for T ≥ N − T is upper-bounded by

min[N (K − 1) log(s + r), N K log(s + 1)].


[k∗ ]
Proof. For datebase-n, the query Qn has K symbols, and each symbol is from the alphabet
{0, 1, . . . , s}, and thus the the upload cost is clearly upper bounded by N log(s + 1)K . However,
observe that the query Qn is calculated by mapping the random key F with log |F| = N (K −

14
1) log(s+r) through an surjective function d·es , whose image size is upper bounded by log(s+r)K−1 .
[k∗ ]
If (s + 1)K > (s + r)K−1 , then we can simply use the auxiliary query Q̄n . Thus the upload cost
is at most the less of the two terms as given above.

If r > 1, the quantity N K log(s + 1) is clearly the less of the two when K is large, and thus
may lead to significant savings in terms of the upload cost.

4.2 Construction-B for T ≤ N − T


Here the same random key F = (F0 , F1 , . . . , FK−1 ) as in Construction-A is again used, and the
MDS encoding matrices and decoding functions are also exactly the same as in Construction-A.
The other components of the codes are as follows.
3. For any n ∈ {0, 1, . . . , N −1}, the query generating function produces a query with K symbols
∗ [k∗ ] [k∗ ] [k∗ ]
φn (k ∗ , F) = Q[k
n
]
= (Q0,n , Q1,n , . . . , QK−1,n )T =
d(F0 , . . . , Fk∗ −1 , (Fk∗ + n)s+r , Fk∗ +1 , . . . , FK−1 )T er . (43)

4. The query length function is then defined as


 
[k∗ ]
`n = s · 1 min Qk,n <r . (44)
k=0,...,K−1

5. Let Vnk,r = W k,r = 0. Database-n first produces a K ×s query matrix Q̃n for i = 0, 1, . . . , s−1

 r [k∗ ]
if Qk,n = r
Q̃k,i
n =
 ∗
[k ]
 . (45)
 Qk,n + i otherwise
r

[k∗ ]
For n ∈ {0, 1, . . . , N − 1}, an intermediate answer vector Ãn of length-s is formed (similar
to Construction-A) as
K−1 K−1
∗]
M k,0 M k,1
Ã[k
n := Vnk,Q̃n , Vnk,Q̃n ,
k=0 k=0
K−1
!T
k,Q̃k,s−1
M
..., Vn n
. (46)
k=0

[k∗ ] [k∗ ]
The eventual answer An of length `n is formed by concatenating the components of Ãn
which are not constantly zero, as indicated by (44).

6. The reconstruction function is the same as that of Construction-A, and the desired message
can be correctly reconstructed as long as |Ti | ≥ T and |Nm | ≥ T .
For better visualization, we can again consider the (uncompressed) auxiliary query

Q̄[k
n
]
= (F0 , F1 , . . . , Fk∗ −1 , (Fk∗ + n)s+r ,
Fk∗ +1 , . . . , FK−1 )T . (47)
[k∗ ] [k∗ ]
The query vector Qn is a compressed version of the auxiliary query Q̄n .

15
4.2.1 An Example for Construction-B
Consider an example N = 5, T = 2, K = 4, which induces the parameters (p, r, s, L) = (1, 3, 2, 6).
Let the messages be W 0:3 = (A, B, C, D). Consider the case when message W0 = A is being
requested, and the key is F = (4, 1, 2). Then the auxiliary queries and the queries are as given in
(36) and (37), respectively. The intermediate query matrix at all the databases are

[Q̃0 , Q̃1 , Q̃2 , Q̃3 , Q̃4 ]


 
3 3 3 3 0 1 1 2 2 0
 3 3 3 3 3 3 3 3 3 3 
=  1 2 1 2
. (48)
1 2 1 2 1 2 
2 0 2 0 2 0 2 0 2 0

The answers from the five databases are then


 1
C0 + D02 C11 + D12 A02 + C21 + D22 A13 + C31 + D32 A24 + C41 + D42

. (49)
C02 + D00 C12 + D10 A12 + C22 + D20 A23 + C32 + D30 A04 + C42 + D40

In the first row, (C01 + D02 , C11 + D12 ) can be used to recover (Cn1 + Dn2 ) for any n = 0, 1, . . . , 4,
using the MDS property of code C. Similarly, Cn2 + Dn0 can be recovered. Therefore, the following
information on the requested message A can be obtained

∗ ∗ A02 A13 A24


 
, (50)
∗ ∗ A12 A23 A04

from which message A can clearly be reconstructed.

4.2.2 Correctness, Privacy, and Communication Costs


The following lemma establishes the correctness of Construction-B when T ≤ N − T , the proof of
which can be found in Appendix A.
Lemma 5. In the construction above, for any request of message-k ∗ and any random key F,
1. |Ti | = T for any i ∈ {0, 1, . . . , s − 1};

2. |Nm | = T for any m ∈ {0, 1, . . . , r − 1}.


Theorem 3. Construction-B is both private and capacity-achieving for T ≤ N − T .
Proof. The fact that the code is private is immediate for the same reason for the case T ≥ N − T .
The expected length of the answers is
N −1 N −1  
[k∗ ]
X X
E(`n ) = s Pr min Qk,n <r ,
k=0,1,...,K−1
n=0 n=0

[k∗ ]
assuming an arbitrary message k ∗ is being requested. By the definition of Qn , if any item in

d(F0 , ..., Fk∗ −1 , Fk∗ +1 , . . . , FK−1 )r+s er


[k∗ ]
is less than r, then mink=0,1,...,K−1 Qk,n < r for all n = 0, 1, . . . , N − 1, which will induce sN
transmitted symbols in the retrieval from all databases; this event E occurs with probability 1 −
(s/(r + s))K−1 .

16
On the other hand, when the event E does not occur, in the vector

d(Fk∗ + 0, Fk∗ + 1, . . . , Fk∗ + N − 1)r+s er

the number of elements that are less than r is exactly N − T , which induces s(N − T ) symbols
being transmitted. Therefore
N
X −1
E(`n ) = Pr(E)sN + (1 − Pr(E)) s(N − T )
n=0
 K−1 "  K #
T T
= sN − sT = sN 1 − , (51)
N N

from which it follows that Construction-B is indeed capacity achieving.

Lemma 6. The upload cost of Construction-B for T ≤ N − T is upper-bounded by

min[N (K − 1) log(s + r), N K log(r + 1)].

The proof follows the same argument as that of Lemma 4, and it is omitted here for brevity.

5 Minimum Message Size for Capacity-Achieving Linear Codes


In this section, we establish the minimum message size as lcm(N − T, T ) when K is above a
threshold, then shows that it is in fact possible to use an even smaller message size when K is
below this threshold.

5.1 Properties of Capacity-Achieving Linear MDS-PIR Codes


In this section, we provide two key properties of capacity-achieving linear MDS-PIR codes, which
play an instrumental role in our study of the minimum message size.

Lemma 7. Any linear MDS-PIR code must have:


[k]
P0 For any T ⊆ {0, 1, . . . , N − 1} satisfying |T | = T , {An }n∈T are mutually independent, given
any subset of messages W 0:K−1 .

Lemma 8. Let π : {0, 1, . . . , K − 1} → {0, 1, . . . , K − 1} be a permutation function. We have for


any k = 0, 1, . . . , K − 2,
"N −1 #
X
N H(A[π(k)]
n | W π(0:k−1) , F) − L log |X |
n=0
N
X −1
≥T H(A[π(k+1)]
n | W π(0:k) , F). (52)
n=0

Moreover, for any linear MDS-PIR code for which the equality holds for any k and π(·) in (52),

let q0:N −1 be a combination of queries such that Pr(q0:N −1 ) > 0 for the retrieval of W k , then the
code must have:

17
P1 For any T ⊆ {0, 1, . . . , N − 1} such that |T | = T , and J ⊆ {0, 1, . . . , K − 1} satisfying k ∗ ∈ J

 
(q 0 )
H An0n , n0 ∈ T̄ | W J , An(qn ) , n ∈ T = 0,

where T̄ is the complement of T .

Property P0 is a direct consequence of the linear MDS-PIR code definition. Property P1 states
that the interference signals from the answers of any T databases in a capacity-achieving code can
fully determine all interference signals in other answers. The inequalities in Lemma 8 are the key
steps in deriving the capacity of MDS-PIR codes; the proofs of these properties are can be found in
Appendix B. Conversely, for any capacity-achieving linear MDS-PIR code, these inequalities must
hold with equality, implying the following theorem.

Theorem 4. Any capacity-achieving linear MDS-PIR code must have the properties P0 and P1.

Proof. Let π : {0, 1, ..., K − 1} → {0, 1, ..., K − 1} be a permutation. By applying Lemma 8


recursively, we can write
N −1
L log |X | X
≥ H(A[π(0)]
n | F) (53)
R
n=0
N −1
T X
≥ L log |X | + H(A[π(1)]
n | W π(0) , F) (54)
N
n=0
≥ ··· (55)
 K−1 !
T T
≥ L log |X | 1 + + ··· + , (56)
N N

and it follows that R ≥ C. For any capacity-achieving linear MDS-PIR code, all the inequalities
in Lemma 8 must be equality. Therefore, any capacity-achieving linear MDS-PIR code must have
properties P0 and P1.

A similar set of properties for capacity-achieving PIR codes on replicated databases was derived
in [8], which holds for a more general code class and in a more explicit form. For MDS coded
databases, firstly it is more meaningful to consider only linear codes, and secondly, it is not clear
whether the properties stated in Lemma 7 and 8 can be extended to the more general class of codes
considered in [8].

5.2 Bounding the Minimum Message Size


The main result of this section is the following theorem.

Theorem 5. When K > T / gcd(N, T ), the message size of any capacity-achieving linear MDS-PIR
code satisfies L ≥ lcm(N − T, T ).

From Theorem 5, we can conclude that codes obtained by Construction-A and Construction-B
indeed have the minimum message size when K > T / gcd(N, T ). The proof of Theorem 5 relies on
the delicate relation among a set of auxiliary quantities Hnk ’s and Ink ’s which we define next. For

18
any given capacity-achieving linear MDS-PIR code, let (k̆, f˘) be the maximizer for the following
optimization problem for n = 0:

max max H(A[k] k


n | W , F = f ). (57)
k=0,1,...,K−1 f ∈F

Define for k = 0, 1, . . . , K − 1 and n = 0, 1, . . . , N − 1,


[k̆] [k̆]
H(An | W k , F = f˘) k I(An ; W k | F = f˘)
Hnk := , In := .
log |X | log |X |

The following lemma implies that the optimization problem in (57) has the same maximizer for
all n ∈ {0, 1, . . . , N − 1}.

Lemma 9. For any capacity-achieving linear MDS-PIR code, ∀n0 6= n00 where n0 , n00 ∈ {0, 1, . . . , N −
1}, any k ∗ ∈ {0, 1, . . . , K − 1}, any f ∈ F,
[k∗ ] ∗ [k∗ ] ∗
H(An0 | W k , F = f ) = H(An00 | W k , F = f ).

This lemma also implies that we can define H k̆ := H0k̆ = . . . = HN k̆


−1 . The next two lemmas
establish a critical property of, and relevant relations between, Hn ’s and Ink ’s.
k

Lemma 10.
N
X −1
L − (N − T )H k̆ = Ink̆ . (58)
n=0

Lemma 11. For any k ∈ {0, 1, . . . , K − 1} and n ∈ {0, 1, . . . , N − 1}, Hnk and Ink are integers;
moreover
X
Hnk̆ ≥ Ink , (59)
k6=k̆

and when K > s,

H k̆ ≥ s (60)

The proofs of Lemma 9-11 are given in Appendix B. We are now ready to prove Theorem 5.

Proof of Theorem 5. When K > s, by (60) in Lemma 11 and Lemma 10, we have
N
X −1

L − (N − T )s ≥ L − (N − T )H = Ink̆ ≥ 0. (61)
n=0

Substituting L = M ps and (N − T ) = pr into the left hand side leads to the conclusion M ≥ r,
implying that L ≥ M T ≥ rT = lcm(N − T, T ).

19
5.3 Message Size Reduction for Small K
The following theorem confirms that for small K, it is in fact possible to construct a capacity-
achieving code with an even smaller message size.

Theorem 6. When K = 2 and T ≥ N − T , the minimum message size of capacity-achieving codes


is T .

Proof of Theorem 6. Since L = M T > 0, it is trivial to see that L ≥ T , and thus it only remains
to provide a construction of a capacity-achieving linear MDS-PIR code with such a message size.
Let database-n store two symbols Vn0 , Vn1 ∈ X , which are MDS-coded symbols of messages W 0

and W 1 , respectively. When the user wishes to retrieve message W k where k ∗ ∈ {0, 1}, two query
strategies are used.
T
• With probability N , randomly partition N databases into 3 disjoint sets G (0) , G (1) and G (2) ,
with |G | = N − T , |G (1) | = 2T − N and |G (2) | = N − T . The user requests Vn0 ⊕ Vn1 from
(0)

databases in G (0) , (Vn0 , Vn1 ) from those in G (1) , and Vn1−k in G (2) .
−T
• With probability NN , randomly partition N databases into 2 disjoint sets G (3) and G (4) ,

with |G (3) | = T and |G (4) | = N − T . The user requests Vnk from databases in G (3) , but no
information from those in G (4) .

It is straightforward to verify that the code is indeed correct, private, and capacity-achieving.

Theorem 6 provides a code construction with a small message size for the special case of K = 2
and T ≥ N − T , however, we suspect codes with small message sizes also exist for other parameters
when K is below the threshold, but they may require certain more sophisticated combinatorial
structure. A systematic approach to construct such codes and a converse result appear rather
difficult to find.

6 Conclusion
We proposed two code constructions for private information retrieval from MDS-coded databases
with message size lcm(N − T, T ), and show that this is the minimum message size for linear codes
when K is above a threshold. For smaller K it is in fact possible to design PIR codes with an even
smaller message size, which we show by a special example for K = 2. This work generalizes our
previous result on private information retrieval from replicated databases [8] in a highly non-trivial
manner. We expect the code constructions and the converse proof approach to be also applicable
and fruitful in other privacy-preserving primitives.
Independent of this work and inspired by our previous result [8], Zhu et al. [13] discovered
a different code construction, and also derived a lower bound on the message size similar to the
one reported here. All three code constructions have the same message size, however, the two
constructions we provided have a lower upload cost due to the queries being better compressed. It
is worth noting that in our previous work [8], the proposed code was also shown to be optimal in
terms of the upload cost, however, proving the proposed codes in the current work to be optimal
appears more difficult due to the more complex dependence stipulated by the MDS code.

20
A Proofs of Lemma 1, Lemma 3, and Lemma 5
∗ ,i
Proof of Lemma 1. Fix a particular k ∗ . It is convenient to represent the Q̃kn as a matrix given
below
 ∗ ∗ ∗ ,0 
Q̃k0 ,0 Q̃k1 ,0 . . . Q̃kN −1
 k∗ ,1 ∗ ∗ ,1 
 Q̃0 Q̃k1 ,1 . . . Q̃kN −1 
. (62)
 . .. .. .. 
 .
 . . . . 
∗ ∗ ∗ ,s−1
Q̃0k ,s−1 Q̃k1 ,s−1 . . . Q̃kN −1

Since Q̃nk ,i = (Fk∗ + i + n)s+r , N = p · (s + r), and T = p · s, for any i and any given realization

of key F, there are exactly T elements in Q̃k0:N,i−1 , which is a row of (62), that are greater than or
equal to r. This proves the first statement that |Ti | ≥ T . For the second statement, consider a
fixed m ∈ {0, 1, . . . , r + s − 1}. It is clear that each row in the matrix (62) has exactly p positions
being m. Therefore, there are a total of p · s = T elements in the matrix being m. Since for any
∗ ∗ 0 ∗
i 6= i0 where 0 ≤ i, i0 < s, we have Q̃kn ,i 6= Q̃kn ,i , due to Q̃kn ,i = (Fk∗ + i + n)s+r , these T positions
are all in different columns, implying |Nm | = T .

Proof of Lemma 3. For each fixed i ∈ {0, 1, 2, . . . , s − 1}, we have


   
T̃i = n Pi,Q[k∗ ] = 0 = n P̄i,Q̄[k∗ ] = 0 , (63)
k∗ ,n k∗ ,n

however, the vector


[k∗ ] [k∗ ] [k∗ ]
(Q̄k∗ ,0 , Q̄k∗ ,1 , . . . , Q̄k∗ ,n−1 ) (64)

is simply a permutation (in fact, a cyclic shift) of p-replicated vector of (0, 1, . . . , r + s − 1), and
thus
 
P̄i,Q̄[k ] , P̄i,Q̄[k ] , . . . , P̄i,Q̄[k ]
∗ ∗ ∗ (65)
k∗ ,0 k∗ ,1 k∗ ,N −1

is in fact is a permutation of [P̄i , P̄i , . . . , P̄i ], i.e., a p-replicated version of the i-th row of P̄ . It is
now clear that |T̃i | = T , because each row of P̄ has exactly s zeros, and the replication gives a total
of ps = T zeros in the vector in (65).
To see that |N | = T , let us focus on any single database-n such that |Sn | > 0, i.e., P̄i,Q̄[k∗ ] = 1
k∗ ,n
[k∗ ]
for some i ∈ {0, 1, . . . , s − 1}. This condition is equivalent to Q̄k∗ ,n < s, and moreover, it also
implies |Sn | = r because the vector
 T
P̄0,Q̄[k ] , P̄1,Q̄[k ] , . . . , P̄s−1,Q̄[k ]
∗ ∗ ∗ , (66)
k∗ ,n k∗ ,n k∗ ,n

[k∗ ]
is the j = Q̄k∗ ,n -th column of the extended pattern matrix P̄ , and for any j < s, such a column
has exact r positions being 1. It is also immediately clear that among the N databases, there are
[k∗ ]
a total of sp = T of them with Q̄k∗ ,n < s, again because the vector (64) is a permutation of the
p-replicated vector of (0, 1, . . . , r + s − 1).

21
[k∗ ]
n o
Proof of Lemma 5. Define T := n|Qk∗ ,n = r , and notice that in Construction-B, Ti = T for all
i = 0, 1, . . . , s − 1. Moreover
[k∗ ] [k∗ ]
n o n o
T = n Qk∗ ,n = r = n Q̄k∗ ,n ≥ r . (67)
Notice that the vector
[k∗ ] [k∗ ] [k∗ ]
 
Q̄k∗ ,0 , Q̄k∗ ,1 , . . . , Q̄k∗ ,N −1 (68)

is a permutation of p-replicated vector (0, 1, . . . , r + s − 1), and thus it has exactly ps = T items
that are greater or equal to r. This directly implies |T | = T .

By definition, for any Q̃kn ,i = m where m < r, to hold, we must have (Fk∗ + n)s+r < r. It is

also clear that Q̃kn ,i = ((Fk∗ + n)s+r + i)r . For any m ∈ {0, 1, . . . , r − 1}, there are in fact exactly

ps = T pairs of (n, i)’s such that Q̃kn ,i = m. This can be seen as follows: the vector
(Fk∗ + 0, Fk∗ + 1, . . . , Fk∗ + N − 1)s+r (69)
has rp = (N − T ) items less than r, and these items are a permutation of p-replicated vector
(0, 1, . . . , r − 1). The symmetry of these values indeed implies that there are (N − T )s/r = T such
(n, i) pairs for each m ∈ {0, 1, . . . , r − 1}. We next show that such pairs do not have any common
n. To see this, suppose there are two distinct pairs (n, i) and (n, i0 ), which satisfy
((Fk∗ + n)s+r + i)r = ((Fk∗ + n)s+r + i0 )r = m, (70)
for certain m ∈ {0, 1, . . . , r−1}. However, notice that 0 ≤ i, i0 < s ≤ r, we therefore must have i = i0 ,
which contradicts our supposition. It follows that indeed |Nm | = T for any m ∈ {0, 1, . . . , r−1}.

B Proofs of Lemmas 8-11


Proof of Lemma 8.
"N −1 #
X
N H(A[π(k)]
n | W π(0:k−1) , F) − L log |X | (71)
n=0
h i
[π(k)]
≥ N H(A0:N −1 | W π(0:k−1) , F) − L log |X | (72)
h
[π(k)]
= N H(A0:N −1 | W π(0:k−1) , F)
i
[π(k)]
− I(W π(k) ; A0:N −1 | W π(0:k−1) , F) (73)
[π(k)]
= N H(A0:N −1 | W π(0:k) , F) (74)
(b) N −1
[π(k)]
X
≥ H(Aρ((n:n+T −1)N ) | W π(0:k) , F) (75)
n=0
N −1 T −1
(c) X X [π(k)]
= H(A(n+s)N |W π(0:k) , F) (76)
n=0 s=0
"N −1 #
X
=T H(A[π(k)]
n |W π(0:k)
, F) (77)
n=0
"N −1 #
X
=T H(A[π(k+1)]
n | W π(0:k) , F) , (78)
n=0

22
where ρ : {0, 1, . . . , N −1} → {0, 1, . . . , N −1} in inequality (75) is a permutation over {0, 1, . . . , N −
1}. Equality (78) is due to the privacy constraint, which is

H(A[π(k)]
n | W π(0:k) , F) (79)
X
= Pr(F = f )H(A[π(k)]n | W π(0:k) , F = f ) (80)
f ∈F
X
= Pr(Qn = qn )H(A(q
n
n)
| W π(0:k) ) (81)
qn ∈Qn

= H(A[π(k+1)]
n | W π(0:k) , F) (82)

For the equality (b) to hold for an MDS-PIR code, i.e.,


[π(k)] [π(k)]
H(A0:N −1 | W π(0:k) , F) = H(Aρ((n:n+T −1))N | W π(0:k) , F).

the equality must hold for each F = f , which concludes the proof of property P1.

Proof of Lemma 9. Let T 0 , T 00 ⊆ {0, 1, . . . , N − 1} such that |T 0 | = |T 00 | = T , and n0 ∈ T 0 ,


T 00 = T 0 \ {n0 } ∪ {n00 }. By P1, any capacity-achieving linear code must have
[k∗ ] [k∗ ] ∗
H(AT 0 | AT 00 , W k , F = f )
[k∗ ] [k∗ ] ∗
= H(AT 00 | AT 0 , W k , F = f ) = 0, (83)

which implies
[k∗ ] ∗ [k∗ ] ∗
H(AT 0 | W k , F = f ) = H(AT 00 | W k , F = f ). (84)

Invoking P0 leads to
∗ ∗ ∗ ∗
X X
H(A[k
n
]
| Wk , F = f) = H(A[k
n
]
| W k , F = f ),
n∈T 0 n∈T 00

which further implies that


[k∗ ] ∗ [k∗ ] ∗
H(An0 | W k , F = f ) = H(An00 | W k , F = f ).

This completes the proof.

Proof of Lemma 10. For any capacity-achieving code, (71) must equal to (76). The equality should
also hold when F = f˘, k = 0 and π(0) = k̆. Substituting the definition of Hnk̆ and Ink̆ directly, we
have
"N −1 # N −1 T −1
X  XX
k̆ k̆
N H + In − L = H k̆ ,
n=0 n=0 s=0

which simplifies to the desired equality.

Proof of Lemma 11. Let qn = φn (k̆, f˘). The linearity of the code implies that
[k̆] (q )
H(An | W k , F = f˘) H(An n |W k )
Hnk = = (85)
log |X | log |X |

23
is an integer. Notice that Ink + Hnk is also an integer by the same argument, from which we conclude
that Ink is also an integer.
To see the inequality in (59), we write

H(A(qn) (qn )
n ) ≥ I(An ; W
0:K−1
)
K−1
X
= I(A(qn) k
n ; W |W
0:k−1
)
k=0
(a) K−1
X K−1
X
≥ I(An(qn ) ; W k ) = log |X | Ink , (86)
k=0 k=0

where (a) is because

I(A(qn) k
n ; W |W
0:k−1
)
= H(W k |W 0:k−1 ) − H(W k |W 0:k−1 , A(qn)
n )
≥ H(W k ) − H(W k |A(qn)
n )
= I(A(qn) k
n ; W ). (87)

It follows that

Hnk̆ log |X | = H(A(q n) k̆


n ) − In log |X |
X
≥ log |X | Ink . (88)
k6=k̆

Next we prove the inequality (60). First define the following query support set at database-n

Q∗n = {qn |qn = φn (k = 0, f ) for some f ∈ F}, (89)

and since the code is privacy preserving, it is also the query support set for all other k = 1, . . . , K −1.
We can then write

H k̆ log |X | = max max H(A[k] k


n | W , F = f)
k=0,1,...,K−1 f ∈F

= max max H(A(q


n
n)
| W k)
k=0,1,...,K−1 qn ∈Q∗n

≥ max∗ H(A(q
n
n)
| W k ) ≥ H(A[nk̆] | W k , F = f˘)
qn ∈Qn

= Hnk log |X | (90)

for any n = 0, 1, . . . , N − 1 and k = 0, 1, . . . , K − 1.


Plugging T = ps, N − T = pr, and L = M T into (58) gives
N
X −1
Ink̆ = L − (N − T )H k̆ = p[M s − rH k̆ ]. (91)
n=0

Since Ink̆ ≥ 0, the left hand side is strictly positive unless (N − T ) is a factor of L. If the left hand
side is zero, then L ≥ lcm(T, N − T ), implying H k̆ ≥ s. If (N − T ) is not a factor of L, implying

24
that the left hand side is indeed strictly positive, then there exists at least one n∗ ∈ {0, 1, . . . , N −1}
such that Ink̆∗ ≥ 1. For any k = 0, 1, . . . , K − 1,

[k̆]
Ink∗ + Hnk∗ = H(An∗ | F = f˘)/ log |X |, (92)

however, the right hand side of (92) is independent of k, and thus Ink∗ + Hnk∗ is in fact a constant.
Furthermore, H k̆ ≥ Hnk∗ by (90), it follows that Ink∗ ≥ Ink̆∗ for all k = 0, 1, . . . , K − 1. By (59), we
can write
X
Hnk̆∗ ≥ Ink∗ ≥ (K − 1)Ink̆∗ ≥ (K − 1) ≥ s, (93)
k6=k∗

when K > s.

References
[1] B. Chor, E. Kushilevitz, O. Goldreich, and M. Sudan, “Private information retrieval,” Journal
of the ACM (JACM), vol. 45, no. 6, pp. 965–981, Nov. 1998.

[2] H. Sun and S. A. Jafar, “The capacity of private information retrieval,” IEEE Transactions
on Information Theory, vol. 63, no. 7, pp. 4075–4088, Jul. 2017.

[3] N. B. Shah, K. Rashmi, and K. Ramchandran, “One extra bit of download ensures perfectly
private information retrieval,” in Information Theory (ISIT), 2014 IEEE International Sym-
posium on, 2014, pp. 856–860.

[4] K. Banawan and S. Ulukus, “The capacity of private information retrieval from coded
databases,” IEEE Transactions on Information Theory, vol. 64, no. 3, pp. 1945–1956, Mar.
2018.

[5] R. Tajeddine, O. W. Gnilke, and S. El Rouayheb, “Private information retrieval from MDS
coded data in distributed storage systems,” IEEE Transactions on Information Theory, vol. 64,
no. 11, pp. 7081–7093, Nov. 2018.

[6] J. Xu and Z. Zhang, “On sub-packetization and access number of capacity-achieving PIR
schemes for MDS coded non-colluding databases,” SCIENCE CHINA Information Sciences,
vol. 61, no. 7, pp. 100 306:1–100 306:16, 2018.

[7] H.-Y. Lin, S. Kumar, E. Rosnes, A. G. i Amat, and E. Yaakobi, “Weakly-private informa-
tion retrieval,” in 2019 Proceedings of IEEE International Symposium on Information Theory
(ISIT), 2019, pp. 1257–1261.

[8] C. Tian, H. Sun, and J. Chen, “Capacity-achieving private information retrieval codes with
optimal message size and upload cost,” IEEE Transactions on Information Theory, vol. 65,
no. 11, pp. 7613–7627, Nov. 2019.

[9] K. Banawan and S. Ulukus, “Asymmetry hurts: Private information retrieval under asym-
metric traffic constraints,” IEEE Transactions on Information Theory, vol. 65, no. 11, pp.
7628–7645, Nov. 2019.

25
[10] H. Sun and S. A. Jafar, “Optimal download cost of private information retrieval for arbitrary
message length,” IEEE Transactions on Information Forensics and Security, vol. 12, no. 12,
pp. 2920–2932, Dec. 2017.

[11] C. Tian, “Symmetry, outer bounds, and code constructions: A computer-aided investigation
on the fundamental limits of caching,” MDPI Entropy, vol. 20, no. 8, pp. 603.1–43, Aug. 2018.

[12] S. Lin and D. J. Costello, Error control coding, 2nd ed. Prentice Hall, 2004.

[13] J. Zhu, Q. Yan, C. Qi, and X. Tang, “A new capacity-achieving private information retrieval
scheme with (almost) optimal file length for coded servers,” arXiv preprint arXiv:1903.06921,
2019.

26

You might also like