Papr 7
Papr 7
Abstract
We consider constructing capacity-achieving linear codes with minimum message size for
private information retrieval (PIR) from N non-colluding databases, where each message is coded
using maximum distance separable (MDS) codes, such that it can be recovered from accessing
the contents of any T databases. It is shown that the minimum message size (sometimes also
referred to as the sub-packetization factor) is significantly, in fact exponentially, lower than
previously believed. More precisely, when K > T / gcd(N, T ) where K is the total number
of messages in the system and gcd(·, ·) means the greatest common divisor, we establish, by
providing both novel code constructions and a matching converse, the minimum message size
as lcm(N − T, T ), where lcm(·, ·) means the least common multiple. On the other hand, when
K is small, we show that it is in fact possible to design codes with a message size even smaller
than lcm(N − T, T ).
1 Introduction
The problem of private information retrieval (PIR), since its introduction [1], has attracted signifi-
cant attention from researchers in the fields of theoretical computer science, cryptography, informa-
tion theory, and coding theory. In the classical PIR model, a user wishes to retrieve one of the K
available messages, from N non-colluding databases, each of which has a copy of these K messages.
User privacy needs to be preserved during message retrieval, which requires that the identity of the
desired message not be revealed to any single database. To accomplish the task efficiently, good
codes should be designed to download the least amount of data per-bit of desired message, the
inverse of which is referred to as the capacity of the PIR system. This capacity problem in the
classical setting was settled recently [2].
In practical systems, the databases may suffer from failures, and are also constrained on the
storage space. Erasure codes can be used to improve both storage efficiency and failure resistance.
This consideration motivated the investigation of PIR from MDS-coded databases [3–6], with coding
parameter (N, T ), i.e., the messages can be recovered by accessing any T databases. The capacity
of PIR from MDS-coded databases (MDS-PIR) was characterized [4] as
K−1 !−1
T T
C = 1+ + ··· + . (1)
N N
In a given code, the smallest required number of symbols in each message is called the message
size L (sometimes also referred to as the sub-packetization factor), which is an important factor
impacting the practicality and efficiency of the code. A large message size implies that the message
(or the data file in practice systems) needs to be large for such code to be applicable, which
1
significantly restricts the possible usage scenarios. Moreover, a large message size also usually
implies that the encoding and the decoding functions are more complex, which not only requires
more engineering efforts to implement but also hinders the efficiency of the system operation. From
a theoretical point of view, a code with a smaller message size usually implies a more transparent
coding structure, which can be valuable for related problems; see, e.g., [7] for such an example.
Thus codes with a smaller message size are highly desirable in both theory and practice.
The capacity-achieving code given in [4] requires L = T N K , which can be extremely large
for a system with even a moderate number of messages. The problem of reducing the message
size of capacity-achieving codes was recently considered by Xu and Zhang [6], and it was shown
that under the assumption that all answers are of the same length, the message size must satisfy
L ≥ T (N/ gcd(N, T ))K−1 . These existing results may have left the impression that capacity-
achieving codes would necessitate a message size exponential in the number of messages.
In this work, we show that the minimum message size for capacity-achieving PIR codes can in
fact be significantly smaller than previously believed, by providing capacity-achieving linear codes
with message size L = lcm(N −T, T ). Two linear code constructions, referred to as Construction-A
and Construction-B, respectively, are given. The two constructions have the same download cost
and message size, however Construction-B has a better upload cost (i.e., a lower communication
cost for the user to send the queries), at the expense of being slightly more sophisticated than
Construction-A. The key difference between the two proposed constructions and existing codes in
the literature is that the proposed codes reduce the reliance on the so-called variety symmetry
[8], which should be distinguished from the asymmetry discussed in [9], and the answers may
be of different lengths1 . We further show that this is in fact the minimum message size when
K > T / gcd(N, T ), the proof of which requires a careful analysis of the converse proof of the
information-theoretic MDS-PIR capacity. Finally, we show that, when K is small, it is in fact
possible to design codes with a message size even smaller than lcm(N − T, T ).
The code constructions and converse proof reflect a reverse engineering approach which further
extends [8, 11]. Particularly, in [8], a similar approach was used to tackle the canonical PIR setting
with replicated databases and a capacity-achieving PIR code with the minimum message size and
upload cost was discovered, and in the current work the databases are instead MDS-coded. The
analysis technique and the code construction in the current work, however, are considerably more
involved due to the additional coding requirements and the several integer constraints.
The rest of the paper is organized as follows. In Section 2, a formal problem definition is given.
Construction-A and Construction-B are then given in Section 3 and Section 4, respectively, where
the correctness and performance are also proved and analyzed. The optimality of message size is
established by first identifying several critical properties of capacity-achieving codes in Section 5.1,
then lower-bounding the minimum message size when K > T / gcd(N, T ) in Section 5.2. A special
code is given in Section 5.3 to show that when K ≤ T / gcd(N, T ), the message size can be even
lower than lcm(N − T, T ). Finally, Section 6 concludes the paper. Several technical proofs are
relegated to the Appendices.
1
The download cost is measured in this work as the expected number of downloaded symbols (over all random
queries), which is in line with the prevailing approach in the literature when PIR capacity is concerned [2, 4], where
the download cost is viewed as being equivalent to certain entropy term. However, if we instead measure the download
cost by the maximum number of downloaded symbols (among all possible queries), which was the alternative and
more stringent approach used in [10] and [6], then the optimal minimum message sizes will need to be much larger.
In a sense, having the more stringent requirement that the maximum download cost needs to match the PIR capacity
forces certain symmetrization to be built in the code, which necessitates a significant increase in the message size.
2
2 System Model
There are a total of K mutually independent messages W 0 , W 1 , . . . , W K−1 in the system. Each
message is uniformly distributed over X L , i.e., the set of length-L sequences in the finite alphabet
X . All the messages can be collected and written as a single length-LK row vector W 0:K−1 . Each
message is MDS-coded and then distributed to N databases, such that from any T databases, the
messages can be fully recovered. Since the messages are (N, T ) MDS-coded, it is without loss of
generality to assume that L = M · T for some integer M .
∗ [k∗ ] [k∗ ] [k∗ ]
When a user wishes to retrieve a particular message W k , N queries Q0:N −1 = (Q0 , . . . , QN −1 )
[k∗ ]
are sent to the databases, where Qn is the query for database-n. The retrieval needs to be
information theoretically private, i.e., any database is not able to infer any knowledge as to which
message is being requested. For this purpose, a random key F in the set F is used together with
[k∗ ] [k∗ ]
the desired message index k ∗ to generate the set of queries Q0:N −1 . Each query Qn belongs to
[k∗ ]
the set of allowed queries for database-n, denoted as Qn . After receiving query Qn , database-n
[k∗ ]
responds with an answer An . Each symbol in the answers also belongs to the finite field X , and
[k∗ ]
the answers may have multiple (and different numbers of) symbols. Using the answers A0:N −1 from
∗
all N databases, together with F and k ∗ , the user then reconstructs Ŵ k .
A more rigorous definition of the linear information retrieval process we consider in this work
can be specified by a set of coding matrices and functions as follows. For notational simplicity, we
denote the cardinality of a set A as |A|.
Definition 1. A linear private information retrieval code from linearly MDS-coded databases (a
linear MDS-PIR code) consists of the following coding components:
1. A set of MDS encoding matrices:
G̃n := diag(G̃0n , G̃1n , . . . , G̃K−1
n ),
n ∈ {0, 1, . . . , N − 1}, (2)
3
5. An answer generating matrix
Ĝ(q
n
n)
∈ X M K×`n , qn ∈ Qn , n ∈ {0, 1, . . . , N − 1}, (5)
[k∗ ] (q ) (q ) [k∗ ]
i.e., the answer An = An n := Vn · Ĝn n , when qn = Qn is the query received by database-n;
6. A reconstruction function
N
Y −1
ψ: X `n × {0, 1, . . . , K − 1} × F → X L , (6)
n=0
∗ [k∗ ]
i.e., after receiving the answers, the user reconstructs the message as Ŵ k = ψ(A0:N −1 , k ∗ , F).
[k∗ ]
Note that Qn is in fact a random variable, since F is the random key. It follows that even
[k∗ ]
when the messages are viewed as deterministic, An is still not deterministic. In contrast, for
∗
[k ] (q )
any specific query realization Qn = qn , the corresponding answer An n is deterministic when the
∗
[k ] (q )
messages are viewed as deterministic. The distinction between An and An n is indicated by the
bracket [·] and the parenthesis (·).
In order to measure the performance of an MDS-PIR code, we consider the following two metrics,
with the focus on minimizing the latter while keeping the former optimal:
This is the number of bits of desired message information that can be privately retrieved per
bit of downloaded data. It was shown [4] that the maximum retrieval rate, i.e., the capacity
of such MDS-PIR systems, is as given in (1).
2. The message size L, which is the number of symbols to represent each individual message.
This quantity should be minimized, because in practical applications, a smaller message size
implies a more versatile code.
A third metric, the upload cost, is also of interest in practical systems (also particularly in
computer science literature, e.g., [1]), although it is not our main focus in this work. The upload
cost can be defined as
N
X −1
log2 |Qn |, (9)
n=0
4
which is roughly the total number of bits that the user needs to send to the servers during the
query phase.
We will need several more parameters before proceeding. Define p := gcd(N, T ), then
N − T = p · r, T = p · s, (10)
where (·)r+s indicates modulo (r + s). In this code construction, we need to first choose a (in
fact, any) linear (N, T )-MDS code C, in the alphabet X as our base code. There are many known
techniques to construct such codes, such as Reed-Solomon codes and Cauchy matrix based con-
structions; see [12]. The coding functions can then be given as follows:
3. For any n ∈ {0, 1, . . . , N − 1}, the query generating function produces a length-K column
vector as
∗
φn (k ∗ , F) = Q[k
n
]
=
(F0 , . . . , Fk∗ −1 , (Fk∗ + n)r+s , Fk∗ +1 , . . . , FK−1 )T . (12)
5
1,"
.D"1," .D#1," .D*&#
K
... 7∗
s s s
DB 0 DB 1 DB N-1
Figure 1: The queries to different databases are illustrated. The parts of the queries related to the
interference signals are of the same pattern. As a consequence, the induced interference signals in
the answers will have the same pattern, and T of them can be isolated to remove the interference
signals in all the answers.
where 1t is the all-one column vector of length t, and T indicates matrix transpose; the
element of Q̃n on the k-th row and i-th column is denoted as Q̃k,i
n . The query length function
is then defined as:
s−1
X
`n = 1 min Q̃k,i
n <r , (14)
k=0,1,...,K−1
i=0
where 1(·) is the indicator function, i.e., `n is the number of columns in Q̃n which has an
element less than r.
5. The second novel ingredient, which is different from previous ones in the literature, is the
introduction of pseudo code symbols and pseudo message symbols in the sub-messages: Vnk,i =
[k∗ ]
W k,i = 0 for i ≥ r. For n ∈ {0, 1, . . . , N − 1}, an intermediate answer vector Ãn of length-s
is formed as
K−1 K−1
∗
M k,0 M k,1
Ã[k
n
]
:= Vnk,Q̃n , Vnk,Q̃n ,
k=0 k=0
K−1
!
k,Q̃k,s−1
M
..., Vn n
, (15)
k=0
each component of which is the finite field addition of some components of the vector Vn that
[k∗ ]
are indicated by the corresponding column of Q̃n . The eventual answer An of length `n is
[k∗ ]
formed by concatenating the components of Ãn which are not constantly zero, i.e., those
corresponding to the positions indicated in (14).
∗ ,i
6. For any i ∈ {0, 1, . . . , s − 1}, define the interference database set Ti := {n| Q̃nk ≥ r}. The
6
[k∗ ]
i-th component of Ãn , n ∈ Ti , can be written as
K−1 K−1
k,i M k,i
M
Vnk,Q̃n = W k,Q̃n · G̃∗n
k=0 k=0
K−1
!
k,Q̃k,i ∗ ],i
M
= W n
· G̃∗n = W̄ [k · G̃∗n , n ∈ Ti ,
k=0
∗ ],i
where the length-T row vector W̄ [k is defined as
∗ −1
kM K−1
!
[k∗ ],i k,Q̃k,i k,Q̃k,i
M
W̄ := W n
⊕ W n
.
k=0 k=k∗ +1
∗
Note that W̄ [k ],i is not a function of n, since Q̃k,i k,i ∗
n = Q̃n0 unless k = k . Thus as long as
[k ∗ ],i
|Ti | ≥ T , the vector W̄ can be fully recovered by the MDS property of the code C; see Fig.
[k∗ ]
1 for an illustration. Further note that the i-th component of Ãn for n ∈ {0, 1, . . . , N −1}\Ti
can be written as
∗
k∗ ,Q̃kn ,i
[k∗ ],i ∗ ∗
W̄ · G̃n ⊕ W · G̃n , (16)
∗ ],i
from which, since W̄ [k is known, we can recover
∗
k∗ ,Q̃kn ,i ∗
W · G̃n , n ∈ {0, 1, . . . , N − 1} \ Ti . (17)
∗
n o ∗
Denote Nm := n Q̃kn ,i = m . As long as |Nm | ≥ T , we can recover the vector W k ,m by
again invoking the property of the MDS code C.
The corresponding queries (and query matrices such induced) and answers are marked bold in
Table 1. In these Q̃ matrices, each column has at least one element being 0, and thus the total
number of transmission symbols is 6. It is seen that from (V00 , V10 ), the symbol V20 can be recovered
by the MDS property, and thus V21 . Similarly, we can recover V11 . Using both (V11 , V21 ), we can
then recover the original message W1 by decoding the MDS code C.
7
Table 1: Queries and answers for (N, T, K) = (3, 2, 3).
database-0 database-1 database-2
Q Q̃ A0 Q Q̃ A1 Q Q̃ A2
0 0 1 1 2 2
0 01 0 01 0 01
0 01 V 0 ⊕ V 1 ⊕ V 2 , ∅ 0 01 0 1
V1 ⊕ V1 , ∅ 0 01 V0 ⊕ V1 , V2
0 0 0 2 2 2
0 01 1 12 2 20
0 01 0 01 0 01
1 12 V 0 , V2 1 12 V 0 ⊕ V 2, ∅ 1 12 V 0, ∅
0 0 1 1 2
2 20 0 01 1 12
0 01 0 01 0 01
2 20 V 0, V 1 2 20 V0 , V1 ⊕ V2 2 20 V 0 ⊕ V 2 , V 1
0 0 1 1 1 2 2 2
1 12 2 20 0 01
.. .. .. .. .. .. .. .. ..
. . . . . . . . .
2 20 2 20 2 20
2 20 ∅, V 0 ⊕ V 1 ⊕ V 2 2 20 V 2 , V 0 ⊕ V 1 2 20 0
∅, V2 ⊕ V2 1
0 0 0 1 1 1
2 20 0 01 1 12
Lemma 1. In Construction-A, for any request of message-k ∗ and any random key F,
8
The expected lengths of the answers is
N
X −1 N
X −1 X
s−1
E(`n ) = Pr min Q̃k,i
n <r
k=0,1,...,K−1
n=0 n=0 i=0
X NX
s−1 −1
= Pr min Q̃k,i
n < r , (20)
k=0,1,...,K−1
i=0 n=0
assuming an arbitrary message k ∗ is being requested. The probabilities involved in the summand
i = i∗ depend on
∗ 0:K−1,i∗ 0:K−1,i∗
Q̃0:K−1,i
0 , Q̃1 , . . . , Q̃N −1 . (21)
the number of elements that are less than r is N − T , which induces (N − T ) symbols being
transmitted. Therefore
N
X −1
E(`n ) = s [Pr(E)N + (1 − Pr(E)) (N − T )]
n=0
K−1 " K #
T T
= sN − sT = sN 1 − , (22)
N N
from which it follows that the code is indeed capacity achieving, by taking into account (10).
Proof. Consider any k ∗ . By (12), we see that |Qn | = |F| = (N/ gcd(N, T ))K−1 , and it follows that
PN −1
the upload cost is n=0 log(|Qn |) = N (K − 1) log [N/ gcd(N, T )].
9
4.1 Construction-B for T ≥ N − T
In this construction, the same random key F = (F0 , F1 , . . . , FK−1 ) as in Construction-A is used, and
the MDS encoding matrices and decoding functions are also exactly the same as in Construction-A.
We need a second generic (s, r)-MDS code Cc in the alphabet X in this construction. Construction-
B essentially utilizes a product code with row code C and column code Cc [12]. In this context, it
is more convenient to view the message W k as being represented as an r × T matrix, denoted as
W̆ k
W k,0
W k,1
W̆ k = . . (23)
..
W k,r−1
3. The query generating function at server-n produces the following K × 1 query vector
∗ [k∗ ] [k∗ ] [k∗ ]
φn (k ∗ , F) = Q[k
n
]
= (Q0,n , Q1,n , . . . , QK−1,n )T
= d(F0 , F1 , . . . , Fk∗ −1 , (Fk∗ + n)s+r ,
Fk∗ +1 , . . . , FK−1 )T es , (24)
where dxes := min(x, s), and it operates on a vector by operating on each component indi-
vidually.
where the first row has the first r elements being 1’s and the rest (s − r + 1) being 0’s, and
the remaining rows are obtained by cyclically shifting the first s elements in the first row but
keeping the last 0 in place. The query length function is then defined as
s−1 K−1
!
X X
`n = 1 Pi,Q[k∗ ] > 0 , (26)
k,n
i=0 k=0
[k∗ ]
i.e., it is the number of columns in the matrix P selected by the vector Qn that have non-zero
elements.
Vnk,0 W k,0
Vnk,1 W k,1
Vnk = = · G̃∗n . (27)
.. ..
. .
k,r−1 W k,r−1
Vn
10
In order to generate the answer, each Vnk vector is encoded by Cc into a length-s intermediate
code vector
Ṽnk = (Ṽn,0
k k
, Ṽn,1 k
, . . . , Ṽn,s−1 ) = (Ĝ∗ )T · Vnk , (28)
where Ĝ∗ is the generator matrix of the code Cc . An intermediate answer vector is then
produced
K−1 K−1
∗
M M
Ã[k
n
]
:= Ṽnk,0 · P0,Q[k∗ ] , Ṽnk,1 · P1,Q[k∗ ] ,
k,n k.n
k=0 k=0
K−1
!T
M
··· , Ṽnk,s−1 · Ps−1,Q[k∗ ] . (29)
k,n
k=0
[k∗ ] [k∗ ]
The eventual answer An
of length `n is formed by concatenating the components of Ãn
which are not constantly zero, i.e., those indicated by (26).
6. For any i ∈ {0, 1, . . . , s − 1}, define the interference database set T̃i := {n| Pi,Q[k∗ ] = 0}. For
k∗ ,n
11
[k∗ ]
from which, since (Ĝ∗i )T · W̄i is known, we can recover
∗
(Ĝ∗i )T · W̆ k · G̃∗n , n ∈ {0, 1, . . . , N − 1} \ T̃i . (31)
Denote Sn := i Pi,Q[k∗ ] = 1 and N := {n | |Sn | ≥ r}, the latter of which is the set of
k∗ ,n
databases that provide at least r symbols of the requested messages in the form (31). For
∗
any n ∈ N , we can recover W̆ k · G̃∗n by invoking the property of MDS code Cc . Then as long
∗
as |N | ≥ T , W̆ k can be fully recovered by invoking the MDS property of code C.
where each column corresponds to a database, and the contents below the horizontal line are not
stored physically, but can be generated as part of the answer computation.
[k∗ ]
It is straightforward to verify that with the auxiliary query Q̄n as the queries, the extended
pattern matrix P̄ in the answer generation, and the uncompressed stored contents Ṽ0k,0 as the
stored content, the answer is precisely the same as the uncompressed version described above.
12
Let the messages be W 0:3 = (A, B, C, D). Consider the case when message W 0 = A is being
requested, and the key is F = (4, 1, 2). Then the auxiliary queries are
3 4 0 1 2
4 4 4 4 4
[Q̄0 , Q̄1 , Q̄2 , Q̄3 , Q̄4 ] =
1 1 1 1 1 ,
(36)
2 2 2 2 2
∗ ∗ A02 A03 ∗
∗ ∗ ∗ A13 A14 . (39)
2
∗ ∗ A2 ∗ A4 2
It is now straightforward to see that through the product code based on the (5, 3) MDS code C
and the (3, 2) MDS code Cc , the message A can be fully recovered.
Lemma 3. In Construction-B, for any request of message-k ∗ and any random key F,
2. |N | = T .
Theorem 2. The codes obtained by Construction-B for T ≥ N − T are both private and capacity-
achieving.
13
Proof. To see that the code is private, observe that for database-n, the auxiliary query vector
[k∗ ]
Q̄n follows a uniform distribution in the set defined in (19) for any requested message k ∗ ∈
[k∗ ]
{0, 1, . . . , K − 1}. The query Qn is obtained through an additional mapping d·es regardless of k ∗ ,
and thus the query at database-n follows the same distribution for all k ∗ .
The expected lengths of the answers can be written as
N −1 N −1 X
s−1 K−1
!
X X X
E(`n ) = Pr Pi,Q[k∗ ] > 0
k,n
n=0 n=0 i=0 k=0
X NX
s−1 −1 K−1
!
X
= Pr Pi,Q[k∗ ] > 0
k,n
i=0 n=0 k=0
X NX
s−1 −1 K−1
!
X
= Pr P̄i,Q̄[k∗ ] > 0 , (40)
k,n
i=0 n=0 k=0
assuming an arbitrary message k ∗ is being requested. The probabilities involved in the summand
for i = i∗ depend on the vector
∗
[k∗ ] [k∗ ]
[k ]
Q̄0:K−1,0 , Q̄0:K−1,1 , . . . , Q̄0:K−1,N −1 . (41)
[k∗ ]
By the definition of Q̄0:K−1,n , it is clear that if P̄i∗ ,Fk = 1 for any k ∈ {0, 1, . . . , k ∗ − 1, k ∗ +
1, . . . , K − 1}, then K−1
P
k=0 Pi,Q[k ] > 0 for all n = 0, 1, . . . , N − 1. This will induce N transmissions
∗
k,n
in the retrieval from all databases for i = i∗ . This event, denoted as E, occurs with probability
[k∗ ]
1 − (s/(r + s))K−1 , since the i∗ -th row of the matrix P̄ has r entries of value 1, and Q̄k,n = Fk ,
k = 0, 1, . . . , k ∗ − 1, k ∗ + 1, . . . , K − 1, are mutually independent, identically distributed, and each
follows a uniform distribution on {0, 1, . . . , r + s − 1}.
On the other hand, when the event E does not occur, the vector
is a permutation of the p-replicated vector of (0, 1, . . . , s + r − 1), and thus the number of elements
that satisfying P̄i∗ ,(Fk∗ +n)r+s = 1, n = 0, 1, . . . , N − 1, is exactly N − T , implying that (N − T )
symbols will be transmitted for i = i∗ . Therefore, we have
N
X −1
E(`n ) = s [Pr(E)N + (1 − Pr(E)) (N − T )]
n=0
K−1 " K #
T T
= sN − sT = sN 1 − , (42)
N N
14
1) log(s+r) through an surjective function d·es , whose image size is upper bounded by log(s+r)K−1 .
[k∗ ]
If (s + 1)K > (s + r)K−1 , then we can simply use the auxiliary query Q̄n . Thus the upload cost
is at most the less of the two terms as given above.
If r > 1, the quantity N K log(s + 1) is clearly the less of the two when K is large, and thus
may lead to significant savings in terms of the upload cost.
5. Let Vnk,r = W k,r = 0. Database-n first produces a K ×s query matrix Q̃n for i = 0, 1, . . . , s−1
r [k∗ ]
if Qk,n = r
Q̃k,i
n =
∗
[k ]
. (45)
Qk,n + i otherwise
r
[k∗ ]
For n ∈ {0, 1, . . . , N − 1}, an intermediate answer vector Ãn of length-s is formed (similar
to Construction-A) as
K−1 K−1
∗]
M k,0 M k,1
Ã[k
n := Vnk,Q̃n , Vnk,Q̃n ,
k=0 k=0
K−1
!T
k,Q̃k,s−1
M
..., Vn n
. (46)
k=0
[k∗ ] [k∗ ]
The eventual answer An of length `n is formed by concatenating the components of Ãn
which are not constantly zero, as indicated by (44).
6. The reconstruction function is the same as that of Construction-A, and the desired message
can be correctly reconstructed as long as |Ti | ≥ T and |Nm | ≥ T .
For better visualization, we can again consider the (uncompressed) auxiliary query
∗
Q̄[k
n
]
= (F0 , F1 , . . . , Fk∗ −1 , (Fk∗ + n)s+r ,
Fk∗ +1 , . . . , FK−1 )T . (47)
[k∗ ] [k∗ ]
The query vector Qn is a compressed version of the auxiliary query Q̄n .
15
4.2.1 An Example for Construction-B
Consider an example N = 5, T = 2, K = 4, which induces the parameters (p, r, s, L) = (1, 3, 2, 6).
Let the messages be W 0:3 = (A, B, C, D). Consider the case when message W0 = A is being
requested, and the key is F = (4, 1, 2). Then the auxiliary queries and the queries are as given in
(36) and (37), respectively. The intermediate query matrix at all the databases are
In the first row, (C01 + D02 , C11 + D12 ) can be used to recover (Cn1 + Dn2 ) for any n = 0, 1, . . . , 4,
using the MDS property of code C. Similarly, Cn2 + Dn0 can be recovered. Therefore, the following
information on the requested message A can be obtained
[k∗ ]
assuming an arbitrary message k ∗ is being requested. By the definition of Qn , if any item in
16
On the other hand, when the event E does not occur, in the vector
the number of elements that are less than r is exactly N − T , which induces s(N − T ) symbols
being transmitted. Therefore
N
X −1
E(`n ) = Pr(E)sN + (1 − Pr(E)) s(N − T )
n=0
K−1 " K #
T T
= sN − sT = sN 1 − , (51)
N N
The proof follows the same argument as that of Lemma 4, and it is omitted here for brevity.
Moreover, for any linear MDS-PIR code for which the equality holds for any k and π(·) in (52),
∗
let q0:N −1 be a combination of queries such that Pr(q0:N −1 ) > 0 for the retrieval of W k , then the
code must have:
17
P1 For any T ⊆ {0, 1, . . . , N − 1} such that |T | = T , and J ⊆ {0, 1, . . . , K − 1} satisfying k ∗ ∈ J
(q 0 )
H An0n , n0 ∈ T̄ | W J , An(qn ) , n ∈ T = 0,
Property P0 is a direct consequence of the linear MDS-PIR code definition. Property P1 states
that the interference signals from the answers of any T databases in a capacity-achieving code can
fully determine all interference signals in other answers. The inequalities in Lemma 8 are the key
steps in deriving the capacity of MDS-PIR codes; the proofs of these properties are can be found in
Appendix B. Conversely, for any capacity-achieving linear MDS-PIR code, these inequalities must
hold with equality, implying the following theorem.
Theorem 4. Any capacity-achieving linear MDS-PIR code must have the properties P0 and P1.
and it follows that R ≥ C. For any capacity-achieving linear MDS-PIR code, all the inequalities
in Lemma 8 must be equality. Therefore, any capacity-achieving linear MDS-PIR code must have
properties P0 and P1.
A similar set of properties for capacity-achieving PIR codes on replicated databases was derived
in [8], which holds for a more general code class and in a more explicit form. For MDS coded
databases, firstly it is more meaningful to consider only linear codes, and secondly, it is not clear
whether the properties stated in Lemma 7 and 8 can be extended to the more general class of codes
considered in [8].
Theorem 5. When K > T / gcd(N, T ), the message size of any capacity-achieving linear MDS-PIR
code satisfies L ≥ lcm(N − T, T ).
From Theorem 5, we can conclude that codes obtained by Construction-A and Construction-B
indeed have the minimum message size when K > T / gcd(N, T ). The proof of Theorem 5 relies on
the delicate relation among a set of auxiliary quantities Hnk ’s and Ink ’s which we define next. For
18
any given capacity-achieving linear MDS-PIR code, let (k̆, f˘) be the maximizer for the following
optimization problem for n = 0:
The following lemma implies that the optimization problem in (57) has the same maximizer for
all n ∈ {0, 1, . . . , N − 1}.
Lemma 9. For any capacity-achieving linear MDS-PIR code, ∀n0 6= n00 where n0 , n00 ∈ {0, 1, . . . , N −
1}, any k ∗ ∈ {0, 1, . . . , K − 1}, any f ∈ F,
[k∗ ] ∗ [k∗ ] ∗
H(An0 | W k , F = f ) = H(An00 | W k , F = f ).
Lemma 10.
N
X −1
L − (N − T )H k̆ = Ink̆ . (58)
n=0
Lemma 11. For any k ∈ {0, 1, . . . , K − 1} and n ∈ {0, 1, . . . , N − 1}, Hnk and Ink are integers;
moreover
X
Hnk̆ ≥ Ink , (59)
k6=k̆
H k̆ ≥ s (60)
The proofs of Lemma 9-11 are given in Appendix B. We are now ready to prove Theorem 5.
Proof of Theorem 5. When K > s, by (60) in Lemma 11 and Lemma 10, we have
N
X −1
k̆
L − (N − T )s ≥ L − (N − T )H = Ink̆ ≥ 0. (61)
n=0
Substituting L = M ps and (N − T ) = pr into the left hand side leads to the conclusion M ≥ r,
implying that L ≥ M T ≥ rT = lcm(N − T, T ).
19
5.3 Message Size Reduction for Small K
The following theorem confirms that for small K, it is in fact possible to construct a capacity-
achieving code with an even smaller message size.
Proof of Theorem 6. Since L = M T > 0, it is trivial to see that L ≥ T , and thus it only remains
to provide a construction of a capacity-achieving linear MDS-PIR code with such a message size.
Let database-n store two symbols Vn0 , Vn1 ∈ X , which are MDS-coded symbols of messages W 0
∗
and W 1 , respectively. When the user wishes to retrieve message W k where k ∗ ∈ {0, 1}, two query
strategies are used.
T
• With probability N , randomly partition N databases into 3 disjoint sets G (0) , G (1) and G (2) ,
with |G | = N − T , |G (1) | = 2T − N and |G (2) | = N − T . The user requests Vn0 ⊕ Vn1 from
(0)
∗
databases in G (0) , (Vn0 , Vn1 ) from those in G (1) , and Vn1−k in G (2) .
−T
• With probability NN , randomly partition N databases into 2 disjoint sets G (3) and G (4) ,
∗
with |G (3) | = T and |G (4) | = N − T . The user requests Vnk from databases in G (3) , but no
information from those in G (4) .
It is straightforward to verify that the code is indeed correct, private, and capacity-achieving.
Theorem 6 provides a code construction with a small message size for the special case of K = 2
and T ≥ N − T , however, we suspect codes with small message sizes also exist for other parameters
when K is below the threshold, but they may require certain more sophisticated combinatorial
structure. A systematic approach to construct such codes and a converse result appear rather
difficult to find.
6 Conclusion
We proposed two code constructions for private information retrieval from MDS-coded databases
with message size lcm(N − T, T ), and show that this is the minimum message size for linear codes
when K is above a threshold. For smaller K it is in fact possible to design PIR codes with an even
smaller message size, which we show by a special example for K = 2. This work generalizes our
previous result on private information retrieval from replicated databases [8] in a highly non-trivial
manner. We expect the code constructions and the converse proof approach to be also applicable
and fruitful in other privacy-preserving primitives.
Independent of this work and inspired by our previous result [8], Zhu et al. [13] discovered
a different code construction, and also derived a lower bound on the message size similar to the
one reported here. All three code constructions have the same message size, however, the two
constructions we provided have a lower upload cost due to the queries being better compressed. It
is worth noting that in our previous work [8], the proposed code was also shown to be optimal in
terms of the upload cost, however, proving the proposed codes in the current work to be optimal
appears more difficult due to the more complex dependence stipulated by the MDS code.
20
A Proofs of Lemma 1, Lemma 3, and Lemma 5
∗ ,i
Proof of Lemma 1. Fix a particular k ∗ . It is convenient to represent the Q̃kn as a matrix given
below
∗ ∗ ∗ ,0
Q̃k0 ,0 Q̃k1 ,0 . . . Q̃kN −1
k∗ ,1 ∗ ∗ ,1
Q̃0 Q̃k1 ,1 . . . Q̃kN −1
. (62)
. .. .. ..
.
. . . .
∗ ∗ ∗ ,s−1
Q̃0k ,s−1 Q̃k1 ,s−1 . . . Q̃kN −1
∗
Since Q̃nk ,i = (Fk∗ + i + n)s+r , N = p · (s + r), and T = p · s, for any i and any given realization
∗
of key F, there are exactly T elements in Q̃k0:N,i−1 , which is a row of (62), that are greater than or
equal to r. This proves the first statement that |Ti | ≥ T . For the second statement, consider a
fixed m ∈ {0, 1, . . . , r + s − 1}. It is clear that each row in the matrix (62) has exactly p positions
being m. Therefore, there are a total of p · s = T elements in the matrix being m. Since for any
∗ ∗ 0 ∗
i 6= i0 where 0 ≤ i, i0 < s, we have Q̃kn ,i 6= Q̃kn ,i , due to Q̃kn ,i = (Fk∗ + i + n)s+r , these T positions
are all in different columns, implying |Nm | = T .
is simply a permutation (in fact, a cyclic shift) of p-replicated vector of (0, 1, . . . , r + s − 1), and
thus
P̄i,Q̄[k ] , P̄i,Q̄[k ] , . . . , P̄i,Q̄[k ]
∗ ∗ ∗ (65)
k∗ ,0 k∗ ,1 k∗ ,N −1
is in fact is a permutation of [P̄i , P̄i , . . . , P̄i ], i.e., a p-replicated version of the i-th row of P̄ . It is
now clear that |T̃i | = T , because each row of P̄ has exactly s zeros, and the replication gives a total
of ps = T zeros in the vector in (65).
To see that |N | = T , let us focus on any single database-n such that |Sn | > 0, i.e., P̄i,Q̄[k∗ ] = 1
k∗ ,n
[k∗ ]
for some i ∈ {0, 1, . . . , s − 1}. This condition is equivalent to Q̄k∗ ,n < s, and moreover, it also
implies |Sn | = r because the vector
T
P̄0,Q̄[k ] , P̄1,Q̄[k ] , . . . , P̄s−1,Q̄[k ]
∗ ∗ ∗ , (66)
k∗ ,n k∗ ,n k∗ ,n
[k∗ ]
is the j = Q̄k∗ ,n -th column of the extended pattern matrix P̄ , and for any j < s, such a column
has exact r positions being 1. It is also immediately clear that among the N databases, there are
[k∗ ]
a total of sp = T of them with Q̄k∗ ,n < s, again because the vector (64) is a permutation of the
p-replicated vector of (0, 1, . . . , r + s − 1).
21
[k∗ ]
n o
Proof of Lemma 5. Define T := n|Qk∗ ,n = r , and notice that in Construction-B, Ti = T for all
i = 0, 1, . . . , s − 1. Moreover
[k∗ ] [k∗ ]
n o n o
T = n Qk∗ ,n = r = n Q̄k∗ ,n ≥ r . (67)
Notice that the vector
[k∗ ] [k∗ ] [k∗ ]
Q̄k∗ ,0 , Q̄k∗ ,1 , . . . , Q̄k∗ ,N −1 (68)
is a permutation of p-replicated vector (0, 1, . . . , r + s − 1), and thus it has exactly ps = T items
that are greater or equal to r. This directly implies |T | = T .
∗
By definition, for any Q̃kn ,i = m where m < r, to hold, we must have (Fk∗ + n)s+r < r. It is
∗
also clear that Q̃kn ,i = ((Fk∗ + n)s+r + i)r . For any m ∈ {0, 1, . . . , r − 1}, there are in fact exactly
∗
ps = T pairs of (n, i)’s such that Q̃kn ,i = m. This can be seen as follows: the vector
(Fk∗ + 0, Fk∗ + 1, . . . , Fk∗ + N − 1)s+r (69)
has rp = (N − T ) items less than r, and these items are a permutation of p-replicated vector
(0, 1, . . . , r − 1). The symmetry of these values indeed implies that there are (N − T )s/r = T such
(n, i) pairs for each m ∈ {0, 1, . . . , r − 1}. We next show that such pairs do not have any common
n. To see this, suppose there are two distinct pairs (n, i) and (n, i0 ), which satisfy
((Fk∗ + n)s+r + i)r = ((Fk∗ + n)s+r + i0 )r = m, (70)
for certain m ∈ {0, 1, . . . , r−1}. However, notice that 0 ≤ i, i0 < s ≤ r, we therefore must have i = i0 ,
which contradicts our supposition. It follows that indeed |Nm | = T for any m ∈ {0, 1, . . . , r−1}.
22
where ρ : {0, 1, . . . , N −1} → {0, 1, . . . , N −1} in inequality (75) is a permutation over {0, 1, . . . , N −
1}. Equality (78) is due to the privacy constraint, which is
H(A[π(k)]
n | W π(0:k) , F) (79)
X
= Pr(F = f )H(A[π(k)]n | W π(0:k) , F = f ) (80)
f ∈F
X
= Pr(Qn = qn )H(A(q
n
n)
| W π(0:k) ) (81)
qn ∈Qn
= H(A[π(k+1)]
n | W π(0:k) , F) (82)
the equality must hold for each F = f , which concludes the proof of property P1.
which implies
[k∗ ] ∗ [k∗ ] ∗
H(AT 0 | W k , F = f ) = H(AT 00 | W k , F = f ). (84)
Invoking P0 leads to
∗ ∗ ∗ ∗
X X
H(A[k
n
]
| Wk , F = f) = H(A[k
n
]
| W k , F = f ),
n∈T 0 n∈T 00
Proof of Lemma 10. For any capacity-achieving code, (71) must equal to (76). The equality should
also hold when F = f˘, k = 0 and π(0) = k̆. Substituting the definition of Hnk̆ and Ink̆ directly, we
have
"N −1 # N −1 T −1
X XX
k̆ k̆
N H + In − L = H k̆ ,
n=0 n=0 s=0
Proof of Lemma 11. Let qn = φn (k̆, f˘). The linearity of the code implies that
[k̆] (q )
H(An | W k , F = f˘) H(An n |W k )
Hnk = = (85)
log |X | log |X |
23
is an integer. Notice that Ink + Hnk is also an integer by the same argument, from which we conclude
that Ink is also an integer.
To see the inequality in (59), we write
H(A(qn) (qn )
n ) ≥ I(An ; W
0:K−1
)
K−1
X
= I(A(qn) k
n ; W |W
0:k−1
)
k=0
(a) K−1
X K−1
X
≥ I(An(qn ) ; W k ) = log |X | Ink , (86)
k=0 k=0
I(A(qn) k
n ; W |W
0:k−1
)
= H(W k |W 0:k−1 ) − H(W k |W 0:k−1 , A(qn)
n )
≥ H(W k ) − H(W k |A(qn)
n )
= I(A(qn) k
n ; W ). (87)
It follows that
Next we prove the inequality (60). First define the following query support set at database-n
and since the code is privacy preserving, it is also the query support set for all other k = 1, . . . , K −1.
We can then write
≥ max∗ H(A(q
n
n)
| W k ) ≥ H(A[nk̆] | W k , F = f˘)
qn ∈Qn
Since Ink̆ ≥ 0, the left hand side is strictly positive unless (N − T ) is a factor of L. If the left hand
side is zero, then L ≥ lcm(T, N − T ), implying H k̆ ≥ s. If (N − T ) is not a factor of L, implying
24
that the left hand side is indeed strictly positive, then there exists at least one n∗ ∈ {0, 1, . . . , N −1}
such that Ink̆∗ ≥ 1. For any k = 0, 1, . . . , K − 1,
[k̆]
Ink∗ + Hnk∗ = H(An∗ | F = f˘)/ log |X |, (92)
however, the right hand side of (92) is independent of k, and thus Ink∗ + Hnk∗ is in fact a constant.
Furthermore, H k̆ ≥ Hnk∗ by (90), it follows that Ink∗ ≥ Ink̆∗ for all k = 0, 1, . . . , K − 1. By (59), we
can write
X
Hnk̆∗ ≥ Ink∗ ≥ (K − 1)Ink̆∗ ≥ (K − 1) ≥ s, (93)
k6=k∗
when K > s.
References
[1] B. Chor, E. Kushilevitz, O. Goldreich, and M. Sudan, “Private information retrieval,” Journal
of the ACM (JACM), vol. 45, no. 6, pp. 965–981, Nov. 1998.
[2] H. Sun and S. A. Jafar, “The capacity of private information retrieval,” IEEE Transactions
on Information Theory, vol. 63, no. 7, pp. 4075–4088, Jul. 2017.
[3] N. B. Shah, K. Rashmi, and K. Ramchandran, “One extra bit of download ensures perfectly
private information retrieval,” in Information Theory (ISIT), 2014 IEEE International Sym-
posium on, 2014, pp. 856–860.
[4] K. Banawan and S. Ulukus, “The capacity of private information retrieval from coded
databases,” IEEE Transactions on Information Theory, vol. 64, no. 3, pp. 1945–1956, Mar.
2018.
[5] R. Tajeddine, O. W. Gnilke, and S. El Rouayheb, “Private information retrieval from MDS
coded data in distributed storage systems,” IEEE Transactions on Information Theory, vol. 64,
no. 11, pp. 7081–7093, Nov. 2018.
[6] J. Xu and Z. Zhang, “On sub-packetization and access number of capacity-achieving PIR
schemes for MDS coded non-colluding databases,” SCIENCE CHINA Information Sciences,
vol. 61, no. 7, pp. 100 306:1–100 306:16, 2018.
[7] H.-Y. Lin, S. Kumar, E. Rosnes, A. G. i Amat, and E. Yaakobi, “Weakly-private informa-
tion retrieval,” in 2019 Proceedings of IEEE International Symposium on Information Theory
(ISIT), 2019, pp. 1257–1261.
[8] C. Tian, H. Sun, and J. Chen, “Capacity-achieving private information retrieval codes with
optimal message size and upload cost,” IEEE Transactions on Information Theory, vol. 65,
no. 11, pp. 7613–7627, Nov. 2019.
[9] K. Banawan and S. Ulukus, “Asymmetry hurts: Private information retrieval under asym-
metric traffic constraints,” IEEE Transactions on Information Theory, vol. 65, no. 11, pp.
7628–7645, Nov. 2019.
25
[10] H. Sun and S. A. Jafar, “Optimal download cost of private information retrieval for arbitrary
message length,” IEEE Transactions on Information Forensics and Security, vol. 12, no. 12,
pp. 2920–2932, Dec. 2017.
[11] C. Tian, “Symmetry, outer bounds, and code constructions: A computer-aided investigation
on the fundamental limits of caching,” MDPI Entropy, vol. 20, no. 8, pp. 603.1–43, Aug. 2018.
[12] S. Lin and D. J. Costello, Error control coding, 2nd ed. Prentice Hall, 2004.
[13] J. Zhu, Q. Yan, C. Qi, and X. Tang, “A new capacity-achieving private information retrieval
scheme with (almost) optimal file length for coded servers,” arXiv preprint arXiv:1903.06921,
2019.
26