-
Notifications
You must be signed in to change notification settings - Fork 158
Description
Problem Description
There is currently no efficient method to list all document names in a very large collection when it's required to also include "phantom" documents (i.e., documents that do not actually exist but do have subcollections).
The goal is to perform a full scan of a collection's document IDs, similar to a keys-only query, but with the ability to find phantom documents, and to do so in parallel to make it practical for large collections (e.g., 500k to 1M+ documents).
Currently, there are two primary APIs for this, but they are mutually exclusive in their capabilities:
v1.listDocuments
withshowMissing: true
:- What it does well: It correctly returns all document names, including phantoms.
- The limitation: It is an inherently sequential, paginated API. It does not support range queries (e.g.,
__name__ > 'a' AND __name__ < 'b'
) or partition cursors. This forces a single, long-running sequential scan, which is unacceptably slow for large collections. A scan of ~420,000 documents can take over two hours.
v1.partitionQuery + v1.runQuery
:- What it does well: This is the recommended and highly effective method for parallelizing reads across a large collection. It is very fast.
- The limitation: The
runQuery
RPC does not support ashowMissing
flag. As a result, it only returns existing documents and silently skips over phantom documents, making it unsuitable for use cases that require a complete list of all document paths.
This leaves users with a difficult choice: either a correct but impractically slow scan, or a fast but incomplete one.
Use Case & Business Impact
This limitation is a significant bottleneck for critical back-office operations, data migrations, and large-scale data integrity checks. For example, when migrating data, it's essential to have a complete map of the existing structure, including documents that serve as parents for subcollections.
The inability to parallelize this task means that scripts which should take minutes can take many hours, making them difficult to run, monitor, and recover from in a production environment. This increases operational risk and engineering cost.
Proposed Solutions
To address this, we propose adding capabilities to the Firestore API that would bridge this gap. Any of the following solutions would be a massive improvement:
- Add
showMissing: true
support to therunQuery
RPC.- This is perhaps the most direct solution. If
runQuery
supportedshowMissing
, we could continue using the existingpartitionQuery
API to generate cursors and then execute parallel, keys-only queries that also find phantom documents. This seems like a natural extension of the existing parallel-scan pattern.
- This is perhaps the most direct solution. If
- Enhance listDocuments to support partitioning.
- Add support for
startAt
andendAt
cursors (like those frompartitionQuery
) to theIListDocumentsRequest
. This would allow us to partition the keyspace and then make parallellistDocuments
calls, each withshowMissing: true
. - Alternative: Allow
where
filters on the__name__
property within alistDocuments
request. This would enable manual partitioning of the keyspace (e.g., by character ranges:a*
,b*
, etc.) and allow for parallel execution. The client library documentation forlistDocuments
states thatshowMissing
may not be used withorderBy
, so this restriction would need to be lifted to allow for manual range scans. (From docs: "Requests withshow_missing
may not specifywhere
ororder_by
".)
- Add support for
- Introduce a new, dedicated API for parallel keyspace enumeration.
- Create a new RPC specifically designed for this task. It could be a "partitionable list" or a "keyspace scan" API that takes a parent path and returns a stream of all document names (existing and missing) within that path, with built-in support for parallel execution. This would provide a purpose-built tool for a common and important administrative task.
- Enhance
IListenRequest
to notify on phantom document changes.- Extend the
listen
RPC to send notifications when phantom documents are
implicitly created (because a subcollection document is added) or removed
(because their last subcollection document is removed). This would allow
for real-time tracking of the complete keyspace, which is crucial for
maintaining live caches or indexes of a collection's structure.
- Extend the
- Introduce a new API for traversing and watching parameterized paths.
- Provide a new API that accepts a fully parameterized path pattern (e.g.,
users/{userId}/posts/{postId}/comments/{commentId}
). This API would
stream all matching documents, including phantom documents along the path,
and return the extracted, typed parameters for each document (e.g.,
{ userId: 'user-123', postId: 'post-abc', ... }
). It should also
support watching for new documents that match the pattern. This is
conceptually similar to how Firebase Functions v2 triggers can be defined
with wildcards (e.g.,onDocumentWritten("users/{userId}/posts/{postId}")
),
which has proven to be a very powerful and intuitive pattern. This would
massively simplify complex data traversal and synchronization logic.
- Provide a new API that accepts a fully parameterized path pattern (e.g.,
For us, enabling an efficient, parallel, and complete scan of a collection's keyspace is a very useful tool to have for managing Firestore data at scale. We understand this requires changes to the backend and would be grateful if you could consider this proposal and forward it to the appropriate team.