[FR] Enable fast, parallel listing of all document names, including missing/phantom documents

## Problem Description
There is currently no efficient method to list all document names in a very large collection when it's required to also include "phantom" documents (i.e., documents that do not actually exist but do have subcollections).

The goal is to perform a full scan of a collection's document IDs, similar to a keys-only query, but with the ability to find phantom documents, and to do so in parallel to make it practical for large collections (e.g., 500k to 1M+ documents).

Currently, there are two primary APIs for this, but they are mutually exclusive in their capabilities:
1. `v1.listDocuments` with `showMissing: true`:
    - *What it does well*: It correctly returns all document names, including phantoms.
    - *The limitation*: It is an inherently sequential, paginated API. It does not support range queries (e.g., `__name__ > 'a' AND __name__ < 'b'`) or partition cursors. This forces a single, long-running sequential scan, which is unacceptably slow for large collections. A scan of ~420,000 documents can take over two hours.
2. `v1.partitionQuery + v1.runQuery`:
    - *What it does well*: This is the recommended and highly effective method for parallelizing reads across a large collection. It is very fast.
    - *The limitation*: The `runQuery` RPC does not support a `showMissing` flag. As a result, it only returns existing documents and silently skips over phantom documents, making it unsuitable for use cases that require a complete list of all document paths.

This leaves users with a difficult choice: either a correct but impractically slow scan, or a fast but incomplete one.

## Use Case & Business Impact
This limitation is a significant bottleneck for critical back-office operations, data migrations, and large-scale data integrity checks. For example, when migrating data, it's essential to have a complete map of the existing structure, including documents that serve as parents for subcollections.

The inability to parallelize this task means that scripts which should take minutes can take many hours, making them difficult to run, monitor, and recover from in a production environment. This increases operational risk and engineering cost.

## Proposed Solutions
To address this, we propose adding capabilities to the Firestore API that would bridge this gap. Any of the following solutions would be a massive improvement:

1. Add `showMissing: true` support to the `runQuery` RPC.
    - This is perhaps the most direct solution. If `runQuery` supported `showMissing`, we could continue using the existing `partitionQuery` API to generate cursors and then execute parallel, keys-only queries that also find phantom documents. This seems like a natural extension of the existing parallel-scan pattern.
2. Enhance listDocuments to support partitioning.
    - Add support for `startAt` and `endAt` cursors (like those from `partitionQuery`) to the `IListDocumentsRequest`. This would allow us to partition the keyspace and then make parallel `listDocuments` calls, each with `showMissing: true`.
    - *Alternative*: Allow `where` filters on the `__name__` property within a `listDocuments` request. This would enable manual partitioning of the keyspace (e.g., by character ranges: `a*`, `b*`, etc.) and allow for parallel execution. The client library documentation for `listDocuments` states that `showMissing` may not be used with `orderBy`, so this restriction would need to be lifted to allow for manual range scans. (From docs: "Requests with `show_missing` may not specify `where` or `order_by`".)
3. Introduce a new, dedicated API for parallel keyspace enumeration.
    - Create a new RPC specifically designed for this task. It could be a "partitionable list" or a "keyspace scan" API that takes a parent path and returns a stream of all document names (existing and missing) within that path, with built-in support for parallel execution. This would provide a purpose-built tool for a common and important administrative task.
4. Enhance `IListenRequest` to notify on phantom document changes.
   - Extend the `listen` RPC to send notifications when phantom documents are
     implicitly created (because a subcollection document is added) or removed
     (because their last subcollection document is removed). This would allow
     for real-time tracking of the complete keyspace, which is crucial for
     maintaining live caches or indexes of a collection's structure.
5. Introduce a new API for traversing and watching parameterized paths.
   - Provide a new API that accepts a fully parameterized path pattern (e.g.,
     `users/{userId}/posts/{postId}/comments/{commentId}`). This API would
     stream all matching documents, including phantom documents along the path,
     and return the extracted, typed parameters for each document (e.g.,
     `{ userId: 'user-123', postId: 'post-abc', ... }`). It should also
     support watching for new documents that match the pattern. This is
     conceptually similar to how Firebase Functions v2 triggers can be defined
     with wildcards (e.g., `onDocumentWritten("users/{userId}/posts/{postId}")`),
     which has proven to be a very powerful and intuitive pattern. This would
     massively simplify complex data traversal and synchronization logic.

For us, enabling an efficient, parallel, and complete scan of a collection's keyspace is a very useful tool to have for managing Firestore data at scale. We understand this requires changes to the backend and would be grateful if you could consider this proposal and forward it to the appropriate team.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FR] Enable fast, parallel listing of all document names, including missing/phantom documents #2434

Problem Description

Use Case & Business Impact

Proposed Solutions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FR] Enable fast, parallel listing of all document names, including missing/phantom documents #2434

Description

Problem Description

Use Case & Business Impact

Proposed Solutions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions