Skip to content

[FR] Enable fast, parallel listing of all document names, including missing/phantom documents #2434

@JrSchild

Description

@JrSchild

Problem Description

There is currently no efficient method to list all document names in a very large collection when it's required to also include "phantom" documents (i.e., documents that do not actually exist but do have subcollections).

The goal is to perform a full scan of a collection's document IDs, similar to a keys-only query, but with the ability to find phantom documents, and to do so in parallel to make it practical for large collections (e.g., 500k to 1M+ documents).

Currently, there are two primary APIs for this, but they are mutually exclusive in their capabilities:

  1. v1.listDocuments with showMissing: true:
    • What it does well: It correctly returns all document names, including phantoms.
    • The limitation: It is an inherently sequential, paginated API. It does not support range queries (e.g., __name__ > 'a' AND __name__ < 'b') or partition cursors. This forces a single, long-running sequential scan, which is unacceptably slow for large collections. A scan of ~420,000 documents can take over two hours.
  2. v1.partitionQuery + v1.runQuery:
    • What it does well: This is the recommended and highly effective method for parallelizing reads across a large collection. It is very fast.
    • The limitation: The runQuery RPC does not support a showMissing flag. As a result, it only returns existing documents and silently skips over phantom documents, making it unsuitable for use cases that require a complete list of all document paths.

This leaves users with a difficult choice: either a correct but impractically slow scan, or a fast but incomplete one.

Use Case & Business Impact

This limitation is a significant bottleneck for critical back-office operations, data migrations, and large-scale data integrity checks. For example, when migrating data, it's essential to have a complete map of the existing structure, including documents that serve as parents for subcollections.

The inability to parallelize this task means that scripts which should take minutes can take many hours, making them difficult to run, monitor, and recover from in a production environment. This increases operational risk and engineering cost.

Proposed Solutions

To address this, we propose adding capabilities to the Firestore API that would bridge this gap. Any of the following solutions would be a massive improvement:

  1. Add showMissing: true support to the runQuery RPC.
    • This is perhaps the most direct solution. If runQuery supported showMissing, we could continue using the existing partitionQuery API to generate cursors and then execute parallel, keys-only queries that also find phantom documents. This seems like a natural extension of the existing parallel-scan pattern.
  2. Enhance listDocuments to support partitioning.
    • Add support for startAt and endAt cursors (like those from partitionQuery) to the IListDocumentsRequest. This would allow us to partition the keyspace and then make parallel listDocuments calls, each with showMissing: true.
    • Alternative: Allow where filters on the __name__ property within a listDocuments request. This would enable manual partitioning of the keyspace (e.g., by character ranges: a*, b*, etc.) and allow for parallel execution. The client library documentation for listDocuments states that showMissing may not be used with orderBy, so this restriction would need to be lifted to allow for manual range scans. (From docs: "Requests with show_missing may not specify where or order_by".)
  3. Introduce a new, dedicated API for parallel keyspace enumeration.
    • Create a new RPC specifically designed for this task. It could be a "partitionable list" or a "keyspace scan" API that takes a parent path and returns a stream of all document names (existing and missing) within that path, with built-in support for parallel execution. This would provide a purpose-built tool for a common and important administrative task.
  4. Enhance IListenRequest to notify on phantom document changes.
    • Extend the listen RPC to send notifications when phantom documents are
      implicitly created (because a subcollection document is added) or removed
      (because their last subcollection document is removed). This would allow
      for real-time tracking of the complete keyspace, which is crucial for
      maintaining live caches or indexes of a collection's structure.
  5. Introduce a new API for traversing and watching parameterized paths.
    • Provide a new API that accepts a fully parameterized path pattern (e.g.,
      users/{userId}/posts/{postId}/comments/{commentId}). This API would
      stream all matching documents, including phantom documents along the path,
      and return the extracted, typed parameters for each document (e.g.,
      { userId: 'user-123', postId: 'post-abc', ... }). It should also
      support watching for new documents that match the pattern. This is
      conceptually similar to how Firebase Functions v2 triggers can be defined
      with wildcards (e.g., onDocumentWritten("users/{userId}/posts/{postId}")),
      which has proven to be a very powerful and intuitive pattern. This would
      massively simplify complex data traversal and synchronization logic.

For us, enabling an efficient, parallel, and complete scan of a collection's keyspace is a very useful tool to have for managing Firestore data at scale. We understand this requires changes to the backend and would be grateful if you could consider this proposal and forward it to the appropriate team.

Metadata

Metadata

Assignees

Labels

api: firestoreIssues related to the googleapis/nodejs-firestore API.priority: p3Desirable enhancement or fix. May not be included in next release.type: feature request‘Nice-to-have’ improvement, new feature or different behavior or design.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions