Skip to content

[Bug]: Legacy /chunks parse & stop use requester tenant for docstore index — stale chunks on team-shared datasets #15960

@kiannidev

Description

@kiannidev

Self Checks

  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (Language Policy).
  • Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
  • Please do not modify this template :) and fill in all the required fields.

RAGFlow workspace code commit ID

0d836af

RAGFlow image version

0d836af

Other environment information

Actual behavior

Legacy parse and stop_parsing build the docstore index name from the requester tenant_id injected by @add_tenant_id_to_kwargs, not from the dataset owner's tenant (kb.tenant_id).

# chunk_api.py — parse & stop_parsing (current main)
index_name = search.index_name(tenant_id)  # requester user id
if settings.docStoreConn.index_exist(index_name, dataset_id):
    settings.docStoreConn.delete({"doc_id": ...}, index_name, dataset_id)

Chunks for a dataset are indexed under the owner tenant index (ragflow_{kb.tenant_id}), which is what every other handler in the same file already uses via _get_dataset_tenant_id():

# chunk_api.py — list_chunks, add_chunk, rm_chunk, etc.
dataset_tenant_id = _get_dataset_tenant_id(dataset_id)
search.index_name(dataset_tenant_id)

When team member B re-parses or stops parsing on a dataset owned by A:

  1. KnowledgebaseService.accessible passes (team permission).
  2. index_exist(ragflow_B, dataset_id) is false (real data lives in ragflow_A).
  3. Delete is skipped (logged as "index does not exist").
  4. Stale chunks remain in the docstore; re-parse can duplicate or serve outdated chunks.

Expected behavior

Resolve index_name from the dataset owner (_get_dataset_tenant_id(dataset_id) or doc[0].kb_id owner tenant) and pass doc[0].kb_id as the kb shard id — matching list_chunks, document_api.py guarded deletes, and the newer POST /documents/parse flow.

Steps to reproduce

1. Deploy RAGFlow `main` @ `0d836afd3`.
2. **User A** creates dataset `DS` with `permission = team` and uploads/parses document `DOC`.
3. Invite **User B** to the same tenant (team member, not owner of `DS`).
4. As **User B**, start parsing `DOC` again via legacy route:

   curl -sS -X POST "http://<HOST>/api/v1/datasets/DS/chunks" \
     -H "Authorization: Bearer KEY_B" \
     -H "Content-Type: application/json" \
     -d '{"document_ids": ["DOC"]}'

5. Observe logs: `Skipping chunk delete during parse for doc DOC: index ragflow_<B>/DS does not exist`
6. Query chunks (as either user) — old chunks from the prior parse are still present alongside new ones.

**Control:** Same operation as dataset owner **User A** deletes from `ragflow_<A>` and does not leave stale chunks.

Additional information

Suggested fix:

dataset_tenant_id = _get_dataset_tenant_id(dataset_id)
index_name = search.index_name(dataset_tenant_id)
if settings.docStoreConn.index_exist(index_name, doc[0].kb_id):
    settings.docStoreConn.delete({"doc_id": id}, index_name, doc[0].kb_id)

Add unit tests where accessible passes for a non-owner team member but get_by_id returns kb.tenant_id = owner-tenant, asserting delete uses search.index_name(owner-tenant).

Metadata

Metadata

Assignees

No one assigned

    Labels

    🐞 bugSomething isn't working, pull request that fix bug.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions