feat(kb): add Markdown-aware chunker for structured documents#8151
feat(kb): add Markdown-aware chunker for structured documents#8151fausalors wants to merge 4 commits into
Conversation
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- The Markdown heading detection uses a simple regex over the whole text and will also match
#lines inside fenced code blocks or inline examples; consider skipping fenced code blocks (e.g., track ```/~~~ sections) before applying the heading regex to avoid spurious section splits. - In
upload_document, a newMarkdownChunkeris instantiated for every Markdown upload instead of reusing or deriving from the existingself.chunkerconfiguration; consider reusing a shared instance or a small factory to avoid per-upload allocation and to keep configuration (likeinclude_heading_context/max_heading_depth) consistently customizable.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The Markdown heading detection uses a simple regex over the whole text and will also match `#` lines inside fenced code blocks or inline examples; consider skipping fenced code blocks (e.g., track ```/~~~ sections) before applying the heading regex to avoid spurious section splits.
- In `upload_document`, a new `MarkdownChunker` is instantiated for every Markdown upload instead of reusing or deriving from the existing `self.chunker` configuration; consider reusing a shared instance or a small factory to avoid per-upload allocation and to keep configuration (like `include_heading_context`/`max_heading_depth`) consistently customizable.
## Individual Comments
### Comment 1
<location path="astrbot/core/knowledge_base/chunking/markdown.py" line_range="110-119" />
<code_context>
+ chunk_text = sub_chunk
+ raw_chunks.append((chunk_text.strip(), True))
+
+ # 合并没有实质正文的 chunk 到下一个有正文的 chunk
+ merged = []
+ pending = ""
+ for chunk_text, has_body in raw_chunks:
+ if not chunk_text:
+ continue
+ if not has_body:
+ # 纯标题节,暂存,等待合并到下一个有内容的 chunk
+ pending += chunk_text + "\n\n"
+ else:
+ if pending:
+ combined = pending + chunk_text
+ # 如果合并后不超过 chunk_size,合并
+ if len(combined) <= chunk_size:
+ merged.append(combined.strip())
+ else:
+ # 超长了,分开保留
+ merged.append(pending.strip())
+ merged.append(chunk_text.strip())
+ pending = ""
+ else:
</code_context>
<issue_to_address>
**issue (bug_risk):** Pending header-only chunks can exceed `chunk_size` without being further split
Because `pending` is only size-checked when combined with a `has_body=True` chunk, a run of many or very long header-only chunks can make `pending` exceed `chunk_size`. In that case, the `else` path appends `pending`/`pending.strip()` directly to `merged` (and similarly in the tail handling), producing chunks larger than `chunk_size`. Consider either enforcing a size cap as `pending` grows or passing `pending` through the fallback chunker before appending, so all emitted chunks stay within `chunk_size`.
</issue_to_address>
### Comment 2
<location path="astrbot/core/knowledge_base/chunking/markdown.py" line_range="13" />
<code_context>
+from .recursive import RecursiveCharacterChunker
+
+
+class MarkdownChunker(BaseChunker):
+ """Markdown 感知分块器
+
</code_context>
<issue_to_address>
**issue (complexity):** Consider refactoring `MarkdownChunker` to introduce small typed helpers (e.g., dataclasses and merge functions) so `chunk()` becomes a clearer high-level orchestration rather than a large stateful method.
You can keep all current behavior but reduce complexity by introducing small typed objects and extracting the “phases” of `chunk()` into helpers. That will make the control flow much easier to follow and maintain.
### 1. Use a `Section` dataclass instead of generic dicts
Encoding the structure of `_parse_sections` in a dataclass removes a lot of implicit knowledge from `chunk()`:
```python
from dataclasses import dataclass
from typing import List
@dataclass
class Section:
heading_path: List[str]
text: str
has_body: bool
```
Update `_parse_sections` to return `list[Section]`:
```python
def _parse_sections(self, text: str) -> list[Section]:
...
sections: list[Section] = []
preamble = text[: headings[0]["start"]].strip()
if preamble:
sections.append(Section(heading_path=[], text=preamble, has_body=True))
heading_stack: list[dict] = []
for i, heading in enumerate(headings):
...
section_text = heading_line
if body:
section_text += "\n" + body
heading_path = [h["title"] for h in heading_stack[:-1]]
sections.append(
Section(
heading_path=heading_path,
text=section_text,
has_body=bool(body),
)
)
return sections
```
This lets `chunk()` work with `Section` objects directly, instead of re‑deriving meanings from dict keys.
### 2. Extract heading-context / continuation formatting
Move the continuation-label formatting out of the main loop so `chunk()` only decides *when* to create continuation chunks:
```python
def _apply_heading_context(
self, heading_path: list[str], content: str, is_continuation: bool
) -> str:
if not self.include_heading_context or not heading_path:
return content.strip()
title = " > ".join(heading_path)
if is_continuation:
return f"[续] {title}\n\n{content}".strip()
else:
return f"{title}\n\n{content}".strip()
```
Use it inside your fallback chunking:
```python
sub_chunks = await self._fallback_chunker.chunk(
section_text, chunk_size=chunk_size, chunk_overlap=chunk_overlap
)
for i, sub_chunk in enumerate(sub_chunks):
raw_chunks.append(
(
self._apply_heading_context(
heading_path, sub_chunk, is_continuation=(i > 0)
),
True,
)
)
```
This removes a nested conditional from the main orchestration logic.
### 3. Factor out the two merge passes into helpers
You currently have two separate passes with different ad‑hoc state (`pending`, `buf`). You can keep behavior but make it clearer by using small helpers over a simple chunk type.
Introduce a light `Chunk` type:
```python
@dataclass
class Chunk:
text: str
has_body: bool # for heading-only merge; ignored by size merge
```
Then:
```python
def _merge_heading_only_chunks(
self, chunks: list[Chunk], chunk_size: int
) -> list[str]:
merged: list[str] = []
pending = ""
for c in chunks:
if not c.text:
continue
if not c.has_body:
pending += c.text + "\n\n"
continue
if pending:
combined = pending + c.text
if len(combined) <= chunk_size:
merged.append(combined.strip())
else:
merged.append(pending.strip())
merged.append(c.text.strip())
pending = ""
else:
merged.append(c.text.strip())
if pending:
if merged and len(merged[-1] + "\n\n" + pending.strip()) <= chunk_size:
merged[-1] = merged[-1] + "\n\n" + pending.strip()
else:
merged.append(pending.strip())
return [m for m in merged if m.strip()]
```
And a separate helper that only cares about size, not headings:
```python
def _merge_short_chunks_by_size(
self, chunks: list[str], chunk_size: int, min_chunk_size: int
) -> list[str]:
if min_chunk_size <= 0 or len(chunks) <= 1:
return chunks
final: list[str] = []
buf = ""
for c in chunks:
if buf:
combined = buf + "\n\n" + c
if len(combined) <= chunk_size:
buf = combined
else:
final.append(buf)
buf = c if len(c) < min_chunk_size else ""
if len(c) >= min_chunk_size:
final.append(c)
elif len(c) < min_chunk_size:
buf = c
else:
final.append(c)
if buf:
if final and len(final[-1] + "\n\n" + buf) <= chunk_size:
final[-1] = final[-1] + "\n\n" + buf
else:
final.append(buf)
return final
```
Then `chunk()` becomes a higher‑level orchestration:
```python
sections = self._parse_sections(text)
...
raw_chunks: list[Chunk] = self._sections_to_chunks(
sections, chunk_size, chunk_overlap
)
merged = self._merge_heading_only_chunks(raw_chunks, chunk_size)
merged = self._merge_short_chunks_by_size(
merged, chunk_size, self.min_chunk_size
)
return merged
```
Where `_sections_to_chunks` contains only the “section → raw chunk” logic (heading context + recursion), independent from merging rules.
This keeps all current behavior but flattens the nested state machines into small, named helpers that are easier to reason about and test independently.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Code Review
This pull request introduces a MarkdownChunker to provide structure-aware document splitting for Markdown files, preserving heading context across chunks and integrating it into the document upload process. Feedback focuses on improving the robustness of the chunker, specifically by clamping max_heading_depth to prevent regex errors, adjusting sub-chunk sizes to account for prefix lengths, and refining header detection to ignore matches within code blocks. Additionally, it is recommended to avoid hardcoded language-specific strings like "[续]" to support multi-language environments.
- Clamp max_heading_depth to 1-6 to prevent regex errors - Deduct prefix length when splitting oversized sections - Replace hardcoded "[续]" with configurable continuation_prefix - Skip fenced code blocks when detecting headings - Cap pending size to prevent chunks exceeding chunk_size - Refactor into dataclass + helper methods
|
@sourcery-ai review |
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- Consider allowing the MarkdownChunker configuration (e.g., include_heading_context, max_heading_depth, min_chunk_size) to be injected from existing KB or app settings instead of hardcoding defaults in kb_helper so users can tune behavior without code changes.
- In MarkdownChunker.chunk, the heavy regex parsing and section processing are done synchronously inside an async method; if this is used on large documents or high throughput, consider offloading parsing to a thread pool or making chunk a sync method for clarity about its blocking behavior.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Consider allowing the MarkdownChunker configuration (e.g., include_heading_context, max_heading_depth, min_chunk_size) to be injected from existing KB or app settings instead of hardcoding defaults in kb_helper so users can tune behavior without code changes.
- In MarkdownChunker.chunk, the heavy regex parsing and section processing are done synchronously inside an async method; if this is used on large documents or high throughput, consider offloading parsing to a thread pool or making chunk a sync method for clarity about its blocking behavior.
## Individual Comments
### Comment 1
<location path="astrbot/core/knowledge_base/chunking/markdown.py" line_range="131-132" />
<code_context>
+ # 章节过长,内部递归分割
+ # 扣除前缀长度,确保添加前缀后不超过 chunk_size
+ prefix_len = self._estimate_prefix_length(heading_path)
+ effective_chunk_size = max(
+ chunk_size - prefix_len, chunk_size // 2
+ )
+
</code_context>
<issue_to_address>
**issue (bug_risk):** The `effective_chunk_size` calculation can produce chunks that exceed the intended `chunk_size` once the heading prefix is added.
Because of the `chunk_size // 2` fallback, large `prefix_len` values (e.g. `chunk_size=1024`, `prefix_len=900`) can produce `effective_chunk_size=512`, so `len(prefix) + len(content)` can exceed `chunk_size` (~1412 in this example), breaking the size guarantee.
To preserve the invariant, derive `effective_chunk_size` only from the remaining budget after the prefix, e.g.:
```python
base = chunk_size - prefix_len
effective_chunk_size = max(1, base)
```
This keeps `len(prefix) + len(content) <= chunk_size` even for very long headings.
</issue_to_address>
### Comment 2
<location path="astrbot/core/knowledge_base/chunking/markdown.py" line_range="319-320" />
<code_context>
+ def _find_fenced_code_ranges(text: str) -> list[tuple[int, int]]:
+ """找到所有围栏代码块的 (start, end) 范围"""
+ ranges: list[tuple[int, int]] = []
+ fence_pattern = re.compile(r"^(`{3,}|~{3,})", re.MULTILINE)
+ matches = list(fence_pattern.finditer(text))
+
+ i = 0
</code_context>
<issue_to_address>
**issue:** Single unmatched fenced code blocks are not treated as code, so `#` inside them may be misclassified as headings.
Because the loop stops at `while i < len(matches) - 1:`, a single unmatched fence (e.g. an unterminated ``` block at EOF) is never handled, and the `else` branch that should mark the rest of the text as code is never reached. This allows `#` inside that trailing fence to be parsed as real headings.
To treat an unmatched opening fence as extending to EOF (like common Markdown renderers), iterate over all matches and special‑case the last unmatched one, for example:
```python
i = 0
while i < len(matches):
open_match = matches[i]
open_fence = open_match.group(1)
fence_char = open_fence[0]
fence_len = len(open_fence)
for j in range(i + 1, len(matches)):
close_match = matches[j]
close_fence = close_match.group(1)
if close_fence[0] == fence_char and len(close_fence) >= fence_len:
ranges.append((open_match.start(), close_match.end()))
i = j + 1
break
else:
ranges.append((open_match.start(), len(text)))
break
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
这个问题等维护者确认KB设置体系的接入方式后在后续PR中处理吧
这个跟现有的RecursiveCharacterChunker是同样的模式,如果要改应该是整个chunking模块统一重构,应该不属于此PR的范围内 |
|
@sourcery-ai review |
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- When switching to
MarkdownChunkerinupload_document, you always construct a new instance with onlychunk_sizeandchunk_overlap; if callers customize other chunker options (e.g. include heading context, min chunk size), consider exposing those via configuration or reusing an injected Markdown-aware chunker to avoid silently ignoring settings. - Markdown detection in
upload_documentis based on a lowercase suffix check; if filenames can contain uppercase or mixed-case extensions, ensurefile_extis normalized once (e.g. at upload) or use a case-insensitive comparison to avoid missing valid Markdown files.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- When switching to `MarkdownChunker` in `upload_document`, you always construct a new instance with only `chunk_size` and `chunk_overlap`; if callers customize other chunker options (e.g. include heading context, min chunk size), consider exposing those via configuration or reusing an injected Markdown-aware chunker to avoid silently ignoring settings.
- Markdown detection in `upload_document` is based on a lowercase suffix check; if filenames can contain uppercase or mixed-case extensions, ensure `file_ext` is normalized once (e.g. at upload) or use a case-insensitive comparison to avoid missing valid Markdown files.
## Individual Comments
### Comment 1
<location path="astrbot/core/knowledge_base/chunking/markdown.py" line_range="129-131" />
<code_context>
+ raw_chunks.append((full_text.strip(), has_body))
+ else:
+ # 章节过长,内部递归分割
+ # 扣除前缀长度,确保添加前缀后不超过 chunk_size
+ prefix_len = self._estimate_prefix_length(heading_path)
+ effective_chunk_size = max(
+ chunk_size // 4, chunk_size - prefix_len
+ )
</code_context>
<issue_to_address>
**issue (bug_risk):** Effective sub-chunk size can exceed the configured chunk_size when heading prefixes are very long.
When `prefix_len > chunk_size`, `effective_chunk_size` becomes `chunk_size // 4` and is used to split `section_text`, but `_apply_heading_context` later prepends the full prefix without re-checking the size. This can yield chunks where `len(prefix) + len(sub_chunk) > chunk_size`, breaking the size guarantee.
To preserve the `chunk_size` invariant, you could either:
- Compute `effective_chunk_size` as `max(1, chunk_size - prefix_len)` (and explicitly handle very small values), or
- Enforce `chunk_size` after prefixing by trimming or further splitting so `len(chunk_text) <= chunk_size`.
</issue_to_address>
### Comment 2
<location path="astrbot/core/knowledge_base/chunking/markdown.py" line_range="15" />
<code_context>
+
+
+@dataclass
+class _Section:
+ """解析后的 Markdown 章节"""
+
</code_context>
<issue_to_address>
**issue (complexity):** Consider refactoring heading-only section handling and short-chunk merging into simpler, single-pass flows to reduce state and make the markdown chunker easier to reason about.
You can reduce complexity in two focused spots without changing behavior: heading-only handling and short‑chunk merging.
---
### 1. Remove `has_body` + `_merge_heading_only_chunks` two‑phase handling
Right now:
- `_parse_sections` packs heading line + body into `Section.text` and sets `has_body`.
- `_sections_to_chunks` propagates `has_body`.
- `_merge_heading_only_chunks` then reinterprets that flag.
You can instead keep heading‑only sections out of the chunk list in the first place and prepend them directly to the next section’s text. That lets you:
- Drop `has_body` from `_Section`
- Make `_sections_to_chunks` return plain `list[str]`
- Inline the “heading‑only merging” logic where chunks are produced
Example of a narrower refactor:
```python
@dataclass
class _Section:
heading_path: list[str]
heading_title: str # current heading title
body: str # body WITHOUT heading line
async def _sections_to_chunks(
self, sections: list[_Section], chunk_size: int, chunk_overlap: int
) -> list[str]:
chunks: list[str] = []
pending_heading_only: list[str] = []
for section in sections:
# heading-only section: accumulate and continue
if not section.body.strip():
pending_heading_only.append(section.heading_title)
continue
# build effective heading_path including any pending heading-only titles
heading_path = section.heading_path + pending_heading_only + [section.heading_title]
pending_heading_only = []
context_prefix = self._build_context_prefix(heading_path)
full_text = context_prefix + section.body
if len(full_text) <= chunk_size:
chunks.append(full_text.strip())
else:
prefix_len = self._estimate_prefix_length(heading_path)
effective_chunk_size = max(chunk_size // 4, chunk_size - prefix_len)
sub_chunks = await self._fallback_chunker.chunk(
section.body,
chunk_size=effective_chunk_size,
chunk_overlap=chunk_overlap,
)
for i, sub_chunk in enumerate(sub_chunks):
chunk_text = self._apply_heading_context(
heading_path, sub_chunk, is_continuation=(i > 0)
)
chunks.append(chunk_text.strip())
# optional: if trailing heading-only sections exist, attach to last chunk
if pending_heading_only and chunks:
tail = "\n\n" + "\n\n".join(pending_heading_only)
if len(chunks[-1] + tail) <= chunk_size:
chunks[-1] = (chunks[-1] + tail).strip()
else:
chunks.extend(pending_heading_only)
return chunks
```
Then you can simplify `chunk()` to skip `_merge_heading_only_chunks` entirely:
```python
raw_chunks = await self._sections_to_chunks(sections, chunk_size, chunk_overlap)
merged = self._merge_short_chunks(raw_chunks, chunk_size)
```
And `_Section` no longer needs `has_body`.
This keeps all existing behaviors (headings merged into nearest body) but removes the boolean flag, the two‑phase pipeline, and `_merge_heading_only_chunks`.
---
### 2. Simplify `_merge_short_chunks` control flow
The current version has a `buf` with branchy updates and conditional `final.append` calls that are hard to reason about. You can keep semantics but use a single “current chunk” accumulator.
Pattern:
- Maintain `current_chunk` (string or `None`).
- For each `c`:
- Try to merge into `current_chunk` if under `chunk_size`.
- Ensure that anything shorter than `min_chunk_size` is merged forward if possible.
A simpler outline:
```python
def _merge_short_chunks(self, chunks: list[str], chunk_size: int) -> list[str]:
if self.min_chunk_size <= 0 or len(chunks) <= 1:
return chunks
result: list[str] = []
current: str | None = None
for c in chunks:
if current is None:
current = c
continue
combined = current + "\n\n" + c
# prefer merging if either side is too small and we stay within limit
if (len(current) < self.min_chunk_size or len(c) < self.min_chunk_size) and len(combined) <= chunk_size:
current = combined
else:
result.append(current)
current = c
if current is not None:
# optional last merge if trailing small chunk fits into previous
if result and len(current) < self.min_chunk_size:
candidate = result[-1] + "\n\n" + current
if len(candidate) <= chunk_size:
result[-1] = candidate
return result
result.append(current)
return result
```
This keeps all logic in one place:
- “Small” is defined once (`len(c) < min_chunk_size`).
- Merge decision is explicit (`len(combined) <= chunk_size`).
- No dual role of `buf` vs `final[-1]`.
You can adjust the merge heuristics to exactly mirror current behavior, but the structure will remain much easier to read and test.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| # 扣除前缀长度,确保添加前缀后不超过 chunk_size | ||
| prefix_len = self._estimate_prefix_length(heading_path) | ||
| effective_chunk_size = max( |
There was a problem hiding this comment.
issue (bug_risk): Effective sub-chunk size can exceed the configured chunk_size when heading prefixes are very long.
When prefix_len > chunk_size, effective_chunk_size becomes chunk_size // 4 and is used to split section_text, but _apply_heading_context later prepends the full prefix without re-checking the size. This can yield chunks where len(prefix) + len(sub_chunk) > chunk_size, breaking the size guarantee.
To preserve the chunk_size invariant, you could either:
- Compute
effective_chunk_sizeasmax(1, chunk_size - prefix_len)(and explicitly handle very small values), or - Enforce
chunk_sizeafter prefixing by trimming or further splitting solen(chunk_text) <= chunk_size.
|
|
||
|
|
||
| @dataclass | ||
| class _Section: |
There was a problem hiding this comment.
issue (complexity): Consider refactoring heading-only section handling and short-chunk merging into simpler, single-pass flows to reduce state and make the markdown chunker easier to reason about.
You can reduce complexity in two focused spots without changing behavior: heading-only handling and short‑chunk merging.
1. Remove has_body + _merge_heading_only_chunks two‑phase handling
Right now:
_parse_sectionspacks heading line + body intoSection.textand setshas_body._sections_to_chunkspropagateshas_body._merge_heading_only_chunksthen reinterprets that flag.
You can instead keep heading‑only sections out of the chunk list in the first place and prepend them directly to the next section’s text. That lets you:
- Drop
has_bodyfrom_Section - Make
_sections_to_chunksreturn plainlist[str] - Inline the “heading‑only merging” logic where chunks are produced
Example of a narrower refactor:
@dataclass
class _Section:
heading_path: list[str]
heading_title: str # current heading title
body: str # body WITHOUT heading line
async def _sections_to_chunks(
self, sections: list[_Section], chunk_size: int, chunk_overlap: int
) -> list[str]:
chunks: list[str] = []
pending_heading_only: list[str] = []
for section in sections:
# heading-only section: accumulate and continue
if not section.body.strip():
pending_heading_only.append(section.heading_title)
continue
# build effective heading_path including any pending heading-only titles
heading_path = section.heading_path + pending_heading_only + [section.heading_title]
pending_heading_only = []
context_prefix = self._build_context_prefix(heading_path)
full_text = context_prefix + section.body
if len(full_text) <= chunk_size:
chunks.append(full_text.strip())
else:
prefix_len = self._estimate_prefix_length(heading_path)
effective_chunk_size = max(chunk_size // 4, chunk_size - prefix_len)
sub_chunks = await self._fallback_chunker.chunk(
section.body,
chunk_size=effective_chunk_size,
chunk_overlap=chunk_overlap,
)
for i, sub_chunk in enumerate(sub_chunks):
chunk_text = self._apply_heading_context(
heading_path, sub_chunk, is_continuation=(i > 0)
)
chunks.append(chunk_text.strip())
# optional: if trailing heading-only sections exist, attach to last chunk
if pending_heading_only and chunks:
tail = "\n\n" + "\n\n".join(pending_heading_only)
if len(chunks[-1] + tail) <= chunk_size:
chunks[-1] = (chunks[-1] + tail).strip()
else:
chunks.extend(pending_heading_only)
return chunksThen you can simplify chunk() to skip _merge_heading_only_chunks entirely:
raw_chunks = await self._sections_to_chunks(sections, chunk_size, chunk_overlap)
merged = self._merge_short_chunks(raw_chunks, chunk_size)And _Section no longer needs has_body.
This keeps all existing behaviors (headings merged into nearest body) but removes the boolean flag, the two‑phase pipeline, and _merge_heading_only_chunks.
2. Simplify _merge_short_chunks control flow
The current version has a buf with branchy updates and conditional final.append calls that are hard to reason about. You can keep semantics but use a single “current chunk” accumulator.
Pattern:
- Maintain
current_chunk(string orNone). - For each
c:- Try to merge into
current_chunkif underchunk_size. - Ensure that anything shorter than
min_chunk_sizeis merged forward if possible.
- Try to merge into
A simpler outline:
def _merge_short_chunks(self, chunks: list[str], chunk_size: int) -> list[str]:
if self.min_chunk_size <= 0 or len(chunks) <= 1:
return chunks
result: list[str] = []
current: str | None = None
for c in chunks:
if current is None:
current = c
continue
combined = current + "\n\n" + c
# prefer merging if either side is too small and we stay within limit
if (len(current) < self.min_chunk_size or len(c) < self.min_chunk_size) and len(combined) <= chunk_size:
current = combined
else:
result.append(current)
current = c
if current is not None:
# optional last merge if trailing small chunk fits into previous
if result and len(current) < self.min_chunk_size:
candidate = result[-1] + "\n\n" + current
if len(candidate) <= chunk_size:
result[-1] = candidate
return result
result.append(current)
return resultThis keeps all logic in one place:
- “Small” is defined once (
len(c) < min_chunk_size). - Merge decision is explicit (
len(combined) <= chunk_size). - No dual role of
bufvsfinal[-1].
You can adjust the merge heuristics to exactly mirror current behavior, but the structure will remain much easier to read and test.
|
@sourcery-ai 第一个bug-risk:已在上一个 commit 中将 fallback 从 chunk_size // 2 改为 chunk_size // 4。极端情况下(prefix_len > chunk_size)属于不合理输入,实际文档中不会出现。如果后续有需要可以加 post-check 截断,当前不做过度防御 |
现有的
RecursiveCharacterChunker按字符数切分 Markdown 文档,会破坏标题层级结构——一个 chunk 可能包含上一节的尾部和下一节的头部,导致语义不连贯,降低知识库检索质量。本 PR 新增
MarkdownChunker,按标题层级切分文档,保持每个章节的语义完整性。上传.md文件时自动启用,对非 Markdown 文件无影响。Modifications / 改动点
新增
astrbot/core/knowledge_base/chunking/markdown.py:Markdown 感知分块器,按标题层级切分,支持短 chunk 合并、超长章节回退递归分割修改
astrbot/core/knowledge_base/chunking/__init__.py:导出MarkdownChunker修改
astrbot/core/knowledge_base/kb_helper.py:上传文档时检测.md/.markdown/.mkd/.mdx后缀,自动切换到MarkdownChunkerThis is NOT a breaking change. / 这不是一个破坏性变更。
Screenshots or Test Results / 运行截图或测试结果
Benchmark: 3 份文档(分布式系统设计 + ML工程实践 + CCF大模型认证大纲,共 11,669 字符),20 条 query,chunk_size=256。
bge-m3 (Pro/BAAI/bge-m3, 1024d)
Recall@1 +40%, Precision@1 达到 1.0。
mxbai-embed-large (本地, 1024d)
Recall@3 +55%。
Checklist / 检查清单
😊 If there are new features added in the PR, I have discussed it with the authors through issues/emails, etc.
/ 如果 PR 中有新加入的功能,已经通过 Issue / 邮件等方式和作者讨论过。
👀 My changes have been well-tested, and "Verification Steps" and "Screenshots" have been provided above.
/ 我的更改经过了良好的测试,并已在上方提供了“验证步骤”和“运行截图”。
🤓 I have ensured that no new dependencies are introduced, OR if new dependencies are introduced, they have been added to the appropriate locations in
requirements.txtandpyproject.toml./ 我确保没有引入新依赖库,或者引入了新依赖库的同时将其添加到
requirements.txt和pyproject.toml文件相应位置。😮 My changes do not introduce malicious code.
/ 我的更改没有引入恶意代码。
Summary by Sourcery
Add a Markdown-aware chunker and integrate it into knowledge base document upload to preserve section structure for Markdown files.
New Features:
Enhancements: