Tags: Encamina/enmarcha
Tags
Merge pull request #177 from Encamina/@hramos/smart-chunking # Summary This is an improvement to PDF processing for use in RAG. ## Technical details - PDF -> Markdown conversion using the Mistral Document AI 2505 model. - Refinement of the resulting Markdown with GPT-4.1. - Document segmentation (chunking) with the following rules: - Token limit per chunk: 1024. - A header hierarchy is respected (H1, H2, ... and bold text). - Header inheritance system: chunks keep the context of higher-level headers to preserve content coherence. - Filtering of very small chunks (e.g., < 30 tokens) to avoid noise in the index. ## Included files - `src\Encamina.Enmarcha.SemanticKernel.Connectors.Document\Connectors\MistralAIDocumentConnector.cs` - Orchestrates extraction from PDFs, calls to MistralAI (HTTP endpoint) and subsequent refinement with a chat model (GPT-4.1). - Manages PDF splitting, the HTTP request to Mistral, and the logic to send parts for LLM refinement. - Configurable via `MistralAIDocumentConnectorOptions` (Endpoint, ApiKey, ModelName, SplitPageNumber, LLMPostProcessing). - `src\Encamina.Enmarcha.SemanticKernel.Connectors.Document\Utils\MistralAIHelper.cs` - Utilities for: - Splitting PDFs by pages (`SplitPdfByPagesAsync`) using PdfPig. - Building a base64 data URL of the PDF to send to the service (`BuildPdfDataUrlAsync`). - Extracting and combining Markdown from Mistral's JSON response (`ExtractAndCombineMarkdown`), replacing image references with filenames. - Splitting Markdown into manageable parts for LLM refinement (`SplitMarkdownForRefinement`). - Normalizing and extracting embedded images (`ExtractImageDataFromPage` / `ReplaceImagesInMarkdown`). - Implements merging and cleanup during page extraction. - `src\Encamina.Enmarcha.AI\TextSplitters\EnrichedMarkdownCharacterSplitter.cs` - Splitter that: - Respects header hierarchies (#, ##, ###, ...) and treats H1 as main sections. - Performs recursive splitting by header levels and by delimiters when necessary. - Extracts metadata (H1...H6 and Bold) and maintains inherited context across chunks. - Avoids very small chunks and prioritizes keeping paragraphs/semantic blocks together. - `src\Encamina.Enmarcha.AI\OpenAI\Abstractions\ModelInfo.cs` - Add GPT-5 models: - GPT-5 - GPT-5-mini ## Rules and transformations applied - Preserve all textual content from the PDF (do not remove text); only correct/structure it into Markdown. - Merge tables split across pages when they share identical headers or are direct continuations. - Fix malformed tables, lists, and markdown; remove repeated footers/headers and HTML pagination comments. - Correct common OCR errors (hyphenated/split words, extra spaces, stray characters). - Do not generate automatic links or HTML entities; do not add new content that changes the original text.
PreviousNext