Skip to content

Tags: Encamina/enmarcha

Tags

v10.0.2

Toggle v10.0.2's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Merge pull request #183 from Encamina/@mramos/fix_OpenAPI

Update Swashbuckle packages to 8.1.4

v10.0.1

Toggle v10.0.1's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Merge pull request #182 from Encamina/@rliberoff/update-semantic-kernel

Update Semantic Kernel dependencies to 1.68.0, bump version

v10.0.0

Toggle v10.0.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Merge pull request #181 from Encamina/upgrade-to-NET10

Upgrade to net10

v10.0.0-preview-09

Toggle v10.0.0-preview-09's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Merge pull request #180 from Encamina/@ddiaz/migrating-middlewares

Add SemanticKernel rate limit middleware

v10.0.0-preview-08

Toggle v10.0.0-preview-08's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Merge pull request #179 from Encamina/@ddiaz/new-version-update

Update version suffix to preview-08

v10.0.0-preview-07

Toggle v10.0.0-preview-07's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Merge pull request #177 from Encamina/@hramos/smart-chunking

# Summary
This is an improvement to PDF processing for use in RAG.
## Technical details
- PDF -> Markdown conversion using the Mistral Document AI 2505 model.
- Refinement of the resulting Markdown with GPT-4.1.
- Document segmentation (chunking) with the following rules:
  - Token limit per chunk: 1024.
  - A header hierarchy is respected (H1, H2, ... and bold text).
  - Header inheritance system: chunks keep the context of higher-level headers to preserve content coherence.
  - Filtering of very small chunks (e.g., < 30 tokens) to avoid noise in the index.

## Included files
- `src\Encamina.Enmarcha.SemanticKernel.Connectors.Document\Connectors\MistralAIDocumentConnector.cs`
  - Orchestrates extraction from PDFs, calls to MistralAI (HTTP endpoint) and subsequent refinement with a chat model (GPT-4.1).
  - Manages PDF splitting, the HTTP request to Mistral, and the logic to send parts for LLM refinement.
  - Configurable via `MistralAIDocumentConnectorOptions` (Endpoint, ApiKey, ModelName, SplitPageNumber, LLMPostProcessing).

- `src\Encamina.Enmarcha.SemanticKernel.Connectors.Document\Utils\MistralAIHelper.cs`
  - Utilities for:
    - Splitting PDFs by pages (`SplitPdfByPagesAsync`) using PdfPig.
    - Building a base64 data URL of the PDF to send to the service (`BuildPdfDataUrlAsync`).
    - Extracting and combining Markdown from Mistral's JSON response (`ExtractAndCombineMarkdown`), replacing image references with filenames.
    - Splitting Markdown into manageable parts for LLM refinement (`SplitMarkdownForRefinement`).
    - Normalizing and extracting embedded images (`ExtractImageDataFromPage` / `ReplaceImagesInMarkdown`).
  - Implements merging and cleanup during page extraction.

- `src\Encamina.Enmarcha.AI\TextSplitters\EnrichedMarkdownCharacterSplitter.cs`
  - Splitter that:
    - Respects header hierarchies (#, ##, ###, ...) and treats H1 as main sections.
    - Performs recursive splitting by header levels and by delimiters when necessary.
    - Extracts metadata (H1...H6 and Bold) and maintains inherited context across chunks.
    - Avoids very small chunks and prioritizes keeping paragraphs/semantic blocks together.
    
- `src\Encamina.Enmarcha.AI\OpenAI\Abstractions\ModelInfo.cs`
  - Add GPT-5 models:
    - GPT-5
    - GPT-5-mini
 
## Rules and transformations applied
- Preserve all textual content from the PDF (do not remove text); only correct/structure it into Markdown.
- Merge tables split across pages when they share identical headers or are direct continuations.
- Fix malformed tables, lists, and markdown; remove repeated footers/headers and HTML pagination comments.
- Correct common OCR errors (hyphenated/split words, extra spaces, stray characters).
- Do not generate automatic links or HTML entities; do not add new content that changes the original text.

v10.0.0-preview-06

Toggle v10.0.0-preview-06's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Merge pull request #176 from Encamina/@mramos/update_Microsoft.Azure.…

…Cosmos

Update Microsoft.Azure.Cosmos to `3.49.0`

v10.0.0-preview-05

Toggle v10.0.0-preview-05's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Merge pull request #174 from Encamina/@mramos/traceparent_agentsSdk

Implement telemetry correlation in M365 Agents SDK

v10.0.0-preview-04

Toggle v10.0.0-preview-04's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Merge pull request #172 from Encamina/@mramos/activity_propagation_logs

Update `TelemetryInitializerMiddleware` and `TelemetryAgentIdInitializer` to work with Agents 365 SDK

v10.0.0-preview-03

Toggle v10.0.0-preview-03's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Merge pull request #171 from LuisM000/@lmarcos/fix_conversation_state…

…_logger_middleware

Update conversation state access in middleware