Changelog¶
All notable changes to LongParser are documented here.
This project follows Semantic Versioning and Keep a Changelog.
[0.1.5] — 2026-05-05¶
Added¶
- Semantic chunking —
all-MiniLM-L6-v2embedding-based boundary detection inHybridChunker(optional viause_semantic_chunking). - Cross-reference resolution — Highly efficient $O(N)$ resolution for explicit ("Figure 3") and implicit ("the table below") references via spatial proximity.
- Summary chunks — Asynchronous ARQ background worker (
enrich_summaries_job) to auto-generate LLM section summaries for hierarchical RAG retrieval. - Chunk quality scorer — Zero-ML, heuristic-based chunk scoring using block token confidences, Dictionary Word Coverage (
/usr/share/dict/words), and fastText Lang-ID validation. - PII redaction — Hybrid approach using fast Regex+Luhn (Emails, Phones, SSNs, CCs, IPs) and optional spaCy NER (
en_core_web_sm) for names, organizations, and locations. Preserves original values in secure block metadata for HITL.
Changed¶
- Bumped
marker-pdfversion support in dependencies. - Added
neroptional dependency group (spacy>=3.7.0) inpyproject.toml. - Expanded
ChunkingConfigandProcessingConfigwith new semantic, summary, and PII toggle options. - Marked Phase 1 as officially complete in Roadmap.
[0.1.3] — 2026-04-13¶
Fixed¶
- Source code: Added
DocumentPipelineas a public alias forPipelineOrchestrator— docs, quickstart, and all examples now use this name consistently - Documentation: Fixed wrong coverage path
long_parser→longparserinCONTRIBUTING.md - Documentation: Replaced stale
cleanrag-apireference in Docker deployment docs - Documentation: Standardized Gemini API key env var to
GOOGLE_API_KEYacross all docs - Source code: Updated default LLM model fallback from
gpt-4otogpt-5.3inschemas.py,llm_chain.py, andengine.py - Source code: Renamed stale
cleanrag:Redis key prefix tolongparser:in embeddings
Changed¶
- Python 3.13 added to CI matrix, badges, and installation docs
SECURITY.mdupdated with Redis rate-limiting and CORS threat mitigations
[0.1.2] — 2026-04-05¶
Changed¶
- Project logo added to documentation site, README, and PyPI page
- Documentation site header updated — logo replaces text title
- Installation guide restructured for clarity
[0.1.1] — 2026-04-04¶
Added¶
- CPU / GPU install separation — dedicated
[cpu]and[gpu]meta-extras for clean one-command installs faiss-gpuextra (faiss-gpu>=1.7) as a distinct option fromfaiss-cpu- Granular torch-based extras —
embeddings-cpu,embeddings-gpu,latex-ocr-cpu,latex-ocr-gpufor fine-grained dependency control
Fixed¶
- Package metadata: license field updated to SPDX expression format per PEP 639
- Documentation site build reliability improvements
Changed¶
[gpu]is now the recommended default install — one command, works on both GPU and CPU machines[cpu]documented as the advanced path for size-constrained environments (Docker, edge, CI)[all]now resolves to[cpu]as a safe, dependency-minimal default
[0.1.0] — 2026-04-04¶
🎉 Initial Public Release¶
LongParser is the open-source document intelligence engine built by ENDEVSOLS for production RAG pipelines.
Added¶
- 5-stage extraction pipeline —
Extract → Validate → HITL Review → Chunk → Embed → Index - Multi-format extraction — PDF, DOCX, PPTX, XLSX, CSV via Docling
HybridChunker— token-aware, heading-hierarchy-aware, table-aware chunking- Human-in-the-Loop (HITL) review — approve / edit / reject blocks and chunks
via LangGraph
interrupt()before embedding - 3-layer memory chat — short-term turns + rolling summary + long-term facts, powered by LCEL chains
- Multi-provider LLM support — OpenAI (
gpt-4o), Gemini (gemini-2.0-flash), Groq (llama-3.3-70b-versatile), OpenRouter - Multi-backend vector stores — Chroma, FAISS, Qdrant
- Async-first REST API — FastAPI + Motor (MongoDB) + ARQ (Redis job queue)
LongParserRetriever— drop-in LangChainBaseRetrieveradapterLongParserLoader— LangChain document loader integrationLongParserReader— LlamaIndexBaseReaderintegrationLongParserCallbackHandler— observability callbacks for LangChain chains- Built-in citation validation — chunk IDs verified against retrieved set before any answer is returned
- Privacy-first — all processing runs locally; no data leaves your infrastructure
py.typedmarker — full PEP 561 typing support- Unit test suite —
test_schemas.py(22 passing),test_llm_chain.py,test_chat_utils.py - GitHub Actions CI — lint (
ruff), tests across Python 3.10 / 3.11 / 3.12, coverage reporting - GitHub Actions publish — PyPI trusted publishing triggered on GitHub releases
pyproject.tomlwithserver,langchain,llamaindex,embeddings,chroma,faiss,qdrantoptional extrasDockerfileanddocker-compose.ymlfor one-command local deploymentCONTRIBUTING.md,SECURITY.md,.env.example— full OSS scaffolding