V-Legal is being repointed toward a full local-first markdown port of Vietnamese law.
The target product is:
- a filesystem-backed legal corpus where every law document becomes structured markdown
- a rebuildable SQLite index for search, passages, citations, relations, and vectorless RAG
- official-source provenance from
vbpl.vn,vanban.chinhphu.vn, andphapdien.moj.gov.vn - a document-first reading experience inspired by the strengths of
thuvienphapluat.vn
This direction is closer to the legalize-es model than to the previous focused-corpus prototype.
This repo currently contains a working FastAPI + SQLite prototype, but the architecture is in transition.
- the old focused economy/finance corpus direction is now legacy
- Appwrite is no longer the target data architecture
- the next major implementation step is a canonical markdown corpus under
data/corpus/ - the Hugging Face dataset remains useful for bootstrap and backfill, but not as the final source of truth
As of the latest local build (data/corpus/manifests/corpus_summary.json):
153,926canonical markdown documents underdata/corpus/markdown/(153,620issued,306Phap dien)- coverage years
1925–2026 - derived SQLite index at
data/corpus/sqlite/vlegal.sqlite
Recent search/reader correctness and performance work is tracked in docs/IMPROVEMENT_PLAN.md, including a one-time document_number_norm backfill migration (scripts/migrate_add_search_indexes.py).
flowchart LR
U[Browser] --> A[FastAPI + Jinja App]
A --> S[(SQLite Derived Index)]
S --> M[Local Markdown Corpus]
V[VBPL] --> R[Raw Source Snapshots]
C[vanban.chinhphu.vn] --> R
P[Phap dien] --> R
H[HF Bootstrap Dataset] --> R
R --> M
- markdown is the source of truth
- SQLite is derived and rebuildable
- raw source snapshots are preserved for provenance
- official sources outrank third-party snapshots
- deterministic parsing comes before AI-assisted repair
- vectorless RAG should work directly from markdown-derived indexes
vbpl.vn: primary source for promulgated-document metadata, legal status, relationships, and HTML text when availablevanban.chinhphu.vn: secondary freshness source for newly issued government texts and official PDFsphapdien.moj.gov.vn: primary source for codified topic and effective-law structureth1nhng0/vietnamese-legal-documents: bootstrap and backfill source onlyVN-Law-Advisor: reference implementation for crawler ideas onlythuvienphapluat.vn: UI and workflow benchmark only, not a source dependency
| Document | Purpose |
|---|---|
docs/FULL_MARKDOWN_PORT_PLAN.md |
phase-by-phase plan for the full markdown port |
docs/CORPUS_ARCHITECTURE.md |
canonical local-first corpus architecture |
docs/MARKDOWN_SCHEMA.md |
frontmatter, path, and body rules for canonical markdown |
docs/SOURCE_UPDATE_WORKFLOW.md |
official-source update and reconciliation workflow |
docs/AI_PARSING_POLICY.md |
when AI parsing helps and when it should stay out of the critical path |
docs/PDF_OCR_STRATEGY.md |
recommended OCR stack and PDF-to-markdown structure mapping |
docs/LEGAL_READER_IMPLEMENTATION_PLAN.md |
reader and interaction plan |
docs/JOURNAL.md |
development log |
DESIGN.md |
visual design system |
| File | Responsibility |
|---|---|
app.py |
FastAPI routes, page assembly, startup initialization |
settings.py |
environment config via pydantic-settings |
db.py |
SQLite schema, FTS tables, and connection helpers |
search.py |
FTS search, document retrieval, and filtering |
structure.py |
legal text structure, display parsing, inline reference linking |
citations.py |
citation extraction and graph queries |
relations.py |
lifecycle and relationship graph |
compare.py |
side-by-side document comparison |
answering.py |
conservative grounded brief generation |
vectorless.py |
lexical retrieval profile for vectorless RAG |
provenance.py |
VBPL and VNCP provenance profiles |
tracking.py |
same-subject update alerts |
appwrite_client.py |
legacy integration path, no longer the target architecture |
These scripts reflect the current prototype state and will be superseded gradually by markdown-first exporters and rebuilders.
| Script | Current Purpose |
|---|---|
bootstrap_hf_dataset.py |
import corpus from Hugging Face into SQLite |
bootstrap_hf_focused_corpus.py |
legacy focused-corpus rebuild |
update_hf_monthly.py |
incremental update from Hugging Face snapshot |
export_hf_markdown_corpus.py |
export the HF bootstrap corpus into canonical markdown files |
import_phapdien_corpus.py |
import the official Phap dien ebook bundle into codified markdown files |
backfill_vbpl_references.py |
harvest official VBPL links from Phap dien and backfill issued markdown provenance |
import_vncp_listing.py |
import Government Portal legal-document listing rows into provisional issued markdown |
build_pdf_ocr_queue.py |
build an OCR work queue for PDF-backed corpus records |
preview_pdf_ocr_mapping.py |
preview how OCR text maps into the legal markdown structure |
apply_pdf_ocr_text.py |
apply OCR text back into a canonical markdown document |
process_pdf_attachments.py |
download official PDF attachments, extract text, and map them back into canonical markdown |
apply_minimax_ocr_correction.py |
run MiniMax text-only post-correction on OCR text for low-confidence scanned laws |
rebuild_sqlite_from_markdown.py |
rebuild SQLite indexes from canonical markdown |
migrate_add_search_indexes.py |
backfill document_number_norm and search indexes on an existing SQLite DB |
migrate_add_validity_columns.py |
backfill status/effective_date/expiry_date from markdown frontmatter on an existing SQLite DB |
snapshot_official_source.py |
fetch and store raw snapshots from VBPL, vanban.chinhphu, or Phap dien |
bootstrap_phapdien_taxonomy.py |
seed taxonomy subjects |
bootstrap_relationship_graph.py |
build lifecycle relations |
bootstrap_citation_index.py |
build citation links |
prepare_demo_bundle.py |
prepare preview bundle |
repair_document_dates.py |
repair inconsistent dates |
- Python
3.12+ uv
uv sync
uv run uvicorn vlegal_prototype.app:app --reload --app-dir srcOpen http://127.0.0.1:8000.
uv run python -m compileall src scripts
curl http://127.0.0.1:8000/healthThese commands cover both the new markdown-first workflow and the older SQLite-first prototype.
# Export canonical markdown files under data/corpus/
uv run python scripts/export_hf_markdown_corpus.py --limit 500 --reset
# Rebuild SQLite from the canonical markdown corpus
set VLEGAL_DATABASE_PATH=data/corpus/sqlite/vlegal.sqlite
uv run python scripts/rebuild_sqlite_from_markdown.py --resetuv run python scripts/snapshot_official_source.py \
--source-system vbpl \
--source-document-id 213327 \
--url https://vbpl.vn/pages/portal.aspx# Import one de muc for testing
uv run python scripts/import_phapdien_corpus.py \
--limit 1 \
--topic-id 0cf69ad9-6f29-4ee4-8e16-2c3aa65c3a52
# Import the full ebook bundle into data/corpus/markdown/phapdien/
uv run python scripts/import_phapdien_corpus.pyuv run python scripts/backfill_vbpl_references.pyThis scans imported Phap dien topics for embedded official vbpl.vn links, builds a local VBPL reference index, and writes official_source_* fields into matching issued markdown documents when the document number matches exactly.
# Import the first law/phap lenh row from the Government Portal listing
uv run python scripts/import_vncp_listing.py --typegroupid 3 --limit 1
# Import the first 10 Government Portal decrees
uv run python scripts/import_vncp_listing.py --typegroupid 4 --limit 10This uses the server-rendered he-thong-van-ban listing pages, fetches each detail page, emits provisional markdown records, and backfills official_source_* onto matching issued markdown files when no better source is already attached.
# Discover which canonical documents still need PDF OCR
uv run python scripts/build_pdf_ocr_queue.py
# Preview how OCR text maps into the markdown schema
uv run python scripts/preview_pdf_ocr_mapping.py --text-file data/sample_ocr.txt --title "Luật mẫu"
# Apply OCR text back into a canonical markdown document
uv run python scripts/apply_pdf_ocr_text.py \
--canonical-id vn-issued-vpcp-216529 \
--text-file data/ocr_output.txt \
--backend paddleocr_ppstructure_plus_vietocr
# Run the hybrid PDF processor directly
uv run python scripts/process_pdf_attachments.py --canonical-id vn-issued-vpcp-216529
# Run MiniMax text-only OCR post-correction after PaddleOCR
uv run python scripts/apply_minimax_ocr_correction.py --canonical-id vn-issued-vpcp-216529The main app environment stays on Python 3.14, but the PaddleOCR execution path runs in a dedicated Python 3.13 environment at .venv-ocr/ because paddlepaddle does not currently ship wheels for cp314 on Windows.
MiniMax does not need image support in this architecture. Images and PDFs are handled by PaddleOCR first, then MiniMax is used only on extracted text for post-correction when the OCR sidecar flags a document as optional or recommended for LLM help.
uv run python scripts/bootstrap_hf_dataset.py --limit 500 --reset
uv run python scripts/bootstrap_phapdien_taxonomy.py --seed-onlyset VLEGAL_DATABASE_PATH=data/full_hf.sqlite
uv run python scripts/bootstrap_hf_dataset.py
uv run python scripts/bootstrap_relationship_graph.py
uv run python scripts/bootstrap_citation_index.pyuv run python scripts/update_hf_monthly.py --mode fullThese commands are transitional. The target pipeline is markdown-first and is documented in docs/FULL_MARKDOWN_PORT_PLAN.md.
| Variable | Default | Description |
|---|---|---|
VLEGAL_DATABASE_PATH |
data/vlegal.sqlite |
SQLite index location |
VLEGAL_CORPUS_ROOT |
data/corpus |
local markdown corpus root |
VLEGAL_PHAPDIEN_EBOOK_URL |
official Phap dien zip URL | official ebook snapshot URL |
VLEGAL_CORS_ALLOWED_ORIGINS |
http://localhost:3000,http://127.0.0.1:3000 |
CORS origins |
VLEGAL_SEARCH_PAGE_SIZE |
12 |
results per page |
VLEGAL_ANSWER_PASSAGE_LIMIT |
6 |
passages used for grounded briefs |
VLEGAL_ENABLE_AI_SCHEMA_SIDECAR |
false |
enable optional AI schema reinforcement sidecars |
VLEGAL_AI_SCHEMA_PROVIDER |
off |
AI sidecar provider, currently off, openai_compatible, or minimax |
VLEGAL_AI_SCHEMA_MODEL |
- | model name for optional AI sidecar generation |
VLEGAL_AI_SCHEMA_ENDPOINT |
- | chat completions endpoint for optional AI sidecar generation |
VLEGAL_AI_SCHEMA_API_KEY |
- | API key for optional AI sidecar generation |
Legacy transitional settings still exist in code for Appwrite, but Appwrite is not part of the target architecture.
- Never commit secrets. Real credentials (API keys, tokens, SSH connection strings, passwords) must live only in a local
.envfile, which is gitignored. Copy.env.exampleto.envand fill in real values locally. .env.examplemust contain placeholders only (e.g.your_api_key); it is the only env file tracked in git.- All secrets are loaded from environment variables via
src/vlegal_prototype/settings.py. Do not hardcode keys in source, scripts, or docs. - If a secret is ever committed, treat it as compromised: rotate it immediately at the provider (Appwrite, Netlify, MiniMax, OCI, etc.) and then purge it from git history (
git filter-repo/BFG) before force-pushing. - Before committing, scan staged changes for accidental secrets (e.g.
git diff --cachedplus a grep forapi_key|token|secret|password).
//tracking/documents/{id}/compare/{left_id}/{right_id}/health
This is still a prototype and is not an authoritative legal-status engine.
Current limits:
- the current import path is still SQLite-first
- Hugging Face remains a bootstrap source, not final source of truth
- citation and lifecycle graphs depend on local corpus quality
- official-source reconciliation is not fully implemented yet
- grounded briefs remain retrieval-based summaries that require official verification
docs/FOCUSED_CORPUS_SCOPE.mdis now a legacy direction notedocs/DEPLOYMENT_NETLIFY_APPWRITE.mdis an archived deployment path- remaining Appwrite code should be treated as transitional, not strategic
- Design direction:
DESIGN.md - Agent guidance:
AGENTS.md - Development journal:
docs/JOURNAL.md - Memory bank:
.agents/memory-bank/