MVP / Architectural proposal for exposing website content in an LLM-friendly way.
Think robots.txt for LLMs: clean .html.md mirrors of your pages + a simple llms.txt manifest.
Starts with Flask, designed to extend to FastAPI, Django, Node.js, etc.
This repository started as an individual MVP proposal. The idea would be to transfer it under an IBM OSS org if it proves useful to the community
LLM agents (ChatGPT, Claude, etc.) increasingly “read” the web. But raw HTML is noisy (ads, navs, interactivity). Sites need a standard, opt-in surface that’s clean, stable, and easy to crawl.
This project provides:
- Markdown mirrors of your HTML pages at /.llms/.html.md
- A /llms.txt manifest to advertise what’s available (following the llmstxt.org standard)
- A simple middleware so you don’t touch your existing routes/templates
- A generator pipeline (HTML → JSON → Jinja → Markdown) you can customize with generic Jinja templates for any kind of view
open_llms_txt.parsers.html.parse_html_to_json: Minimal, robust HTML → JSON extraction (title, h1, headings, paragraphs, links).open_llms_txt.generators.html_to_md.HtmlToMdGenerator: Jinja-based Markdown renderer (override templates as needed).open_llms_txt.middleware.flask@html2md(...): exposes a Markdown mirror for a decorated route@llmstxt(...): serves /llms.txt based on the decorated page + allow-list
open_llms_txt.scrapers.local_scraper/web_scraperReference scrapers used in the CLI.
# requires Python 3.13
mise trust
mise install
mise run setupAdd the middleware:
# app.py
from flask import Flask, render_template
from open_llms_txt.middleware.flask import html2md, llmstxt
app = Flask(__name__)
@app.get("/")
@llmstxt(app, template_name="llms.txt.jinja")
def home():
return render_template("home.html") # Regular HTML page
@app.get("/docs")
@html2md(app, template_name="html_to_md.jinja")
def docs():
return render_template("docs.html")Run your app, then:
# LLM-friendly Markdown mirrors
curl http://localhost:5000/docs.html.md
# Manifest of available mirrors
curl http://localhost:5000/llms.txtBy default, templates live in src/open_llms_txt/templates:
html_to_md.jinja→ how a single page is rendered to Markdownllms.txt.jinja→ the manifest body
You can override with your own template directory:
@html2md(app, template_dir="path/to/templates", template_name="html_to_md.jinja")
@llmstxt(app, template_dir="path/to/templates", template_name="llms.txt.jinja")# Run the included demo app
mise run setup
uv run python examples/flask_site/complex_app.py
curl -s http://127.0.0.1:8000/pricing.html.md | sed -n '1,20p'
curl -s http://127.0.0.1:8000/llms.txt | sed -n '1,20p'flowchart LR
A[Middleware route returns HTML] --> B[BeautifulSoup parse_html_to_json]
B --> C["Normalized JSON: title, h1, headings, paragraphs, links"]
C --> D[Jinja HtmlToMdGenerator]
D --> E[Markdown output]
E -->|served as| F["/[.llms]/<route>.html.md"]
G[Allow-list] --> H["/llms.txt"]
H -->|lists| F
@html2md: exposes /[.llms]/.html.md@llmstxt: exposes llms.txt (based on a decorated source page + allow-list)
- Only same-domain links are mirrored by the web scraper logic.
- Anchors and non-.html files are ignored for mirrors.
- Middleware uses module-level state; one Flask app per process recommended.
- Errors from source pages propagate as Markdown # 4xx/5xx bodies.
This repo uses mise (tasks/tooling) and uv (package manager/builds).
# Install tools, deps (including dev) and git hooks
mise run setup
# Lint & format (Ruff), type-check (mypy)
mise run check
# Run tests
mise run testsGit hooks run the same tasks (pre-commit → check, pre-push → tests).
CI uses the same tasks for 1:1 parity.
- Async tests via
pytest-asyncio(asyncio_mode = "auto") - Optional parallelism via
pytest-xdist(use -n auto) - Network blocked by default in tests (see
tests/conftest.py)
This is an architecture, not just a Flask helper. Adapters can reuse the same core:
- Parsers (
parse_html_to_json) - Generators (
HtmlToMdGenerator+ your Jinja templates) - A small framework adapter to:
- expose Markdown mirrors (e.g.,
/[.llms]/<route>.html.md) - render
/llms.txtfrom an index page - enforce an allow-list of mirrored routes
- expose Markdown mirrors (e.g.,
- FastAPI & Django adapters (decorators / routers / middlewares)
- Template presets for blogs, docs sites, API docs
- Node.js port (same contract, renderers in Nunjucks/MDX)
Contributions welcome! This is an MVP to shape a future llms.txt ecosystem.
- Fork & branch
mise run setupmise run check&&mise run tests- PR 🚀
Please read CONTRIBUTING.md for details.
Apache-2.0. See LICENSE.
Is there a CLI? A CLI entry point exists but is experimental. For now, prefer the middleware.
Why Markdown, not JSON? Markdown is LLM-native and preserves readable structure. You can still add JSON endpoints using a custom JSON generator later if needed.
What about SEO? These mirrors live under a prefix (e.g., /.llms) to avoid confusing users and crawlers. You can additionally disallow them in robots.txt if desired.