Skip to content

alganet/rvcomplete

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

rvcomplete

In-browser autocomplete for the Respect/Validation fluent PHP API (v::numericVal()->positive()->between(1, 255)), powered by a small fine-tuned FIM model that runs entirely client-side via transformers.js + ONNX Runtime Web (WebGPU, WASM fallback). This repo is the data → training → export pipeline that produces it.

Live demo: https://alganet.github.io/rvcomplete/ — runs entirely in your browser (first load downloads the ~424 MB model from the Hugging Face Hub).

Goal

A ~0.5B FIM code model specialized on one library's surface — its ~150 validators, their argument shapes, and chaining idioms — that completes the next chain node from the cursor. Not a general assistant: a fast, always-valid completer for v:: chains, suitable for docs and live demos.

Design

The decisions that shape everything:

  • Ranking backbone, not free generation. After v::/-> the only legal next identifier is a real validator, enumerated in a canonical vocab (data/symbols/suggest_vocab.json). So the model never invents a name — its job is to rank the valid candidates. One forward pass scores them by first-token logit (scripts/constrain/constrain.py, ported to JS in demo/index.html). This removes hallucination by construction and makes inference a single forward instead of an autoregressive loop.

  • Base: Qwen2.5-Coder-0.5B. FIM-native, decent PHP from pretraining, small enough to QLoRA-train on a 6 GB GPU and ship to a browser. Train and infer both use the Qwen sentinel layout: <|fim_prefix|>{before}<|fim_suffix|>{after}<|fim_middle|>{completion}.

  • Execution-verified, programmatic data (no LLM). A sandboxed harness (php -l + symbol check + execution against the pinned 3.1 library) is the gate for every synthetic chain. Sources:

    • real — chains extracted from the library's docs and tests (the anchor).
    • idiomatic — developer-knowledge schemas (field → validator, nullable, collections); the realistic "ranking" half of the corpus.
    • combinatorial — enumerate the valid surface, keep a chain only if it constructs and discriminates (≥1 probe input passes, ≥1 fails), so type-match emerges from runtime behavior rather than hardcoded rules.
    • provider miner — real labeled inputs mined from the test suite, for checksum/format rules (CPF/IBAN/date) that need genuinely valid values.
  • Training: QLoRA on a cheap consumer GPU. 4-bit NF4 (works on CC 6.1; LLM.int8() does not — fp16 LoRA is the only fallback), seq-len 512, batch 1 + gradient accumulation, gradient checkpointing, paged_adamw_8bit. Slow but fits. Eval holds out real chains only (generalization, not generator-style memorization). A GTX 1060 (6 GB, Pascal) used for testing.

  • Export & serving. Merge LoRA → ONNX via Optimum (fp16 on GPU) → int4 MatMulNBitsq4f16. A size pass (scripts/export/shrink_embedding.py) unties and int4s the lm_head and int8s the input embedding (~424 MB total). Served by transformers.js, WebGPU primary, WASM fallback — details in demo/README.md.

Pipeline (reproducible)

The data side is PHP 8.5 (independent of the Python/ML env); the ML side uses the .venv. Run from the repo root.

# 0. Pinned library — ground truth for the symbol table AND the execution gate
git clone --branch 3.1 --depth 1 https://github.com/Respect/Validation.git data/raw/Validation
(cd data/raw/Validation && composer install)

# 1. Symbol table + canonical decode (ranking) vocab
php scripts/extract/extract_symbols.php
php scripts/constrain/build_vocab.php

# 2. Real chains → FIM examples
php scripts/extract/extract_real.php
php scripts/fim/build_fim.php --in data/raw/chains.jsonl --out data/fim/real.jsonl --origin real

# 3. Programmatic generation (NO LLM) → verify → FIM
php scripts/generate/idiomatic_chains.php
php scripts/fim/build_fim.php --in data/generated/idiomatic.jsonl --out data/fim/idiomatic.jsonl --origin synthetic
php scripts/generate/combine_chains.php
php scripts/fim/build_fim.php --in data/generated/combined.jsonl --out data/fim/combined.jsonl --origin synthetic
php scripts/extract/extract_providers.php
php scripts/harness/verify.php --in data/generated/providers_raw.jsonl \
    --out data/generated/providers_verified.jsonl --rejected data/generated/providers_rejected.jsonl
php scripts/fim/build_fim.php --in data/generated/providers_verified.jsonl --out data/fim/provider.jsonl --origin synthetic

# 4. Coverage report + assemble (dedup, real-only eval hold-out)
php scripts/coverage/track_coverage.php --floor=40
.venv/bin/python scripts/train/assemble_dataset.py

# 5. Train → export → shrink
.venv/bin/python scripts/train/train_qlora.py
.venv/bin/python scripts/export/export_onnx.py --adapter runs/<run>/best --out demo/model
.venv/bin/python scripts/export/shrink_embedding.py \
    --src demo/model/onnx/model_q4f16.onnx --out demo/model/onnx/model_q4f16.onnx

# 6. Evaluate (curated developer-moment scenarios, ranking on)
.venv/bin/python scripts/eval/eval_curated.py

Publishing

The demo is a single static page (demo/index.html); the model weights live on the Hugging Face Hub, not in this repo (the 424 MB bundle is gitignored and exceeds GitHub's file limit). On localhost the page loads the local demo/model5/; published, it loads from the Hub.

# Upload (or re-upload) the exported bundle to the Hub, preserving the onnx/ layout.
.venv/bin/hf auth login                                  # once
.venv/bin/hf upload alganet/rvcomplete demo/model5 . --repo-type model

GitHub Pages serves only the page, via .github/workflows/pages.yml: every push to main that touches demo/ uploads demo/ as the Pages artifact (weights excluded by gitignore) and deploys it. One-time setup: Settings → Pages → Source = GitHub Actions. After a re-export, just hf upload again — the live site picks up the new weights on next load; no redeploy needed unless the page itself changed.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors