A local diagnostic tool that turns a journal article PDF/DOCX into a Research Nexus completeness scorecard plus a Crossref-ready DOI submission XML. It runs the document through layout analysis, deterministic extraction, free identifier-registry lookups, and opt-in AI enrichment, and shows you in one screen exactly which integrity-relevant fields are present, which can be auto-filled, and which need editorial input — with running cost in USD for every paid LLM call.
A follow-up to the Crossref Research Nexus Score visualised at nexus-score.vercel.app. Where the score tells you what's missing, this tool helps you fix it.
┌────────────────────────────────────────────────────────────────────┐
│ 47 RESEARCH NEXUS │
│ Mandatory: 9/9 · Depositable │
│ │
│ 25% Provenance █░░░░░░░ 13% │
│ 20% People ████░░░░ 60% │
│ 20% Funding ░░░░░░░░ 0% │
│ 20% Access ███████░ 80% │
│ 15% Organizations █████░░░ 70% │
│ │
│ [Run automated extraction] [Run AI enrichment ~$0.025] │
└────────────────────────────────────────────────────────────────────┘
The whole paper on one screen. The big number on the left is the Research Nexus score (weighted across Provenance · People · Funding · Access · Organizations); the banner next to it tracks the Mandatory deposit gate ("5/9 fields · Not yet depositable") which must be satisfied before the deposit XML can be generated. Each dimension bar shows its current score and field-completion ratio.
Below the hero, the six dimensions collapse by default — editors drill
into one section at a time. Each header shows the weight badge
(GATE/25%/20%/...), the dimension's score, the N/M fields
counter, and a · X need attention hint when there's open work, so the
collapsed view is enough to plan a session without expanding anything.
The hero buttons are the publisher-controlled spend: Run automated
extraction (free pass — regex / Docling / free APIs), Run AI
enrichment with the upfront cost preview, and Generate Crossref
XML which is gated on the Mandatory banner.
Expanding a dimension reveals the field cards and the locate-in-document workflow that powers most editorial actions. Pictured: the Provenance dimension after the editor clicked Identify on document on References. The PDF renders inline at the right resolution for box-level selection — drag a lasso to multi-select boxes, shift-drag to deselect, and switch pages without losing the selection on the previous page (multi-page reference lists Just Work). The selected text is processed lookup-first: inline DOI regex → Crossref bibliographic search → OpenAlex match. AI is reserved for the ambiguous leftovers, opt-in.
Notice the layout: the dimension headers keep showing their score and counter even while one is open, so editors can monitor overall progress without scrolling back to the hero. Confirmed fields render as collapsible cards under "Confirmed"; fields that still need attention appear under "Needs attention" with the next-best-action button (Run lookup · free / Identify on document / Confirm / Reject and re-identify) — never more than one primary CTA per state.
The interpretation banner adapts to the score: "Depositable, but the record is thin. It will be hard for indexers to use." — Mandatory is satisfied (so the deposit XML can be generated), but the weighted Research Nexus score is low because Funding and Access are 0% on this paper. The hero score and banner are deliberately separate from the Mandatory gate: the deposit can go through, but the record won't enable discovery, citation graphs, or compliance reporting until the other dimensions are filled.
Pictured: People and Organizations both expanded.
In People, the dimension-owned pillar reads 8/8 authors with ORCID — that's the actual entity count, not a rubric flag. Inside,
each field card expands to show the underlying values:
ORCID for corresponding authorandORCID for every authorshow each author with their resolved ORCID chip linking out to the public profile.lookupbadges advertise that these were free API resolutions — no LLM cost.CRediT contributor rolesis the AI structurer's output: per author, the 14-role CRediT taxonomy assignment, each role accompanied by an evidence quote pulled from the PDF and a confidence percentage. Editors can scan the evidence to verify the LLM's mapping rather than trust it blindly.
In Organizations, every author's affiliation is linked to a ROR
chip cross-referenced against the Research Organization Registry. The
list deduplicates so three authors at the same institution count as
one resolved affiliation, not three (a common scoring trap fixed
earlier in the project's history). Each row shows the affiliation
string, its ROR ID, and the link out — the same shape the deposit
XML's <affiliation><institution-id type="ror"> element wants.
This is the "show the data, not just the verdict" rule: every dimension score is decomposable down to the actual extracted values, with their provenance, so the editor can audit before depositing.
Aligned with the Crossref Research Nexus framing and the weighting used by nexus-score.vercel.app.
| Bucket | Weight | What it covers |
|---|---|---|
| Mandatory | gate | DOI, title, journal, ISSN, year, ≥1 author, full pub date, vol/issue/pages, copyright. The Crossref deposit minimum — must be satisfied before deposit. |
| Provenance | 25% | References, refs-with-DOI, preprint→VoR link, Crossmark, conflict-of-interest, data/code availability. |
| People | 20% | Full author names, ORCID for corresponding author, ORCID for every author, CRediT contributor roles. |
| Funding | 20% | Funder Registry DOI, award/grant numbers. |
| Access | 20% | Abstract (plain + JATS), license, OA indicator, plain-language summary. |
| Organizations | 15% | Affiliations extracted, ROR for every affiliation. |
The hero Research Nexus score is the weight-averaged percentage across the five dimensions. The Mandatory gate decides whether the record is depositable at all.
Five layers, each layered on the one below:
| Layer | What | Cost | Trigger |
|---|---|---|---|
| L0 · Docling layout | PDF/DOCX → structured JSON + per-element bboxes + page renders | $0 | Always (parse stage) |
| L1 · Deterministic factsheet | Regex sweep (DOI / ORCID / ROR / ISSN / arXiv / license / grant patterns / preprint DOIs) + PDF /Info + header parser (authors + affiliation marker map) + boilerplate anchor matching (funding / CoI / data availability / ethics) | $0 | Always (parse stage) |
| L2 · Free enricher APIs | ORCID public API · ROR v2 · OpenAlex · Crossref REST. Now with affiliation normalisation (Solr-style alternatives, Indian Institute of Technology, Delhi → Indian Institute of Technology Delhi), name-swap fallback for Indian/Telugu profiles, and ROR clear-winner auto-accept (top score ≥ 0.95 with 0.10+ margin). |
$0 | Auto-fix (publisher click) |
| L3 · LLM picker | When the editor explicitly opts in, runs ONE structured-output call per ambiguous field that picks among the API-returned candidates with reasoning + confidence. | ~$0.0002/call | Per-field "Adjudicate with AI" |
| L4 · LLM structurer | Higher-leverage: takes a raw content region (or editor-located text via text_override) and returns a clean structured record. Five named tasks: structure_authors, structure_references, structure_funding, structure_credit, verify_authors. The verifier now also receives the paper's title + abstract excerpt + OpenAlex concepts so it can reject candidates whose research domain doesn't match. |
~$0.0003 – $0.013/call | Per-field "Identify on document" with AI cost meta-pill |
Cost rule: the LLM is never called automatically. Every paid call is the publisher's deliberate choice and is recorded in a per-submission USD ledger.
Side path · GLiNER2 NER. A local zero-shot named-entity recogniser
(fastino/gliner2-large-v1 by default, fastino/gliner2-base-v1 for
smaller hosts) is mounted at POST /submissions/{id}/ner for ad-hoc entity
extraction over a chosen text region. Per-zone label presets (e.g. funding
zone → award_id + funder_name; CoI zone → person + organization +
relationship; affiliation zone → organization + country + gpe) live
in services/ner.py. GLiNER runs entirely on-device, no network calls, no
LLM cost — it's the cheap layer for "find ORCIDs in this paragraph" or
"pull award IDs from this funding statement" when neither regex nor a
structured LLM call is the right tool. Weights download once into
./data/hf-cache/ (~1.4 GB) on first call.
For a typical paper, full premium processing tops out around $0.025. Standard auto-fix (no LLM) is $0.
- Upload PDF/DOCX →
POST /submissions. Status flowsuploaded → parsing → parsed. - Parse runs once per upload, in the background:
docling-serve /v1/convert/file→ structured JSON + markdown.- PyMuPDF renders each page to PNG at 150 DPI; bboxes are mapped from PDF points → image pixels for the layout overlay.
factsheet.pyruns the deterministic L1 extraction.
- Score is computed on demand from the factsheet + saved metadata.
Returned by
GET /submissions/{id}/score— includes per-dimension scores, the Research Nexus weighted score, the Mandatory gate state, and the entity-count pillars (e.g.7/9 authors with ORCID,3/8 affiliations with ROR). - Auto-fix (free, deterministic):
POST /submissions/{id}/autofix/allruns every high-impact fixer in one click (the hero CTA).
- Identify on document (the unified per-field action when something
is missing): editor selects boxes containing the value; the backend
transparently routes to either field-aware regex (for deterministic
fields) or the LLM structurer with
text_override(for AI-leverage fields), with an upfront cost-confirm dialog. - Confirm / Reject —
POST /submissions/{id}/confirmflipsprovenance.confirmed=true;POST .../rejectflips back toneeds_locateto prompt re-identification. - Generate XML —
POST /submissions/{id}/xmlbuilds,GET .../xmldownloads.
| Method | Path | Purpose |
|---|---|---|
POST |
/submissions |
Multipart upload, kicks off parse |
GET |
/submissions |
List submissions + status |
GET |
/submissions/{id} |
Status of one submission |
DELETE |
/submissions/{id} |
Remove submission + all generated files |
GET |
/submissions/{id}/factsheet |
Deterministic L1 output |
GET |
/submissions/{id}/score |
Dimension rubric + Research Nexus score + entity pillars |
GET |
/submissions/{id}/sections |
Layout-derived sections |
GET |
/submissions/{id}/sections/{n} |
One section with full text |
GET |
/submissions/{id}/markdown |
Full Docling markdown |
GET |
/submissions/{id}/pages |
Per-page render dimensions |
GET |
/submissions/{id}/pages/{n}/image |
Page PNG |
GET |
/submissions/{id}/pages/{n}/boxes |
Per-page Docling bboxes + text |
GET |
/submissions/{id}/references_layout |
Three-tier references detection result |
GET |
/submissions/{id}/provenance |
Per-field source / confidence / reasoning |
GET |
/submissions/{id}/cost |
LLM cost ledger for this submission |
GET |
/submissions/{id}/metadata |
Saved metadata JSON |
PUT |
/submissions/{id}/metadata |
Save edited metadata |
POST |
/submissions/{id}/autofix |
One deterministic fixer (free) |
POST |
/submissions/{id}/autofix/all |
Run every high-impact fixer (free) |
POST |
/submissions/{id}/pick |
Manual editor pick from candidate list (free) |
POST |
/submissions/{id}/disambiguate/estimate |
Preview LLM cost (free) |
POST |
/submissions/{id}/disambiguate |
Opt-in LLM picker (~$0.0002/call) |
POST |
/submissions/{id}/structure/{task}/estimate |
Per-task cost preview |
POST |
/submissions/{id}/structure/{task} |
Opt-in LLM structurer; optional body {"text_override": "..."} for editor-located source text |
POST |
/submissions/{id}/enrich/all |
Run all premium structurers in sequence |
POST |
/submissions/{id}/confirm |
Editor confirms a field |
POST |
/submissions/{id}/reject |
Editor rejects a field → needs_locate |
POST |
/submissions/{id}/locate |
Editor pointed to box(es); regex extraction for deterministic fields |
POST |
/submissions/{id}/ner |
On-device GLiNER2 zero-shot NER over a text region |
GET |
/submissions/presets/labels |
NER label presets per content zone |
POST |
/submissions/{id}/xml |
Build Crossref XML |
GET |
/submissions/{id}/xml |
Download XML |
Vite + React + react-router. Restrained scholarly palette (parchment / inkwell / muted-stone / onyx-orange accent), shadcn-style icons (lucide), Lato body + JetBrains Mono code, 4/8px radii, compact density. Light is the canonical theme; dark is a derived inverse.
Two pages:
- Submissions (
/upload) — drag-and-drop dropzone, status pills, per-row delete. - Review (
/review/:id) — top-to-bottom flow:- Hero — Research Nexus score (weighted), Mandatory gate banner
("9/9 fields · Depositable" or "Not yet depositable"), five
dimension bars with weight labels, action buttons (
Run automated extraction·Run AI enrichment ~$0.0NN·Generate Crossref XML). - Sticky dimension nav — chip per dimension with current score, jumps to that section.
- Per-dimension sections — Mandatory, then Provenance / People /
Funding / Access / Organizations. Each has a strong header (weight
badge, title, description, score number, progress bar), a
dimension-owned entity-progress strip when applicable (e.g. People
shows "7/9 authors with ORCID"), and field cards split into two
sub-buckets:
- Needs attention — everything not confirmed
- Confirmed — green-edged cards collapsed beneath
- Field cards — each card is collapsed by default; click to expand
the detail panel showing the structured data inline:
- Author-related fields (full names, ORCIDs, affiliations, RORs) expand to show a per-author list with ORCID / ROR / affiliation chips and AI evidence chains.
- CRediT contributor roles expand to show per-author roles with evidence quotes from the contribution paragraph and confidence %.
- Hero — Research Nexus score (weighted), Mandatory gate banner
("9/9 fields · Depositable" or "Not yet depositable"), five
dimension bars with weight labels, action buttons (
- Step-by-step CTAs: at most one primary action per state. Missing
fields show a single
Identify on documentbutton (with cost meta-pill if AI-leverage). Pending fields showConfirm/Reject. Confirmed fields show onlyReject and re-identify. The hero owns the globalRun automated extraction(free pass) andRun AI enrichment(priced pass) buttons.
git clone https://github.com/aadivar/metadata_gapfixer
cd metadata_gapfixer
cp .env.example .env # edit OPENAI_API_KEY + CONTACT_EMAIL
docker compose up -d --buildOpen http://localhost:3000.
First boot downloads:
ghcr.io/docling-project/docling-serve:latest(~2 GB)- GLiNER2 (
fastino/gliner2-large-v1) into./data/hf-cache/(~1.4 GB)
Tail logs with docker compose logs -f backend.
Set
CONTACT_EMAILin.env— it's used in the User-Agent for ORCID, ROR, OpenAlex, and Crossref polite-pool routing. Identified clients get much higher rate limits.
All knobs live in .env. See .env.example for working snippets per
provider (OpenAI, OpenRouter, Anthropic-compat, Groq, Ollama, LiteLLM).
| Var | Default | Notes |
|---|---|---|
OPENAI_BASE_URL |
https://api.openai.com/v1 |
Any OpenAI-compatible endpoint |
OPENAI_API_KEY |
(required) | Provider API key |
OPENAI_MODEL |
gpt-4o-mini |
Must support response_format=json_schema |
GLINER_MODEL |
fastino/gliner2-large-v1 |
Use fastino/gliner2-base-v1 (205M) on small hosts |
CONTACT_EMAIL |
anonymous@example.org |
Sent in User-Agent for ORCID/ROR/OpenAlex/Crossref polite pools — change this |
HF_TOKEN |
(unset) | Optional — higher rate limits on HuggingFace downloads |
Per-task model routing lives in backend/app/services/llm_router.py's
TASK_CONFIG dict.
For now, everything is on disk under ./data/:
data/
├── uploads/ # raw uploaded PDFs/DOCXs
├── outputs/
│ ├── {id}_docling.json # full DoclingDocument
│ ├── {id}_layout.json # rendered pages + bboxes (PDFs only)
│ ├── {id}_pages/page_NNN.png # 150 DPI page renders
│ ├── {id}_factsheet.json # deterministic L1 extraction
│ ├── {id}_metadata.json # editor-facing metadata + provenance
│ ├── {id}_cost.json # per-call LLM cost ledger
│ └── {id}_crossref.xml # generated deposit XML
├── cache/http/ # diskcache for ORCID/ROR/OpenAlex/Crossref
├── hf-cache/ # huggingface model weights
└── mgf.db # SQLite — submissions table only
To swap storage (S3, GCS, Azure Blob, NFS, etc.): bind-mount
./data/uploads and ./data/outputs to your network storage (works for
any POSIX-mountable backend), and replace the SQLite engine in
backend/app/db.py with a Postgres / MySQL URL — SQLModel speaks both.
We deliberately did not embed an S3 / cloud-specific client so the project stays portable. Pick what fits your infrastructure.
metadata_gapfixer/
├── docker-compose.yml
├── .env.example
├── README.md · DESIGN.md
├── LICENSE # AGPL-v3
├── backend/
│ ├── Dockerfile · requirements.txt
│ ├── templates/journal_article.xml.j2
│ └── app/
│ ├── main.py · config.py · db.py · models.py · pipeline.py
│ ├── routes/
│ │ ├── health.py
│ │ └── submissions.py
│ └── services/
│ ├── docling_client.py · page_render.py · sections.py
│ ├── factsheet.py · ner.py
│ ├── scoring.py # rubric + dimensions + Research Nexus score
│ ├── autofix.py # free deterministic fixers + needs_pick
│ ├── llm_router.py # per-task models + cost ledger
│ ├── structurers.py # five LLM structurers + paper-context aware verifier
│ ├── crossref_xml.py
│ └── enrichers/
│ ├── _base.py · orcid.py · ror.py · openalex.py · crossref.py
│ └── (ORCID name-swap, ROR comma-delete normalisation, etc.)
└── frontend/
├── Dockerfile · package.json
└── src/
├── App.tsx · api.ts · theme.ts · icons.tsx · styles.css
└── pages/
├── Upload.tsx
└── Review.tsx # dimension-bucketed scorecard + step-by-step CTAs
- Journal articles only. No book chapters, conference papers, datasets,
preprints (as deposit type). Preprint relations on a journal article
are supported via the
preprint_doifield. - No fabrication. The LLM is constrained by structured outputs and explicit candidate lists. ORCIDs, DOIs, RORs, ISSNs come from APIs or PDF text — never made up.
- Topic-aware verification. The
verify_authorsstructurer now receives the paper's title, abstract excerpt, and OpenAlex concepts, and rejects candidates whose top concepts have zero overlap with the paper — even when the institution matches. - XML is well-formedness checked, not XSD-validated. Run Crossref's
XML validator (or
xmllint --schema crossref5.3.1.xsd ...) before deposit. The template targets schema 5.3.1. - LLM provider must support OpenAI structured outputs (
response_formatwithjson_schema).
Ordered roughly by impact-per-week-of-work for the editorial-tool use case.
- PDF layout memory. Cache layout signatures (page geometry, text-block
fingerprints, header bbox, references-section detector results) per
publisher / per template. On the second paper from the same journal, the
parser short-circuits to the known regions instead of re-running detection
end-to-end. Today's three-tier references detector
(
services/references_layout.py) is the first taste; generalise toauthors_layout.py,funding_layout.py,crediting_layout.py. - Batch processing of multiple files. Drop a directory or a ZIP of
PDFs at the dropzone (or
POST /batch) and process N submissions in parallel, with a single combined progress view and a single Crossref XML bundle download. Same parse pipeline as the single-file flow, shared HTTP cache, throughput-bounded by Docling-serve concurrency. - CLI for backfill of older records. A
mgf-cli ingest <path>that runs the parse + factsheet + free auto-fix end-to-end without the GUI. Useful for re-processing a publisher's archive once layout-aware detection improves, or for retrofitting metadata into legacy DOIs that were deposited with thin records. Output: a JSON report per file + the generated XML, written next to each PDF.
- Per-editor / per-journal preference storage. Capture and replay publisher policy fields that don't live in the PDF: Crossmark policy URL, copyright holder, license-of-record default, depositor email, preferred CRediT taxonomy version, archived-versions URL pattern, funding-text → funder mapping rules. First time an editor sets these for a journal (keyed by ISSN or DOI prefix), they're remembered for every subsequent paper in that journal.
needs_pickinline UI. When ORCID or ROR returns N>1 candidates, the data is already inprovenance— render the candidate list inline on the field card with [Pick] buttons (free) and an "Adjudicate with AI · ~$0.0002" button. CLAUDE.md's next-slice item.- Inter-paper consistency checks. Across a single issue or volume, flag inconsistencies the editor would otherwise miss — author name variants for the same person across papers, mismatched affiliations, inconsistent license, duplicate DOIs, drifted funder labels.
- Audit log per submission. Append-only ledger of every editor action (confirm, reject, pick, locate, AI call, manual edit) with timestamps. Both for accountability and to surface per-publisher patterns that should become learned rules.
- Bump XML template to Crossref schema 5.4.0 (currently 5.3.1).
One-file change in
templates/journal_article.xml.j2. - XSD validation of the generated XML, not just well-formedness. Run schema 5.4.0 inline before download / deposit.
- Direct deposit to Crossref end-to-end. Sign in once with
Crossref credentials and
POST /submitstraight to their deposit API instead of downloading XML and uploading by hand. - Crossref REST diff. Before depositing an update, fetch the
existing record from
api.crossref.org/works/{doi}and show a diff view so the editor sees exactly what they're about to overwrite.
- Local ROR / ORCID daily snapshot. Mirror the public dumps (ROR: ~50 MB/day, ORCID public summary: bulk export) so air-gapped or rate-limited publishers can resolve identifiers without internet egress. Falls back to live API if the snapshot misses.
- GROBID fallback for scanned / OCR-only PDFs. Docling struggles with poor scans; GROBID's TEI output covers a different failure mode for legacy archives.
- Multi-language abstract support. Crossref schema 5.4.0 lets you deposit abstracts in multiple languages — surface and structure them separately when present.
- XMP write-back into the PDF. Embed the corrected metadata into the PDF's XMP packet so downstream tools (institutional repositories, preservation systems) read the same metadata that was deposited.
- Per-publisher rubric weights. Let the publisher tune the five Research Nexus dimension weights to match their own integrity priorities (e.g. an OA-only publisher may weight Access higher).
- Webhook from typesetting pipeline. Trigger
POST /submissionsfrom a publisher's CI (Editorial Manager, OJS, Manuscript Manager) the moment a final PDF is generated, so the metadata report is ready before the editor opens it. - Multi-tenant auth. If hosted, per-publisher login + scoped data (uploads, profiles, cost ledger) so multiple journals can share one instance.
If you'd like to pick one up or have other priorities, please open an issue at https://github.com/aadivar/metadata_gapfixer/issues.
Built by the team behind nexus-score.vercel.app. Source on GitHub: https://github.com/aadivar/metadata_gapfixer.
AGPL-v3. If you run a modified version of this software on a network-accessible service, you must offer that modified source to the service's users.