Metadata Generator

A local diagnostic tool that turns a journal article PDF/DOCX into a Research Nexus completeness scorecard plus a Crossref-ready DOI submission XML. It runs the document through layout analysis, deterministic extraction, free identifier-registry lookups, and opt-in AI enrichment, and shows you in one screen exactly which integrity-relevant fields are present, which can be auto-filled, and which need editorial input — with running cost in USD for every paid LLM call.

A follow-up to the Crossref Research Nexus Score visualised at nexus-score.vercel.app. Where the score tells you what's missing, this tool helps you fix it.

┌────────────────────────────────────────────────────────────────────┐
│  47   RESEARCH NEXUS                                               │
│       Mandatory: 9/9 · Depositable                                 │
│                                                                    │
│   25%  Provenance       █░░░░░░░   13%                             │
│   20%  People           ████░░░░   60%                             │
│   20%  Funding          ░░░░░░░░    0%                             │
│   20%  Access           ███████░   80%                             │
│   15%  Organizations    █████░░░   70%                             │
│                                                                    │
│   [Run automated extraction]  [Run AI enrichment ~$0.025]          │
└────────────────────────────────────────────────────────────────────┘

Screenshots

Scorecard overview

The whole paper on one screen. The big number on the left is the Research Nexus score (weighted across Provenance · People · Funding · Access · Organizations); the banner next to it tracks the Mandatory deposit gate ("5/9 fields · Not yet depositable") which must be satisfied before the deposit XML can be generated. Each dimension bar shows its current score and field-completion ratio.

Below the hero, the six dimensions collapse by default — editors drill into one section at a time. Each header shows the weight badge (GATE/25%/20%/...), the dimension's score, the N/M fields counter, and a · X need attention hint when there's open work, so the collapsed view is enough to plan a session without expanding anything. The hero buttons are the publisher-controlled spend: Run automated extraction (free pass — regex / Docling / free APIs), Run AI enrichment with the upfront cost preview, and Generate Crossref XML which is gated on the Mandatory banner.

Locate flow with paginated references

Expanding a dimension reveals the field cards and the locate-in-document workflow that powers most editorial actions. Pictured: the Provenance dimension after the editor clicked Identify on document on References. The PDF renders inline at the right resolution for box-level selection — drag a lasso to multi-select boxes, shift-drag to deselect, and switch pages without losing the selection on the previous page (multi-page reference lists Just Work). The selected text is processed lookup-first: inline DOI regex → Crossref bibliographic search → OpenAlex match. AI is reserved for the ambiguous leftovers, opt-in.

Notice the layout: the dimension headers keep showing their score and counter even while one is open, so editors can monitor overall progress without scrolling back to the hero. Confirmed fields render as collapsible cards under "Confirmed"; fields that still need attention appear under "Needs attention" with the next-best-action button (Run lookup · free / Identify on document / Confirm / Reject and re-identify) — never more than one primary CTA per state.

Per-author entity views (ORCID, CRediT, ROR)

The interpretation banner adapts to the score: "Depositable, but the record is thin. It will be hard for indexers to use." — Mandatory is satisfied (so the deposit XML can be generated), but the weighted Research Nexus score is low because Funding and Access are 0% on this paper. The hero score and banner are deliberately separate from the Mandatory gate: the deposit can go through, but the record won't enable discovery, citation graphs, or compliance reporting until the other dimensions are filled.

Pictured: People and Organizations both expanded.

In People, the dimension-owned pillar reads 8/8 authors with ORCID — that's the actual entity count, not a rubric flag. Inside, each field card expands to show the underlying values:

ORCID for corresponding author and ORCID for every author show each author with their resolved ORCID chip linking out to the public profile. lookup badges advertise that these were free API resolutions — no LLM cost.
CRediT contributor roles is the AI structurer's output: per author, the 14-role CRediT taxonomy assignment, each role accompanied by an evidence quote pulled from the PDF and a confidence percentage. Editors can scan the evidence to verify the LLM's mapping rather than trust it blindly.

In Organizations, every author's affiliation is linked to a ROR chip cross-referenced against the Research Organization Registry. The list deduplicates so three authors at the same institution count as one resolved affiliation, not three (a common scoring trap fixed earlier in the project's history). Each row shows the affiliation string, its ROR ID, and the link out — the same shape the deposit XML's <affiliation><institution-id type="ror"> element wants.

This is the "show the data, not just the verdict" rule: every dimension score is decomposable down to the actual extracted values, with their provenance, so the editor can audit before depositing.

Categorisation: one Mandatory gate + five Research Nexus dimensions

Aligned with the Crossref Research Nexus framing and the weighting used by nexus-score.vercel.app.

Bucket	Weight	What it covers
Mandatory	gate	DOI, title, journal, ISSN, year, ≥1 author, full pub date, vol/issue/pages, copyright. The Crossref deposit minimum — must be satisfied before deposit.
Provenance	25%	References, refs-with-DOI, preprint→VoR link, Crossmark, conflict-of-interest, data/code availability.
People	20%	Full author names, ORCID for corresponding author, ORCID for every author, CRediT contributor roles.
Funding	20%	Funder Registry DOI, award/grant numbers.
Access	20%	Abstract (plain + JATS), license, OA indicator, plain-language summary.
Organizations	15%	Affiliations extracted, ROR for every affiliation.

The hero Research Nexus score is the weight-averaged percentage across the five dimensions. The Mandatory gate decides whether the record is depositable at all.

Architecture

Five layers, each layered on the one below:

Layer	What	Cost	Trigger
L0 · Docling layout	PDF/DOCX → structured JSON + per-element bboxes + page renders	$0	Always (parse stage)
L1 · Deterministic factsheet	Regex sweep (DOI / ORCID / ROR / ISSN / arXiv / license / grant patterns / preprint DOIs) + PDF /Info + header parser (authors + affiliation marker map) + boilerplate anchor matching (funding / CoI / data availability / ethics)	$0	Always (parse stage)
L2 · Free enricher APIs	ORCID public API · ROR v2 · OpenAlex · Crossref REST. Now with affiliation normalisation (Solr-style alternatives, `Indian Institute of Technology, Delhi → Indian Institute of Technology Delhi`), name-swap fallback for Indian/Telugu profiles, and ROR clear-winner auto-accept (top score ≥ 0.95 with 0.10+ margin).	$0	Auto-fix (publisher click)
L3 · LLM picker	When the editor explicitly opts in, runs ONE structured-output call per ambiguous field that picks among the API-returned candidates with reasoning + confidence.	~$0.0002/call	Per-field "Adjudicate with AI"
L4 · LLM structurer	Higher-leverage: takes a raw content region (or editor-located text via `text_override`) and returns a clean structured record. Five named tasks: `structure_authors`, `structure_references`, `structure_funding`, `structure_credit`, `verify_authors`. The verifier now also receives the paper's title + abstract excerpt + OpenAlex concepts so it can reject candidates whose research domain doesn't match.	~$0.0003 – $0.013/call	Per-field "Identify on document" with AI cost meta-pill

Cost rule: the LLM is never called automatically. Every paid call is the publisher's deliberate choice and is recorded in a per-submission USD ledger.

Side path · GLiNER2 NER. A local zero-shot named-entity recogniser (fastino/gliner2-large-v1 by default, fastino/gliner2-base-v1 for smaller hosts) is mounted at POST /submissions/{id}/ner for ad-hoc entity extraction over a chosen text region. Per-zone label presets (e.g. funding zone → award_id + funder_name; CoI zone → person + organization + relationship; affiliation zone → organization + country + gpe) live in services/ner.py. GLiNER runs entirely on-device, no network calls, no LLM cost — it's the cheap layer for "find ORCIDs in this paragraph" or "pull award IDs from this funding statement" when neither regex nor a structured LLM call is the right tool. Weights download once into ./data/hf-cache/ (~1.4 GB) on first call.

For a typical paper, full premium processing tops out around $0.025. Standard auto-fix (no LLM) is $0.

Pipeline

Upload PDF/DOCX → POST /submissions. Status flows uploaded → parsing → parsed.
Parse runs once per upload, in the background:
- docling-serve /v1/convert/file → structured JSON + markdown.
- PyMuPDF renders each page to PNG at 150 DPI; bboxes are mapped from PDF points → image pixels for the layout overlay.
- factsheet.py runs the deterministic L1 extraction.
Score is computed on demand from the factsheet + saved metadata. Returned by GET /submissions/{id}/score — includes per-dimension scores, the Research Nexus weighted score, the Mandatory gate state, and the entity-count pillars (e.g. 7/9 authors with ORCID, 3/8 affiliations with ROR).
Auto-fix (free, deterministic):
- POST /submissions/{id}/autofix/all runs every high-impact fixer in one click (the hero CTA).
Identify on document (the unified per-field action when something is missing): editor selects boxes containing the value; the backend transparently routes to either field-aware regex (for deterministic fields) or the LLM structurer with text_override (for AI-leverage fields), with an upfront cost-confirm dialog.
Confirm / Reject — POST /submissions/{id}/confirm flips provenance.confirmed=true; POST .../reject flips back to needs_locate to prompt re-identification.
Generate XML — POST /submissions/{id}/xml builds, GET .../xml downloads.

API surface

Method	Path	Purpose
`POST`	`/submissions`	Multipart upload, kicks off parse
`GET`	`/submissions`	List submissions + status
`GET`	`/submissions/{id}`	Status of one submission
`DELETE`	`/submissions/{id}`	Remove submission + all generated files
`GET`	`/submissions/{id}/factsheet`	Deterministic L1 output
`GET`	`/submissions/{id}/score`	Dimension rubric + Research Nexus score + entity pillars
`GET`	`/submissions/{id}/sections`	Layout-derived sections
`GET`	`/submissions/{id}/sections/{n}`	One section with full text
`GET`	`/submissions/{id}/markdown`	Full Docling markdown
`GET`	`/submissions/{id}/pages`	Per-page render dimensions
`GET`	`/submissions/{id}/pages/{n}/image`	Page PNG
`GET`	`/submissions/{id}/pages/{n}/boxes`	Per-page Docling bboxes + text
`GET`	`/submissions/{id}/references_layout`	Three-tier references detection result
`GET`	`/submissions/{id}/provenance`	Per-field source / confidence / reasoning
`GET`	`/submissions/{id}/cost`	LLM cost ledger for this submission
`GET`	`/submissions/{id}/metadata`	Saved metadata JSON
`PUT`	`/submissions/{id}/metadata`	Save edited metadata
`POST`	`/submissions/{id}/autofix`	One deterministic fixer (free)
`POST`	`/submissions/{id}/autofix/all`	Run every high-impact fixer (free)
`POST`	`/submissions/{id}/pick`	Manual editor pick from candidate list (free)
`POST`	`/submissions/{id}/disambiguate/estimate`	Preview LLM cost (free)
`POST`	`/submissions/{id}/disambiguate`	Opt-in LLM picker (~$0.0002/call)
`POST`	`/submissions/{id}/structure/{task}/estimate`	Per-task cost preview
`POST`	`/submissions/{id}/structure/{task}`	Opt-in LLM structurer; optional body `{"text_override": "..."}` for editor-located source text
`POST`	`/submissions/{id}/enrich/all`	Run all premium structurers in sequence
`POST`	`/submissions/{id}/confirm`	Editor confirms a field
`POST`	`/submissions/{id}/reject`	Editor rejects a field → `needs_locate`
`POST`	`/submissions/{id}/locate`	Editor pointed to box(es); regex extraction for deterministic fields
`POST`	`/submissions/{id}/ner`	On-device GLiNER2 zero-shot NER over a text region
`GET`	`/submissions/presets/labels`	NER label presets per content zone
`POST`	`/submissions/{id}/xml`	Build Crossref XML
`GET`	`/submissions/{id}/xml`	Download XML

GUI

Vite + React + react-router. Restrained scholarly palette (parchment / inkwell / muted-stone / onyx-orange accent), shadcn-style icons (lucide), Lato body + JetBrains Mono code, 4/8px radii, compact density. Light is the canonical theme; dark is a derived inverse.

Two pages:

Submissions (/upload) — drag-and-drop dropzone, status pills, per-row delete.
Review (/review/:id) — top-to-bottom flow:
1. Hero — Research Nexus score (weighted), Mandatory gate banner ("9/9 fields · Depositable" or "Not yet depositable"), five dimension bars with weight labels, action buttons (Run automated extraction · Run AI enrichment ~$0.0NN · Generate Crossref XML).
2. Sticky dimension nav — chip per dimension with current score, jumps to that section.
3. Per-dimension sections — Mandatory, then Provenance / People / Funding / Access / Organizations. Each has a strong header (weight badge, title, description, score number, progress bar), a dimension-owned entity-progress strip when applicable (e.g. People shows "7/9 authors with ORCID"), and field cards split into two sub-buckets:
  - Needs attention — everything not confirmed
  - Confirmed — green-edged cards collapsed beneath
4. Field cards — each card is collapsed by default; click to expand the detail panel showing the structured data inline:
  - Author-related fields (full names, ORCIDs, affiliations, RORs) expand to show a per-author list with ORCID / ROR / affiliation chips and AI evidence chains.
  - CRediT contributor roles expand to show per-author roles with evidence quotes from the contribution paragraph and confidence %.
Step-by-step CTAs: at most one primary action per state. Missing fields show a single Identify on document button (with cost meta-pill if AI-leverage). Pending fields show Confirm / Reject. Confirmed fields show only Reject and re-identify. The hero owns the global Run automated extraction (free pass) and Run AI enrichment (priced pass) buttons.

Quick start

git clone https://github.com/aadivar/metadata_gapfixer
cd metadata_gapfixer
cp .env.example .env                # edit OPENAI_API_KEY + CONTACT_EMAIL
docker compose up -d --build

Open http://localhost:3000.

First boot downloads:

ghcr.io/docling-project/docling-serve:latest (~2 GB)
GLiNER2 (fastino/gliner2-large-v1) into ./data/hf-cache/ (~1.4 GB)

Tail logs with docker compose logs -f backend.

Set CONTACT_EMAIL in .env — it's used in the User-Agent for ORCID, ROR, OpenAlex, and Crossref polite-pool routing. Identified clients get much higher rate limits.

Configuration

All knobs live in .env. See .env.example for working snippets per provider (OpenAI, OpenRouter, Anthropic-compat, Groq, Ollama, LiteLLM).

Var	Default	Notes
`OPENAI_BASE_URL`	`https://api.openai.com/v1`	Any OpenAI-compatible endpoint
`OPENAI_API_KEY`	(required)	Provider API key
`OPENAI_MODEL`	`gpt-4o-mini`	Must support `response_format=json_schema`
`GLINER_MODEL`	`fastino/gliner2-large-v1`	Use `fastino/gliner2-base-v1` (205M) on small hosts
`CONTACT_EMAIL`	`anonymous@example.org`	Sent in User-Agent for ORCID/ROR/OpenAlex/Crossref polite pools — change this
`HF_TOKEN`	(unset)	Optional — higher rate limits on HuggingFace downloads

Per-task model routing lives in backend/app/services/llm_router.py's TASK_CONFIG dict.

Storage

For now, everything is on disk under ./data/:

data/
├── uploads/                      # raw uploaded PDFs/DOCXs
├── outputs/
│   ├── {id}_docling.json         # full DoclingDocument
│   ├── {id}_layout.json          # rendered pages + bboxes (PDFs only)
│   ├── {id}_pages/page_NNN.png   # 150 DPI page renders
│   ├── {id}_factsheet.json       # deterministic L1 extraction
│   ├── {id}_metadata.json        # editor-facing metadata + provenance
│   ├── {id}_cost.json            # per-call LLM cost ledger
│   └── {id}_crossref.xml         # generated deposit XML
├── cache/http/                   # diskcache for ORCID/ROR/OpenAlex/Crossref
├── hf-cache/                     # huggingface model weights
└── mgf.db                        # SQLite — submissions table only

To swap storage (S3, GCS, Azure Blob, NFS, etc.): bind-mount ./data/uploads and ./data/outputs to your network storage (works for any POSIX-mountable backend), and replace the SQLite engine in backend/app/db.py with a Postgres / MySQL URL — SQLModel speaks both.

We deliberately did not embed an S3 / cloud-specific client so the project stays portable. Pick what fits your infrastructure.

Layout

metadata_gapfixer/
├── docker-compose.yml
├── .env.example
├── README.md  · DESIGN.md
├── LICENSE                                   # AGPL-v3
├── backend/
│   ├── Dockerfile · requirements.txt
│   ├── templates/journal_article.xml.j2
│   └── app/
│       ├── main.py · config.py · db.py · models.py · pipeline.py
│       ├── routes/
│       │   ├── health.py
│       │   └── submissions.py
│       └── services/
│           ├── docling_client.py · page_render.py · sections.py
│           ├── factsheet.py · ner.py
│           ├── scoring.py                    # rubric + dimensions + Research Nexus score
│           ├── autofix.py                    # free deterministic fixers + needs_pick
│           ├── llm_router.py                 # per-task models + cost ledger
│           ├── structurers.py                # five LLM structurers + paper-context aware verifier
│           ├── crossref_xml.py
│           └── enrichers/
│               ├── _base.py · orcid.py · ror.py · openalex.py · crossref.py
│               └── (ORCID name-swap, ROR comma-delete normalisation, etc.)
└── frontend/
    ├── Dockerfile · package.json
    └── src/
        ├── App.tsx · api.ts · theme.ts · icons.tsx · styles.css
        └── pages/
            ├── Upload.tsx
            └── Review.tsx                    # dimension-bucketed scorecard + step-by-step CTAs

Notes / caveats

Journal articles only. No book chapters, conference papers, datasets, preprints (as deposit type). Preprint relations on a journal article are supported via the preprint_doi field.
No fabrication. The LLM is constrained by structured outputs and explicit candidate lists. ORCIDs, DOIs, RORs, ISSNs come from APIs or PDF text — never made up.
Topic-aware verification. The verify_authors structurer now receives the paper's title, abstract excerpt, and OpenAlex concepts, and rejects candidates whose top concepts have zero overlap with the paper — even when the institution matches.
XML is well-formedness checked, not XSD-validated. Run Crossref's XML validator (or xmllint --schema crossref5.3.1.xsd ...) before deposit. The template targets schema 5.3.1.
LLM provider must support OpenAI structured outputs (response_format with json_schema).

Roadmap

Ordered roughly by impact-per-week-of-work for the editorial-tool use case.

Speed & throughput

PDF layout memory. Cache layout signatures (page geometry, text-block fingerprints, header bbox, references-section detector results) per publisher / per template. On the second paper from the same journal, the parser short-circuits to the known regions instead of re-running detection end-to-end. Today's three-tier references detector (services/references_layout.py) is the first taste; generalise to authors_layout.py, funding_layout.py, crediting_layout.py.
Batch processing of multiple files. Drop a directory or a ZIP of PDFs at the dropzone (or POST /batch) and process N submissions in parallel, with a single combined progress view and a single Crossref XML bundle download. Same parse pipeline as the single-file flow, shared HTTP cache, throughput-bounded by Docling-serve concurrency.
CLI for backfill of older records. A mgf-cli ingest <path> that runs the parse + factsheet + free auto-fix end-to-end without the GUI. Useful for re-processing a publisher's archive once layout-aware detection improves, or for retrofitting metadata into legacy DOIs that were deposited with thin records. Output: a JSON report per file + the generated XML, written next to each PDF.

Editorial workflow

Per-editor / per-journal preference storage. Capture and replay publisher policy fields that don't live in the PDF: Crossmark policy URL, copyright holder, license-of-record default, depositor email, preferred CRediT taxonomy version, archived-versions URL pattern, funding-text → funder mapping rules. First time an editor sets these for a journal (keyed by ISSN or DOI prefix), they're remembered for every subsequent paper in that journal.
needs_pick inline UI. When ORCID or ROR returns N>1 candidates, the data is already in provenance — render the candidate list inline on the field card with [Pick] buttons (free) and an "Adjudicate with AI · ~$0.0002" button. CLAUDE.md's next-slice item.
Inter-paper consistency checks. Across a single issue or volume, flag inconsistencies the editor would otherwise miss — author name variants for the same person across papers, mismatched affiliations, inconsistent license, duplicate DOIs, drifted funder labels.
Audit log per submission. Append-only ledger of every editor action (confirm, reject, pick, locate, AI call, manual edit) with timestamps. Both for accountability and to surface per-publisher patterns that should become learned rules.

Crossref-side completeness

Bump XML template to Crossref schema 5.4.0 (currently 5.3.1). One-file change in templates/journal_article.xml.j2.
XSD validation of the generated XML, not just well-formedness. Run schema 5.4.0 inline before download / deposit.
Direct deposit to Crossref end-to-end. Sign in once with Crossref credentials and POST /submit straight to their deposit API instead of downloading XML and uploading by hand.
Crossref REST diff. Before depositing an update, fetch the existing record from api.crossref.org/works/{doi} and show a diff view so the editor sees exactly what they're about to overwrite.

Robustness & richer signals

Local ROR / ORCID daily snapshot. Mirror the public dumps (ROR: ~50 MB/day, ORCID public summary: bulk export) so air-gapped or rate-limited publishers can resolve identifiers without internet egress. Falls back to live API if the snapshot misses.
GROBID fallback for scanned / OCR-only PDFs. Docling struggles with poor scans; GROBID's TEI output covers a different failure mode for legacy archives.
Multi-language abstract support. Crossref schema 5.4.0 lets you deposit abstracts in multiple languages — surface and structure them separately when present.
XMP write-back into the PDF. Embed the corrected metadata into the PDF's XMP packet so downstream tools (institutional repositories, preservation systems) read the same metadata that was deposited.
Per-publisher rubric weights. Let the publisher tune the five Research Nexus dimension weights to match their own integrity priorities (e.g. an OA-only publisher may weight Access higher).

Integration

Webhook from typesetting pipeline. Trigger POST /submissions from a publisher's CI (Editorial Manager, OJS, Manuscript Manager) the moment a final PDF is generated, so the metadata report is ready before the editor opens it.
Multi-tenant auth. If hosted, per-publisher login + scoped data (uploads, profiles, cost ledger) so multiple journals can share one instance.

If you'd like to pick one up or have other priorities, please open an issue at https://github.com/aadivar/metadata_gapfixer/issues.

Credits

Built by the team behind nexus-score.vercel.app. Source on GitHub: https://github.com/aadivar/metadata_gapfixer.

License

AGPL-v3. If you run a modified version of this software on a network-accessible service, you must offer that modified source to the service's users.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Metadata Generator

Screenshots

Scorecard overview

Locate flow with paginated references

Per-author entity views (ORCID, CRediT, ROR)

Categorisation: one Mandatory gate + five Research Nexus dimensions

Architecture

Pipeline

API surface

GUI

Quick start

Configuration

Storage

Layout

Notes / caveats

Roadmap

Speed & throughput

Editorial workflow

Crossref-side completeness

Robustness & richer signals

Integration

Credits

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
backend		backend
frontend		frontend
.env.example		.env.example
.gitignore		.gitignore
DESIGN.md		DESIGN.md
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
screenshot-1.png		screenshot-1.png
screenshot-3.png		screenshot-3.png
screenshot.png		screenshot.png

Folders and files

Latest commit

History

Repository files navigation

Metadata Generator

Screenshots

Scorecard overview

Locate flow with paginated references

Per-author entity views (ORCID, CRediT, ROR)

Categorisation: one Mandatory gate + five Research Nexus dimensions

Architecture

Pipeline

API surface

GUI

Quick start

Configuration

Storage

Layout

Notes / caveats

Roadmap

Speed & throughput

Editorial workflow

Crossref-side completeness

Robustness & richer signals

Integration

Credits

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages