Skip to content

aadivar/metadata_gapfixer

Repository files navigation

Metadata Generator

A local diagnostic tool that turns a journal article PDF/DOCX into a Research Nexus completeness scorecard plus a Crossref-ready DOI submission XML. It runs the document through layout analysis, deterministic extraction, free identifier-registry lookups, and opt-in AI enrichment, and shows you in one screen exactly which integrity-relevant fields are present, which can be auto-filled, and which need editorial input — with running cost in USD for every paid LLM call.

A follow-up to the Crossref Research Nexus Score visualised at nexus-score.vercel.app. Where the score tells you what's missing, this tool helps you fix it.

┌────────────────────────────────────────────────────────────────────┐
│  47   RESEARCH NEXUS                                               │
│       Mandatory: 9/9 · Depositable                                 │
│                                                                    │
│   25%  Provenance       █░░░░░░░   13%                             │
│   20%  People           ████░░░░   60%                             │
│   20%  Funding          ░░░░░░░░    0%                             │
│   20%  Access           ███████░   80%                             │
│   15%  Organizations    █████░░░   70%                             │
│                                                                    │
│   [Run automated extraction]  [Run AI enrichment ~$0.025]          │
└────────────────────────────────────────────────────────────────────┘

Screenshots

Scorecard overview

Review page — scorecard overview

The whole paper on one screen. The big number on the left is the Research Nexus score (weighted across Provenance · People · Funding · Access · Organizations); the banner next to it tracks the Mandatory deposit gate ("5/9 fields · Not yet depositable") which must be satisfied before the deposit XML can be generated. Each dimension bar shows its current score and field-completion ratio.

Below the hero, the six dimensions collapse by default — editors drill into one section at a time. Each header shows the weight badge (GATE/25%/20%/...), the dimension's score, the N/M fields counter, and a · X need attention hint when there's open work, so the collapsed view is enough to plan a session without expanding anything. The hero buttons are the publisher-controlled spend: Run automated extraction (free pass — regex / Docling / free APIs), Run AI enrichment with the upfront cost preview, and Generate Crossref XML which is gated on the Mandatory banner.

Locate flow with paginated references

Provenance section expanded — locate flow

Expanding a dimension reveals the field cards and the locate-in-document workflow that powers most editorial actions. Pictured: the Provenance dimension after the editor clicked Identify on document on References. The PDF renders inline at the right resolution for box-level selection — drag a lasso to multi-select boxes, shift-drag to deselect, and switch pages without losing the selection on the previous page (multi-page reference lists Just Work). The selected text is processed lookup-first: inline DOI regex → Crossref bibliographic search → OpenAlex match. AI is reserved for the ambiguous leftovers, opt-in.

Notice the layout: the dimension headers keep showing their score and counter even while one is open, so editors can monitor overall progress without scrolling back to the hero. Confirmed fields render as collapsible cards under "Confirmed"; fields that still need attention appear under "Needs attention" with the next-best-action button (Run lookup · free / Identify on document / Confirm / Reject and re-identify) — never more than one primary CTA per state.

Per-author entity views (ORCID, CRediT, ROR)

People + Organizations expanded — entity-level views

The interpretation banner adapts to the score: "Depositable, but the record is thin. It will be hard for indexers to use." — Mandatory is satisfied (so the deposit XML can be generated), but the weighted Research Nexus score is low because Funding and Access are 0% on this paper. The hero score and banner are deliberately separate from the Mandatory gate: the deposit can go through, but the record won't enable discovery, citation graphs, or compliance reporting until the other dimensions are filled.

Pictured: People and Organizations both expanded.

In People, the dimension-owned pillar reads 8/8 authors with ORCID — that's the actual entity count, not a rubric flag. Inside, each field card expands to show the underlying values:

  • ORCID for corresponding author and ORCID for every author show each author with their resolved ORCID chip linking out to the public profile. lookup badges advertise that these were free API resolutions — no LLM cost.
  • CRediT contributor roles is the AI structurer's output: per author, the 14-role CRediT taxonomy assignment, each role accompanied by an evidence quote pulled from the PDF and a confidence percentage. Editors can scan the evidence to verify the LLM's mapping rather than trust it blindly.

In Organizations, every author's affiliation is linked to a ROR chip cross-referenced against the Research Organization Registry. The list deduplicates so three authors at the same institution count as one resolved affiliation, not three (a common scoring trap fixed earlier in the project's history). Each row shows the affiliation string, its ROR ID, and the link out — the same shape the deposit XML's <affiliation><institution-id type="ror"> element wants.

This is the "show the data, not just the verdict" rule: every dimension score is decomposable down to the actual extracted values, with their provenance, so the editor can audit before depositing.


Categorisation: one Mandatory gate + five Research Nexus dimensions

Aligned with the Crossref Research Nexus framing and the weighting used by nexus-score.vercel.app.

Bucket Weight What it covers
Mandatory gate DOI, title, journal, ISSN, year, ≥1 author, full pub date, vol/issue/pages, copyright. The Crossref deposit minimum — must be satisfied before deposit.
Provenance 25% References, refs-with-DOI, preprint→VoR link, Crossmark, conflict-of-interest, data/code availability.
People 20% Full author names, ORCID for corresponding author, ORCID for every author, CRediT contributor roles.
Funding 20% Funder Registry DOI, award/grant numbers.
Access 20% Abstract (plain + JATS), license, OA indicator, plain-language summary.
Organizations 15% Affiliations extracted, ROR for every affiliation.

The hero Research Nexus score is the weight-averaged percentage across the five dimensions. The Mandatory gate decides whether the record is depositable at all.


Architecture

Five layers, each layered on the one below:

Layer What Cost Trigger
L0 · Docling layout PDF/DOCX → structured JSON + per-element bboxes + page renders $0 Always (parse stage)
L1 · Deterministic factsheet Regex sweep (DOI / ORCID / ROR / ISSN / arXiv / license / grant patterns / preprint DOIs) + PDF /Info + header parser (authors + affiliation marker map) + boilerplate anchor matching (funding / CoI / data availability / ethics) $0 Always (parse stage)
L2 · Free enricher APIs ORCID public API · ROR v2 · OpenAlex · Crossref REST. Now with affiliation normalisation (Solr-style alternatives, Indian Institute of Technology, Delhi → Indian Institute of Technology Delhi), name-swap fallback for Indian/Telugu profiles, and ROR clear-winner auto-accept (top score ≥ 0.95 with 0.10+ margin). $0 Auto-fix (publisher click)
L3 · LLM picker When the editor explicitly opts in, runs ONE structured-output call per ambiguous field that picks among the API-returned candidates with reasoning + confidence. ~$0.0002/call Per-field "Adjudicate with AI"
L4 · LLM structurer Higher-leverage: takes a raw content region (or editor-located text via text_override) and returns a clean structured record. Five named tasks: structure_authors, structure_references, structure_funding, structure_credit, verify_authors. The verifier now also receives the paper's title + abstract excerpt + OpenAlex concepts so it can reject candidates whose research domain doesn't match. ~$0.0003 – $0.013/call Per-field "Identify on document" with AI cost meta-pill

Cost rule: the LLM is never called automatically. Every paid call is the publisher's deliberate choice and is recorded in a per-submission USD ledger.

Side path · GLiNER2 NER. A local zero-shot named-entity recogniser (fastino/gliner2-large-v1 by default, fastino/gliner2-base-v1 for smaller hosts) is mounted at POST /submissions/{id}/ner for ad-hoc entity extraction over a chosen text region. Per-zone label presets (e.g. funding zone → award_id + funder_name; CoI zone → person + organization + relationship; affiliation zone → organization + country + gpe) live in services/ner.py. GLiNER runs entirely on-device, no network calls, no LLM cost — it's the cheap layer for "find ORCIDs in this paragraph" or "pull award IDs from this funding statement" when neither regex nor a structured LLM call is the right tool. Weights download once into ./data/hf-cache/ (~1.4 GB) on first call.

For a typical paper, full premium processing tops out around $0.025. Standard auto-fix (no LLM) is $0.


Pipeline

  1. Upload PDF/DOCX → POST /submissions. Status flows uploaded → parsing → parsed.
  2. Parse runs once per upload, in the background:
    • docling-serve /v1/convert/file → structured JSON + markdown.
    • PyMuPDF renders each page to PNG at 150 DPI; bboxes are mapped from PDF points → image pixels for the layout overlay.
    • factsheet.py runs the deterministic L1 extraction.
  3. Score is computed on demand from the factsheet + saved metadata. Returned by GET /submissions/{id}/score — includes per-dimension scores, the Research Nexus weighted score, the Mandatory gate state, and the entity-count pillars (e.g. 7/9 authors with ORCID, 3/8 affiliations with ROR).
  4. Auto-fix (free, deterministic):
    • POST /submissions/{id}/autofix/all runs every high-impact fixer in one click (the hero CTA).
  5. Identify on document (the unified per-field action when something is missing): editor selects boxes containing the value; the backend transparently routes to either field-aware regex (for deterministic fields) or the LLM structurer with text_override (for AI-leverage fields), with an upfront cost-confirm dialog.
  6. Confirm / RejectPOST /submissions/{id}/confirm flips provenance.confirmed=true; POST .../reject flips back to needs_locate to prompt re-identification.
  7. Generate XMLPOST /submissions/{id}/xml builds, GET .../xml downloads.

API surface

Method Path Purpose
POST /submissions Multipart upload, kicks off parse
GET /submissions List submissions + status
GET /submissions/{id} Status of one submission
DELETE /submissions/{id} Remove submission + all generated files
GET /submissions/{id}/factsheet Deterministic L1 output
GET /submissions/{id}/score Dimension rubric + Research Nexus score + entity pillars
GET /submissions/{id}/sections Layout-derived sections
GET /submissions/{id}/sections/{n} One section with full text
GET /submissions/{id}/markdown Full Docling markdown
GET /submissions/{id}/pages Per-page render dimensions
GET /submissions/{id}/pages/{n}/image Page PNG
GET /submissions/{id}/pages/{n}/boxes Per-page Docling bboxes + text
GET /submissions/{id}/references_layout Three-tier references detection result
GET /submissions/{id}/provenance Per-field source / confidence / reasoning
GET /submissions/{id}/cost LLM cost ledger for this submission
GET /submissions/{id}/metadata Saved metadata JSON
PUT /submissions/{id}/metadata Save edited metadata
POST /submissions/{id}/autofix One deterministic fixer (free)
POST /submissions/{id}/autofix/all Run every high-impact fixer (free)
POST /submissions/{id}/pick Manual editor pick from candidate list (free)
POST /submissions/{id}/disambiguate/estimate Preview LLM cost (free)
POST /submissions/{id}/disambiguate Opt-in LLM picker (~$0.0002/call)
POST /submissions/{id}/structure/{task}/estimate Per-task cost preview
POST /submissions/{id}/structure/{task} Opt-in LLM structurer; optional body {"text_override": "..."} for editor-located source text
POST /submissions/{id}/enrich/all Run all premium structurers in sequence
POST /submissions/{id}/confirm Editor confirms a field
POST /submissions/{id}/reject Editor rejects a field → needs_locate
POST /submissions/{id}/locate Editor pointed to box(es); regex extraction for deterministic fields
POST /submissions/{id}/ner On-device GLiNER2 zero-shot NER over a text region
GET /submissions/presets/labels NER label presets per content zone
POST /submissions/{id}/xml Build Crossref XML
GET /submissions/{id}/xml Download XML

GUI

Vite + React + react-router. Restrained scholarly palette (parchment / inkwell / muted-stone / onyx-orange accent), shadcn-style icons (lucide), Lato body + JetBrains Mono code, 4/8px radii, compact density. Light is the canonical theme; dark is a derived inverse.

Two pages:

  • Submissions (/upload) — drag-and-drop dropzone, status pills, per-row delete.
  • Review (/review/:id) — top-to-bottom flow:
    1. Hero — Research Nexus score (weighted), Mandatory gate banner ("9/9 fields · Depositable" or "Not yet depositable"), five dimension bars with weight labels, action buttons (Run automated extraction · Run AI enrichment ~$0.0NN · Generate Crossref XML).
    2. Sticky dimension nav — chip per dimension with current score, jumps to that section.
    3. Per-dimension sections — Mandatory, then Provenance / People / Funding / Access / Organizations. Each has a strong header (weight badge, title, description, score number, progress bar), a dimension-owned entity-progress strip when applicable (e.g. People shows "7/9 authors with ORCID"), and field cards split into two sub-buckets:
      • Needs attention — everything not confirmed
      • Confirmed — green-edged cards collapsed beneath
    4. Field cards — each card is collapsed by default; click to expand the detail panel showing the structured data inline:
      • Author-related fields (full names, ORCIDs, affiliations, RORs) expand to show a per-author list with ORCID / ROR / affiliation chips and AI evidence chains.
      • CRediT contributor roles expand to show per-author roles with evidence quotes from the contribution paragraph and confidence %.
  • Step-by-step CTAs: at most one primary action per state. Missing fields show a single Identify on document button (with cost meta-pill if AI-leverage). Pending fields show Confirm / Reject. Confirmed fields show only Reject and re-identify. The hero owns the global Run automated extraction (free pass) and Run AI enrichment (priced pass) buttons.

Quick start

git clone https://github.com/aadivar/metadata_gapfixer
cd metadata_gapfixer
cp .env.example .env                # edit OPENAI_API_KEY + CONTACT_EMAIL
docker compose up -d --build

Open http://localhost:3000.

First boot downloads:

  • ghcr.io/docling-project/docling-serve:latest (~2 GB)
  • GLiNER2 (fastino/gliner2-large-v1) into ./data/hf-cache/ (~1.4 GB)

Tail logs with docker compose logs -f backend.

Set CONTACT_EMAIL in .env — it's used in the User-Agent for ORCID, ROR, OpenAlex, and Crossref polite-pool routing. Identified clients get much higher rate limits.


Configuration

All knobs live in .env. See .env.example for working snippets per provider (OpenAI, OpenRouter, Anthropic-compat, Groq, Ollama, LiteLLM).

Var Default Notes
OPENAI_BASE_URL https://api.openai.com/v1 Any OpenAI-compatible endpoint
OPENAI_API_KEY (required) Provider API key
OPENAI_MODEL gpt-4o-mini Must support response_format=json_schema
GLINER_MODEL fastino/gliner2-large-v1 Use fastino/gliner2-base-v1 (205M) on small hosts
CONTACT_EMAIL anonymous@example.org Sent in User-Agent for ORCID/ROR/OpenAlex/Crossref polite pools — change this
HF_TOKEN (unset) Optional — higher rate limits on HuggingFace downloads

Per-task model routing lives in backend/app/services/llm_router.py's TASK_CONFIG dict.


Storage

For now, everything is on disk under ./data/:

data/
├── uploads/                      # raw uploaded PDFs/DOCXs
├── outputs/
│   ├── {id}_docling.json         # full DoclingDocument
│   ├── {id}_layout.json          # rendered pages + bboxes (PDFs only)
│   ├── {id}_pages/page_NNN.png   # 150 DPI page renders
│   ├── {id}_factsheet.json       # deterministic L1 extraction
│   ├── {id}_metadata.json        # editor-facing metadata + provenance
│   ├── {id}_cost.json            # per-call LLM cost ledger
│   └── {id}_crossref.xml         # generated deposit XML
├── cache/http/                   # diskcache for ORCID/ROR/OpenAlex/Crossref
├── hf-cache/                     # huggingface model weights
└── mgf.db                        # SQLite — submissions table only

To swap storage (S3, GCS, Azure Blob, NFS, etc.): bind-mount ./data/uploads and ./data/outputs to your network storage (works for any POSIX-mountable backend), and replace the SQLite engine in backend/app/db.py with a Postgres / MySQL URL — SQLModel speaks both.

We deliberately did not embed an S3 / cloud-specific client so the project stays portable. Pick what fits your infrastructure.


Layout

metadata_gapfixer/
├── docker-compose.yml
├── .env.example
├── README.md  · DESIGN.md
├── LICENSE                                   # AGPL-v3
├── backend/
│   ├── Dockerfile · requirements.txt
│   ├── templates/journal_article.xml.j2
│   └── app/
│       ├── main.py · config.py · db.py · models.py · pipeline.py
│       ├── routes/
│       │   ├── health.py
│       │   └── submissions.py
│       └── services/
│           ├── docling_client.py · page_render.py · sections.py
│           ├── factsheet.py · ner.py
│           ├── scoring.py                    # rubric + dimensions + Research Nexus score
│           ├── autofix.py                    # free deterministic fixers + needs_pick
│           ├── llm_router.py                 # per-task models + cost ledger
│           ├── structurers.py                # five LLM structurers + paper-context aware verifier
│           ├── crossref_xml.py
│           └── enrichers/
│               ├── _base.py · orcid.py · ror.py · openalex.py · crossref.py
│               └── (ORCID name-swap, ROR comma-delete normalisation, etc.)
└── frontend/
    ├── Dockerfile · package.json
    └── src/
        ├── App.tsx · api.ts · theme.ts · icons.tsx · styles.css
        └── pages/
            ├── Upload.tsx
            └── Review.tsx                    # dimension-bucketed scorecard + step-by-step CTAs

Notes / caveats

  • Journal articles only. No book chapters, conference papers, datasets, preprints (as deposit type). Preprint relations on a journal article are supported via the preprint_doi field.
  • No fabrication. The LLM is constrained by structured outputs and explicit candidate lists. ORCIDs, DOIs, RORs, ISSNs come from APIs or PDF text — never made up.
  • Topic-aware verification. The verify_authors structurer now receives the paper's title, abstract excerpt, and OpenAlex concepts, and rejects candidates whose top concepts have zero overlap with the paper — even when the institution matches.
  • XML is well-formedness checked, not XSD-validated. Run Crossref's XML validator (or xmllint --schema crossref5.3.1.xsd ...) before deposit. The template targets schema 5.3.1.
  • LLM provider must support OpenAI structured outputs (response_format with json_schema).

Roadmap

Ordered roughly by impact-per-week-of-work for the editorial-tool use case.

Speed & throughput

  • PDF layout memory. Cache layout signatures (page geometry, text-block fingerprints, header bbox, references-section detector results) per publisher / per template. On the second paper from the same journal, the parser short-circuits to the known regions instead of re-running detection end-to-end. Today's three-tier references detector (services/references_layout.py) is the first taste; generalise to authors_layout.py, funding_layout.py, crediting_layout.py.
  • Batch processing of multiple files. Drop a directory or a ZIP of PDFs at the dropzone (or POST /batch) and process N submissions in parallel, with a single combined progress view and a single Crossref XML bundle download. Same parse pipeline as the single-file flow, shared HTTP cache, throughput-bounded by Docling-serve concurrency.
  • CLI for backfill of older records. A mgf-cli ingest <path> that runs the parse + factsheet + free auto-fix end-to-end without the GUI. Useful for re-processing a publisher's archive once layout-aware detection improves, or for retrofitting metadata into legacy DOIs that were deposited with thin records. Output: a JSON report per file + the generated XML, written next to each PDF.

Editorial workflow

  • Per-editor / per-journal preference storage. Capture and replay publisher policy fields that don't live in the PDF: Crossmark policy URL, copyright holder, license-of-record default, depositor email, preferred CRediT taxonomy version, archived-versions URL pattern, funding-text → funder mapping rules. First time an editor sets these for a journal (keyed by ISSN or DOI prefix), they're remembered for every subsequent paper in that journal.
  • needs_pick inline UI. When ORCID or ROR returns N>1 candidates, the data is already in provenance — render the candidate list inline on the field card with [Pick] buttons (free) and an "Adjudicate with AI · ~$0.0002" button. CLAUDE.md's next-slice item.
  • Inter-paper consistency checks. Across a single issue or volume, flag inconsistencies the editor would otherwise miss — author name variants for the same person across papers, mismatched affiliations, inconsistent license, duplicate DOIs, drifted funder labels.
  • Audit log per submission. Append-only ledger of every editor action (confirm, reject, pick, locate, AI call, manual edit) with timestamps. Both for accountability and to surface per-publisher patterns that should become learned rules.

Crossref-side completeness

  • Bump XML template to Crossref schema 5.4.0 (currently 5.3.1). One-file change in templates/journal_article.xml.j2.
  • XSD validation of the generated XML, not just well-formedness. Run schema 5.4.0 inline before download / deposit.
  • Direct deposit to Crossref end-to-end. Sign in once with Crossref credentials and POST /submit straight to their deposit API instead of downloading XML and uploading by hand.
  • Crossref REST diff. Before depositing an update, fetch the existing record from api.crossref.org/works/{doi} and show a diff view so the editor sees exactly what they're about to overwrite.

Robustness & richer signals

  • Local ROR / ORCID daily snapshot. Mirror the public dumps (ROR: ~50 MB/day, ORCID public summary: bulk export) so air-gapped or rate-limited publishers can resolve identifiers without internet egress. Falls back to live API if the snapshot misses.
  • GROBID fallback for scanned / OCR-only PDFs. Docling struggles with poor scans; GROBID's TEI output covers a different failure mode for legacy archives.
  • Multi-language abstract support. Crossref schema 5.4.0 lets you deposit abstracts in multiple languages — surface and structure them separately when present.
  • XMP write-back into the PDF. Embed the corrected metadata into the PDF's XMP packet so downstream tools (institutional repositories, preservation systems) read the same metadata that was deposited.
  • Per-publisher rubric weights. Let the publisher tune the five Research Nexus dimension weights to match their own integrity priorities (e.g. an OA-only publisher may weight Access higher).

Integration

  • Webhook from typesetting pipeline. Trigger POST /submissions from a publisher's CI (Editorial Manager, OJS, Manuscript Manager) the moment a final PDF is generated, so the metadata report is ready before the editor opens it.
  • Multi-tenant auth. If hosted, per-publisher login + scoped data (uploads, profiles, cost ledger) so multiple journals can share one instance.

If you'd like to pick one up or have other priorities, please open an issue at https://github.com/aadivar/metadata_gapfixer/issues.


Credits

Built by the team behind nexus-score.vercel.app. Source on GitHub: https://github.com/aadivar/metadata_gapfixer.

License

AGPL-v3. If you run a modified version of this software on a network-accessible service, you must offer that modified source to the service's users.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors