The default model in this repository is SAVER-SIS (main model), with an optional SAVER-RL (RES) extension.
SAVER follows three core principles:
- Use vision only when the current entity (MNER) or marked entity pair (MRE) is likely to be visually groundable.
- When vision is activated, acquire only a small and complementary multi-image evidence set.
- Use a unified scoring head to combine text and optional visual evidence.
The full pipeline has four stages:
- Encoding: text encoder produces token/span representations; vision encoder first produces global image vectors.
- CGG (Conformal Groundability Gate): unit-level routing decision for whether visual evidence is needed.
- Evidence Constructor (SIS or RES) + Set Transformer: after activation, select up to K images and aggregate region evidence.
- Energy-Inspired Joint Scoring: unified scoring for MNER (span/type) or MRE (relation).
- Input: text + image set.
- Output: entity spans and entity types.
- SAVER performs CGG and evidence selection at candidate-span granularity.
- Input: text with a marked head/tail entity pair + image set.
- Output: relation label of the marked pair.
- SAVER performs pair-level routing and evidence selection (pair gate is derived from entity gates by default).
- Computes groundability score
g(u)using global image vectors and text-image similarity statistics. - Uses threshold-based hard routing at inference:
γ(u)=1[g(u)≥τ]. - Chooses
τon a calibration split via a Clopper-Pearson upper-bound constraint under target riskα.
- When gate is active, selects at most K images from N candidates.
- Objective balances relevance and coverage/diversity.
- Uses greedy approximation (
1-1/e) for efficient selection.
- Formulates evidence acquisition as sequential decision making with a STOP action.
- Uses cost-aware rewards with CGG-based action masking.
- Reported as an extension in ablations; SIS remains the default model in main tables.
- Uses standard cross-entropy training (energy notation is used for unified formulation).
- Combines task score, text-vision consistency term, and gate sparsity term.
- MRE: MNRE, MRE-MI
- MNER: Twitter-2015, Twitter-2017, MNER-MI, MNER-MI-Plus
Core dataset scales used in the main text:
| Dataset | Task | Train / Dev / Test | Avg. images |
|---|---|---|---|
| MNRE (v2) | RE | 12,247 / 1,624 / 1,614 | 1.00 |
| MRE-MI | RE | 13,504 / 4,500 / 4,500 | 2.80 |
| Twitter-2017 | MNER | 3,373 / 723 / 723 | 1.00 |
| MNER-MI-Plus | MNER | 10,229 / 1,583 / 1,583 | 2.15 |
- MRE: micro-F1 / Precision / Recall
- MNER: strict entity-level F1 (boundary + type)
- Selectivity: Risk-Activation-Coverage curves, AURC, ActCov@0.10
- Efficiency: FLOPs/sample, end-to-end P90 latency
| Method | P↑ | R↑ | F1↑ | AURC↓ | ActCov@0.10↑ | FLOPs (G/sample)↓ | P90 (ms)↓ |
|---|---|---|---|---|---|---|---|
| ModernBERT-only | 82.37 | 79.84 | 81.09 | 0.147 | 0.68 | 13 | 17 |
| DeBERTa-v3-only | 81.53 | 79.48 | 80.49 | 0.153 | 0.66 | 18 | 27 |
| HVPNeT | 73.87 | 76.82 | 75.32 | 0.168 | 0.63 | 66 | 99 |
| RSRNeT | 84.78 | 83.06 | 83.89 | 0.129 | 0.74 | 60 | 90 |
| All-Images Attn. | 83.47 | 82.18 | 82.82 | 0.142 | 0.72 | 62 | 93 |
| Top-K by relevance | 85.31 | 83.62 | 84.45 | 0.119 | 0.77 | 51 | 77 |
| GLRA | 85.23 | 83.81 | 84.51 | 0.117 | 0.78 | 56 | 84 |
| Retrieval-Aug. | 84.27 | 82.86 | 83.56 | 0.124 | 0.75 | 55 | 82 |
| SAVER (full) | 85.93 | 84.57 | 85.24 | 0.104 | 0.82 | 36 | 54 |
| SAVER w/o CGG | 84.46 | 83.18 | 83.81 | 0.124 | 0.76 | 51 | 77 |
| SAVER w/o SIS | 84.13 | 82.74 | 83.43 | 0.136 | 0.74 | 62 | 93 |
| SAVER w/o J.Score | 85.28 | 84.12 | 84.70 | 0.111 | 0.80 | 37 | 56 |
| SAVER (CGG+SIS) | 85.14 | 84.33 | 84.73 | 0.107 | 0.81 | 35 | 53 |
| Dataset | Strong baseline | SAVER | Gain |
|---|---|---|---|
| MNRE | 83.9 (RSRNeT) | 84.7 | +0.8 |
| Twitter-2015 | 76.5 (RSRNeT) | 77.0 | +0.5 |
| Twitter-2017 | 87.9 (RSRNeT) | 88.0 | +0.1 |
| MNER-MI | 76.9 (GLRA-adapt) | 77.3 | +0.4 |
| MNER-MI-Plus | 83.1 (GLRA-adapt) | 83.7 | +0.6 |
| Dataset | Act. Coverage | Emp. Error | CP Upper |
|---|---|---|---|
| MRE-MI | 0.82 | 0.087 | 0.097 |
| MNRE | 0.82 | 0.083 | 0.098 |
| Twitter-2017 | 0.80 | 0.077 | 0.099 |
| MNER-MI-Plus | 0.81 | 0.084 | 0.099 |
pip install -r requirements.txtPlace data under data/NER_data, or update data paths in run.py.
Use links in the project documentation or prepare data with the expected directory structure.
Note: SAVER supports both single-image and multi-image inputs. In multi-image settings, SIS/RES builds compact evidence subsets only when gate activation occurs.
bash run_twitter15.sh
bash run_twitter17.shbash run_re_task.shpython -u run.py \
--dataset_name="MRE" \
--bert_name="bert-base-uncased" \
--seed=1234 \
--only_test \
--max_seq=80 \
--use_prompt \
--prompt_len=4 \
--sample_ratio=1.0 \
--load_path='your_re_ckpt_path'- This README now presents SAVER as the primary method and main result set.
- If updated SAVER architecture/risk-coverage figures are available, replace placeholders in this README accordingly.
- Twitter15/Twitter17 data processing follows UMT.
- MNRE data sourcing follows MEGA.
- Early implementation inspirations include HVPNeT.