RGate-MNER: RL-Calibrated Relation Gating for Multimodal NER

This repository contains the implementation of RGate-MNER: RL-Calibrated Relation Gating for Multimodal NER.

Requirements

Datasets

Download the multimodal NER dataset Twitter-15 (Zhang et al., 2018) from here and place it in resources/datasets/twitter2015.
Download the multimodal NER dataset Twitter-17 (Lu et al., 2018) and place it in resources/datasets/twitter2017.
Download the text-image relationship dataset (Vempala et al., 2019) from here and place it in resources/datasets/relationship.

Run python data/loader.py to verify that the dataset statistics match the reports in (Zhang et al., 2018) and (Lu et al., 2018).

Twitter-15	NUM	PER	LOC	ORG	MISC
Training	4000	2217	2091	928	940
Development	1000	552	522	247	225
Testing	3257	1816	1697	839	726

Twitter-17	NUM	TOKEN
Training	4290	68655
Development	1432	22872
Testing	1459	23051

Models

Download the pre-trained ResNet-101 weights from here and place them in resources/models/cnn/resnet101.pth.
Download the pre-trained BERT-base-uncased weights from here and place them in resources/models/transformers/bert-base-uncased.
Download the pre-trained word embeddings from here and place them in resources/models/embeddings.

Libraries

tqdm
Pillow
numpy
torch
torchvision
transformers
flair
pytorch-crf

Method

Overview and Two-Stage Gating

Social-media MNER is a budgeted decision problem under asymmetric observability: before full visual encoding the model only sees cheap but coarse evidence, while after full encoding it has richer cross-modal evidence but has already paid the dominant visual cost. Because different post-image pairs exhibit different relevance and uncertainty profiles, visual usage should be decided adaptively for each example. RGate-MNER therefore decomposes visual usage into two decisions with different roles and observability. An early gate makes the visual acquisition decision, while a relation gate makes the post-encoding trust decision and determines the strength of visual weighting.

Given text $x_t$ and image $x_i$, the text encoder produces word-level features $H_t\in\mathbb{R}^{T\times D}$ and pooled vector $c_t\in\mathbb{R}^{D}$, and a lightweight preview branch produces $c_i^g\in\mathbb{R}^{D}$. The early gate predicts

$$ p_0=\sigma!\big(f_{\mathrm{early}}([c_t;c_i^g;u_t])\big), $$

where $u_t$ is a text-only uncertainty statistic. If $p_0<\theta_0$, inference exits to the text-only path and skips the full visual encoder. Otherwise, the full visual backbone yields visual tokens $V\in\mathbb{R}^{M\times D}$ and pooled vector $c_i$, from which we form

$$ c_{ti}=f_{\mathrm{pair}}!\big([c_t;c_i;c_t\odot c_i;|c_t-c_i|]\big) $$

and compute the relation gate

$$ p_1=\sigma!\big(f_{\mathrm{gate}}([c_{ti};u_t])\big). $$

At inference time, the selected representation is

$$ \hat H_t= \begin{cases} H_t, & p_0<\theta_0,\\ H_t, & p_0\ge\theta_0\ \text{and}\ p_1<\theta_1,\\ F_{\mathrm{fuse}}(H_t,,p_1V), & p_0\ge\theta_0\ \text{and}\ p_1\ge\theta_1. \end{cases} $$

In practice, $p_1$ is evaluated only for activated examples with $p_0\ge\theta_0$. Hence, $p_0$ controls the hard acquisition decision and the dominant image-encoding cost. Within activated cases, $p_1$ serves two roles: a hard back-off decision through $\mathbb{I}[p_1\ge\theta_1]$ and a soft weighting factor through $p_1V$. During the multimodal training stages (Stages II/III), full visual features are computed for all samples to construct pseudo negatives, usefulness targets, and RL rewards; conditional execution is applied only at inference.

Figure: RGate-MNER. A lightweight preview encoder supports an early gate $p_0$ for image acquisition; at inference, only activated examples run the full visual encoder. A relation gate $p_1$ then decides whether encoded visual tokens are injected, using both binary back-off and continuous weighting. Green dashed lines denote training-only stochastic gating.

Encoders and Fusion

We encode the post with BERT-base-uncased (Devlin et al., 2019). Given the word sequence $x_t=(w_1,\dots,w_T)$ and subwords $z_{1:L}=\mathrm{WP}(x_t)$, the encoder returns subword states and the pooled [CLS] vector, and we keep the first subword of each word to form word-level features:

$$ \tilde H_t,,c_t = E_t(z_{1:L}),\qquad H_t=\mathrm{FirstSubword}(\tilde H_t)\in\mathbb{R}^{T\times D}. $$

We use a lightweight text-only diagnostic head to produce per-token label distributions $\pi_j$ from $H_t$, and define the average token entropy

$$ u_t=\frac{1}{T}\sum_{j=1}^{T}\mathrm{Ent}(\pi_j), $$

which serves as a task-aware signal for both gates.

For images, a low-resolution preview encoder produces the cheap embedding

$$ c_i^g=\mathrm{Pool}!\big(\mathrm{Proj}_{g}(E_v^g(\mathrm{Resize}(x_i)))\big)\in\mathbb{R}^{D}, $$

and the full backbone $E_v$ (ResNet-152 (He et al., 2016)) produces visual features that are flattened into $M$ tokens and projected into the shared $D$-dimensional space:

$$ V=\mathrm{Proj}(E_v(x_i)),\qquad c_i=\mathrm{Pool}(V)\in\mathbb{R}^{D}. $$

We then add 2D positional and modality-type embeddings before fusion. When visual injection is activated, we fuse $H_t$ with $p_1V$ using $F_{\mathrm{fuse}}(\cdot)$. Concretely, $F_{\mathrm{fuse}}$ is implemented as a lightweight cross-attention module in which textual representations attend to visual features as auxiliary context. The same fusion architecture is also adopted in the always-on Naive-MM baseline, ensuring that improvements come from adaptive visual invocation rather than greater multimodal modeling capacity.

Training Objectives

Within each mini-batch, we construct pseudo negatives by shuffling image indices with a permutation $\pi$, so $(x_t^b,x_i^{\pi(b)})$ forms a mismatched pair. For pseudo-ITM, each sample uses its matched image and one shuffled mismatch; for InfoNCE, all in-batch non-matching images serve as negatives. The early gate is trained as a usefulness predictor using the task gain from a deterministic soft-fusion pass:

$$ \Delta \mathrm{NLL}^b = \mathrm{NLL}_{\mathrm{text}}^b-\mathrm{NLL}_{\mathrm{mm}}^{\mathrm{soft},b}, $$

$$ y_{\mathrm{use}}^b = \mathbb{I}[\Delta \mathrm{NLL}^b>\delta],\qquad \mathcal{L}_{\mathrm{use}}=\tfrac{1}{B}\sum_{b=1}^{B}\mathrm{BCE}(p_0^b,y_{\mathrm{use}}^b). $$

To calibrate relation-aware pair features, we form pair representations

$$ c_{ti}^{b,b'} = f_{\mathrm{pair}}!\big([c_t^b;c_i^{b'};c_t^b\odot c_i^{b'};|c_t^b-c_i^{b'}|]\big), $$

and predict matched and mismatched ITM scores by

$$ s_b^{\mathrm{pos}} = f_{\mathrm{itm}}([c_{ti}^{b,b};u_t^b]),\qquad s_b^{\mathrm{neg}} = f_{\mathrm{itm}}([c_{ti}^{b,\pi(b)};u_t^b]). $$

We further apply bidirectional InfoNCE on pair representations, with

$$ s_{b,b'}=t\cdot f_{\mathrm{nce}}(c_{ti}^{b,b'}), \qquad t=\exp(\kappa)>0. $$

The resulting objectives are

$$ \mathcal{L}_{\mathrm{itm}} = \tfrac{1}{2B}\sum_{b=1}^{B}!\Big[ \mathrm{BCE}(\sigma(s_b^{\mathrm{pos}}),1)+\mathrm{BCE}(\sigma(s_b^{\mathrm{neg}}),0)\Big], $$

$$ \mathcal{L}_{\mathrm{nce}} = -\frac{1}{2B}\sum_{b=1}^{B}!\Big[ \log \frac{\exp(s_{b,b})}{\sum_{b'}\exp(s_{b,b'})} +\log \frac{\exp(s_{b,b})}{\sum_{b'}\exp(s_{b',b})} \Big]. $$

The relation-gate probability used by fusion is computed on the matched pair:

$$ p_1^b=\sigma!\big(f_{\mathrm{gate}}([c_{ti}^{b,b};u_t^b])\big). $$

The same $p_1^b$ is also used to form $\mathrm{NLL}{\mathrm{mm}}^{\mathrm{soft},b}$ for the early-gate usefulness target via deterministic soft fusion $F{\mathrm{fuse}}(H_t^b,,p_1^bV^b)$.

For sequence labeling, Stage II uses deterministic soft fusion $F_{\mathrm{fuse}}(H_t^b,,p_1^bV^b)$. In Stage III, we sample

$$ r^b \sim \pi_\phi(\cdot \mid [c_{ti}^{b,b};u_t^b])=\mathrm{Bern}(p_1^b), $$

and the selected representation is

$$ \hat H_t^b(r)= \begin{cases} H_t^b, & r=0,\\ F_{\mathrm{fuse}}(H_t^b,,p_1^bV^b), & r=1. \end{cases} $$

The decoder then yields the NER loss $\mathcal{L}_{\mathrm{ner}}$ on the selected representation.

We regularize the expected activation rate of the early gate by

$$ \mathcal{L}_{\mathrm{budget}}=(\bar p_0-\tau)^2,\qquad \bar p_0=\tfrac{1}{B}\sum_{b=1}^{B}p_0^b. $$

Because the relation gate is applied only after the early gate has already activated the multimodal branch, its reward penalizes only the residual fusion/injection cost that remains controllable at this stage rather than the full visual-encoding cost. For sample $b$, we define

$$ R^b=\big[\mathrm{NLL}_{\mathrm{text}}^b-\mathrm{NLL}_{\mathrm{sel}}^b(r^b)\big]-\lambda_c,r^b\cdot C_{\mathrm{fuse}}, $$

where

$$ \mathrm{NLL}_{\mathrm{sel}}^b(0)=\mathrm{NLL}_{\mathrm{text}}^b,\qquad \mathrm{NLL}_{\mathrm{sel}}^b(1)=\mathrm{NLL}_{\mathrm{mm}}^{\mathrm{gate},b}, $$

and $\mathrm{NLL}{\mathrm{mm}}^{\mathrm{gate},b}$ is obtained from gated fusion $F{\mathrm{fuse}}(H_t^b,,p_1^bV^b)$.

Let

$$ q^b=\sigma!\big(f_{\mathrm{gate}}^{\mathrm{EMA}}([c_{ti}^{b,b};u_t^b])\big) $$

be an EMA anchor of the gate, where gradients are stopped through $q^b$. The policy-gradient loss is

$$ \begin{aligned} \mathcal{L}_{\mathrm{rl}} =& -\eta,\frac{1}{B}\sum_{b=1}^{B}\Big[(\mathrm{stopgrad}(R^b)-b_{\mathrm{ma}}) \log \pi_\phi(r^b,|,[c_{ti}^{b,b};u_t^b])\Big] \\ &+ \beta_{\mathrm{KL}},\frac{1}{B}\sum_{b=1}^{B} \mathrm{KL}!\Big(\mathrm{Bern}(p_1^b),|,\mathrm{Bern}(q^b)\Big), \end{aligned} $$

and the full objective in Stage III is

$$ \mathcal{L}_{\mathrm{total}}=\mathcal{L}_{\mathrm{ner}}+\alpha,\mathcal{L}_{\mathrm{itm}}+\beta,\mathcal{L}_{\mathrm{nce}}+\mu,\mathcal{L}_{\mathrm{use}}+\lambda,\mathcal{L}_{\mathrm{budget}}+\mathcal{L}_{\mathrm{rl}}. $$

Training uses three stages: Stage I is a text-only warm-up that optimizes only $\mathcal{L}{\mathrm{ner}}$; Stage II enables $\mathcal{L}{\mathrm{itm}}$, $\mathcal{L}{\mathrm{nce}}$, $\mathcal{L}{\mathrm{use}}$, and $\mathcal{L}{\mathrm{budget}}$ under deterministic soft fusion and sets $\mathcal{L}{\mathrm{rl}}=0$; Stage III turns on stochastic relation gating and optimizes the full objective above. We do not route examples with $p_0$ during training because early skipping would remove the multimodal signals needed to construct pseudo negatives, usefulness targets, and RL rewards.

Usage

# BERT-BiLSTM-CRF
python main.py --stacked --rnn --crf --dataset [dataset_id] --cuda [gpu_id]

# RGate-MNER-BiLSTM-CRF
python main.py --stacked --rnn --crf --encoder_v resnet101 --aux --gate --dataset twitter2017 --cuda [gpu_id]

# Save the best model to ./ckpt and also save checkpoints every 3 epochs
python main.py --encoder_v resnet101 --gate --save_interval 3

# Directly load an existing model and evaluate on the test set
python main.py --load_model ckpt/best_model.pt

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
model		model
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
config_utils.py		config_utils.py
constants.py		constants.py
main.py		main.py
requirements.txt		requirements.txt
result.txt		result.txt
test.sh		test.sh
train.sh		train.sh
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RGate-MNER: RL-Calibrated Relation Gating for Multimodal NER

Requirements

Datasets

Models

Libraries

Method

Overview and Two-Stage Gating

Encoders and Fusion

Training Objectives

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RGate-MNER: RL-Calibrated Relation Gating for Multimodal NER

Requirements

Datasets

Models

Libraries

Method

Overview and Two-Stage Gating

Encoders and Fusion

Training Objectives

Usage

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages