You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Download the pre-trained ResNet-101 weights from here and place them in resources/models/cnn/resnet101.pth.
Download the pre-trained BERT-base-uncased weights from here and place them in resources/models/transformers/bert-base-uncased.
Download the pre-trained word embeddings from here and place them in resources/models/embeddings.
Libraries
tqdm
Pillow
numpy
torch
torchvision
transformers
flair
pytorch-crf
Method
Overview and Two-Stage Gating
Social-media MNER is a budgeted decision problem under asymmetric observability: before full visual encoding the model only sees cheap but coarse evidence, while after full encoding it has richer cross-modal evidence but has already paid the dominant visual cost. Because different post-image pairs exhibit different relevance and uncertainty profiles, visual usage should be decided adaptively for each example. RGate-MNER therefore decomposes visual usage into two decisions with different roles and observability. An early gate makes the visual acquisition decision, while a relation gate makes the post-encoding trust decision and determines the strength of visual weighting.
Given text $x_t$ and image $x_i$, the text encoder produces word-level features $H_t\in\mathbb{R}^{T\times D}$ and pooled vector $c_t\in\mathbb{R}^{D}$, and a lightweight preview branch produces $c_i^g\in\mathbb{R}^{D}$. The early gate predicts
where $u_t$ is a text-only uncertainty statistic. If $p_0<\theta_0$, inference exits to the text-only path and skips the full visual encoder. Otherwise, the full visual backbone yields visual tokens $V\in\mathbb{R}^{M\times D}$ and pooled vector $c_i$, from which we form
In practice, $p_1$ is evaluated only for activated examples with $p_0\ge\theta_0$. Hence, $p_0$ controls the hard acquisition decision and the dominant image-encoding cost. Within activated cases, $p_1$ serves two roles: a hard back-off decision through $\mathbb{I}[p_1\ge\theta_1]$ and a soft weighting factor through $p_1V$. During the multimodal training stages (Stages II/III), full visual features are computed for all samples to construct pseudo negatives, usefulness targets, and RL rewards; conditional execution is applied only at inference.
Figure: RGate-MNER. A lightweight preview encoder supports an early gate $p_0$ for image acquisition; at inference, only activated examples run the full visual encoder. A relation gate $p_1$ then decides whether encoded visual tokens are injected, using both binary back-off and continuous weighting. Green dashed lines denote training-only stochastic gating.
Encoders and Fusion
We encode the post with BERT-base-uncased(Devlin et al., 2019). Given the word sequence $x_t=(w_1,\dots,w_T)$ and subwords $z_{1:L}=\mathrm{WP}(x_t)$, the encoder returns subword states and the pooled [CLS] vector, and we keep the first subword of each word to form word-level features:
and the full backbone $E_v$ (ResNet-152(He et al., 2016)) produces visual features that are flattened into $M$ tokens and projected into the shared $D$-dimensional space:
We then add 2D positional and modality-type embeddings before fusion. When visual injection is activated, we fuse $H_t$ with $p_1V$ using $F_{\mathrm{fuse}}(\cdot)$. Concretely, $F_{\mathrm{fuse}}$ is implemented as a lightweight cross-attention module in which textual representations attend to visual features as auxiliary context. The same fusion architecture is also adopted in the always-on Naive-MM baseline, ensuring that improvements come from adaptive visual invocation rather than greater multimodal modeling capacity.
Training Objectives
Within each mini-batch, we construct pseudo negatives by shuffling image indices with a permutation $\pi$, so $(x_t^b,x_i^{\pi(b)})$ forms a mismatched pair. For pseudo-ITM, each sample uses its matched image and one shuffled mismatch; for InfoNCE, all in-batch non-matching images serve as negatives. The early gate is trained as a usefulness predictor using the task gain from a deterministic soft-fusion pass:
The same $p_1^b$ is also used to form $\mathrm{NLL}{\mathrm{mm}}^{\mathrm{soft},b}$ for the early-gate usefulness target via deterministic soft fusion $F{\mathrm{fuse}}(H_t^b,,p_1^bV^b)$.
For sequence labeling, Stage II uses deterministic soft fusion $F_{\mathrm{fuse}}(H_t^b,,p_1^bV^b)$. In Stage III, we sample
Because the relation gate is applied only after the early gate has already activated the multimodal branch, its reward penalizes only the residual fusion/injection cost that remains controllable at this stage rather than the full visual-encoding cost. For sample $b$, we define
Training uses three stages: Stage I is a text-only warm-up that optimizes only $\mathcal{L}{\mathrm{ner}}$; Stage II enables $\mathcal{L}{\mathrm{itm}}$, $\mathcal{L}{\mathrm{nce}}$, $\mathcal{L}{\mathrm{use}}$, and $\mathcal{L}{\mathrm{budget}}$ under deterministic soft fusion and sets $\mathcal{L}{\mathrm{rl}}=0$; Stage III turns on stochastic relation gating and optimizes the full objective above. We do not route examples with $p_0$ during training because early skipping would remove the multimodal signals needed to construct pseudo negatives, usefulness targets, and RL rewards.
Usage
# BERT-BiLSTM-CRF
python main.py --stacked --rnn --crf --dataset [dataset_id] --cuda [gpu_id]
# RGate-MNER-BiLSTM-CRF
python main.py --stacked --rnn --crf --encoder_v resnet101 --aux --gate --dataset twitter2017 --cuda [gpu_id]
# Save the best model to ./ckpt and also save checkpoints every 3 epochs
python main.py --encoder_v resnet101 --gate --save_interval 3
# Directly load an existing model and evaluate on the test set
python main.py --load_model ckpt/best_model.pt