Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration
Langlin Huang, Chengsong Huang, Jinyuan Li, Donghong Cai, Yuyi Yang, Jiaxin Huang
HINT Lab, Washington University in St. Louis
๐ Paper (arXiv) ยท ๐ป Code
LoPE is a simple yet effective resampling strategy for GRPO-style reinforcement learning that breaks through the zero-advantage problem โ the situation where all sampled rollouts for a hard question fail, the relative advantage collapses to zero, and the training signal is wasted.
Instead of just throwing more compute at logit-space exploration (e.g., higher temperature), LoPE prepends a randomly generated Lorem Ipsum sequence to the prompt before resampling. This semantically neutral, prompt-space perturbation shifts the model's output distribution just enough to unlock orthogonal reasoning trajectories โ without distorting its understanding of the question.
Figure 1. During the standard rollout phase, if all G responses fail, LoPE prepends a Lorem Ipsum sequence to the prompt and resamples Gโฒ responses. Successful responses are regrouped with the original failed ones to form a mixed batch of size G for policy update.
- ๐ฏ Zero-Advantage Recovery. When all initial rollouts fail, LoPE-perturbed resampling recovers correct trajectories that neither naive resampling nor high-temperature sampling can hardly succeed.
- ๐งญ Orthogonal Exploration. On a hard 352-question subset, Lorem-perturbed prompts independently solve more questions that other methods miss (see Figure 2).
- ๐งฌ Controlled Perplexity is Key. Among all tested prompt space perturbations, the top-performance three use perturbation with lower perplexity (closest to natural language). Their perturbation intensity is sufficient to drive exploration, avoiding detrimental effects of excessive noise.
- ๐ Consistent Gains. Average improvement of +2.79 on Qwen3-1.7B-Base, +4.62 on Qwen3-4B-Base, and +6.20 on Qwen2.5-Math-7B across five math benchmarks.
Figure 2. Venn diagrams of questions successfully resolved (Pass@8) by naive prompting, high-temperature sampling, and Lorem perturbation. LoPE unlocks reasoning paths that pure logit-space methods cannot reach.
We need a perturbation that is structurally similar to natural language but semantically empty โ so it doesn't leak hints or distort the question. Lorem Ipsum fits perfectly, and our analysis reveals a clear pattern across 9 perturbation types:
| Perturbation Type | Mean Perplexity | Regime |
|---|---|---|
| Question Text (reference) | 4.82 | Natural language (English) |
| Lorem Ipsum โ | 25.12 | Near-natural |
| Latin Natural Language | 46.09 | Near-natural |
| Latin Unigram Model | 51.32 | Near-natural |
| English Unigram Model | 85.30 | Near-natural |
| Latin 3-Gram Model | 91.45 | Near-natural |
| Random ASCII | 492.93 | Moderately OOD |
| Random Fake English | 2,429.9 | Moderately OOD |
| Random Tokens โ | 4.6 ร 10โต | Severely OOD |
The key insight: moderate, near-natural perturbation increases response entropy and promotes exploration without harming the input representation. Excessively high-perplexity perturbations (e.g., Random Tokens) corrupt the model's understanding of the question itself, as confirmed by both token-level entropy analysis and t-SNE visualization of question representations.
Beyond the mean, Lorem Ipsum also exhibits the lowest standard deviation (2.84) among all synthetic perturbations, ensuring a consistent, controlled distributional shift across samples โ a property that other low-mean methods (e.g., Natural Language Latin with std 42.63) lack.
Results on five math reasoning benchmarks (MATH-500, GSM8K, AMC, AIME24, AIME25):
| Model & Method | MATH-500 | GSM8K | AMC | AIME24 | AIME25 | Avg. |
|---|---|---|---|---|---|---|
| Qwen3-1.7B-Base | 63.40 | 76.92 | 26.87 | 5.33 | 2.00 | 34.90 |
| ย ย + GRPO | 64.20 | 82.71 | 27.61 | 6.15 | 4.47 | 37.03 |
| ย ย + Resample w/ Naive Prompt | 67.00 | 82.18 | 28.36 | 8.70 | 4.58 | 38.16 |
| ย ย + Resample w/ LoPE | 68.80 | 82.94 | 32.84 | 8.80 | 5.73 | 39.82 |
| Qwen3-4B-Base | 65.80 | 82.71 | 32.84 | 9.38 | 7.24 | 39.59 |
| ย ย + GRPO | 77.80 | 91.74 | 47.76 | 16.41 | 13.12 | 49.37 |
| ย ย + Resample w/ Naive Prompt | 79.80 | 92.87 | 45.52 | 14.90 | 11.67 | 48.95 |
| ย ย + Resample w/ LoPE | 82.60 | 92.95 | 58.21 | 19.90 | 16.27 | 53.99 |
| Qwen2.5-Math-7B | 52.80 | 65.50 | 35.40 | 12.90 | 7.90 | 34.90 |
| ย ย + GRPO | 78.00 | 85.06 | 47.76 | 17.66 | 9.90 | 47.68 |
| ย ย + Resample w/ Naive Prompt | 78.20 | 83.02 | 50.00 | 17.19 | 9.17 | 47.52 |
| ย ย + Resample w/ LoPE | 81.80 | 90.30 | 61.19 | 19.58 | 16.51 | 53.88 |
All perturbation methods below use Training Signal Shaping. The three top performers (LoPE, Latin Natural Language, Latin Unigram Model) share the lowest perplexity values among all evaluated perturbations.
| Method | MATH-500 | GSM8K | AMC | AIME24 | AIME25 | Avg. |
|---|---|---|---|---|---|---|
| GRPO | 64.20 | 82.71 | 27.61 | 6.15 | 4.47 | 37.03 |
| Resample w/o perturbation | ||||||
| w/ Naive Prompt | 67.00 | 82.18 | 28.36 | 8.70 | 4.58 | 38.16 |
| w/ Naive Prompt (Temp=1.2) | 64.40 | 82.87 | 31.34 | 8.65 | 4.48 | 38.35 |
| Resample w/ perturbation | ||||||
| w/ LoPE | 68.80 | 82.94 | 32.84 | 8.80 | 5.73 | 39.82 |
| w/ Latin Natural Language | 68.80 | 82.71 | 32.84 | 9.32 | 5.57 | 39.85 |
| w/ Latin Unigram Model | 69.40 | 83.32 | 32.09 | 7.19 | 6.35 | 39.67 |
| w/ Latin 3-Gram Model | 68.80 | 81.88 | 29.85 | 7.92 | 5.93 | 38.88 |
| w/ English Unigram Model | 67.00 | 83.32 | 28.36 | 8.49 | 5.42 | 38.52 |
| w/ Random Fake English | 65.80 | 81.96 | 32.09 | 7.50 | 5.42 | 38.55 |
| w/ Random ASCII | 66.20 | 82.94 | 28.36 | 8.12 | 5.32 | 38.19 |
| w/ Random Token | 64.20 | 81.50 | 29.85 | 8.08 | 4.63 | 37.65 |
Takeaway: The most effective prompt-space perturbations share two characteristics: (i) composed of Latin words, and (ii) relatively low perplexity. English-based perturbations (e.g., English Unigram Model) tend to interfere with the model's original English reasoning context, while extremely high-perplexity perturbations (e.g., Random Token) corrupt the model's input understanding.
LoPE follows the standard GRPO training loop with three modifications when all initial rollouts fail:
-
Rollout with Perturbation. Prepend a random Lorem Ipsum sequence ฮด (100โ300 tokens) to the original prompt p, then sample Gโฒ additional responses from
$\pi_{\theta_{old}}(o' | \delta \oplus p, q)$ . - Regroup Responses. Replace failed rollouts with successful resampled ones, keeping the group size at G and at least one incorrect response so advantages remain non-zero.
- Advantage Estimation with Importance Correction. Convert resampled responses into pseudo rollouts paired with the naive prompt, and correct the distribution shift via:
- Training Signal Shaping. Reshape the importance ratio to
ฯ' = ฯ / (ฯ + 0.1)to amplify low-probability tokens corresponding to critical reasoning steps, and compute group advantage over all G + Gโฒ rollouts to restore training weights of rare successes to a larger value. Refer to Appendix C for details. - No KL Regularization. KL constraints counteract the broader exploration LoPE aims to promote.
# Clone the repository
git clone https://github.com/shrango/LoPE.git
cd LoPE
# Install dependencies
pip install -r requirements.txtOur implementation is built on top of EasyR1.
python3 -m verl.trainer.main \
config=examples/config.yaml \
data.max_response_length=8192 \
data.max_prompt_length=2048 \
worker.rollout.max_num_batched_tokens=10240 \
data.train_files=$OPENR1DATA \
data.format_prompt=examples/format_prompt/math.jinja \
data.val_files=$MATHTEST \
worker.actor.model.model_path=Qwen/Qwen3-1.7B-Base \
trainer.save_checkpoint_path=$SAVE_DIR \
worker.rollout.n=8 \
algorithm.use_kl_loss=false \
algorithm.disable_kl=true \
algorithm.kl_coef=0.0 \
data.use_lorem=true \
data.lorem_word_min=100 \
data.lorem_word_max=300 \
data.rollout_batch_size=128 \
data.val_batch_size=1024 \
worker.actor.global_batch_size=128 \
trainer.val_before_train=true \
trainer.n_gpus_per_node=4 \
worker.rollout.gpu_memory_utilization=0.8We use EvalScope with sampling temperature 0.6 and top-p 0.95. We report Acc@1 for MATH-500, GSM8K, and AMC, and Mean@32 for AIME24 and AIME25.
| Parameter | Value |
|---|---|
| Group size G | 8 |
| Resample size Gโฒ | 24 |
| Rollout temperature | 1.0 |
| Eval temperature | 0.6 |
| Eval top-p | 0.95 |
| Lorem sequence length | 100โ300 tokens |
| Max response length | 8,192 tokens |
| Max input length | 2,048 tokens |
| KL coefficient | 0 (removed) |
A short boundary instruction \nPlease reason step by step, and put your final answer within \boxed{}. is appended after the Lorem Ipsum sequence to prevent the model from generating corrupted outputs.
If you find LoPE useful in your research, please cite:
@article{huang2026lope,
title = {Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration},
author = {Huang, Langlin and Huang, Chengsong and Li, Jinyuan and Cai, Donghong and Yang, Yuyi and Huang, Jiaxin},
journal = {arXiv preprint},
year = {2026}
}This research was supported in part by the NVIDIA Academic Grant Program and WashU Ignite Interdisciplinary Grants.
Our code is built upon EasyR1. We thank the authors of Qwen, OpenR1-Math, and python-lorem for their open-source contributions.