Skip to content

fix: resolve memory leak in VPI callback cleanup#5547

Open
Dragon-Git wants to merge 9 commits into
cocotb:masterfrom
Dragon-Git:fix/5545-vpi-timer-memory-leak
Open

fix: resolve memory leak in VPI callback cleanup#5547
Dragon-Git wants to merge 9 commits into
cocotb:masterfrom
Dragon-Git:fix/5545-vpi-timer-memory-leak

Conversation

@Dragon-Git
Copy link
Copy Markdown

Fixes #5545

This PR resolves the linear RSS memory leak when repeatedly using Timer triggers on Synopsys VCS and Cadence Xcelium.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 24, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 80.94%. Comparing base (af0150d) to head (9f0ec6a).
⚠️ Report is 11 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5547      +/-   ##
==========================================
+ Coverage   80.86%   80.94%   +0.08%     
==========================================
  Files          65       65              
  Lines        9422     9425       +3     
  Branches     2468     2463       -5     
==========================================
+ Hits         7619     7629      +10     
+ Misses       1312     1304       -8     
- Partials      491      492       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Dragon-Git Dragon-Git marked this pull request as ready for review April 25, 2026 03:13
@Dragon-Git Dragon-Git force-pushed the fix/5545-vpi-timer-memory-leak branch from d4a5a41 to 49ce17e Compare April 25, 2026 04:45
@ktbarrett
Copy link
Copy Markdown
Member

Only thing missing is a newsfragment. See docs/source/newsfragments/README.rst for some details.

@ktbarrett
Copy link
Copy Markdown
Member

These memory leak tests take quite a long time. Can they still do a decent job if the number of iterations is cut to 200k or 500k?

@ktbarrett
Copy link
Copy Markdown
Member

The tests are so long they are blowing up the coverage tool.

@Dragon-Git
Copy link
Copy Markdown
Author

The testing time is indeed too long. I reduced the number of iterations to 100K.

@Dragon-Git
Copy link
Copy Markdown
Author

I observed a failure in the last CI run. Specifically, the NextTimeStep test failed when executed with Verilator 5.046 on the Ubuntu. Memory consumption reached 128KB. I tested it on the Verilator 5.046 macOS platform, and the diff is 0. I don't have an Ubuntu environment. Do we need to survey the environment, or do we need to feed this back upstream?

@Dragon-Git Dragon-Git force-pushed the fix/5545-vpi-timer-memory-leak branch from 8cc487f to cbba87f Compare April 26, 2026 04:44
@ktbarrett
Copy link
Copy Markdown
Member

It seems like these tests are a bit flakey now that they are smaller. Is that expected? The Riviera tests weren't failing before and now they all are. Even Xcelium with the fix is failing in this run.

I'm wondering if we keep this test at a larger size, like 200k or 500k, but run it as part of a different regression set. That isn't something you would have to do, just throwing it out there to see what you think.

@Dragon-Git
Copy link
Copy Markdown
Author

It seems that the noise is higher than I expected, but I think 100K tests are sufficient. We need to raise the threshold. Assuming that each leak is the size of the callback structure, that is, on a 32-bit system, the minimum occupancy is 28 Bytes, and 100k calls, the leak will reach 2.67 MB. It is reasonable for us to set the threshold to 2 MB.

typedef struct t_cb_data
{
  PLI_INT32    reason;                        /* callback reason */
  PLI_INT32    (*cb_rtn)(struct t_cb_data *); /* call routine */
  vpiHandle    obj;                           /* trigger object */
  p_vpi_time   time;                          /* callback time */
  p_vpi_value  value;                         /* trigger object value */
  PLI_INT32    index;                         /* index of the memory word or
                                                 var select that changed */
  PLI_BYTE8   *user_data;
} s_cb_data, *p_cb_data;

@Dragon-Git
Copy link
Copy Markdown
Author

Could you please share the Questa and Riviera test logs for this PR?
I’d like to check the exact RSS diff values to verify whether there is a genuine memory leak or just normal memory fluctuation.

@ktbarrett
Copy link
Copy Markdown
Member

riviera.log

The questa failure has to do with it not consistently supporting NextTimeStep. It's kind of a finicky trigger. The issue is that the Clock and the Timer you have "finish" at the same time and so there are no active events and that causes this trigger to segfault. Playing with the Timer to not fall on the same time as the clock may fix it. Otherwise feel free to skip it.

@Dragon-Git
Copy link
Copy Markdown
Author

I skipped the test of next_step_time on Questa. Riviera seems that there is a memory leak. I tried to add a vpi_free_object to it.

@Dragon-Git Dragon-Git force-pushed the fix/5545-vpi-timer-memory-leak branch from a109270 to ca102e8 Compare April 28, 2026 17:18
Comment thread tests/test_cases/test_triger_memory_leak/test_triger_memory_leak.py Outdated
@ktbarrett
Copy link
Copy Markdown
Member

Looks like Riviera also need to modify VpiValueCbHdl::run.

@Dragon-Git
Copy link
Copy Markdown
Author

vpi_remove_cb is now called in the VpiValueCbHdl::run, There should be no more memory leaks. Based on the last log, we should also consider raising the threshold for Riviera.

@Dragon-Git
Copy link
Copy Markdown
Author

Could you please share the Riviera test logs for this CI?

@ktbarrett
Copy link
Copy Markdown
Member

ktbarrett commented Apr 30, 2026

log looks like Questa has the same leak of the callback, the define is MODELSIM.

Also, I noticed the test name is test_triger_memory_leak, there's a missing 'g' there.

@Dragon-Git
Copy link
Copy Markdown
Author

I noticed that for Riviera and ModelSim, both the rising edge callback and the falling edge callback consume more memory than the value change callback. I guess for these two emulators, we need to manually remove the callback or release the handle on each value change callback.

@Dragon-Git Dragon-Git force-pushed the fix/5545-vpi-timer-memory-leak branch from cee2ae1 to 4df88d3 Compare April 30, 2026 14:21
@ktbarrett
Copy link
Copy Markdown
Member

ktbarrett commented Apr 30, 2026

Riviera segfaults on the rising_edge test.

Questa:

# 802999.00ns WARNING  cocotb.regression                  test_trigger_memory_leak.test_raise_edge_leak failed
#                                                         Traceback (most recent call last):
#                                                           File "cocotb/tests/test_cases/test_trigger_memory_leak/test_trigger_memory_leak.py", line 117, in test_raise_edge_leak
#                                                             assert diff <= MEMORY_LEAK_TH, "Memory leak"
#                                                         AssertionError: Memory leak
#                                                         assert 11403264 <= 4194304
# 802999.00ns INFO     cocotb.regression                  running test_trigger_memory_leak.test_falling_edge_leak (5/8)
# 1002998.50ns WARNING  cocotb.regression                  test_trigger_memory_leak.test_falling_edge_leak failed
#                                                         Traceback (most recent call last):
#                                                           File "cocotb/tests/test_cases/test_trigger_memory_leak/test_trigger_memory_leak.py", line 132, in test_falling_edge_leak
#                                                             assert diff <= MEMORY_LEAK_TH, "Memory leak"
#                                                         AssertionError: Memory leak
#                                                         assert 12189696 <= 4194304
# 1002998.50ns INFO     cocotb.regression                  running test_trigger_memory_leak.test_value_change_leak (6/8)
# 1152998.00ns WARNING  cocotb.regression                  test_trigger_memory_leak.test_value_change_leak failed
#                                                         Traceback (most recent call last):
#                                                           File "cocotb/tests/test_cases/test_trigger_memory_leak/test_trigger_memory_leak.py", line 147, in test_value_change_leak
#                                                             assert diff <= MEMORY_LEAK_TH, "Memory leak"
#                                                         AssertionError: Memory leak
#                                                         assert 4718592 <= 4194304

@Dragon-Git Dragon-Git force-pushed the fix/5545-vpi-timer-memory-leak branch from 4df88d3 to b6373a7 Compare May 2, 2026 06:41
@Dragon-Git
Copy link
Copy Markdown
Author

The attempt failed, but the above phenomenon still deserves attention. I reverted to the previous version, and skipped the value change related tests on questa. Do you have any ideas about this phenomenon?

@ktbarrett
Copy link
Copy Markdown
Member

About rising edge and falling edge using more? Yeah that's because the VPI only uses "value change" and we check to see if the value after change is 1 (rising edge) or 0 (falling edge) and if the current value change is to a different value, we keep going. If you are measuring clock cycles you'll get two edges per actual rising edge where the other is an ignored falling edge.

@Dragon-Git
Copy link
Copy Markdown
Author

Yes, I've test in Verilator, and confirmed that as long as we don't remove the callback, the callback's vpiHandle remains the same, so no memory leak occurs. However, for Questa and Riviera, I suspect that a different vpiHandle is generated each time, which causes memory leaks for half of the callbacks. So how should we handle this? Should we add dedicated handling for these two simulators when pass == false, by remove the callback and re-registering a new one? This may impact performance, or should we report this issue upstream instead?

@Dragon-Git Dragon-Git force-pushed the fix/5545-vpi-timer-memory-leak branch from b6373a7 to 369d4e0 Compare May 2, 2026 14:04
@Dragon-Git
Copy link
Copy Markdown
Author

I have tested the scheme of unregistering and re-registering callbacks in Verilator. The performance impact is negligible.

In the test with 10M rising-edge trigger, both approaches finish in around 30 seconds.
For the 20M rising-edge trigger, the baseline scheme without unregistering/re-registering takes approximately 58 seconds, while the scheme with callback unregistration followed by re-registration takes about 60 seconds.

If the tests in Questa and Riviera pass, we can add this change, but we need to provide feedback to the upstream.

@Dragon-Git
Copy link
Copy Markdown
Author

Is the diff still consistent with before?

@ktbarrett ktbarrett added this to the 2.1 milestone May 13, 2026
@Dragon-Git Dragon-Git force-pushed the fix/5545-vpi-timer-memory-leak branch from 369d4e0 to 9f0ec6a Compare May 14, 2026 13:55
@ktbarrett
Copy link
Copy Markdown
Member

Sorry, been very distracted the past couple weeks. Let's try to get this merged this week. I appreciate the effort.

@Dragon-Git
Copy link
Copy Markdown
Author

Dragon-Git commented May 14, 2026

Thanks for the feedback! Regarding the memory issue with Questa/Riviera-PRO during rising/falling edge tests, I've temporarily skipped the related test on Questa due to limited local debugging environment and reliance on CI for iteration. To prioritize landing this PR this week, should we:

Option A: Merge this and address the skipped test in a follow-up? (Ideal if the issue is isolated)

Option B: Debug deeper before merging? (If critical to resolve upfront)

Please advise your preference—I'm ready to adjust accordingly. Appreciate your guidance!

@ktbarrett
Copy link
Copy Markdown
Member

I want a deeper debug. I have access to the machine with the commercial tools so I can runs diagnostics more quickly.

Just to summarize where we are:

Questa is failing on these tests and we are uncertain if the issue is that is also needs the vpi_free_object or if it's creating a new callback on every iteration meaning we should be using vpi_remove_cb?

@ktbarrett ktbarrett changed the title fix: resolve memory leak in VPI callback cleanup for VCS/Xcelium fix: resolve memory leak in VPI callback cleanup May 14, 2026
@Dragon-Git
Copy link
Copy Markdown
Author

Dragon-Git commented May 14, 2026

Yes, I have tested both of these ideas in previous CIs, but they have failed. We may need to use valgrind to track memory. Riviera and Questa both have a similar phenomenon, but Questa use more memory.

@ktbarrett
Copy link
Copy Markdown
Member

It's possible it's a leak in Questa, or maybe removing a cb and adding it back in the same cycle triggers it. I wonder if it's repeating like Verilator and if so if we can just cache it and disable/re-enable it. cocotb had a system like that previously but it was broken over the years partially because it wasn't easy to understand the way it was written and totally uncommented.

@ktbarrett
Copy link
Copy Markdown
Member

ktbarrett commented May 15, 2026

Yeah this is just due to Questa not supporting vpi_remove_cb while the callback is executing. This will require a hefty refactor. Let's just get what we have in.

@ktbarrett ktbarrett force-pushed the fix/5545-vpi-timer-memory-leak branch from e07a7bd to 539108f Compare May 15, 2026 02:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Memory leak with repeated Timer on VPI‑based simulators

2 participants