-
-
Notifications
You must be signed in to change notification settings - Fork 229
Fix/rapl mmio #950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Fix/rapl mmio #950
Conversation
e07e2e6 to
b7ecb41
Compare
|
Another pass with Copilot and experiment on a laptop with modern CPU Intel(R) Core(TM) Ultra 7 265H: Improvements to fix double-counting issues in CodeCarbon's Intel RAPL power monitoring: 🎯 Improvement #1: Use ONLY psys When Available (Best Solution!)
The issue: psys already includes package, core, and uncore. Summing them all causes massive over-counting! Your System Example:
Solution: When psys domain is detected, CodeCarbon now uses ONLY psys and ignores all other domains. This is the most accurate approach for modern Intel systems (Skylake and newer). 🎯 Improvement #2: Deduplicate MSR vs MMIO Domains
These measure the SAME physical CPU package but CodeCarbon was counting both! Solution:
📊 Impact on Your System Before (with all domains): After (psys-only mode): Fallback (if no psys, with deduplication): 🧪 Test Coverage Added 2 comprehensive tests:
All existing tests updated and passing ✅ 💡 Key Benefits
|
|
With this PR, measurement can change for our users. For example we get rid of the double-counting on AMD Threadripper. Mesurement done with a smartplug and an "AMD Ryzen Threadripper 1950X 16-Core Processor with a TDP of 180.0 W":
So we publish it as a minor version instead of a patch ? |
| logger.error(e, exc_info=True) | ||
|
|
||
| def live_out(self, total: EmissionsData, delta: EmissionsData): | ||
| self.out(total, delta) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done this to fix TestCarbonTrackerFlush but I don't understand why it was failing.
Maybe it was only on my local machine and someone else as to test this ?
What is the impact of removing this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't remove this! It is possible that when you run it in local it is reading your codecarbon config? Normally the CI runs this in a clean environment.
Maybe I am wrong and this is documenting twice, can you share the error?
The goal of this was to emit the logs live and not wait for the out/flush to be called
|
Finaly, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice thanks @benoit-cty , left some comments
| max_micro_joules = float(f.read()) | ||
| try: | ||
| self.last_energy = self._get_value() | ||
| except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed since _get_value does not raise any error (it has the try-catch)
| @@ -0,0 +1,148 @@ | |||
| # RAPL Measurement Fix Summary | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe do not commit this file
| logger.error(e, exc_info=True) | ||
|
|
||
| def live_out(self, total: EmissionsData, delta: EmissionsData): | ||
| self.out(total, delta) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't remove this! It is possible that when you run it in local it is reading your codecarbon config? Normally the CI runs this in a clean environment.
Maybe I am wrong and this is documenting twice, can you share the error?
The goal of this was to emit the logs live and not wait for the out/flush to be called
A parameter has been added to allow users to use psys if they wanted. As CodeCarbon does not use it previously, it is set to False by default. But in V4 we could set it to True as it seems reliable on modern hardware. |
Description
AI Disclaimer : code created with Codex-CLI and GPT5-mini, then Copilot+Claude Sonnet 4.5
Problem
CodeCarbon was not properly handling systems with multiple RAPL providers (e.g.,
intel-raplandintel-rapl-mmioin/sys/devices/virtual/powercap/). Permission errors on one provider (like intel-rapl-mmio) would cause the entire tracker to fail, even when another provider had readable domains.Solution
Updated the RAPL scanning logic to:
Scan all common RAPL locations (when using the default path):
/sys/class/powercap/intel-rapl/subsystem/sys/class/powercap/intel-rapl(parent)/sys/class/powercap/sys/devices/virtual/powercap← Now includes intel-rapl-mmioGracefully handle permission errors:
Smart path selection for testing:
Files Modified
codecarbon/core/cpu.pyis_rapl_available(): Updated to scan all RAPL providers, distinguish between default and custom pathsIntelRAPL._fetch_rapl_files(): Updated to scan all providers, handle permission errors gracefully, track availabilityTest Files
tests/test_cpu.py: UpdatedTestIntelRAPL.setUp()to create proper RAPL hierarchy (rapl_dir/intel-rapl/intel-rapl:N/)tests/test_rapl_permissions.py: Updated tests to use proper RAPL provider structuretests/test_rapl_mmio_scanning.py: New comprehensive tests for multi-provider scenariosBehavior
Production (Default Path)
Testing (Custom Path)
Example Real System Structure
Key Features
✅ Discovers all RAPL providers (intel-rapl, intel-rapl-mmio, etc.)
✅ Handles permission errors gracefully (warns and continues)
✅ Only fails if no readable main domain is found
✅ Test isolation (custom paths don't scan system files)
✅ Backward compatible (existing code continues to work)
Test Coverage
All tests passing:
tests/test_cpu.py::TestIntelRAPL(2 tests)tests/test_rapl_permissions.py(2 tests)tests/test_rapl_mmio_scanning.py(2 tests - NEW)Example Output
When intel-rapl-mmio has permission issues:
Migration Notes
No changes required for existing code. The tracker will automatically:
To grant permissions for all RAPL files:
Related Issue
Will close #915
How Has This Been Tested?
Types of changes
What types of changes does your code introduce? Put an
xin all the boxes that apply:Checklist:
Go over all the following points, and put an
xin all the boxes that apply.