Fix/rapl mmio #950

benoit-cty · 2025-10-12T07:55:38Z

Description

AI Disclaimer : code created with Codex-CLI and GPT5-mini, then Copilot+Claude Sonnet 4.5

Problem

CodeCarbon was not properly handling systems with multiple RAPL providers (e.g., intel-rapl and intel-rapl-mmio in /sys/devices/virtual/powercap/). Permission errors on one provider (like intel-rapl-mmio) would cause the entire tracker to fail, even when another provider had readable domains.

Solution

Updated the RAPL scanning logic to:

Scan all common RAPL locations (when using the default path):
- /sys/class/powercap/intel-rapl/subsystem
- /sys/class/powercap/intel-rapl (parent)
- /sys/class/powercap
- /sys/devices/virtual/powercap ← Now includes intel-rapl-mmio
Gracefully handle permission errors:
- Permission errors on individual domains are logged as warnings
- Tracker continues to work with readable domains
- Only fails if NO readable main/package domain is found
Smart path selection for testing:
- Production (default path): Scans all system locations
- Testing (custom path): Only scans the provided directory to avoid system interference

Files Modified

`codecarbon/core/cpu.py`

is_rapl_available(): Updated to scan all RAPL providers, distinguish between default and custom paths
IntelRAPL._fetch_rapl_files(): Updated to scan all providers, handle permission errors gracefully, track availability

Test Files

tests/test_cpu.py: Updated TestIntelRAPL.setUp() to create proper RAPL hierarchy (rapl_dir/intel-rapl/intel-rapl:N/)
tests/test_rapl_permissions.py: Updated tests to use proper RAPL provider structure
tests/test_rapl_mmio_scanning.py: New comprehensive tests for multi-provider scenarios

Behavior

Production (Default Path)

rapl = IntelRAPL()  # Uses default path
# Scans:
# - /sys/class/powercap/intel-rapl/subsystem
# - /sys/class/powercap/intel-rapl
# - /sys/class/powercap
# - /sys/devices/virtual/powercap  ← Finds intel-rapl-mmio here

Testing (Custom Path)

rapl = IntelRAPL(rapl_dir="/tmp/test/rapl")
# Only scans:
# - /tmp/test/rapl
# - /tmp/test  (parent)
# Avoids interference with system /sys files

Example Real System Structure

/sys/devices/virtual/powercap/
├── intel-rapl/
│   ├── intel-rapl:0/  (package-0)
│   │   ├── energy_uj
│   │   ├── intel-rapl:0:0/  (core)
│   │   ├── intel-rapl:0:1/  (uncore)
│   │   └── intel-rapl:0:2/  (dram)
│   └── intel-rapl:1/  (psys)
│       └── energy_uj
└── intel-rapl-mmio/
    └── intel-rapl-mmio:0/  (package-0)
        ├── energy_uj  ← May have permission errors
        └── intel-rapl-mmio:0:0/  (core)
            └── energy_uj

Key Features

✅ Discovers all RAPL providers (intel-rapl, intel-rapl-mmio, etc.)
✅ Handles permission errors gracefully (warns and continues)
✅ Only fails if no readable main domain is found
✅ Test isolation (custom paths don't scan system files)
✅ Backward compatible (existing code continues to work)

Test Coverage

All tests passing:

tests/test_cpu.py::TestIntelRAPL (2 tests)
tests/test_rapl_permissions.py (2 tests)
tests/test_rapl_mmio_scanning.py (2 tests - NEW)

Example Output

When intel-rapl-mmio has permission issues:

[codecarbon WARNING] Permission denied reading RAPL file /sys/devices/virtual/powercap/intel-rapl-mmio/intel-rapl-mmio:0/energy_uj. You can grant read permission with: sudo chmod -R a+r /sys/class/powercap/*; skipping.
[codecarbon INFO] Tracking Intel CPU via RAPL interface
✓ Using readable domains from intel-rapl provider

Migration Notes

No changes required for existing code. The tracker will automatically:

Discover and use all available RAPL providers
Skip unreadable domains with warnings
Continue working with any readable main/package domain

To grant permissions for all RAPL files:

sudo chmod -R a+r /sys/class/powercap/*
sudo chmod -R a+r /sys/devices/virtual/powercap/*

Related Issue

Will close #915

How Has This Been Tested?

tests/test_rapl_permissions.py:
- test_main_rapl_permission_error: ensures initialization raises when the main intel-rapl:0/energy_uj is unreadable.
- test_non_main_rapl_permission_warning_and_skip: ensures unreadable non-main domains are skipped and a warning is logged.
Tests use tmp_path and are Linux-only (they simulate sysfs trees and change file permissions).

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

Go over all the following points, and put an x in all the boxes that apply.

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING.md document.
I have added tests to cover my changes.
All new and existing tests passed.

benoit-cty · 2025-10-26T14:40:12Z

Another pass with Copilot and experiment on a laptop with modern CPU Intel(R) Core(TM) Ultra 7 265H:

Improvements to fix double-counting issues in CodeCarbon's Intel RAPL power monitoring:

🎯 Improvement #1: Use ONLY psys When Available (Best Solution!)
Problem: Modern Intel CPUs expose multiple overlapping RAPL domains:

psys (platform/system) = Total platform power
package-0 = CPU package power
core = CPU cores only
uncore = Memory controller, cache, iGPU

The issue: psys already includes package, core, and uncore. Summing them all causes massive over-counting!

Your System Example:

Old behavior: 9.61W (psys) + 3.78W (package) + 0.84W (core) + 0.21W (uncore) = 14.44W ❌
New behavior: 9.61W (psys only) ✅

Solution: When psys domain is detected, CodeCarbon now uses ONLY psys and ignores all other domains. This is the most accurate approach for modern Intel systems (Skylake and newer).

🎯 Improvement #2: Deduplicate MSR vs MMIO Domains
Problem: Same physical domains appear through two interfaces:

intel-rapl:0/package-0 (MSR-based, older interface) = 3.93W
intel-rapl-mmio:0/package-0 (MMIO-based, newer interface) = 3.78W

These measure the SAME physical CPU package but CodeCarbon was counting both!

Solution:

Detects duplicate domains by name
Prefers MMIO over MSR (newer, recommended interface)
Only deduplicates after checking readability (graceful fallback if MMIO is unreadable)

📊 Impact on Your System

Before (with all domains):

psys: 9.61Wcore: 0.84W 
package-0 (MSR): 3.93W     } Same physical package!
package-0 (MMIO): 3.78W    }
uncore: 0.21W
─────────────────────────
Total: 18.37W ❌ WRONG - Triple counting!

After (psys-only mode):

psys: 9.61W
─────────────────────────
Total: 9.61W ✅ CORRECT!

Fallback (if no psys, with deduplication):

package-0 (MMIO only): 3.78W
core: 0.84W
uncore: 0.21W
─────────────────────────
Total: 4.83W ✅ CORRECT (no double-counting)

🧪 Test Coverage

Added 2 comprehensive tests:

test_psys_only_when_available() - Verifies psys-only behavior
test_rapl_deduplication_prefers_mmio() - Verifies MMIO preference when deduplicating

All existing tests updated and passing ✅

💡 Key Benefits

Accuracy: Eliminates all double/triple counting issues
Simplicity: One measurement (psys) on modern systems
Robustness: Smart fallback when psys unavailable
Future-proof: Handles both MSR and MMIO interfaces
Clear logging: INFO messages explain what's being measured

benoit-cty · 2025-11-01T09:02:44Z

With this PR, measurement can change for our users. For example we get rid of the double-counting on AMD Threadripper.

Mesurement done with a smartplug and an "AMD Ryzen Threadripper 1950X 16-Core Processor with a TDP of 180.0 W":

Idle : 100W for whole computer on smartplug (~ 20W reported for CPU by CodeCarbon)
Full load : 280W for whole computer on smartplug (~ 160W by CodeCarbon)

So we publish it as a minor version instead of a patch ?

…u_power

benoit-cty · 2025-11-01T14:34:13Z

codecarbon/output_methods/logger.py

            logger.error(e, exc_info=True)

    def live_out(self, total: EmissionsData, delta: EmissionsData):
-        self.out(total, delta)


Done this to fix TestCarbonTrackerFlush but I don't understand why it was failing.
Maybe it was only on my local machine and someone else as to test this ?

What is the impact of removing this ?

Don't remove this! It is possible that when you run it in local it is reading your codecarbon config? Normally the CI runs this in a clean environment.

Maybe I am wrong and this is documenting twice, can you share the error?
The goal of this was to emit the logs live and not wait for the out/flush to be called

benoit-cty · 2025-11-01T14:49:53Z

Finaly, psys (platform/system) was not accurate on old laptop with Intel CPU, so I switch back to package, testing are welcome !

inimaz

Nice thanks @benoit-cty , left some comments

inimaz · 2025-11-02T11:01:53Z

codecarbon/core/rapl.py

-            max_micro_joules = float(f.read())
+        try:
+            self.last_energy = self._get_value()
+        except Exception as e:


This is not needed since _get_value does not raise any error (it has the try-catch)

inimaz · 2025-11-02T11:05:33Z

examples/rapl/RAPL_FIX_SUMMARY.md

@@ -0,0 +1,148 @@
+# RAPL Measurement Fix Summary


Maybe do not commit this file

inimaz · 2025-11-02T11:30:34Z

codecarbon/output_methods/logger.py

            logger.error(e, exc_info=True)

    def live_out(self, total: EmissionsData, delta: EmissionsData):
-        self.out(total, delta)


Don't remove this! It is possible that when you run it in local it is reading your codecarbon config? Normally the CI runs this in a clean environment.

Maybe I am wrong and this is documenting twice, can you share the error?
The goal of this was to emit the logs live and not wait for the out/flush to be called

benoit-cty · 2025-11-02T20:27:22Z

Finaly, psys (platform/system) was not accurate on old laptop with Intel CPU, so I switch back to package, testing are welcome !

A parameter has been added to allow users to use psys if they wanted. As CodeCarbon does not use it previously, it is set to False by default. But in V4 we could set it to True as it seems reliable on modern hardware.

benoit-cty mentioned this pull request Oct 12, 2025

cpu not recognised on linux although rapl files are accessible #915

Open

benoit-cty and others added 5 commits October 26, 2025 13:50

Better RAPL

5be43fd

unit tests

0713998

Fix test

9d58e6d

Fix test

1ca38f6

Better RAPL handling

b7ecb41

benoit-cty force-pushed the fix/rapl-mmio branch from e07e2e6 to b7ecb41 Compare October 26, 2025 14:40

benoit-cty requested a review from a team as a code owner October 26, 2025 14:40

benoit-cty and others added 14 commits November 1, 2025 10:42

doc

300f667

Warn user if RAPL file not readable

480d6bd

Better message

052e5a5

Supress divide for Threadripper

da06e9b

Reduce warnings

4538493

contrib doc

9b252a0

Change RAPL read

2c6b91c

RAPL debug script

ad58f25

Fix test_intel_rapl and test_carbon_tracker_offline_constant_force_cp…

f911145

…u_power

fix test_carbon_tracker_offline_load_force_cpu_power

1b90644

Fix test_carbon_tracker_offline_region_error

62d4812

Fix test_carbon_tracker_offline_load_force_cpu_power

a3d2bda

fix tests/test_rapl_mmio_scanning.py

9012ef9

Fix TestCarbonTrackerFlush

2ff2a02

benoit-cty commented Nov 1, 2025

View reviewed changes

Fix test_intel_rapl after CI fail

9719e31

inimaz reviewed Nov 2, 2025

View reviewed changes

benoit-cty added 2 commits November 2, 2025 12:55

Clean full_cpu example

4c0a5fd

Add 265H CPU

edf4b7a

benoit-cty and others added 7 commits November 2, 2025 12:56

265H tests

0aa372f

Test on i7-7600U

11122be

Test on i7-7600U

1ee094e

Add OCCT

9d96e63

doc Ultra 7 265H

1246cf9

doc

10ba942

Add rapl psys and dram parameters

32bd7cd

doc

040497b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix/rapl mmio #950

Fix/rapl mmio #950

Uh oh!

benoit-cty commented Oct 12, 2025 •

edited

Loading

Uh oh!

benoit-cty commented Oct 26, 2025

Uh oh!

benoit-cty commented Nov 1, 2025

Uh oh!

benoit-cty Nov 1, 2025

Uh oh!

inimaz Nov 2, 2025

Uh oh!

benoit-cty commented Nov 1, 2025

Uh oh!

inimaz left a comment

Uh oh!

inimaz Nov 2, 2025

Uh oh!

inimaz Nov 2, 2025

Uh oh!

inimaz Nov 2, 2025

Uh oh!

benoit-cty commented Nov 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Fix/rapl mmio #950

Are you sure you want to change the base?

Fix/rapl mmio #950

Uh oh!

Conversation

benoit-cty commented Oct 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Solution

Files Modified

codecarbon/core/cpu.py

Test Files

Behavior

Production (Default Path)

Testing (Custom Path)

Example Real System Structure

Key Features

Test Coverage

Example Output

Migration Notes

Related Issue

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

benoit-cty commented Oct 26, 2025

Uh oh!

benoit-cty commented Nov 1, 2025

Uh oh!

benoit-cty Nov 1, 2025

Choose a reason for hiding this comment

Uh oh!

inimaz Nov 2, 2025

Choose a reason for hiding this comment

Uh oh!

benoit-cty commented Nov 1, 2025

Uh oh!

inimaz left a comment

Choose a reason for hiding this comment

Uh oh!

inimaz Nov 2, 2025

Choose a reason for hiding this comment

Uh oh!

inimaz Nov 2, 2025

Choose a reason for hiding this comment

Uh oh!

inimaz Nov 2, 2025

Choose a reason for hiding this comment

Uh oh!

benoit-cty commented Nov 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

benoit-cty commented Oct 12, 2025 •

edited

Loading

`codecarbon/core/cpu.py`