Skip to content

Conversation

@benoit-cty
Copy link
Contributor

@benoit-cty benoit-cty commented Oct 12, 2025

Description

AI Disclaimer : code created with Codex-CLI and GPT5-mini, then Copilot+Claude Sonnet 4.5

Problem

CodeCarbon was not properly handling systems with multiple RAPL providers (e.g., intel-rapl and intel-rapl-mmio in /sys/devices/virtual/powercap/). Permission errors on one provider (like intel-rapl-mmio) would cause the entire tracker to fail, even when another provider had readable domains.

Solution

Updated the RAPL scanning logic to:

  1. Scan all common RAPL locations (when using the default path):

    • /sys/class/powercap/intel-rapl/subsystem
    • /sys/class/powercap/intel-rapl (parent)
    • /sys/class/powercap
    • /sys/devices/virtual/powercapNow includes intel-rapl-mmio
  2. Gracefully handle permission errors:

    • Permission errors on individual domains are logged as warnings
    • Tracker continues to work with readable domains
    • Only fails if NO readable main/package domain is found
  3. Smart path selection for testing:

    • Production (default path): Scans all system locations
    • Testing (custom path): Only scans the provided directory to avoid system interference

Files Modified

codecarbon/core/cpu.py

  • is_rapl_available(): Updated to scan all RAPL providers, distinguish between default and custom paths
  • IntelRAPL._fetch_rapl_files(): Updated to scan all providers, handle permission errors gracefully, track availability

Test Files

  • tests/test_cpu.py: Updated TestIntelRAPL.setUp() to create proper RAPL hierarchy (rapl_dir/intel-rapl/intel-rapl:N/)
  • tests/test_rapl_permissions.py: Updated tests to use proper RAPL provider structure
  • tests/test_rapl_mmio_scanning.py: New comprehensive tests for multi-provider scenarios

Behavior

Production (Default Path)

rapl = IntelRAPL()  # Uses default path
# Scans:
# - /sys/class/powercap/intel-rapl/subsystem
# - /sys/class/powercap/intel-rapl
# - /sys/class/powercap
# - /sys/devices/virtual/powercap  ← Finds intel-rapl-mmio here

Testing (Custom Path)

rapl = IntelRAPL(rapl_dir="/tmp/test/rapl")
# Only scans:
# - /tmp/test/rapl
# - /tmp/test  (parent)
# Avoids interference with system /sys files

Example Real System Structure

/sys/devices/virtual/powercap/
├── intel-rapl/
│   ├── intel-rapl:0/  (package-0)
│   │   ├── energy_uj
│   │   ├── intel-rapl:0:0/  (core)
│   │   ├── intel-rapl:0:1/  (uncore)
│   │   └── intel-rapl:0:2/  (dram)
│   └── intel-rapl:1/  (psys)
│       └── energy_uj
└── intel-rapl-mmio/
    └── intel-rapl-mmio:0/  (package-0)
        ├── energy_uj  ← May have permission errors
        └── intel-rapl-mmio:0:0/  (core)
            └── energy_uj

Key Features

Discovers all RAPL providers (intel-rapl, intel-rapl-mmio, etc.)
Handles permission errors gracefully (warns and continues)
Only fails if no readable main domain is found
Test isolation (custom paths don't scan system files)
Backward compatible (existing code continues to work)

Test Coverage

All tests passing:

  • tests/test_cpu.py::TestIntelRAPL (2 tests)
  • tests/test_rapl_permissions.py (2 tests)
  • tests/test_rapl_mmio_scanning.py (2 tests - NEW)

Example Output

When intel-rapl-mmio has permission issues:

[codecarbon WARNING] Permission denied reading RAPL file /sys/devices/virtual/powercap/intel-rapl-mmio/intel-rapl-mmio:0/energy_uj. You can grant read permission with: sudo chmod -R a+r /sys/class/powercap/*; skipping.
[codecarbon INFO] Tracking Intel CPU via RAPL interface
✓ Using readable domains from intel-rapl provider

Migration Notes

No changes required for existing code. The tracker will automatically:

  1. Discover and use all available RAPL providers
  2. Skip unreadable domains with warnings
  3. Continue working with any readable main/package domain

To grant permissions for all RAPL files:

sudo chmod -R a+r /sys/class/powercap/*
sudo chmod -R a+r /sys/devices/virtual/powercap/*

Related Issue

Will close #915

How Has This Been Tested?

  • tests/test_rapl_permissions.py:
    • test_main_rapl_permission_error: ensures initialization raises when the main intel-rapl:0/energy_uj is unreadable.
    • test_non_main_rapl_permission_warning_and_skip: ensures unreadable non-main domains are skipped and a warning is logged.
  • Tests use tmp_path and are Linux-only (they simulate sysfs trees and change file permissions).

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

Go over all the following points, and put an x in all the boxes that apply.

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING.md document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@benoit-cty benoit-cty requested a review from a team as a code owner October 26, 2025 14:40
@benoit-cty
Copy link
Contributor Author

Another pass with Copilot and experiment on a laptop with modern CPU Intel(R) Core(TM) Ultra 7 265H:

Improvements to fix double-counting issues in CodeCarbon's Intel RAPL power monitoring:

🎯 Improvement #1: Use ONLY psys When Available (Best Solution!)
Problem: Modern Intel CPUs expose multiple overlapping RAPL domains:

  • psys (platform/system) = Total platform power
  • package-0 = CPU package power
  • core = CPU cores only
  • uncore = Memory controller, cache, iGPU

The issue: psys already includes package, core, and uncore. Summing them all causes massive over-counting!

Your System Example:

  • Old behavior: 9.61W (psys) + 3.78W (package) + 0.84W (core) + 0.21W (uncore) = 14.44W ❌
  • New behavior: 9.61W (psys only) ✅

Solution: When psys domain is detected, CodeCarbon now uses ONLY psys and ignores all other domains. This is the most accurate approach for modern Intel systems (Skylake and newer).

🎯 Improvement #2: Deduplicate MSR vs MMIO Domains
Problem: Same physical domains appear through two interfaces:

  • intel-rapl:0/package-0 (MSR-based, older interface) = 3.93W
  • intel-rapl-mmio:0/package-0 (MMIO-based, newer interface) = 3.78W

These measure the SAME physical CPU package but CodeCarbon was counting both!

Solution:

  • Detects duplicate domains by name
  • Prefers MMIO over MSR (newer, recommended interface)
  • Only deduplicates after checking readability (graceful fallback if MMIO is unreadable)

📊 Impact on Your System

Before (with all domains):

psys: 9.61Wcore: 0.84W 
package-0 (MSR): 3.93W     } Same physical package!
package-0 (MMIO): 3.78W    }
uncore: 0.21W
─────────────────────────
Total: 18.37W ❌ WRONG - Triple counting!

After (psys-only mode):

psys: 9.61W
─────────────────────────
Total: 9.61W ✅ CORRECT!

Fallback (if no psys, with deduplication):

package-0 (MMIO only): 3.78W
core: 0.84W
uncore: 0.21W
─────────────────────────
Total: 4.83W ✅ CORRECT (no double-counting)

🧪 Test Coverage

Added 2 comprehensive tests:

  • test_psys_only_when_available() - Verifies psys-only behavior
  • test_rapl_deduplication_prefers_mmio() - Verifies MMIO preference when deduplicating

All existing tests updated and passing ✅

💡 Key Benefits

  • Accuracy: Eliminates all double/triple counting issues
  • Simplicity: One measurement (psys) on modern systems
  • Robustness: Smart fallback when psys unavailable
  • Future-proof: Handles both MSR and MMIO interfaces
  • Clear logging: INFO messages explain what's being measured

@benoit-cty
Copy link
Contributor Author

With this PR, measurement can change for our users. For example we get rid of the double-counting on AMD Threadripper.

Mesurement done with a smartplug and an "AMD Ryzen Threadripper 1950X 16-Core Processor with a TDP of 180.0 W":

  • Idle : 100W for whole computer on smartplug (~ 20W reported for CPU by CodeCarbon)
  • Full load : 280W for whole computer on smartplug (~ 160W by CodeCarbon)

So we publish it as a minor version instead of a patch ?

logger.error(e, exc_info=True)

def live_out(self, total: EmissionsData, delta: EmissionsData):
self.out(total, delta)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done this to fix TestCarbonTrackerFlush but I don't understand why it was failing.
Maybe it was only on my local machine and someone else as to test this ?

What is the impact of removing this ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't remove this! It is possible that when you run it in local it is reading your codecarbon config? Normally the CI runs this in a clean environment.

Maybe I am wrong and this is documenting twice, can you share the error?
The goal of this was to emit the logs live and not wait for the out/flush to be called

@benoit-cty
Copy link
Contributor Author

Finaly, psys (platform/system) was not accurate on old laptop with Intel CPU, so I switch back to package, testing are welcome !

Copy link
Collaborator

@inimaz inimaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice thanks @benoit-cty , left some comments

max_micro_joules = float(f.read())
try:
self.last_energy = self._get_value()
except Exception as e:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed since _get_value does not raise any error (it has the try-catch)

@@ -0,0 +1,148 @@
# RAPL Measurement Fix Summary
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe do not commit this file

logger.error(e, exc_info=True)

def live_out(self, total: EmissionsData, delta: EmissionsData):
self.out(total, delta)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't remove this! It is possible that when you run it in local it is reading your codecarbon config? Normally the CI runs this in a clean environment.

Maybe I am wrong and this is documenting twice, can you share the error?
The goal of this was to emit the logs live and not wait for the out/flush to be called

@benoit-cty
Copy link
Contributor Author

Finaly, psys (platform/system) was not accurate on old laptop with Intel CPU, so I switch back to package, testing are welcome !

A parameter has been added to allow users to use psys if they wanted. As CodeCarbon does not use it previously, it is set to False by default. But in V4 we could set it to True as it seems reliable on modern hardware.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cpu not recognised on linux although rapl files are accessible

3 participants