-
Notifications
You must be signed in to change notification settings - Fork 365
fix: add disk cleanup to integration-test-io-credentialed job #5610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The integration-test-io-credentialed job was failing due to disk space issues in the CI environment. The MinIO storage backend was running out of space, causing tests to fail with XMinioStorageFull errors. This adds the "Free Disk Space" step to the integration-test-io-credentialed job, matching the disk cleanup already present in other integration test jobs like integration-test-io, integration-test-catalogs, and unit-test.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #5610 +/- ##
==========================================
- Coverage 70.37% 70.35% -0.03%
==========================================
Files 1012 1012
Lines 130614 130612 -2
==========================================
- Hits 91918 91887 -31
- Misses 38696 38725 +29 🚀 New features to boost your workflow:
|
Greptile Summary
Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant GHA as "GitHub Actions"
participant Runner as "Ubuntu Runner"
participant Cleanup as "Free Disk Space"
participant Docker as "Docker Compose"
participant MinIO as "MinIO Service"
participant Tests as "Integration Tests"
GHA->>Runner: Start integration-test-io-credentialed job
Runner->>Cleanup: Execute jlumbroso/free-disk-space action
Cleanup->>Runner: Remove Android, Haskell, swap storage
Runner->>Runner: Checkout code and setup environment
Runner->>Docker: Spin up IO services
Docker->>MinIO: Start MinIO container with /data volume
Runner->>Tests: Run pytest integration tests
Tests->>MinIO: Write test data during execution
Tests->>Runner: Complete test execution
Runner->>Docker: Tear down services
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 file reviewed, no comments
Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format
samstokes
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
## Changes Made Add disk cleanup step to `integration-test-ai` job in the PR test suite workflow to prevent disk space failures when installing heavy AI dependencies. ## Related Issues Fixes the "no space left on device" errors causing `integration-test-ai` failures starting December 2: - Error: `No space left on device: '/home/runner/actions-runner/cached/_diag/Worker_...'` Related to the broader disk space issue affecting CI jobs: - #5589 - #5609 - #5610 - #5711 ## Root Cause GitHub Actions runners have limited disk space. The combination of: 1. Large Python dependencies (torch, tensorflow, vllm ~2GB+) 2. Rust compilation artifacts ...exceeds available disk space, causing the runner worker process to crash. Disk space became borderline after GitHub updated the `ubuntu-latest` runner image from `ubuntu24/20251102.99` to `ubuntu24/20251112.124` in November (see #5711). The `integration-test-ai` job failures are intermittent due to borderline disk space - sometimes there's just enough, sometimes not. ## Timeline Evidence | Date | Run | Runner Image | Branch | Commit | integration-test-ai Result | |------|-----|--------------|--------|--------|---------------------------| | Dec 2, 04:14 | [19846881482](https://github.com/Eventual-Inc/Daft/actions/runs/19846881482) | `20251112.124.1` | main | `37623e1` | ✅ Success | | Dec 2, 04:58 | [19847729052](https://github.com/Eventual-Inc/Daft/actions/runs/19847729052) | `20251112.124.1` | main | `699e213` | ✅ Success | | Dec 2, 07:43 | [19851013864](https://github.com/Eventual-Inc/Daft/actions/runs/19851013864) | `20251112.124.1` | dataframe-hashing-function | `7b01545` | ✅ Success | | Dec 2, 12:50 | [19859197086](https://github.com/Eventual-Inc/Daft/actions/runs/19859197086) | `20251112.124.1` | zhenchao-lance-stats | `e96d6cd` | ✅ Success | | Dec 2, 17:26 | [19867603566](https://github.com/Eventual-Inc/Daft/actions/runs/19867603566) | `20251112.124.1` | slade/daft-arrow-crate | `00d2eba` | ✅ Success | | **Dec 2, 18:33** | [19869482423](https://github.com/Eventual-Inc/Daft/actions/runs/19869482423) | `20251112.124.1` | main | `15adbc3` | ❌ **First Failure** | | **Dec 2, 18:42** | [19869705631](https://github.com/Eventual-Inc/Daft/actions/runs/19869705631) | `20251112.124.1` | fix/macos-timeout (PR #5731) | `eb69f98` | ❌ Failure | ## Test Plan - [ ] Verify `integration-test-ai` jobs pass on this PR ## Internal Closes EVE-1300
## Changes Made Add disk cleanup step to `integration-test-io` job in the nightly workflow to prevent disk space failures when pulling Docker images (especially the large `google/cloud-sdk` image for the bigtable emulator). ## Related Issues Fixes the "no space left on device" errors causing intermittent nightly integration test failures since November 14: - Error: `failed to register layer: write /usr/lib/google-cloud-sdk/...: no space left on device` Related to the broader disk space issue affecting CI jobs: - #5589 - Fixed disk space in docgen workflow - #5609 - Fixed disk space in doctests job - #5610 - Fixed disk space in integration-test-io-credentialed job ## Root Cause GitHub Actions runners have limited disk space. The combination of: 1. Large Python dependencies (torch, tensorflow, vllm ~2GB+) 2. The `google/cloud-sdk` Docker image for bigtable emulator (~1.3GB+) ...exceeds available disk space, causing Docker layer extraction to fail. This issue began when GitHub updated the `ubuntu-latest` runner image from `ubuntu24/20251102.99` to `ubuntu24/20251112.124` on November 12, which added .NET Core SDK 10.0.100 (~2-3GB), reducing available disk space. ## Timeline Evidence | Date | Nightly Run | Runner Image | integration-test-io Result | |------|-------------|--------------|---------------------------| | Nov 11 | [19255638323](https://github.com/Eventual-Inc/Daft/actions/runs/19255638323) | `20251102.99` (old) | ✅ Success | | Nov 12 | [19287063045](https://github.com/Eventual-Inc/Daft/actions/runs/19287063045) | `20251102.99` (old) | ✅ Success | | Nov 13 | [19321208767](https://github.com/Eventual-Inc/Daft/actions/runs/19321208767) | `20251102.99` (old) | ✅ Success | | **Nov 14** | [19354956886](https://github.com/Eventual-Inc/Daft/actions/runs/19354956886) | `20251112.124` (new) | ❌ **First Failure** | | Nov 15 | [19384849493](https://github.com/Eventual-Inc/Daft/actions/runs/19384849493) | `20251112.124` (new) | ❌ Failure | | Nov 16 | [19400729617](https://github.com/Eventual-Inc/Daft/actions/runs/19400729617) | `20251112.124` (new) | ✅ Success | | Nov 17 | [19419062335](https://github.com/Eventual-Inc/Daft/actions/runs/19419062335) | `20251112.124` (new) | ✅ Success | | Nov 18 | [19454879276](https://github.com/Eventual-Inc/Daft/actions/runs/19454879276) | `20251112.124` (new) | ❌ Failure | | Nov 19 | [19490527917](https://github.com/Eventual-Inc/Daft/actions/runs/19490527917) | `20251112.124` (new) | ❌ Failure | | Nov 20 | [19526319067](https://github.com/Eventual-Inc/Daft/actions/runs/19526319067) | `20251112.124` (new) | ❌ Failure | | Nov 22 | [19590724755](https://github.com/Eventual-Inc/Daft/actions/runs/19590724755) | `20251112.124` (new) | ✅ Success | | Nov 23 | [19606289653](https://github.com/Eventual-Inc/Daft/actions/runs/19606289653) | `20251112.124` (new) | ✅ Success | | Nov 24 | [19624009710](https://github.com/Eventual-Inc/Daft/actions/runs/19624009710) | `20251112.124` (new) | ✅ Success | | Nov 25 | [19658955974](https://github.com/Eventual-Inc/Daft/actions/runs/19658955974) | `20251112.124` (new) | ❌ Failure | | Nov 26 | [19693197927](https://github.com/Eventual-Inc/Daft/actions/runs/19693197927) | `20251112.124` (new) | ❌ Failure | | Nov 28 | [19754690128](https://github.com/Eventual-Inc/Daft/actions/runs/19754690128) | `20251112.124` (new) | ❌ Failure | | Nov 29 | [19779342609](https://github.com/Eventual-Inc/Daft/actions/runs/19779342609) | `20251112.124` (new) | ❌ Failure | | Dec 1 | [19812067599](https://github.com/Eventual-Inc/Daft/actions/runs/19812067599) | `20251112.124` (new) | ❌ Failure | **Note:** The failures are intermittent because disk space is borderline - sometimes there's just enough, sometimes not. ## Testing Verified by running the nightly workflow on two branches: | Branch | PR | Disk Cleanup | integration-test-io Result | |--------|-----|--------------|---------------------------| | `fix/nightly-io-disk-cleanup` | #5711 (this PR) | ✅ Yes | ✅ [Passed](https://github.com/Eventual-Inc/Daft/actions/runs/19834011344) | | `verify-io-disk-failure` | #5720 | ❌ No | ❌ [Failed](https://github.com/Eventual-Inc/Daft/actions/runs/19837720310) (65 MB free, disk space error) | This confirms the disk cleanup step resolves the issue. Closes #5720
Changes Made
Add disk cleanup step to
integration-test-io-credentialedjob to prevent disk space failures.Related Issues
Fixes the MinIO storage full errors causing integration test failures on main (commits cce8b5a, def3836, and 88ea033).
Related to the broader disk space issue affecting CI jobs:
Test Plan
CI will run with the disk cleanup step, preventing the
XMinioStorageFullerrors that were occurring.Verified in test PR #5611 where both
integration-test-io-credentialed (3.10, native)andintegration-test-io-credentialed (3.10, ray)passed successfully.