Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Dec 14, 2025

Proposed changes

When a node crashes and restarts, jobs with concurrency: "forbid" incorrectly allow concurrent execution. The issue occurs because the forbid check relies on an in-memory activeExecutions map that is cleared on restart, missing jobs still running on other nodes.

Changes:

  • Storage layer: Added GetRunningExecutions() method to query persistent storage for executions with StartedAt but no FinishedAt
  • Concurrency check: Modified isRunnable() to check persistent storage first, then fall back to in-memory state
  • Tests: Added comprehensive tests covering node restart scenarios and multi-node execution tracking

The fix uses BuntDB (already persisted) as the source of truth for running executions, making the forbid check resilient to node failures and restarts.

Types of changes

What types of changes does your code introduce?
Put an x in the boxes that apply

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation Update (if none of the other choices apply)

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • 127.0.0.10
    • Triggering command: /tmp/go-build2632617950/b001/dkron.test /tmp/go-build2632617950/b001/dkron.test -test.testlogfile=/tmp/go-build2632617950/b001/testlog.txt -test.paniconexit0 -test.v=true -test.timeout=3m0s HcS8Y-jUi .cfg ux_amd64/vet -p g/protobuf/ptype-test.testlogfile=/tmp/go-build1927090654/b001/testlog.txt -lang=go1.23 ux_amd64/vet -o om/cenkalti/back-test.run=TestStore_GetRunningExecutions om/cenkalti/backoff/v5@v5.0.3/er-ifaceassert ux_amd64/vet points.go /tools/clientcmd-V=full -lang=go1.21 ux_amd64/vet (packet block)
  • 127.0.0.11
    • Triggering command: /tmp/go-build2632617950/b001/dkron.test /tmp/go-build2632617950/b001/dkron.test -test.testlogfile=/tmp/go-build2632617950/b001/testlog.txt -test.paniconexit0 -test.v=true -test.timeout=3m0s HcS8Y-jUi .cfg ux_amd64/vet -p g/protobuf/ptype-test.testlogfile=/tmp/go-build1927090654/b001/testlog.txt -lang=go1.23 ux_amd64/vet -o om/cenkalti/back-test.run=TestStore_GetRunningExecutions om/cenkalti/backoff/v5@v5.0.3/er-ifaceassert ux_amd64/vet points.go /tools/clientcmd-V=full -lang=go1.21 ux_amd64/vet (packet block)
  • 127.0.0.13
    • Triggering command: /tmp/go-build2632617950/b001/dkron.test /tmp/go-build2632617950/b001/dkron.test -test.testlogfile=/tmp/go-build2632617950/b001/testlog.txt -test.paniconexit0 -test.v=true -test.timeout=3m0s HcS8Y-jUi .cfg ux_amd64/vet -p g/protobuf/ptype-test.testlogfile=/tmp/go-build1927090654/b001/testlog.txt -lang=go1.23 ux_amd64/vet -o om/cenkalti/back-test.run=TestStore_GetRunningExecutions om/cenkalti/backoff/v5@v5.0.3/er-ifaceassert ux_amd64/vet points.go /tools/clientcmd-V=full -lang=go1.21 ux_amd64/vet (packet block)
  • 127.0.0.14
    • Triggering command: /tmp/go-build2632617950/b001/dkron.test /tmp/go-build2632617950/b001/dkron.test -test.testlogfile=/tmp/go-build2632617950/b001/testlog.txt -test.paniconexit0 -test.v=true -test.timeout=3m0s HcS8Y-jUi .cfg ux_amd64/vet -p g/protobuf/ptype-test.testlogfile=/tmp/go-build1927090654/b001/testlog.txt -lang=go1.23 ux_amd64/vet -o om/cenkalti/back-test.run=TestStore_GetRunningExecutions om/cenkalti/backoff/v5@v5.0.3/er-ifaceassert ux_amd64/vet points.go /tools/clientcmd-V=full -lang=go1.21 ux_amd64/vet (packet block)

If you need me to access, download, or install something from one of these locations, you can either:

Original prompt

This section details on the original issue you should resolve

<issue_title>The "forbid" concurrent option fails to verify if jobs are already running and allows another job to start</issue_title>
<issue_description>Describe the bug
The "forbid" concurrent option is not working as expected. I have jobs set to "forbid" option. The issue happens when a node in our cluster crashed due to being overloaded, and restarted dkron service itself. Once it's up back, it started running the jobs which were still running in other agent servers.

When all nodes are running without being crashed, the "forbid" concurrent option works as expected.

To Reproduce

  1. On an agent called dkron-marketplace-5, from the dkron log, it receives a signal from the cluster to run a job
Aug 28 00:55:03 dkron-marketplace-5 dkron[250326]: time="2024-08-28T00:55:03+12:00" level=info msg="agent: Calling AgentRun" job_name=marketplace-import-orders-16-amazon node="172.30.3.25:6868"
  1. The job was selected to run on another agent called dkron-marketplace-9 then.
Aug 28 00:55:03 dkron-marketplace-9 dkron[55533]: time="2024-08-28T00:55:03+12:00" level=info msg="grpc_agent: Starting job" job=marketplace-import-orders-16-amazon node=dkron-marketplace-9
  1. At 01:01:34, the dkron-marketplace-5 became overloaded and the dkron service was restarted then.
Aug 28 01:00:02 dkron-marketplace-5 dkron[250326]: time="2024-08-28T01:00:02+12:00" level=error msg="grpc: error dialing." error="failed to build resolver: passthrough: received empty target in Build()" method=ExecutionDone node=dkron-marketplace-5 server_addr=
Aug 28 01:00:02 dkron-marketplace-5 dkron[250326]: time="2024-08-28T01:00:02+12:00" level=fatal msg="agent: error applying SetExecutionType" error="node is not the leader" node=dkron-marketplace-5
Aug 28 01:01:34 dkron-marketplace-5 dkron[504292]: time="2024-08-28T01:01:34+12:00" level=info msg="agent: Dkron agent starting" node=dkron-marketplace-5
Aug 28 01:01:34 dkron-marketplace-5 dkron[504292]: time="2024-08-28T01:01:34+12:00" level=info msg="agent: Retry join LAN is supported for: aliyun aws azure digitalocean gce k8s linode mdns os packet scaleway softlayer tencentcloud triton vsphere" node=dkron-marketplace-5
Aug 28 01:01:34 dkron-marketplace-5 dkron[504292]: time="2024-08-28T01:01:34+12:00" level=info msg="agent: Joining cluster..." cluster=LAN node=dkron-marketplace-5
  1. Once the dkron started and rejoined the cluster, it has started the job which was still running on dkron-marketplace-9.
Aug 28 01:05:00 dkron-marketplace-5 dkron[504292]: time="2024-08-28T01:05:00+12:00" level=info msg="grpc_agent: Starting job" job=marketplace-import-orders-16-amazon node=dkron-marketplace-5

Expected behavior
The jobs shouldn't be allowed to run concurrently in that way because it was set to "forbid" option which allow only to run on 1 node at a time.

** Specifications:**

  • OS: Ubuntu 20.04
  • Version: 3.2.6

</issue_description>

Comments on the Issue (you are @copilot in this section)

@vcastellm Thanks for reporting we'll investigate the issue

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 3 commits December 14, 2025 19:34
… for running executions

Co-authored-by: vcastellm <47026+vcastellm@users.noreply.github.com>
Co-authored-by: vcastellm <47026+vcastellm@users.noreply.github.com>
Co-authored-by: vcastellm <47026+vcastellm@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix forbid concurrent option to verify running jobs Fix "forbid" concurrency check to survive node restarts Dec 14, 2025
Copilot AI requested a review from vcastellm December 14, 2025 19:44
@vcastellm vcastellm marked this pull request as ready for review December 14, 2025 19:48
@codecov
Copy link

codecov bot commented Dec 14, 2025

Codecov Report

❌ Patch coverage is 53.57143% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 31.76%. Comparing base (51317b1) to head (aa2bd08).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
dkron/job.go 20.00% 10 Missing and 2 partials ⚠️
dkron/store.go 92.30% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1870      +/-   ##
==========================================
+ Coverage   31.68%   31.76%   +0.08%     
==========================================
  Files          76       76              
  Lines        6937     6963      +26     
==========================================
+ Hits         2198     2212      +14     
- Misses       4436     4446      +10     
- Partials      303      305       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@vcastellm vcastellm merged commit d207f79 into main Dec 14, 2025
4 checks passed
@vcastellm vcastellm deleted the copilot/fix-forbid-concurrent-option branch December 14, 2025 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The "forbid" concurrent option fails to verify if jobs are already running and allows another job to start

2 participants