Skip to content

K8s: refactor local connection server usage#927

Open
wanyunSu wants to merge 6 commits into
developfrom
wanyunSu/k8s-lcs
Open

K8s: refactor local connection server usage#927
wanyunSu wants to merge 6 commits into
developfrom
wanyunSu/k8s-lcs

Conversation

@wanyunSu

@wanyunSu wanyunSu commented May 19, 2026

Copy link
Copy Markdown
Contributor

Description

Fixes issue #825

Changes:

  1. _LcsSessionState dataclass replaces the old scalars for supporting multi-sessions. Sessions without an LCS have no entry.
  2. _create_nodeport_service takes explicit port / node_port args, stops the root-controller branch from polluting the LCS dict.
  3. _lcs_state.pop(session) on namespace deletion; lcs.is_booted = False in notify_termination when the LCS pod itself terminates.

Type of change

  • New feature / enhancement
  • Optimization
  • Bug fix
  • Breaking change
  • Documentation

List of required branches from other repositories

WHAT PRs NEED TO BE INCLUDED TO MAKE THE CHANGE.

Change log

see above

Suggested manual testing checklist

drunc-unified-shell k8s config/daqsystemtest/example-configs.data.xml local-1x1-config claudia-test

Developer checklist

Prior to marking this as "Ready for Review"

Tests ran on: np04-019 from NFD_DEV_260518_A9

Unit tests - some tests can't be ran on the CI. This is documented. If this PR checks a feature that can't be tested with CI, this has been marked appropriately.

Integration tests - the daqsystemtest_integtest_bundle requires a lot of resources, and connections to the EHN1 infrastructure. Check the cross referenced list if you can't run these. The developer needs to run at least the .

  • Unit tests (pytest --marker) passed
    • With relevant marker
    • Without marker
  • Integration tests passed
    • Only daqsystemtest_integtest_bundle.sh -k minimal_system_quick_test.py
    • Full daqsystemtest_integtest_bundle.sh
  • Testing skipped as there are no core code changes in this PR, this only relates to documentation/CI workflows
  • Drunc integration tests pass (./scripts/drunc_integtest_bundle.sh)

Final checklist prior to marking this as "Ready for Review"

  • Code is clearly commented.
  • New unit tests have been added, or is documented in # ISSUE NUMBER
  • A suitable reviewer has been chosen from this list.

Reviewer checklist

  • This branch has been rebased with develop prior to testing.
  • Suggested manual tests show changes.
  • CI workflows fails documented (if present)
  • Integration tests passed (on either np0x or IC HEP clusters)
    • Use the following guidelines to determine which of the integration tests you need to run
      • You do not need to run any integration tests if
        • Code changes are not associated with src/
        • PR changes only affect docstrings
        • In this case, be sure to validate any suggested manual testing.
      • Run only the minimum integration test as daqsystemtest_integtest_bundle.sh -k minimal_system_quick_test.py if
        • PR changes only affect a few log entries
        • PR changes are small, and do not have a large impact on the workflow (use carefully)
      • Otherwise run the full integration test bundle as daqsystemtest_integtest_bundle.sh
    • What to do if the integration tests fail?
      • Only concern yourself if failures related to drunc are in the log files
      • If non-drunc failure appears:
        • Validate failure in fresh working area
        • Contact Pawel if unsure
  • If you have ran the full integration test bundle, leave a comment on the PR stating
    • Which host the integration tests have ran on
    • [Optional] A copy of the test summary
  • Drunc integration tests pass (scripts/drunc_integtest_bundle.sh)

Once the above boxes are checked, the PR(s) can be merged following the steps below.

Prior to merging

Choose one of the following an complete all substeps
  • Changes only affect the Run Control, are in a single repository, and do not affect the end user.
    • Changes are documented in docstrings and code comments
    • Wiki has been updated if architectural or endpoint changes
  • Otherwise
    • Workflow changes demonstrated in the Change Log (if necessary)
    • Wiki has been updated (if necessary)
    • #dunedaq-integration Slack channel notified (see below)

Once completed, the reviewer can merge the PR.

Notification message for a Slack channel

Note - this should be to #dunedaq-integration for general workflow that isn't during a release candidate period, and to #daq-release-prep otherwise.

For an single merge that changes the user workflow

The CCM WG has an isolated PR ready to merge that affects user workflows. The PR is:

_URL_

I will leave time for any comments, otherwise will merge these at the end of the work day _Insert your time zone_.

For co-ordinated merge

The CCM WG has a set of co-ordinated merges ready to merge. The PRs are:

_URL_

_URL_


I will leave time for any comments, otherwise will merge these at the end of the day.

@wanyunSu wanyunSu self-assigned this Jun 1, 2026
@PawelPlesniak PawelPlesniak self-requested a review June 4, 2026 15:24
@@ -2160,7 +2210,7 @@ def _wait_for_lcs_readiness(self, podname: str, session: str) -> None:

self._wait_for_nodeport_http_ready(url, remaining_time)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question here - why is it that you run the check for nodeport readiness after you call _wait_for_pod_api_ready?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_wait_for_pod_api_ready returns node_name , which is required to construct the NodePort url, also, the NodePort can't be reachable before the pod is running

# --- STAGE 2: Wait for NodePort to be externally reachable (using HTTP urllib) ---
url = f"http://{node_name}:{self.connection_server_node_port}"
lcs = self._lcs_state_for(session)
url = f"http://{node_name}:{lcs.node_port}"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be cleaner if we had the url as an attribute of tthe _LcsSessionState

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to store url in _LcsSessionState, as it is only used as a readiness probe inside _wait_for_lcs_readiness and never read again after boot completes.

@PawelPlesniak PawelPlesniak left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants