Skip to content

Add multi host and multi user support to the process manager#941

Open
emmuhamm wants to merge 23 commits into
PawelPlesniakEmuhammad/SplitShellFixingfrom
emmuhamm/fix-lcs
Open

Add multi host and multi user support to the process manager#941
emmuhamm wants to merge 23 commits into
PawelPlesniakEmuhammad/SplitShellFixingfrom
emmuhamm/fix-lcs

Conversation

@emmuhamm

@emmuhamm emmuhamm commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Description

Fixes #926
Fixes #937

  • Blocks boot behaviour in unified shell if another session (with the same name) is already booted

  • Asks for confirmation when booting in the process manager shell if user tries to spawn a session with the same name as an existing session

  • Multi host support works by mapping localhost instances to an actual host

    • better handling with via localhost remapping in the session_dal
    • Also a remapping in the controller code
  • Multi user support works

  • extra connectivity service is ready check in the controller

  • new touch_and_chmod behaviour to enable correct permissions for read and write

    • This is done for the info.json that drunc usually spits out
  • deletes the logfile in case there was a previous iteration of it at boot now

  • checks for working directory permissions before running

  • BIG NOTE PERMISSIONS SET TO RWX FOR EVERYONE (777)!!! THIS MUST BE ACCEPTABLE

  • all this works for localhost, ehn1, and the localhost-pmaas config (where logs go to /logs)

Note about multi user support

It works!

However it has to run at a very specific level of permissions. To recap:

  • the user running the Process Manager server is deemed the 'superuser'
    • any new processes spawned by the process manager will be made in the name of the superuser
    • so this superuser will own the relevant processes
  • Any user can connect to the process manager via process-manager-shell or unified-shell

Because of the above,

  • Process Manager Server must be run in a folder that the user has rw(x?) permissions for
    • example: emmuhamm in his own home directory
  • Unified Shell / Process Manager shell must be run in a folder that both the superuser and the user running the US/PMS has rw(x?) permissios for
    • eg in an /nfs/sw/dunedaq.. folder that has 777 permissions on it

Type of change

  • Optimization
  • Bug fix
  • New feature

List of required branches from other repositories

DUNE-DAQ/daqsystemtest#322

Suggested manual testing checklist

built off 260603 nightly

Some relevant commands include

drunc-process-manager ssh-standalone 50000

drunc-unified-shell -o grpc://np04-srv-028:50000 config/daqsystemtest/example-configs.data.xml ehn1-local-1x1-config emir-test-ehn1 start-run --run-number 1 wait 20 
drunc-unified-shell -o grpc://np04-srv-028:50000 config/daqsystemtest/example-configs.data.xml local-1x1-config emir-test-local  start-run --run-number 1 wait 20
drunc-unified-shell -o grpc://np04-srv-028:50000 config/daqsystemtest/example-configs.data.xml local-1x1-config-pmaas emir-test-pmaas  start-run --run-number 1 wait 20


boot config/daqsystemtest/example-configs.data.xml ehn1-local-1x1-config emir-test-ehn3
boot config/daqsystemtest/example-configs.data.xml local-1x1-config emir-test-local-2 
boot config/daqsystemtest/example-configs.data.xml local-1x1-config-pmaas emir-test-pmaas-1

The list of steps (vague, sorry) are

  • clone the relevant branches on the relevant nightly
  • make sure you have super big permissions
  • using one user, start up the process manager
  • do the same cloning and permissions with the other user
  • using the other user, run
    • the unified shell with the three different configs
    • the pm shell and boot the three different things
  • check if having more sessions work with the port mapping
  • if you dont have the right permissions, check if the error message is clear enough and explained what the problem is

Developer checklist

Prior to marking this as "Ready for Review"

Tests ran on: NP04 - 028 from release 03 June 2026 nightly

Unit tests - some tests can't be ran on the CI. This is documented. If this PR checks a feature that can't be tested with CI, this has been marked appropriately.

Integration tests - the daqsystemtest_integtest_bundle requires a lot of resources, and connections to the EHN1 infrastructure. Check the cross referenced list if you can't run these. The developer needs to run at least the .

  • Unit tests (pytest --marker) passed
    • With relevant marker
    • Without marker
  • Integration tests passed
    • Only daqsystemtest_integtest_bundle.sh -k minimal_system_quick_test.py
    • Full daqsystemtest_integtest_bundle.sh
  • Testing skipped as there are no core code changes in this PR, this only relates to documentation/CI workflows
  • Drunc integration tests pass (./scripts/drunc_integtest_bundle.sh)

Final checklist prior to marking this as "Ready for Review"

  • Code is clearly commented.
  • New unit tests have been added, or is documented in # ISSUE NUMBER
  • A suitable reviewer has been chosen from this list.

Reviewer checklist

  • This branch has been rebased with develop prior to testing.
  • Suggested manual tests show changes.
  • CI workflows fails documented (if present)
  • Integration tests passed (on either np0x or IC HEP clusters)
    • Use the following guidelines to determine which of the integration tests you need to run
      • You do not need to run any integration tests if
        • Code changes are not associated with src/
        • PR changes only affect docstrings
        • In this case, be sure to validate any suggested manual testing.
      • Run only the minimum integration test as daqsystemtest_integtest_bundle.sh -k minimal_system_quick_test.py if
        • PR changes only affect a few log entries
        • PR changes are small, and do not have a large impact on the workflow (use carefully)
      • Otherwise run the full integration test bundle as daqsystemtest_integtest_bundle.sh
    • What to do if the integration tests fail?
      • Only concern yourself if failures related to drunc are in the log files
      • If non-drunc failure appears:
        • Validate failure in fresh working area
        • Contact Pawel if unsure
  • If you have ran the full integration test bundle, leave a comment on the PR stating
    • Which host the integration tests have ran on
    • [Optional] A copy of the test summary
  • Drunc integration tests pass (scripts/drunc_integtest_bundle.sh)

Once the above boxes are checked, the PR(s) can be merged following the steps below.

Prior to merging

Choose one of the following an complete all substeps
  • Changes only affect the Run Control, are in a single repository, and do not affect the end user.
    • Changes are documented in docstrings and code comments
    • Wiki has been updated if architectural or endpoint changes
  • Otherwise
    • Workflow changes demonstrated in the Change Log (if necessary)
    • Wiki has been updated (if necessary)
    • #dunedaq-integration Slack channel notified (see below)

Once completed, the reviewer can merge the PR.

Notification message for a Slack channel

Note - this should be to #dunedaq-integration for general workflow that isn't during a release candidate period, and to #daq-release-prep otherwise.

For an single merge that changes the user workflow

The CCM WG has an isolated PR ready to merge that affects user workflows. The PR is:

_URL_

I will leave time for any comments, otherwise will merge these at the end of the work day _Insert your time zone_.

For co-ordinated merge

The CCM WG has a set of co-ordinated merges ready to merge. The PRs are:

_URL_

_URL_


I will leave time for any comments, otherwise will merge these at the end of the day.

@emmuhamm emmuhamm marked this pull request as draft June 10, 2026 08:24
@emmuhamm emmuhamm self-assigned this Jun 10, 2026
@emmuhamm emmuhamm changed the base branch from develop to emmuhamm/fix-minor-split-shell-issues June 10, 2026 08:28
@emmuhamm emmuhamm force-pushed the emmuhamm/fix-lcs branch 3 times, most recently from e4cfb16 to d452d58 Compare June 12, 2026 14:17
@emmuhamm

emmuhamm commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

so most things work. things to do now is to ensure that the -pmaas config actually works independently in the daqsystemtest repo

and then later we can investigate if everything works for both ehn1 and local, using /log and not /log

well the pmaas thing worked a treat

@emmuhamm emmuhamm force-pushed the emmuhamm/fix-lcs branch 3 times, most recently from c1bc926 to 3c980ac Compare June 15, 2026 14:51
return

def update_connectivity_port_dal(
def update_connectivity_port_dal( # why is th9is hanging?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dont forget to tackle this (remove?)

@emmuhamm emmuhamm changed the base branch from emmuhamm/fix-minor-split-shell-issues to PawelPlesniakEmuhammad/SplitShellFixing June 17, 2026 13:00
@emmuhamm emmuhamm mentioned this pull request Jun 17, 2026
22 tasks
@emmuhamm emmuhamm added this to the fddaq-v5.7.0 milestone Jun 17, 2026
@emmuhamm emmuhamm marked this pull request as ready for review June 17, 2026 13:23
@emmuhamm emmuhamm changed the title fix lcs wip Add multi host and multi user support to the process manager Jun 17, 2026
@emmuhamm emmuhamm requested a review from PawelPlesniak June 17, 2026 13:29
@emmuhamm

Copy link
Copy Markdown
Contributor Author

Hello hello @PawelPlesniak this is the big boi now. Ready for review to get the multi host multi user support in!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Local Connectivity Service doesn't work for local-x-config Minor issues from PMaaS testing (20260515)

1 participant