Skip to content

Handling exceptions when applications do not report errors but time out in stateful transitions#868

Draft
PawelPlesniak wants to merge 55 commits into
developfrom
PawelPlesniak/IncompleteStatefulCommandTransition
Draft

Handling exceptions when applications do not report errors but time out in stateful transitions#868
PawelPlesniak wants to merge 55 commits into
developfrom
PawelPlesniak/IncompleteStatefulCommandTransition

Conversation

@PawelPlesniak

@PawelPlesniak PawelPlesniak commented Mar 31, 2026

Copy link
Copy Markdown
Collaborator

Description

Fixes issue #803
If a segment does not reach the target state, it is marked as in error, and the timeout is logged in the relevant server.
Also defines a set of configurations constructed to fail, and defines a set of unit tests to demonstrate this behaviour.
Error recovery with the supervisor will address what happens if an application completes this outside of the designated window. This is defined in #840

Type of change

  • New feature / enhancement
  • Optimization
  • Bug fix
  • Breaking change
  • Documentation

List of required branches from other repositories

Requires DUNE-DAQ/druncschema#87

Change log

Defines a set of intentionally failing configurations in config/tests/failure-mode-testing.data.xml, which contain configurations with a set of fake-daq-apps that fail at pre-defined points. These configurations are (note the checkboxes are for final testing whether the intended behaviour is as desired, and will be removed prior to marking as ready for review.

  • ft-reference - this is a reference configuration without the small changes required to simulate failures.
  • ft-death-on-boot-nest-app - this kills a nested application (2+ segments deep) on boot.
  • ft-death-on-boot-top-app - this kills the top application on boot.
  • ft-death-post-boot-nest-app - this kills a nested application (2+ segments deep) after boot, before applications are marked as ready.
  • ft-death-post-boot-top-app - this kills the top application after boot, before applications are marked as ready.
  • ft-fsm-cmd-timeout-nest-app - this times out an FSM transition on a nested application.
  • ft-fsm-cmd-timeout-top-app - this times out an FSM transition on the top application.
  • ft-fsm-cmd-death-nest-app - this kills a nested fake daq app during a FSM transition.
  • ft-fsm-cmd-death-top-app - this kills the top fake daq app during a FSM transition.

These tests have been integrated into the unit test framework.

Suggested manual testing checklist

Run each of the commands, and validate that the intended behaviour is as expected. Prior to running each of these configurations, the following script will need to be run

<DRUNC_ROOT>/config/setup_drunc_config_path.sh

These are the following commands to run each of the tests manually. The checkboxes are left for the reviewer to keep track of their testing progress.

  • Reference test
drunc-unified-shell ssh-standalone config/tests/failure-mode-testing.data.xml ft-reference pr868 start-run --run-number 1 wait 10 shutdown

This run should complete without any error conditions.

  • Nest app death on boot
drunc-unified-shell ssh-standalone config/tests/failure-mode-testing.data.xml ft-death-on-boot-nest-app pr868 boot

bottom-segment-2-application should die on boot, and its logs should contain

Simulating death of bottom-segment-2-application on boot

the top-segment-controller should be in error, and the following line should be in the tty:

ERROR      commands.py:119                          drunc.unified_shell.boot                           Booted, but the top controller is in error
  • Top app death on boot
drunc-unified-shell ssh-standalone config/tests/failure-mode-testing.data.xml ft-death-on-boot-top-app pr868 boot

nested-segment-application should die on boot, and its logs should contain

Simulating death of nested-segment-application on boot

the top-segment-controller should be in error, and the following line should be in the tty:

ERROR      commands.py:119                          drunc.unified_shell.boot                           Booted, but the top controller is in error
  • Nest app death post boot
drunc-unified-shell ssh-standalone config/tests/failure-mode-testing.data.xml ft-death-post-boot-nest-app pr868 boot

bottom-segment-2-application should die at the end of boot, and its logs should contain

Simulating death of bottom-segment-2-application post boot

the top-segment-controller should be in error, and the following line should be in the tty:

ERROR      commands.py:119                          drunc.unified_shell.boot                           Booted, but the top controller is in error
  • Top app death post boot
drunc-unified-shell ssh-standalone config/tests/failure-mode-testing.data.xml ft-death-post-boot-top-app pr868 boot

nested-segment-application should die at the end of boot, and its logs should contain

Simulating death of nested-segment-application post boot

the top-segment-controller should be in error, and the following line should be in the tty:

ERROR      commands.py:119                          drunc.unified_shell.boot                           Booted, but the top controller is in error
  • Nested app timeout on FSM transition
drunc-unified-shell ssh-standalone config/tests/failure-mode-testing.data.xml ft-fsm-cmd-timeout-nest-app pr868 boot

bottom-segment-2-application should time out on conf, and its logs should contain

Delaying execution of bottom-segment-2-application by 100 seconds

the top-segment-controller should be in error, and the following line should be in the tty:

ERROR      shell_utils.py:640                       drunc.controller.iface.shell_utils                 The command timed out, unfortunately this means the server is in undefined state, and your best option at this stage is to terminate and boot.
ERROR      shell_utils.py:657                       drunc.controller.iface.shell_utils                 The session did not complete the stateful transition in the specified time of 60 seconds. To investigate the cause, please check the controller and application logs with the 'logs' command.```
ERROR      commands.py:119                          drunc.unified_shell.boot                           Booted, but the top controller is in error
  • Top app timeout on FSM transition
drunc-unified-shell ssh-standalone config/tests/failure-mode-testing.data.xml ft-fsm-cmd-timeout-top-app pr868 boot

nested-segment-application should time out on conf, and its logs should contain

Delaying execution of nested-segment-application by 100 seconds

the top-segment-controller should be in error, and the following line should be in the tty:

ERROR      shell_utils.py:640                       drunc.controller.iface.shell_utils                 The command timed out, unfortunately this means the server is in undefined state, and your best option at this stage is to terminate and boot.
ERROR      shell_utils.py:657                       drunc.controller.iface.shell_utils                 The session did not complete the stateful transition in the specified time of 60 seconds. To investigate the cause, please check the controller and application logs with the 'logs' command.```
ERROR      commands.py:119                          drunc.unified_shell.boot                           Booted, but the top controller is in error
  • Nested application death during FSM transition
drunc-unified-shell ssh-standalone config/tests/failure-mode-testing.data.xml ft-fsm-cmd-death-nest-app pr868 boot conf

bottom-segment-2-application should die on conf, and its logs should contain.

Simulating death of bottom-segment-2-application during FSM cmd execution

TTY TBC

  • Top application death during FSM transition
drunc-unified-shell ssh-standalone config/tests/failure-mode-testing.data.xml ft-fsm-cmd-death-top-app pr868 boot conf

nested-segment-application should die on conf, and its logs should contain.

Simulating death of nested-segment-application during FSM cmd execution

TTY TBC

Developer checklist

Prior to marking this as "Ready for Review"

Tests ran on: WHAT HOSTNAME from release RELEASE_NAME

Unit tests - some tests can't be ran on the CI. This is documented. If this PR checks a feature that can't be tested with CI, this has been marked appropriately.

Integration tests - the daqsystemtest_integtest_bundle requires a lot of resources, and connections to the EHN1 infrastructure. Check the cross referenced list if you can't run these. The developer needs to run at least the .

  • Unit tests (pytest --marker) passed
    • With relevant marker
    • Without marker
  • Integration tests passed
    • Only daqsystemtest_integtest_bundle.sh -k minimal_system_quick_test.py
    • Full daqsystemtest_integtest_bundle.sh
  • Testing skipped as there are no core code changes in this PR, this only relates to documentation/CI workflows

Final checklist prior to marking this as "Ready for Review"

  • Code is clearly commented.
  • New unit tests have been added, or is documented in # ISSUE NUMBER
  • A suitable reviewer has been chosen from this list.

Reviewer checklist

  • This branch has been rebased with develop prior to testing.
  • Suggested manual tests show changes.
  • CI workflows fails documented (if present)
  • Integration tests passed
    • Only concern yourself if failures related to drunc are in the log files
    • If non-drunc failure appears:
      • Validate failure in fresh working area
      • Contact Pawel if unsure

Once the features are validated and both the unit and integration tests pass, the PRs is ready to be merged.

Prior to merging

Choose one of the following an complete all substeps
  • Changes only affect the Run Control, are in a single repository, and do not affect the end user.
    • Changes are documented in docstrings and code comments
    • Wiki has been updated if architectural or endpoint changes
  • Otherwise
    • Workflow changes demonstrated in the Change Log (if necessary)
    • Wiki has been updated (if necessary)
    • #daq-sw-librarians Slack channel notified (see below)

Once completed, the reviewer can merge the PR.

Notification message for a Slack channel

Note - this should be to #dunedaq-integration for general workflow that isn't during a release candidate period, and to #daq-release-prep otherwise.

For an single merge that changes the user workflow

The CCM WG has an isolated PR ready to merge that affects user workflows. The PR is:

_URL_

I will leave time for any comments, otherwise will merge these at the end of the work day _Insert your time zone_.

For co-ordinated merge

The CCM WG has a set of co-ordinated merges ready to merge. The PRs are:

_URL_

_URL_


I will leave time for any comments, otherwise will merge these at the end of the day.

@PawelPlesniak

Copy link
Copy Markdown
Collaborator Author
image In the case where a second application also fails to complete a transition in time, the same error gets thrown. This is likely caused by the nested structure, and the fact that there are multiple layers to this configuration. A robust solution to this problem will take longer to achieve, but I will continue working on it.

@PawelPlesniak PawelPlesniak changed the title Generating an environment for which the issue can be recreated Handling exceptions when applications do not report errors but time out in stateful transitions Mar 31, 2026
@PawelPlesniak PawelPlesniak changed the base branch from prep-release/fddaq-v5.6.0 to develop June 4, 2026 16:05
@PawelPlesniak

PawelPlesniak commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator Author
  • Log files do not contain wierd characters from redirecting rich ASCII output to a file with colors
  • Top app config failure not working?

@PawelPlesniak

PawelPlesniak commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author
  • Stream handler has no logging time zone, file name, line number, etc?
  • Duplicate logs of apps failing
  • Remove old logs used for debugging
  • Add druncschema dependency to this PR log
  • Integrate the failure mode testing into the unit tests

@PawelPlesniak

PawelPlesniak commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator Author
  • Update config paths to conf_dict.predefined_config_db = "config/drunc/failure-mode-testing.data.xml"

@PawelPlesniak

PawelPlesniak commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator Author
  • Use the --no-stop-error-batch-mode option to also check the ps and status tables
  • Investigate why the ps table does not contain the fake_daq_applicaiton in a dead state

@PawelPlesniak

PawelPlesniak commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator Author
  • add LCS logfile checks
  • add unfiied shell logs command that can target either the process manager or controller codes
  • add post failure ps and status requests, check those tables once the relevant PR in daqsystemtest has gone in
  • Fix pytest
  • Fix docs

@PawelPlesniak

PawelPlesniak commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator Author

At this time, once you set up the test running on NFD_DEV_260615_A9 with

source scripts/setup_drunc_config_path.sh

and run the test with

source scripts/drunc_integtest_bundle.sh -k fsm_cmd_timeout  --verbosity 5

Apps incorrectly register themselves in the controller hierarchy tree, causing the test to fail at a completely different stage. The cause of this is unknown as of now. If one runs the intended config manually as

drunc-unified-shell ssh-standalone config/drunc/failure-testing.data.xml ft-fsm-cmd-timeout-nest-app ft-pr boot conf

The conf command will not complete in time, and the relevant error will be reported. If one then runs the consolidated integration test config manually as

drunc-unified-shell ssh-standalone /tmp/pytest-of-pplesnia/pytest-161/config0/integtest-session-resolved.data.xml ft-pr boot conf

This fails in the same way as running the test with drunc_integtest_bundle.sh.

@PawelPlesniak

Copy link
Copy Markdown
Collaborator Author

Example running the consolidated config

drunc-unified-shell ssh-standalone /tmp/pytest-of-pplesnia/pytest-161/config0/integtest-session-resolved.data.xml ft-fsm-cmd-timeout-nest-app ft-pr boot conf
[2026/06/17 16:55:41 UTC] INFO       shell.py:203                             drunc.unified_shell                                Setting up to use the process manager with configuration ssh-standalone and configuration id "ft-fsm-cmd-timeout-nest-app" from oksconflibs:/tmp/pytest-of-pplesnia/pytest-161/config0/integtest-session-resolved.data.xml
[2026/06/17 16:55:41 UTC] INFO       shell.py:225                             drunc.unified_shell                                Starting process manager
[2026/06/17 16:55:41 UTC] INFO       process_manager.py:109                   drunc.process_manager                              process_manager communicating through address 10.73.136.71:41593
[2026/06/17 16:55:41 UTC] INFO       shell.py:568                             drunc.unified_shell                                unified_shell ready with process_manager and controller commands
[2026/06/17 16:55:41 UTC] INFO       process_manager_driver.py:102            drunc.process_manager_driver                       Booting session ft-pr
[2026/06/17 16:55:41 UTC] INFO       process_manager_driver.py:479            drunc.process_manager_driver                       Configuration did not require modifications.
[2026/06/17 16:55:41 UTC] INFO       ssh_process_manager.py:385               drunc.process_manager.SSH_SHELL_process_manager    Booted 'local-connection-server' from session 'ft-pr' with UUID 9e1321a2-ee0a-4965-b665-8141de98761e
[2026/06/17 16:55:42 UTC] INFO       ssh_process_manager.py:385               drunc.process_manager.SSH_SHELL_process_manager    Booted 'ft-root-controller' from session 'ft-pr' with UUID 36852d96-f9d0-4207-bdc5-e9be1d37d8af
[2026/06/17 16:55:42 UTC] INFO       ssh_process_manager.py:385               drunc.process_manager.SSH_SHELL_process_manager    Booted 'ft-top-segment-controller' from session 'ft-pr' with UUID 64bf7aa9-50a4-4b8d-a86d-17b4f65d1d2d
[2026/06/17 16:55:42 UTC] INFO       ssh_process_manager.py:385               drunc.process_manager.SSH_SHELL_process_manager    Booted 'ft-nested-segment-1-controller' from session 'ft-pr' with UUID f4bcc236-2cf2-421f-9422-2f83c5aa628a
[2026/06/17 16:55:42 UTC] INFO       ssh_process_manager.py:385               drunc.process_manager.SSH_SHELL_process_manager    Booted 'ft-nested-segment-1-application' from session 'ft-pr' with UUID 4f828720-68f1-4d03-929f-3ed3863ae6a8
[2026/06/17 16:55:42 UTC] INFO       ssh_process_manager.py:385               drunc.process_manager.SSH_SHELL_process_manager    Booted 'ft-nested-segment-2-controller' from session 'ft-pr' with UUID d1bf6be8-3d96-4000-963e-8a9b70a6f3e9
[2026/06/17 16:55:42 UTC] INFO       ssh_process_manager.py:385               drunc.process_manager.SSH_SHELL_process_manager    Booted 'ft-nested-segment-2-application' from session 'ft-pr' with UUID 5b996be2-7dcd-4724-8a3d-d943578ca9c8
[2026/06/17 16:55:42 UTC] INFO       ssh_process_manager.py:385               drunc.process_manager.SSH_SHELL_process_manager    Booted 'ft-nested-segment-2.1-application' from session 'ft-pr' with UUID 5097c37a-79c7-4b30-9f84-523c7b450c8a
[2026/06/17 16:55:42 UTC] INFO       process_manager_driver.py:557            drunc.process_manager_driver                       Looking for top controller 'ft-root-controller' in the connectivity service at np04-srv-029:32174
  Looking for ft-root-controller on the connectivity service... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:00:00 0:00:01
⠋ Trying to talk to the root controller... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -:--:-- 0:00:00
                                                        ft-pr status                                                        
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Name                               ┃ Info ┃ State   ┃ Substate ┃ In error ┃ Included ┃ Endpoint                          ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ ft-root-controller                 │      │ initial │ initial  │ Yes      │ Yes      │ grpc://np04-srv-029.cern.ch:30547 │
│   ft-top-segment-controller        │      │ initial │ initial  │ Yes      │ Yes      │ grpc://np04-srv-029.cern.ch:30006 │
│     ft-nested-segment-1-controller │      │         │          │ No       │ No       │                                   │
└────────────────────────────────────┴──────┴─────────┴──────────┴──────────┴──────────┴───────────────────────────────────┘
Waiting on tree initialisation... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 100% 0:00:01
Controller did not initialise in time
[2026/06/17 16:56:55 UTC] ERROR      commands.py:124                          drunc.unified_shell.boot                           Booted, but the number of processes found in the connectivity service (8) does not match the number of processes found in the process manager (6). Use the ps command to determine which applications did not correctly register themselves on the connectivity service by comparing against
the status table, and the logs command to find out more about this failure.

@PawelPlesniak

Copy link
Copy Markdown
Collaborator Author

Example running the same config manually

drunc-unified-shell ssh-standalone config/drunc/failure-testing.data.xml ft-fsm-cmd-timeout-nest-app ft-pr boot conf
[2026/06/17 16:58:09 UTC] INFO       shell.py:203                             drunc.unified_shell                                Setting up to use the process manager with configuration ssh-standalone and configuration id "ft-fsm-cmd-timeout-nest-app" from oksconflibs:config/drunc/failure-testing.data.xml
[2026/06/17 16:58:09 UTC] INFO       shell.py:225                             drunc.unified_shell                                Starting process manager
[2026/06/17 16:58:09 UTC] INFO       process_manager.py:109                   drunc.process_manager                              process_manager communicating through address 10.73.136.71:45341
[2026/06/17 16:58:09 UTC] INFO       shell.py:568                             drunc.unified_shell                                unified_shell ready with process_manager and controller commands
[2026/06/17 16:58:09 UTC] INFO       process_manager_driver.py:102            drunc.process_manager_driver                       Booting session ft-pr
[2026/06/17 16:58:09 UTC] INFO       process_manager_driver.py:479            drunc.process_manager_driver                       Configuration did not require modifications.
[2026/06/17 16:58:09 UTC] INFO       ssh_process_manager.py:385               drunc.process_manager.SSH_SHELL_process_manager    Booted 'local-connection-server' from session 'ft-pr' with UUID e2e8ae7d-b37a-4cd3-a2fb-0fa5c0c38254
[2026/06/17 16:58:10 UTC] INFO       ssh_process_manager.py:385               drunc.process_manager.SSH_SHELL_process_manager    Booted 'ft-root-controller' from session 'ft-pr' with UUID 1207a289-9bb9-480e-ad31-d4e1710579ee
[2026/06/17 16:58:10 UTC] INFO       ssh_process_manager.py:385               drunc.process_manager.SSH_SHELL_process_manager    Booted 'ft-top-segment-controller' from session 'ft-pr' with UUID bf0b0c86-c95d-4be3-b4be-4c09e16119df
[2026/06/17 16:58:10 UTC] INFO       ssh_process_manager.py:385               drunc.process_manager.SSH_SHELL_process_manager    Booted 'ft-nested-segment-1-controller' from session 'ft-pr' with UUID a0421983-1beb-47cf-8eb9-ac8598674989
[2026/06/17 16:58:10 UTC] INFO       ssh_process_manager.py:385               drunc.process_manager.SSH_SHELL_process_manager    Booted 'ft-nested-segment-1-application' from session 'ft-pr' with UUID 4443bcda-46c1-46f5-986d-144e67f7b7e7
[2026/06/17 16:58:10 UTC] INFO       ssh_process_manager.py:385               drunc.process_manager.SSH_SHELL_process_manager    Booted 'ft-nested-segment-2-controller' from session 'ft-pr' with UUID aa1a985c-5ed2-4bd3-9642-f9cec948e455
[2026/06/17 16:58:10 UTC] INFO       ssh_process_manager.py:385               drunc.process_manager.SSH_SHELL_process_manager    Booted 'ft-nested-segment-2-application' from session 'ft-pr' with UUID a8058b82-100b-4dd5-a4cf-8b610843fc9b
[2026/06/17 16:58:10 UTC] INFO       ssh_process_manager.py:385               drunc.process_manager.SSH_SHELL_process_manager    Booted 'ft-nested-segment-2.1-application' from session 'ft-pr' with UUID 322095f7-89ec-48a3-8be8-552c9de429c9
[2026/06/17 16:58:10 UTC] INFO       process_manager_driver.py:557            drunc.process_manager_driver                       Looking for top controller 'ft-root-controller' in the connectivity service at np04-srv-029:30005
  Looking for ft-root-controller on the connectivity service... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:00:00 0:00:01
⠋ Trying to talk to the root controller... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -:--:-- 0:00:00
                                                          ft-pr status                                                           
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Name                                    ┃ Info ┃ State   ┃ Substate ┃ In error ┃ Included ┃ Endpoint                          ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ ft-root-controller                      │      │ initial │ initial  │ No       │ Yes      │ grpc://np04-srv-029.cern.ch:42001 │
│   ft-top-segment-controller             │      │ initial │ initial  │ No       │ Yes      │ grpc://np04-srv-029.cern.ch:30006 │
│     ft-nested-segment-1-controller      │      │ initial │ initial  │ No       │ Yes      │ grpc://np04-srv-029.cern.ch:42717 │
│       ft-nested-segment-1-application   │      │ initial │ idle     │ No       │ Yes      │ rest://np04-srv-029.cern.ch:52919 │
│     ft-nested-segment-2-controller      │      │ initial │ initial  │ No       │ Yes      │ grpc://np04-srv-029.cern.ch:37355 │
│       ft-nested-segment-2-application   │      │ initial │ idle     │ No       │ Yes      │ rest://np04-srv-029.cern.ch:39987 │
│       ft-nested-segment-2.1-application │      │ initial │ idle     │ No       │ Yes      │ rest://np04-srv-029.cern.ch:45185 │
└─────────────────────────────────────────┴──────┴─────────┴──────────┴──────────┴──────────┴───────────────────────────────────┘
Waiting on tree initialisation... ━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   4% 0:01:08
[2026/06/17 16:58:15 UTC] WARNING    commands.py:137                          drunc.unified_shell.boot                           Getting the session states
[2026/06/17 16:58:15 UTC] INFO       commands.py:166                          drunc.unified_shell.boot                           Booted successfully
[2026/06/17 16:58:15 UTC] INFO       shell_utils.py:531                       drunc.controller.iface.shell_utils                 Running transition 'conf' on controller 'ft-root-controller', targeting: 'ft-root-controller'
                                                                ft-pr status                                                                
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Name                                    ┃ Info ┃ State      ┃ Substate         ┃ In error ┃ Included ┃ Endpoint                          ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ ft-root-controller                      │      │ initial    │ propagating-conf │ No       │ Yes      │ grpc://np04-srv-029.cern.ch:42001 │
│   ft-top-segment-controller             │      │ initial    │ propagating-conf │ No       │ Yes      │ grpc://np04-srv-029.cern.ch:30006 │
│     ft-nested-segment-1-controller      │      │ configured │ configured       │ No       │ Yes      │ grpc://np04-srv-029.cern.ch:42717 │
│       ft-nested-segment-1-application   │      │ configured │ idle             │ No       │ Yes      │ rest://np04-srv-029.cern.ch:52919 │
│     ft-nested-segment-2-controller      │      │ initial    │ propagating-conf │ No       │ Yes      │ grpc://np04-srv-029.cern.ch:37355 │
│       ft-nested-segment-2-application   │      │ initial    │ executing_cmd    │ No       │ Yes      │ rest://np04-srv-029.cern.ch:39987 │
│       ft-nested-segment-2.1-application │      │ configured │ idle             │ No       │ Yes      │ rest://np04-srv-029.cern.ch:45185 │
└─────────────────────────────────────────┴──────┴────────────┴──────────────────┴──────────┴──────────┴───────────────────────────────────┘
Waiting for conf to complete... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸  99% 0:00:01
[2026/06/17 16:59:15 UTC] ERROR      shell_utils.py:641                       drunc.controller.iface.shell_utils                 The command timed out, unfortunately this means the server is in undefined state, and your best option at this stage is to terminate and boot.
[2026/06/17 16:59:15 UTC] ERROR      shell_utils.py:658                       drunc.controller.iface.shell_utils                 The session did not complete the stateful transition in the specified time of 60 seconds. To investigate the cause, please check the controller and application logs with the logs command.
[2026/06/17 16:59:15 UTC] CRITICAL   controller_driver.py:346                 drunc.controller.core.ControllerDriver             TEST: Starting to_error on target ''
[2026/06/17 16:59:15 UTC] CRITICAL   controller_driver.py:346                 drunc.controller.core.ControllerDriver             TEST: Starting to_error on target ''
[2026/06/17 16:59:15 UTC] CRITICAL   controller_driver.py:355                 drunc.controller.core.ControllerDriver             TEST: Sending to_error request to server for target ''
[2026/06/17 16:59:15 UTC] CRITICAL   controller_driver.py:355                 drunc.controller.core.ControllerDriver             TEST: Sending to_error request to server for target ''
[2026/06/17 16:59:15 UTC] CRITICAL   controller_driver.py:359                 drunc.controller.core.ControllerDriver             TEST: Received to_error response from server for target ''
[2026/06/17 16:59:15 UTC] CRITICAL   controller_driver.py:359                 drunc.controller.core.ControllerDriver             TEST: Received to_error response from server for target ''
                                                                ft-pr status                                                                
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Name                                    ┃ Info ┃ State      ┃ Substate         ┃ In error ┃ Included ┃ Endpoint                          ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ ft-root-controller                      │      │ initial    │ propagating-conf │ Yes      │ Yes      │ grpc://np04-srv-029.cern.ch:42001 │
│   ft-top-segment-controller             │      │ initial    │ propagating-conf │ No       │ Yes      │ grpc://np04-srv-029.cern.ch:30006 │
│     ft-nested-segment-1-controller      │      │ configured │ configured       │ No       │ Yes      │ grpc://np04-srv-029.cern.ch:42717 │
│       ft-nested-segment-1-application   │      │ configured │ idle             │ No       │ Yes      │ rest://np04-srv-029.cern.ch:52919 │
│     ft-nested-segment-2-controller      │      │ initial    │ propagating-conf │ No       │ Yes      │ grpc://np04-srv-029.cern.ch:37355 │
│       ft-nested-segment-2-application   │      │ initial    │ executing_cmd    │ No       │ Yes      │ rest://np04-srv-029.cern.ch:39987 │
│       ft-nested-segment-2.1-application │      │ configured │ idle             │ No       │ Yes      │ rest://np04-srv-029.cern.ch:45185 │
└─────────────────────────────────────────┴──────┴────────────┴──────────────────┴──────────┴──────────┴───────────────────────────────────┘
[2026/06/17 16:59:16 UTC] ERROR      shell_utils.py:354                       drunc.utils.ShellContext                            FSM is in error (state: "initial"
sub_state: "propagating-conf"
in_error: true
included: true
), not currently accepting new commands.
[2026/06/17 16:59:16 UTC] INFO       shell.py:445                             drunc.unified_shell                                Shutting down the unified_shell
[2026/06/17 16:59:16 UTC] WARNING    shell.py:452                             drunc.unified_shell                                Controller is in error, cannot gracefully shutdown
[2026/06/17 16:59:16 UTC] INFO       shell_utils.py:316                       drunc.utils.ShellContext                           You will not be able to issue commands to the controller anymore.
[2026/06/17 16:59:16 UTC] INFO       shell_utils.py:318                       drunc.utils.ShellContext                           Controller driver has been deleted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants