Skip to content

Implement ERS into drunc with small scale examples#660

Merged
PawelPlesniak merged 12 commits intodevelopfrom
emmuhamm/introduce-ers
Feb 16, 2026
Merged

Implement ERS into drunc with small scale examples#660
PawelPlesniak merged 12 commits intodevelopfrom
emmuhamm/introduce-ers

Conversation

@emmuhamm
Copy link
Contributor

@emmuhamm emmuhamm commented Nov 5, 2025

Description

Fixes #732

Depends on DUNE-DAQ/daqpytools#46

This PR introduces a sample implementation of ERS into drunc, based on the major developments in daqpytools to include this functionality.

IMPORTANT: At the moment when using ERS, it is published to session_tester. This is currently not configurable yet.

Controller level

For the controllers, a bunch of new logic was added to identify the transition between error states. LogHandlerConf was also added; as the controllers by default have access to the ERS env variables, they can be initialised straight away.

A subtle but important change is that in the controller level, it is now initialised as controller.core.{name}_ctrl. This is important so this new handler can inheret the stream handler from controller.core, but also that each controller can obtain its own instance of an ers kafka handler, preventing any mixing. Importantly as well, in the error message it will show in the application field as controller.core.{name}_ctrl, making it more traceable.

The controller now supports two things.

Log error when a controller goes into error

Whenever a to-error command is run on a controller, it will trigger a log message that says that the controller is now in an error state. This is also sent to ERS. There is also logic for sending a message when the error has recovered and is no longer in an error state, however this functionality has not been tested.

Screenshot 2026-02-11 at 10 24 07

Log error when a controller fails an FSM transition due to it or its children being in error

As a fallout for this, if someone then runs an FSM transition command (eg conf), this is not going to work as the controllers are in error.
The log message 'Command 'conf' not executed: node is in error.' already exists; this is supercharged with a message sent to ERS now

Screenshot 2026-02-11 at 10 25 39

Process manager level

There are a few changes in the process manager. Importantly, the configuration has changed such that the OKS configuration for the ERS variables are copied over from the OKS and hardcoded in here. This lets the process manager inject it directly into its running instance, so it can be accessed by LogHandlerConf to initialise the ERS streams properly.

Log error when an application unexpectedly dies

Within the publish method of the PM, which is mainly used to continuously report to OpMon, some new bit of logic is injected to check if the number of dead processes has risen compared to before. If it has gone up, it will identify which process has died and send a log message to ERS.

Screenshot 2026-02-11 at 10 30 27

Testing changes

Given the nature of these changes, testing is a bit involved.

Setting up

  1. Check out this PR and the dependencies above on the latest nightly (developed in NFD_DEV_260205_A9, should work in others)

  2. Run drunc-unified-shell ssh-CERN-kafka config/daqsystemtest/example-configs.data.xml ehn1-local-1x1-config session_tester

  3. Boot

  4. Go to the monitoring dashboard and select the session session_tester. You should see an empty set of messages in message reporting (unless someone else has tested things :p )

Testing

  1. We are now testing the PM. Kill one of the running applications. I usually kill the MLT

    1. To do so, in a new and clean shell, run ps -u [username] -U [username] -f
    2. Identify the relevant PID. Its usually in a process with daq_application -s [session name] -k [configname] -n [application name]
    3. Kill it with kill -9 [pid]
  2. You should see in < 10 s the drunc unified shell giving an error message saying a process [name] with UUID [uuid] has died with code [code].

  3. You should also see that this message has also popped up in the ERS feed in the monitoring dashboard

  4. We are now going to test the controller. Choose your (least?) favorite controller and run to-error --target [controller-name]. When testing I used trg-controller

  5. If you run status you should see that controller is now in the error state

  6. You should also see in the dashboard in ers a message stating that the controller is now in an error state. Parent controllers will likely also go into error state as well, so this should be reflected in the message as well

  7. Lastly, we are going to test an FSM transition. Run conf

  8. This should throw an error message in ERS that says something along the lines of 'command conf not executed: node is in error'.

  9. Done!

Follow up

  • Check what severity level we should puttto these messages
  • See what ers variables we want in the pm configs (eg in the local one do we want protobuf?)
  • Check the wording of the log messages. SDo we want to include any additional info> is it too verbose? not verbose enough?

Type of change

  • Documentation (non-breaking change that adds or improves the documentation)
  • New feature (non-breaking change which adds functionality)
  • Optimization (non-breaking, back-end change that speeds up the code)
  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (whatever its nature)

Key checklist

  • All tests pass (eg. python -m pytest)
  • Pre-commit hooks run successfully (eg. pre-commit run --all-files)

Further checks

  • Code is commented, particularly in hard-to-understand areas
  • Tests added or an issue has been opened to tackle that in the future.
    (Indicate issue here: # (issue))

@emmuhamm emmuhamm force-pushed the emmuhamm/introduce-ers branch from 8a6451d to 65d7060 Compare November 5, 2025 13:45
Base automatically changed from emmuhamm/use-daqpytools-logger to develop November 9, 2025 23:19
@emmuhamm emmuhamm force-pushed the emmuhamm/introduce-ers branch from 65d7060 to 9687278 Compare November 18, 2025 09:33
@emmuhamm emmuhamm self-assigned this Dec 4, 2025
@emmuhamm emmuhamm force-pushed the emmuhamm/introduce-ers branch from 3e75a6b to 2e6bdf5 Compare December 8, 2025 10:02
@emmuhamm emmuhamm force-pushed the emmuhamm/introduce-ers branch 3 times, most recently from 3ab8aeb to 9ed6042 Compare February 11, 2026 11:55
@emmuhamm emmuhamm force-pushed the emmuhamm/introduce-ers branch from 9ed6042 to 5173ee8 Compare February 11, 2026 11:57
@emmuhamm emmuhamm changed the title Introduce ERS to drunc Implement ERS into drunc with small scale examples Feb 11, 2026
@emmuhamm emmuhamm force-pushed the emmuhamm/introduce-ers branch from afe67cf to f4fc20a Compare February 11, 2026 14:25
Copy link
Contributor Author

@emmuhamm emmuhamm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @PawelPlesniak, this PR is now ready for review!!

As usual, I have comments which I'd like your feedback on, but they're quite narrow and simple.

Testing instructions are quite a lot as expected cuz of the changes, but I hope it makes sense.

Note that CI tests are failing. This is fine, mainly because its failing with this:

ERROR tests/process_manager/test_process_manager_endpoints.py::test_kill_endpoint - TypeError: LogHandlerConf.__init__() got an unexpected keyword argument 'init_ers'

This functionality is a feature of the dependency. Note that it runs locally on my machine and passes just fine

Let me know if you have any comments

@emmuhamm emmuhamm marked this pull request as ready for review February 11, 2026 14:36
@PawelPlesniak
Copy link
Collaborator

Reviewing

@PawelPlesniak
Copy link
Collaborator

PawelPlesniak commented Feb 13, 2026

Notes

controller.core.{name}_ctrl

Another benefit - can reduce the number of kafka publishers again - all can go through the drunc.controller logger

There is also logic for sending a message when the error has recovered and is no longer in an error state, however this functionality has not been tested.

This is good, but in a separate PR we will want to change how this is handled. The high-level view is that there will be a notification service that will go to the Supervisor (essentially an automated error recovery service) that will perform an automated action. For now ERS is fine, as a lot of the other messages that currently go through ERS (e.g. start of run messages) will need to go through this separate notification topic. Just for your information more than anything else

IMPORTANT: At the moment when using ERS, it is published to session_tester. This is currently not configurable yet.

Once your current open PRs that we have commented on are addressed, this is the next focus point

From this PR

Some changes

  • The application name listed in ERS is drunc.process_manager.SSH_SHELL_process_manager, which is the name of the log handler. This should be corrected to the true application name, in the case of the above this should be process_manager only, without the remainder (low priority, will list and link issue). Will come up with an implementation that we can discuss.
  • When restarting an application, there should not be an error as this is expected.
drunc-unified-shell ssh-standalone config/daqsystemtest/example-configs.data.xml local-1x1-config pawel
start-run --run-number 1
restart -n mlt
[2026/02/13 14:17:43 UTC] INFO       ssh_process_manager.py:462               drunc.process_manager.SSH_SHELL_process_manager    process_manager restarting ['mlt'] in session pplesnia
[2026/02/13 14:17:43 UTC] INFO       ssh_process_manager.py:277               drunc.process_manager.SSH_SHELL_process_manager    Process 'mlt' (session: 'session_tester', user: 'pplesnia') process exited with exit 
code 255
[2026/02/13 14:17:45 UTC] CRITICAL   process_manager.py:218                   drunc.process_manager.SSH_SHELL_process_manager    Process mlt with UUID 81be44e6-b30d-4f61-b86c-f10487d1c5e8 has died with a return code
255
[2026/02/13 14:17:53 UTC] INFO       ssh_process_lifetime_manager_shell.py:38 drunc.process_manager.SSH_SHELL_process_manager    Process 81be44e6-b30d-4f61-b86c-f10487d1c5e8 terminated
[2026/02/13 14:17:53 UTC] INFO       ssh_process_manager.py:340               drunc.process_manager.SSH_SHELL_process_manager    Booted 'mlt' from session 'session_tester' with UUID 
81be44e6-b30d-4f61-b86c-f10487d1c5e8

I will attempt to address this now - I have ran the integration tests, but there is a chance that they get updated to use restart, and the integration tests I am defining definitely look for this. If there is a report of CRITICAL in the logs, the tests fail, and this is entirely dependent on when the opmon publishing happens in the interval, and how long the app takes to recover.

Going forward

Manual testing to see the errors on the dashboards has passed successfully.
Integration tests have passed successfully
Unit tests with pytest have passed successfully.
I will rerun the minimal system quick test after this change, I do not expect anything to throw with restart. Nice work!

Changes made to this PR

Updated the configurations to use the relevant parameters.

@PawelPlesniak
Copy link
Collaborator

@emmuhamm please review the last two changes, it is a simple check to ensure that when we expect a process to die (e.g. on restart), we do not publish any checks. I have run the MSQT on these small changes and they pass, the remainder does what you intend

@PawelPlesniak
Copy link
Collaborator

Final round of pre-merge testing has shown an error - with terminate the applications are not registered as expected to be dead

[2026/02/16 14:27:50 UTC] CRITICAL   process_manager.py:227                   drunc.process_manager.SSH_SHELL_process_manager    Process dfo-01 with UUID e2963bc3-34cc-42d5-a32f-132a6fe9b5ac has died with a return code 0
[2026/02/16 14:27:50 UTC] CRITICAL   process_manager.py:227                   drunc.process_manager.SSH_SHELL_process_manager    Process mlt with UUID fd57f618-9b63-44bb-9741-f2ecf047b10d has died with a return code 0
[2026/02/16 14:27:50 UTC] CRITICAL   process_manager.py:227                   drunc.process_manager.SSH_SHELL_process_manager    Process ru-det-conn-0 with UUID fc04d067-5337-4eb4-98bb-b0ed7ec0e746 has died with a return code 0

This will be addressed, and then merged

Copy link
Collaborator

@PawelPlesniak PawelPlesniak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested, working as intended

@PawelPlesniak PawelPlesniak merged commit cb195b7 into develop Feb 16, 2026
2 of 4 checks passed
@PawelPlesniak PawelPlesniak deleted the emmuhamm/introduce-ers branch February 16, 2026 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: When an application goes into error, this should be reported as an error

2 participants