Implement ERS into drunc with small scale examples by emmuhamm · Pull Request #660 · DUNE-DAQ/drunc

emmuhamm · 2025-11-05T10:57:00Z

Description

Fixes #732

This PR introduces a sample implementation of ERS into drunc, based on the major developments in daqpytools to include this functionality.

IMPORTANT: At the moment when using ERS, it is published to session_tester. This is currently not configurable yet.

Controller level

For the controllers, a bunch of new logic was added to identify the transition between error states. LogHandlerConf was also added; as the controllers by default have access to the ERS env variables, they can be initialised straight away.

A subtle but important change is that in the controller level, it is now initialised as controller.core.{name}_ctrl. This is important so this new handler can inheret the stream handler from controller.core, but also that each controller can obtain its own instance of an ers kafka handler, preventing any mixing. Importantly as well, in the error message it will show in the application field as controller.core.{name}_ctrl, making it more traceable.

The controller now supports two things.

Log error when a controller goes into error

Whenever a to-error command is run on a controller, it will trigger a log message that says that the controller is now in an error state. This is also sent to ERS. There is also logic for sending a message when the error has recovered and is no longer in an error state, however this functionality has not been tested.

Log error when a controller fails an FSM transition due to it or its children being in error

As a fallout for this, if someone then runs an FSM transition command (eg conf), this is not going to work as the controllers are in error.
The log message 'Command 'conf' not executed: node is in error.' already exists; this is supercharged with a message sent to ERS now

Process manager level

There are a few changes in the process manager. Importantly, the configuration has changed such that the OKS configuration for the ERS variables are copied over from the OKS and hardcoded in here. This lets the process manager inject it directly into its running instance, so it can be accessed by LogHandlerConf to initialise the ERS streams properly.

Log error when an application unexpectedly dies

Within the publish method of the PM, which is mainly used to continuously report to OpMon, some new bit of logic is injected to check if the number of dead processes has risen compared to before. If it has gone up, it will identify which process has died and send a log message to ERS.

Testing changes

Given the nature of these changes, testing is a bit involved.

Setting up

Check out this PR and the dependencies above on the latest nightly (developed in NFD_DEV_260205_A9, should work in others)
Run drunc-unified-shell ssh-CERN-kafka config/daqsystemtest/example-configs.data.xml ehn1-local-1x1-config session_tester
Boot
Go to the monitoring dashboard and select the session session_tester. You should see an empty set of messages in message reporting (unless someone else has tested things :p )

Testing

We are now testing the PM. Kill one of the running applications. I usually kill the MLT
1. To do so, in a new and clean shell, run ps -u [username] -U [username] -f
2. Identify the relevant PID. Its usually in a process with daq_application -s [session name] -k [configname] -n [application name]
3. Kill it with kill -9 [pid]
You should see in < 10 s the drunc unified shell giving an error message saying a process [name] with UUID [uuid] has died with code [code].
You should also see that this message has also popped up in the ERS feed in the monitoring dashboard
We are now going to test the controller. Choose your (least?) favorite controller and run to-error --target [controller-name]. When testing I used trg-controller
If you run status you should see that controller is now in the error state
You should also see in the dashboard in ers a message stating that the controller is now in an error state. Parent controllers will likely also go into error state as well, so this should be reflected in the message as well
Lastly, we are going to test an FSM transition. Run conf
This should throw an error message in ERS that says something along the lines of 'command conf not executed: node is in error'.
Done!

Follow up

Check what severity level we should puttto these messages
See what ers variables we want in the pm configs (eg in the local one do we want protobuf?)
Check the wording of the log messages. SDo we want to include any additional info> is it too verbose? not verbose enough?

Type of change

Documentation (non-breaking change that adds or improves the documentation)
New feature (non-breaking change which adds functionality)
Optimization (non-breaking, back-end change that speeds up the code)
Bug fix (non-breaking change which fixes an issue)
Breaking change (whatever its nature)

Key checklist

All tests pass (eg. python -m pytest)
Pre-commit hooks run successfully (eg. pre-commit run --all-files)

Further checks

Code is commented, particularly in hard-to-understand areas
Tests added or an issue has been opened to tackle that in the future.
(Indicate issue here: # (issue))

Massive code cleanup

emmuhamm

Hi @PawelPlesniak, this PR is now ready for review!!

As usual, I have comments which I'd like your feedback on, but they're quite narrow and simple.

Testing instructions are quite a lot as expected cuz of the changes, but I hope it makes sense.

Note that CI tests are failing. This is fine, mainly because its failing with this:

ERROR tests/process_manager/test_process_manager_endpoints.py::test_kill_endpoint - TypeError: LogHandlerConf.__init__() got an unexpected keyword argument 'init_ers'

This functionality is a feature of the dependency. Note that it runs locally on my machine and passes just fine

Let me know if you have any comments

src/drunc/data/process_manager/ssh-standalone.json

src/drunc/process_manager/process_manager.py

PawelPlesniak · 2026-02-13T13:41:28Z

Reviewing

PawelPlesniak · 2026-02-13T15:26:40Z

Notes

controller.core.{name}_ctrl

Another benefit - can reduce the number of kafka publishers again - all can go through the drunc.controller logger

There is also logic for sending a message when the error has recovered and is no longer in an error state, however this functionality has not been tested.

This is good, but in a separate PR we will want to change how this is handled. The high-level view is that there will be a notification service that will go to the Supervisor (essentially an automated error recovery service) that will perform an automated action. For now ERS is fine, as a lot of the other messages that currently go through ERS (e.g. start of run messages) will need to go through this separate notification topic. Just for your information more than anything else

IMPORTANT: At the moment when using ERS, it is published to session_tester. This is currently not configurable yet.

Once your current open PRs that we have commented on are addressed, this is the next focus point

From this PR

Some changes

The application name listed in ERS is drunc.process_manager.SSH_SHELL_process_manager, which is the name of the log handler. This should be corrected to the true application name, in the case of the above this should be process_manager only, without the remainder (low priority, will list and link issue). Will come up with an implementation that we can discuss.
When restarting an application, there should not be an error as this is expected.

drunc-unified-shell ssh-standalone config/daqsystemtest/example-configs.data.xml local-1x1-config pawel
start-run --run-number 1
restart -n mlt
[2026/02/13 14:17:43 UTC] INFO       ssh_process_manager.py:462               drunc.process_manager.SSH_SHELL_process_manager    process_manager restarting ['mlt'] in session pplesnia
[2026/02/13 14:17:43 UTC] INFO       ssh_process_manager.py:277               drunc.process_manager.SSH_SHELL_process_manager    Process 'mlt' (session: 'session_tester', user: 'pplesnia') process exited with exit 
code 255
[2026/02/13 14:17:45 UTC] CRITICAL   process_manager.py:218                   drunc.process_manager.SSH_SHELL_process_manager    Process mlt with UUID 81be44e6-b30d-4f61-b86c-f10487d1c5e8 has died with a return code
255
[2026/02/13 14:17:53 UTC] INFO       ssh_process_lifetime_manager_shell.py:38 drunc.process_manager.SSH_SHELL_process_manager    Process 81be44e6-b30d-4f61-b86c-f10487d1c5e8 terminated
[2026/02/13 14:17:53 UTC] INFO       ssh_process_manager.py:340               drunc.process_manager.SSH_SHELL_process_manager    Booted 'mlt' from session 'session_tester' with UUID 
81be44e6-b30d-4f61-b86c-f10487d1c5e8

I will attempt to address this now - I have ran the integration tests, but there is a chance that they get updated to use restart, and the integration tests I am defining definitely look for this. If there is a report of CRITICAL in the logs, the tests fail, and this is entirely dependent on when the opmon publishing happens in the interval, and how long the app takes to recover.

Going forward

Manual testing to see the errors on the dashboards has passed successfully.
Integration tests have passed successfully
Unit tests with pytest have passed successfully.
I will rerun the minimal system quick test after this change, I do not expect anything to throw with restart. Nice work!

Changes made to this PR

Updated the configurations to use the relevant parameters.

…day = bad time to test

…tised

PawelPlesniak · 2026-02-16T12:12:29Z

@emmuhamm please review the last two changes, it is a simple check to ensure that when we expect a process to die (e.g. on restart), we do not publish any checks. I have run the MSQT on these small changes and they pass, the remainder does what you intend

PawelPlesniak · 2026-02-16T14:28:48Z

Final round of pre-merge testing has shown an error - with terminate the applications are not registered as expected to be dead

[2026/02/16 14:27:50 UTC] CRITICAL   process_manager.py:227                   drunc.process_manager.SSH_SHELL_process_manager    Process dfo-01 with UUID e2963bc3-34cc-42d5-a32f-132a6fe9b5ac has died with a return code 0
[2026/02/16 14:27:50 UTC] CRITICAL   process_manager.py:227                   drunc.process_manager.SSH_SHELL_process_manager    Process mlt with UUID fd57f618-9b63-44bb-9741-f2ecf047b10d has died with a return code 0
[2026/02/16 14:27:50 UTC] CRITICAL   process_manager.py:227                   drunc.process_manager.SSH_SHELL_process_manager    Process ru-det-conn-0 with UUID fc04d067-5337-4eb4-98bb-b0ed7ec0e746 has died with a return code 0

This will be addressed, and then merged

PawelPlesniak

Tested, working as intended

emmuhamm force-pushed the emmuhamm/introduce-ers branch from 8a6451d to 65d7060 Compare November 5, 2025 13:45

Base automatically changed from emmuhamm/use-daqpytools-logger to develop November 9, 2025 23:19

emmuhamm force-pushed the emmuhamm/introduce-ers branch from 65d7060 to 9687278 Compare November 18, 2025 09:33

emmuhamm self-assigned this Dec 4, 2025

emmuhamm force-pushed the emmuhamm/introduce-ers branch from 3e75a6b to 2e6bdf5 Compare December 8, 2025 10:02

emmuhamm force-pushed the emmuhamm/introduce-ers branch 3 times, most recently from 3ab8aeb to 9ed6042 Compare February 11, 2026 11:55

emmuhamm added 4 commits February 11, 2026 12:57

Add ers_uri and parser for ers handler

b9bb6fb

Massive code cleanup

working prototype of everything

a52613c

Change how configuration is done with PM

2cebe6d

Send pm message to rich as well

5173ee8

emmuhamm force-pushed the emmuhamm/introduce-ers branch from 9ed6042 to 5173ee8 Compare February 11, 2026 11:57

Inject ers env in pytest

5038d55

emmuhamm changed the title ~~Introduce ERS to drunc~~ Implement ERS into drunc with small scale examples Feb 11, 2026

emmuhamm added 2 commits February 11, 2026 15:13

Add two very quick docs about ERS dependency

bc75d8e

Rename variables for clarity

f4fc20a

emmuhamm force-pushed the emmuhamm/introduce-ers branch from afe67cf to f4fc20a Compare February 11, 2026 14:25

emmuhamm commented Feb 11, 2026

View reviewed changes

src/drunc/data/process_manager/ssh-standalone.json Outdated Show resolved Hide resolved

src/drunc/process_manager/process_manager.py Show resolved Hide resolved

emmuhamm marked this pull request as ready for review February 11, 2026 14:36

emmuhamm requested a review from PawelPlesniak February 11, 2026 14:37

emmuhamm mentioned this pull request Feb 12, 2026

Many minor fixes to the handlers implementation DUNE-DAQ/daqpytools#46

Merged

13 tasks

Updating PM env vars

6ea3c35

PawelPlesniak mentioned this pull request Feb 13, 2026

[Feature]: Update application names in protobuf stream ERS #776

Open

PawelPlesniak added 2 commits February 13, 2026 18:04

There is now a log of what processes are expected to be dead. 6pm Fri…

102f304

…day = bad time to test

Adding a simple check to ensure expected dead processes are not adver…

fbb75e3

…tised

Merge branch 'develop' into emmuhamm/introduce-ers

0d992db

Fixing terminate OpMon

59ed054

PawelPlesniak approved these changes Feb 16, 2026

View reviewed changes

PawelPlesniak merged commit cb195b7 into develop Feb 16, 2026
2 of 4 checks passed

PawelPlesniak deleted the emmuhamm/introduce-ers branch February 16, 2026 14:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement ERS into drunc with small scale examples#660

Implement ERS into drunc with small scale examples#660
PawelPlesniak merged 12 commits intodevelopfrom
emmuhamm/introduce-ers

emmuhamm commented Nov 5, 2025 •

edited

Loading

Uh oh!

emmuhamm left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

PawelPlesniak commented Feb 13, 2026

Uh oh!

PawelPlesniak commented Feb 13, 2026 •

edited

Loading

Uh oh!

PawelPlesniak commented Feb 16, 2026

Uh oh!

PawelPlesniak commented Feb 16, 2026

Uh oh!

PawelPlesniak left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

emmuhamm commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Controller level

Log error when a controller goes into error

Log error when a controller fails an FSM transition due to it or its children being in error

Process manager level

Log error when an application unexpectedly dies

Testing changes

Setting up

Testing

Follow up

Type of change

Key checklist

Further checks

Uh oh!

emmuhamm left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

PawelPlesniak commented Feb 13, 2026

Uh oh!

PawelPlesniak commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Notes

From this PR

Going forward

Changes made to this PR

Uh oh!

PawelPlesniak commented Feb 16, 2026

Uh oh!

PawelPlesniak commented Feb 16, 2026

Uh oh!

PawelPlesniak left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

emmuhamm commented Nov 5, 2025 •

edited

Loading

emmuhamm left a comment •

edited

Loading

PawelPlesniak commented Feb 13, 2026 •

edited

Loading