Implement ERS into drunc with small scale examples#660
Implement ERS into drunc with small scale examples#660PawelPlesniak merged 12 commits intodevelopfrom
Conversation
8a6451d to
65d7060
Compare
65d7060 to
9687278
Compare
3e75a6b to
2e6bdf5
Compare
3ab8aeb to
9ed6042
Compare
9ed6042 to
5173ee8
Compare
afe67cf to
f4fc20a
Compare
There was a problem hiding this comment.
Hi @PawelPlesniak, this PR is now ready for review!!
As usual, I have comments which I'd like your feedback on, but they're quite narrow and simple.
Testing instructions are quite a lot as expected cuz of the changes, but I hope it makes sense.
Note that CI tests are failing. This is fine, mainly because its failing with this:
ERROR tests/process_manager/test_process_manager_endpoints.py::test_kill_endpoint - TypeError: LogHandlerConf.__init__() got an unexpected keyword argument 'init_ers'
This functionality is a feature of the dependency. Note that it runs locally on my machine and passes just fine
Let me know if you have any comments
|
Reviewing |
Notes
Another benefit - can reduce the number of kafka publishers again - all can go through the
This is good, but in a separate PR we will want to change how this is handled. The high-level view is that there will be a notification service that will go to the Supervisor (essentially an automated error recovery service) that will perform an automated action. For now ERS is fine, as a lot of the other messages that currently go through ERS (e.g. start of run messages) will need to go through this separate notification topic. Just for your information more than anything else
Once your current open PRs that we have commented on are addressed, this is the next focus point From this PRSome changes
I will attempt to address this now - I have ran the integration tests, but there is a chance that they get updated to use restart, and the integration tests I am defining definitely look for this. If there is a report of Going forwardManual testing to see the errors on the dashboards has passed successfully. Changes made to this PRUpdated the configurations to use the relevant parameters. |
|
@emmuhamm please review the last two changes, it is a simple check to ensure that when we expect a process to die (e.g. on |
|
Final round of pre-merge testing has shown an error - with This will be addressed, and then merged |
PawelPlesniak
left a comment
There was a problem hiding this comment.
Tested, working as intended
Description
Fixes #732
Depends on DUNE-DAQ/daqpytools#46
This PR introduces a sample implementation of ERS into drunc, based on the major developments in daqpytools to include this functionality.
IMPORTANT: At the moment when using ERS, it is published to
session_tester. This is currently not configurable yet.Controller level
For the controllers, a bunch of new logic was added to identify the transition between error states. LogHandlerConf was also added; as the controllers by default have access to the ERS env variables, they can be initialised straight away.
A subtle but important change is that in the controller level, it is now initialised as
controller.core.{name}_ctrl. This is important so this new handler can inheret the stream handler fromcontroller.core, but also that each controller can obtain its own instance of an ers kafka handler, preventing any mixing. Importantly as well, in the error message it will show in the application field ascontroller.core.{name}_ctrl, making it more traceable.The controller now supports two things.
Log error when a controller goes into error
Whenever a
to-errorcommand is run on a controller, it will trigger a log message that says that the controller is now in an error state. This is also sent to ERS. There is also logic for sending a message when the error has recovered and is no longer in an error state, however this functionality has not been tested.Log error when a controller fails an FSM transition due to it or its children being in error
As a fallout for this, if someone then runs an FSM transition command (eg
conf), this is not going to work as the controllers are in error.The log message 'Command 'conf' not executed: node is in error.' already exists; this is supercharged with a message sent to ERS now
Process manager level
There are a few changes in the process manager. Importantly, the configuration has changed such that the OKS configuration for the ERS variables are copied over from the OKS and hardcoded in here. This lets the process manager inject it directly into its running instance, so it can be accessed by LogHandlerConf to initialise the ERS streams properly.
Log error when an application unexpectedly dies
Within the publish method of the PM, which is mainly used to continuously report to OpMon, some new bit of logic is injected to check if the number of dead processes has risen compared to before. If it has gone up, it will identify which process has died and send a log message to ERS.
Testing changes
Given the nature of these changes, testing is a bit involved.
Setting up
Check out this PR and the dependencies above on the latest nightly (developed in
NFD_DEV_260205_A9, should work in others)Run
drunc-unified-shell ssh-CERN-kafka config/daqsystemtest/example-configs.data.xml ehn1-local-1x1-config session_testerBoot
Go to the monitoring dashboard and select the session
session_tester. You should see an empty set of messages in message reporting (unless someone else has tested things :p )Testing
We are now testing the PM. Kill one of the running applications. I usually kill the MLT
ps -u [username] -U [username] -fdaq_application -s [session name] -k [configname] -n [application name]kill -9 [pid]You should see in < 10 s the drunc unified shell giving an error message saying a process [name] with UUID [uuid] has died with code [code].
You should also see that this message has also popped up in the ERS feed in the monitoring dashboard
We are now going to test the controller. Choose your (least?) favorite controller and run
to-error --target [controller-name]. When testing I usedtrg-controllerIf you run
statusyou should see that controller is now in the error stateYou should also see in the dashboard in ers a message stating that the controller is now in an error state. Parent controllers will likely also go into error state as well, so this should be reflected in the message as well
Lastly, we are going to test an FSM transition. Run
confThis should throw an error message in ERS that says something along the lines of 'command
confnot executed: node is in error'.Done!
Follow up
Type of change
Key checklist
python -m pytest)pre-commit run --all-files)Further checks
(Indicate issue here: # (issue))