Skip to content

Error handling: Report YAML pars errors and continue if error setting#353

Open
harriscr wants to merge 1 commit into
ceph:mainfrom
harriscr:ch_wip_error_reporting
Open

Error handling: Report YAML pars errors and continue if error setting#353
harriscr wants to merge 1 commit into
ceph:mainfrom
harriscr:ch_wip_error_reporting

Conversation

@harriscr
Copy link
Copy Markdown
Contributor

This PR addresses 2 weaknesses in CBT error handling:

  1. When there is a malformed CBT yaml, CBT will exit without any helpful error message as to where the error in the file is.
  2. There is no configuration setting to chaneg the value for continue-if-error when calling the underlying PDSH methods.

for 1. the YAML parsing exception is caught and reported to the user which allows them to see exactly where the error in their file is and fix it before trying again.

The problem for 2 was found when the FIO process running during a test was crashing, or never starting. The only way that we could figure out what went wrong was by oddities in the response curves from the benchmark run and them nonitoring the individual processes on the system as the benchmark was running via CBT. It would be better to have a setting to stop on error to allow quicker and more efficient debugging of issues like this. The pdsh helper methods aready support this, but there was no way to change the behaviour. A setting has been added to the common section of the yaml. 2 of the example files have been updated to show this new setting as well. There is potentially more that can be done in this area in the future, but this addresses our current pain-point.

An example output for a malformed yaml:

Traceback (most recent call last):
  File "cbt/./cbt.py", line 100, in <module>
    exit(main(sys.argv))
         ~~~~^^^^^^^^^^
  File "cbt/./cbt.py", line 45, in main
    settings.initialize(ctx)
    ~~~~~~~~~~~~~~~~~~~^^^^^
  File "cbt/settings.py", line 37, in initialize
    config = yaml.safe_load(f)
  File "/usr/lib64/python3.14/site-packages/yaml/__init__.py", line 125, in safe_load
    return load(stream, SafeLoader)
  File "/usr/lib64/python3.14/site-packages/yaml/__init__.py", line 81, in load
    return loader.get_single_data()
           ~~~~~~~~~~~~~~~~~~~~~~^^
  File "/usr/lib64/python3.14/site-packages/yaml/constructor.py", line 49, in get_single_data
    node = self.get_single_node()
  File "/usr/lib64/python3.14/site-packages/yaml/composer.py", line 39, in get_single_node
    if not self.check_event(StreamEndEvent):
           ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.14/site-packages/yaml/parser.py", line 98, in check_event
    self.current_event = self.state()
                         ~~~~~~~~~~^^
  File "/usr/lib64/python3.14/site-packages/yaml/parser.py", line 171, in parse_document_start
    raise ParserError(None, None,
    ...<2 lines>...
            self.peek_token().start_mark)
yaml.parser.ParserError: expected '<document start>', but found '<block mapping start>'
  in "malformed.yaml", line 31, column 1

This PR addresses 2 weaknesses in CBT error handling:
1. When there is a malformed CBT yaml, CBT will exit without any helpful error message as to where the error in the file is.
2. There is no configuration setting to chaneg the value for continue-if-error when calling the underlying PDSH methods.

for 1. the YAML parsing exception is caught and reported to the user which allows them to see exactly where the error in their file is and fix it before trying again.

The problem for 2 was found when the FIO process running during a test was crashing, or never starting. The only way that we could figure out what went wrong was by oddities in the response curves from the benchmark run and them nonitoring the individual processes on the system as the benchmark was running via CBT. It would be better to have a setting to stop on error to allow quicker and more efficient debugging of issues like this.
The pdsh helper methods aready support this, but there was no way to change the behaviour. A setting has been added to the common section of the yaml. 2 of the example files have been updated to show this new setting as well.
There is potentially more that can be done in this area in the future, but this addresses our current pain-point.

Signed-off-by: Chris Harris <harriscr@uk.ibm.com>
@harriscr harriscr self-assigned this May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant