Skip to content

proposal for regression testing#490

Open
HansVRP wants to merge 4 commits into
mainfrom
hv_regresion_benchmark
Open

proposal for regression testing#490
HansVRP wants to merge 4 commits into
mainfrom
hv_regresion_benchmark

Conversation

@HansVRP

@HansVRP HansVRP commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

Idea for starting to include regression benchmarks.

@JanssenBrm I would also need info on how to best expose it such that we can keep a log on the service catalogue

@algorithm-services-catalogue

algorithm-services-catalogue Bot commented Apr 30, 2026

Copy link
Copy Markdown

🔍 Catalogue's Preview Site Deployed

Your changes have been deployed to the preview site:

🔗 Preview URL: https://esa-apex.github.io/apex-algorithms-catalogue-web/pr-preview/pr-490/

This preview will be updated automatically when you push new changes to your PR.

@VictorVerhaert VictorVerhaert self-assigned this May 4, 2026
@HansVRP HansVRP force-pushed the hv_regresion_benchmark branch 3 times, most recently from 42a8863 to be6e8ed Compare May 13, 2026 07:17
@HansVRP

HansVRP commented May 13, 2026

Copy link
Copy Markdown
Contributor Author

@JanssenBrm @VictorVerhaert ready to check. I have opted for a more adaptive benchmark where we look at the average and the std. Depending on the nr of successful runs the benchmark becomes more determinantal

@HansVRP HansVRP requested a review from JeroenVerstraelen May 13, 2026 09:35
@HansVRP

HansVRP commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

@JanssenBrm @JeroenVerstraelen @VictorVerhaert all feedback is welcome

@VictorVerhaert VictorVerhaert left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small optional comment aimed at trying to prevent false fails. One more question: could you try and run it using github actions and see how it behaves in practice?
Otherwise the pr looks clean

Comment thread qa/tools/apex_algorithm_qa_tools/benchmarks.py Outdated
Comment thread qa/tools/apex_algorithm_qa_tools/benchmark_trends.py Outdated
@HansVRP

HansVRP commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

@HansVRP HansVRP requested a review from VictorVerhaert June 12, 2026 15:37
@HansVRP

HansVRP commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

ended up making some changes seeing the behavior on jenkins.

So I did add explicit time limitations on data to use and I made cost the sole gating value which can determine a failure. All other values were to volatile, especially for these small benchmarks which hover around 4 credits.

@HansVRP

HansVRP commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

(blocked by #547)

@HansVRP HansVRP requested review from soxofaan and removed request for JeroenVerstraelen June 23, 2026 14:12
@HansVRP

HansVRP commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

@JanssenBrm @soxofaan Could the two of you also take a look at this proposal of regression testing?

In general I look into the merged parquet for the last X months and per usage metric/cost get all valid metrics for succesful runs.

Based on that I calculate the median and the MAD and convert it into a standard deviation.

For now the test will fail if the measured cost > median +3.5 x MAD which seemed sensible for some tests I have ran.

The other metrics fluctuate quite a lot, so for those I only log warnings in case of a failure

@soxofaan

soxofaan commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

short on time here, so I could only give this a quick look
I'm a bit worried about the design here: if I understand correctly, you are adding hundreds of lines of logic in the main test_run_benchmark code path, involving (S3) io, pandas manipulations, which will make the main code path (already known to be brittle and trigger happy), even more brittle and hard to debug.
Can't the performance regression be added as a separate post-processing tool run, working on test suite output?

@HansVRP

HansVRP commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

@soxofaan most of the code indeed relates to getting a good statistic out of this merger parquet file which can be used to determine there occurred regression or not. Ideally it would be cached somewhere locally such that we would not need to calculate the baseline on runtime each time.

am I correct that this is what you are proposing as well?

@HansVRP

HansVRP commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

what I can do is move the baseline calculation to a separate github workflow and trigger it weekly to recompute the baseline.

in the actual test phase we would read in said file from S3 and check for regression?

@soxofaan

soxofaan commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

no I'm not talking about caching (I'm not sure there is even something useful to do with caching in a test_run_benchmark context),
What I mean is to keep the pytest ... test_run_benchmark run/workflow to the bare essentials (for stability and to-the-point results)
and do all the other analysis, trend watching, ... in separate tools or workflows (to allow experimentation, iteration, interactivity)

@HansVRP

HansVRP commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

makes sense; I'll split both workflows up, let the regression trigger weekly and create separate github issues!

@HansVRP HansVRP force-pushed the hv_regresion_benchmark branch 2 times, most recently from f008197 to f0005a4 Compare June 25, 2026 12:26
@HansVRP HansVRP force-pushed the hv_regresion_benchmark branch from f0005a4 to b2a10e6 Compare June 25, 2026 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants