proposal for regression testing#490
Conversation
🔍 Catalogue's Preview Site DeployedYour changes have been deployed to the preview site: 🔗 Preview URL: https://esa-apex.github.io/apex-algorithms-catalogue-web/pr-preview/pr-490/ This preview will be updated automatically when you push new changes to your PR. |
42a8863 to
be6e8ed
Compare
|
@JanssenBrm @VictorVerhaert ready to check. I have opted for a more adaptive benchmark where we look at the average and the std. Depending on the nr of successful runs the benchmark becomes more determinantal |
|
@JanssenBrm @JeroenVerstraelen @VictorVerhaert all feedback is welcome |
VictorVerhaert
left a comment
There was a problem hiding this comment.
Two small optional comment aimed at trying to prevent false fails. One more question: could you try and run it using github actions and see how it behaves in practice?
Otherwise the pr looks clean
|
seems to be working well: https://jenkins.vgt.vito.be/job/openEO/job/openeo-apex-benchmarks-handpicked-run/61/ |
|
ended up making some changes seeing the behavior on jenkins. So I did add explicit time limitations on data to use and I made cost the sole gating value which can determine a failure. All other values were to volatile, especially for these small benchmarks which hover around 4 credits. |
|
(blocked by #547) |
|
@JanssenBrm @soxofaan Could the two of you also take a look at this proposal of regression testing? In general I look into the merged parquet for the last X months and per usage metric/cost get all valid metrics for succesful runs. Based on that I calculate the median and the MAD and convert it into a standard deviation. For now the test will fail if the measured cost > median +3.5 x MAD which seemed sensible for some tests I have ran. The other metrics fluctuate quite a lot, so for those I only log warnings in case of a failure |
|
short on time here, so I could only give this a quick look |
|
@soxofaan most of the code indeed relates to getting a good statistic out of this merger parquet file which can be used to determine there occurred regression or not. Ideally it would be cached somewhere locally such that we would not need to calculate the baseline on runtime each time. am I correct that this is what you are proposing as well? |
|
what I can do is move the baseline calculation to a separate github workflow and trigger it weekly to recompute the baseline. in the actual test phase we would read in said file from S3 and check for regression? |
|
no I'm not talking about caching (I'm not sure there is even something useful to do with caching in a |
|
makes sense; I'll split both workflows up, let the regression trigger weekly and create separate github issues! |
f008197 to
f0005a4
Compare
…eek to compare recent benchmark runs
f0005a4 to
b2a10e6
Compare
Idea for starting to include regression benchmarks.
@JanssenBrm I would also need info on how to best expose it such that we can keep a log on the service catalogue