Do not assume determinism by mzuenni · Pull Request #575 · Kattis/problem-package-format

mzuenni · 2026-01-29T16:10:53Z

The assumption of determinism was added to the new spec in af7ac80, but I think this it is a mistake. It breaks backwards compatibility and also breaks the assumptions of all kinds of user groups.

First of: non deterministic submissins do exists and are also sometimes required. The assumption of determinism clearly breaks such randomized problems in one way or another.

The fact whether or not results are reused can heavily influence the acceptance probability for submissions in various ways:
- a problem setter may assume that dublicate test cases can be used to reduce the acceptance probability
- a participant who knows that result are reused might get an advantage because he knows that his submission only fails on test cases
- a participant who does not know this would resubmit his time/hadware seeded submission expecting it to be rejudged
- No user group can actually be sure whether or not results are reused...
This assumption did not exist in legacy
- This will confuse problem setters
- This is an issue when upgrading problems
This makes problems with randomized inputs impossible
- https://open.kattis.com/problems/evolutionaryexcerpt

RagnarGrootKoerkamp · 2026-01-29T16:23:03Z

Some thoughts:

It seems that was added in Define test cases #153 which seems to be purely for defining things, not for practical reasons, so maybe this is an oversight?
What benefit would we get anyway from identifying identical cases? For test groups, validator flags will be different, and without groups, identical cases usually shouldn't be used anyway (and both problemtools and bapctools warn for it I think?)
The implication of 'once we run a submission on a testcase, we fix the result forever, even for resubmissions' is clearly not intended right.
Within one submission, this could be worked around by adding a counter to the input that the input validator ignores. But that's of course not backwards-compatible.
There might be situations where you want to have 100x the same input, but without the need for an input validator, because the only known solutions are truely random anyway? (This is a bit of a stretch though.)

So anyway, I'd say 'testcase' should never merge files with different basenames as long as we don't have a very good use case where this is helpful.

jsannemo · 2026-01-29T16:28:51Z

What benefit would we get anyway from identifying identical cases? For test groups, validator flags will be different,

Input validator flags possibly, output validator flags usually not.

The big benefit is reducing judging effort by allowing e.g test cases in a group with N <= 100 to be reused in a group with N <= 100,000, where all inputs to the process of judging the test case are identical.

This can make a big difference (>3x wall clock time for some problems) in judging time for e.g. IOI problems.

jsannemo · 2026-01-29T16:30:01Z

The implication of 'once we run a submission on a testcase, we fix the result forever, even for resubmissions' is clearly not intended right.

Allowing an assumption to be made does not imply forcing the assumption to be made.

jsannemo · 2026-01-29T16:31:17Z

There might be situations where you want to have 100x the same input,

An ugly workaround is a dummy output validator flag per case.

mzuenni · 2026-01-29T16:31:56Z

Allowing an assumption to be made does not imply forcing the assumption to be made.

This makes it even worse, because neither the problem setter nor the participant what choice the judging system made... but it impacts them.

jsannemo · 2026-01-29T16:32:36Z

But that's of course not backwards-compatible.

AFAIK Kattis already performs deduplication of (input, output validator flags)

jsannemo · 2026-01-29T16:34:52Z

Allowing an assumption to be made does not imply forcing the assumption to be made.

This makes it even worse, because neither the problem setter nor the participant what choice the judging system made... but it impacts them.

Well, you can certainly force a non-deduplication by the suggested workarounds, but you can't allow the optimization without letting judging systems perform it.

mzuenni · 2026-01-29T16:51:36Z

But that's of course not backwards-compatible.

AFAIK Kattis already performs deduplication of (input, output validator flags)

I would say:

this is not allowed by the legacy spec (even though this is mit explicit)
this obviously violates the guarantees given in https://open.kattis.com/problems/evolutionaryexcerpt which is a problem on kattis

With the current spec there is NO way to fix https://open.kattis.com/problems/evolutionaryexcerpt

mzuenni · 2026-01-29T16:53:18Z

but you can't allow the optimization without letting judging systems perform it.

then we need a different solution for that. But this is not it. It breaks older problems and makes certain types of problems that existed in the past impossible.

RagnarGrootKoerkamp · 2026-01-29T18:05:58Z

The big benefit is reducing judging effort by allowing e.g test cases in a group with N <= 100 to be reused in a group with N <= 100,000, where all inputs to the process of judging the test case are identical.

Yeah ok, in this case I do get that you want to reuse results.
In practice we use symlinks for this. Can that not be the goto way to prevent rerunning? (I guess windows is a problem ...)
But at least in our generators, we are very explicit about this.

Or did we already add some metadata somewhere also in the spec that groups depend on other groups, and are only allowed to run if those dependent groups/cases pass?

An ugly workaround is a dummy output validator flag per case.

Yes that'd be super annoying.

AFAIK Kattis already performs deduplication of (input, output validator flags)

(I'm assuming you meant (input, input validator flags) right?)
In that case: for interactive/multipass problems, output validator flags can affect what is send to the program, so assuming that output validator flags do not influence the result is tricky, right?

RagnarGrootKoerkamp · 2026-01-29T18:16:00Z

Or did we already add some metadata somewhere also in the spec that groups depend on other groups.

We have require_pass for this. Doesn't that make the re-running of testcases across groups completely irrelevant?
And so basically you should never have identical testcases for this purpose?

jsannemo · 2026-01-29T18:46:33Z

Or did we already add some metadata somewhere also in the spec that groups depend on other groups.

We have require_pass for this. Doesn't that make the re-running of testcases across groups completely irrelevant?
And so basically you should never have identical testcases for this purpose?

Ah, I haven't kept up with the development recently.

Do note that it's very common for cases to be reused in ways that are not strict subsets.
E.g:
Group 1: N <= 100, B <= 10
Group 2: N <= 1000
Group 3: B <= 1

Where the second two groups reuse just subsets of the first.

jsannemo · 2026-01-29T18:47:57Z

The big benefit is reducing judging effort by allowing e.g test cases in a group with N <= 100 to be reused in a group with N <= 100,000, where all inputs to the process of judging the test case are identical.

Yeah ok, in this case I do get that you want to reuse results.
In practice we use symlinks for this. Can that not be the goto way to prevent rerunning? (I guess windows is a problem ...)
But at least in our generators, we are very explicit about this.

Or did we already add some metadata somewhere also in the spec that groups depend on other groups, and are only allowed to run if those dependent groups/cases pass?

An ugly workaround is a dummy output validator flag per case.

Yes that'd be super annoying.

I'm not sure I agree with that quantifier: I can count on one hand the number of problems I've written where I want to enforce this behavior.

AFAIK Kattis already performs deduplication of (input, output validator flags)

(I'm assuming you meant (input, input validator flags) right?)
In that case: for interactive/multipass problems, output validator flags can affect what is send to the program, so assuming that output validator flags do not influence the result is tricky, right?

No, output validator flags are what affect judging, input validator flags are irrelevant after install, no?

RagnarGrootKoerkamp · 2026-01-29T18:59:52Z

Where the second two groups reuse just subsets of the first.

Hmm yeah that's currently not supported, although I think it easily could be?

I'm not sure I agree with that quantifier: I can count on one hand the number of problems I've written where I want to enforce this behavior.

Yeah I haven't needed it much, but making a list of one output_validator_flags per test case would be annoying. Then it's easier to just add some dummy integer to the input files.

No, output validator flags are what affect judging, input validator flags are irrelevant after install, no?

oh right indeed, my bad. I thought you were suggesting that kattis only runs once ignoring output validator flags, and then only checks the team output multiple times against different output_validator_flags. But then is not the case then.

jsannemo · 2026-01-29T20:58:37Z

But that's of course not backwards-compatible.

AFAIK Kattis already performs deduplication of (input, output validator flags)

I would say:

this is not allowed by the legacy spec (even though this is mit explicit)

this obviously violates the guarantees given in https://open.kattis.com/problems/evolutionaryexcerpt which is a problem on kattis

With the current spec there is NO way to fix https://open.kattis.com/problems/evolutionaryexcerpt

Can't you add dummy output validator flaga per test case? That's how I would assume it's implemented now, since the current spec formulation is how Kattis does it today.

I'd argue the legacy spec allows it specifically because it's not explicit. :)

then we need a different solution for that. But this is not it. It breaks older problems and makes certain types of problems that existed in the past impossible.

I.e, this is clearly not true since Kattis today works under this assumption (unless I'm mistaken, @niemela )

mzuenni · 2026-01-29T21:29:11Z

But that's of course not backwards-compatible.

AFAIK Kattis already performs deduplication of (input, output validator flags)

I would say:

this is not allowed by the legacy spec (even though this is mit explicit)

this obviously violates the guarantees given in https://open.kattis.com/problems/evolutionaryexcerpt which is a problem on kattis

With the current spec there is NO way to fix https://open.kattis.com/problems/evolutionaryexcerpt

Can't you add dummy output validator flaga per test case? That's how I would assume it's implemented now, since the current spec formulation is how Kattis does it today.

No. Even the exact same submission should be rejudged since the input is truly random every time. And with the current formulation a judging system could decide that if the submission is identical it would not need to be rerun.

hen we need a different solution for that. But this is not it. It breaks older problems and makes certain types of problems that existed in the past impossible.

I.e, this is clearly not true since Kattis today works under this assumption

no thats a false conclusion. This means that Kattis today has a bug since it clearly violoates whats written in the satement of that problem that is hosted on Kattis. (In other words either the statement or the implementation is wrong but both is Kattis responsibility here?)

I'd argue the legacy spec allows it specifically because it's not explicit. :)

I would argue that a judging is not allowed to change the judging process in any way that is observable for a user and this is very well observable.

mzuenni · 2026-01-29T21:45:28Z

Also as @RagnarGrootKoerkamp pointed out, the following sentence has very weird consequences.

Judge systems may assume that the result of running a program on a test case is deterministic.

It allows a judging system to run the same submission on the same input multiple times and pick the "worst" verdict? For a deterministic submission this makes no difference and therefore should be fine?

In generell we should not assume things about submission that are are not necessarily true. And we should not allow a judging system to make decission that influence the verdict of a submission/are observable for participants.

The arguments in favor of this (that I have read so far) also show that you do not actually want to reuse a testcase but rather the outcome (verdict) of a submission on a testcase. If you want this you should do it directly and not do this in some hacky way with reusing input files.

jsannemo · 2026-01-29T22:06:50Z

I see what you're saying, but I just don't think it's a very big problem, and in particular not enough to outweigh the benefit of having the sameness of test cases result in the same verdict implicitly.

Specifically for randomized solutions, it's in most languages, and especially all languages at e.g. the ICPC and IOI, trivial to derandomize: you fix a seed, and if you assume your solution passes e.g 95% of submission attempts, you submit with two different seeds. In fact, repeating your test case is something I as a jury member would strongly discourage - it's easily defeated by selecting as seed a hash of the input.

I can kind of buy the point about knowledge of this behavior benefiting those who know of the behaviour. At e.g. the IOI, the rules were very clear in that it's your responsibility to make your solution deterministic. However, I would argue this is always the case, and that the implications of not assuming determinism are worse.

At any contest, your solution may be rejudged at the discretion of the judges for a number of reasons. As such, any nondeterminstic solution could change its verdict on a rejudgment. For example, a discovered hardware problem, invalid test data, etc etc.

Making problems that explicitly count on the non-determinism of solutions rather than requiring determinism - and, informing contestants of this as e.g the IOI does in its rule - suddenly makes it such that your verdict might suddenly change on a rejudge. I think it's deeply problematic for this to be the case for how the problem is expected to be solved.

jsannemo · 2026-01-29T22:12:14Z

But that's of course not backwards-compatible.

AFAIK Kattis already performs deduplication of (input, output validator flags)

I would say:

this is not allowed by the legacy spec (even though this is mit explicit)

this obviously violates the guarantees given in https://open.kattis.com/problems/evolutionaryexcerpt which is a problem on kattis

With the current spec there is NO way to fix https://open.kattis.com/problems/evolutionaryexcerpt

Can't you add dummy output validator flaga per test case? That's how I would assume it's implemented now, since the current spec formulation is how Kattis does it today.

No. Even the exact same submission should be rejudged since the input is truly random every time. And with the current formulation a judging system could decide that if the submission is identical it would not need to be rerun.

hen we need a different solution for that. But this is not it. It breaks older problems and makes certain types of problems that existed in the past impossible.

I.e, this is clearly not true since Kattis today works under this assumption

no thats a false conclusion. This means that Kattis today has a bug since it clearly violoates whats written in the satement of that problem that is hosted on Kattis. (In other words either the statement or the implementation is wrong but both is Kattis responsibility here?)

Of course it's not a false conclusion? It doesn't "break older problems" - at most, older problems are today relying on unspecified behavior that make them broken. Defining undefined behavior is not breaking backwards compatibility.

I'd argue the legacy spec allows it specifically because it's not explicit. :)

I would argue that a judging is not allowed to change the judging process in any way that is observable for a user and this is very well observable.

I don't think anything in the old spec guarantees any sources of randomness being random: a sandbox may for example always return a fixed timestamp, always return the same /dev/random output etc.

mzuenni · 2026-01-30T00:27:13Z

I just don't think it's a very big problem

I disagree :)

outweigh the benefit of having the sameness of test cases result in the same verdict implicitly

I don't see any benefit of this. As mentioned before, it seems like you want to express something like "test group A relies on the verdict/score of test case X", but the current "solution" for this just does something entirely different.

Specifically for randomized solutions, it's in most languages, and especially all languages at e.g. the ICPC and IOI, trivial to derandomize

~~That is true (at least if the person who wrote the code did not make it intentionally hard...)~~ but also irrelevant. A derandomized solution is a different solution and can get a different verdict.

At e.g. the IOI, the rules were very clear in that it's your responsibility to make your solution deterministic.

That is fine for them. IOI can add whaterver rules they would like. We on the other hand should not add any rules or any unneccessary restriction.

However, I would argue this is always the case, and that the implications of not assuming determinism are worse.

This is not true for ICPC?

At any contest, your solution may be rejudged at the discretion of the judges for a number of reasons. As such, any nondeterminstic solution could change its verdict on a rejudgment. For example, a discovered hardware problem, invalid test data, etc etc.

Yes, but what are you arguing here? You do the rejudging because you expect a deterministic solution to get a new verdict... Obviously any solution could get a different judging here?!
And I want to add the rejudging should probably only happen to cases where this could happpen... but whoever does the rejudging has the right to do so. That is just irrelevant for this discussion.

Making problems that explicitly count on the non-determinism of solutions rather than requiring determinism - and, informing contestants of this as e.g the IOI does in its rule - suddenly makes it such that your verdict might suddenly change on a rejudge. I think it's deeply problematic for this to be the case for how the problem is expected to be solved.

This is besides the point. If the competition has such a rule than the judging system can do the assumption. But again, we should not add such assumptions.

RagnarGrootKoerkamp · 2026-01-30T01:08:04Z

non-determinism of solutions

So actually in our randomized-input problems, we use the non-determinism of the interactor, and very much rely on this.
We want to insist that re-submissions of the same code will get a new random input.

So also here, rejudging will be broken, and if we can't even require our own code to be deterministic, it probably doesn't add much to require that from submissions.

Also, there is stuff like PYTHONHASHSEED, which influences the order in which thing are iterated over in a set, and supposedly cannot be changed from inside the program.
Similarly, Rust also randomizes the hash function for all HashSet instances. (Not sure if the random state is global or per instance, but it definitely changes across independent runs of the executable.)

So if you're requiring deterministic output, you're basically forcing submissions to avoid a bunch of common language features, which seems completely basides the point.
Sure, IOI may want to enfore this, but the spec absolutely should not (since it's also used in uni courses and such).

Regarding rejudging: generally if something is accepted once it should remain accepted, and there's not much one can do about it anyway. The other case is when rejudging a WA submission becomes AC only due to randomness. But in that case you still have the option to manually run it a few more times and/or to just not apply the rejudging.

mzuenni · 2026-01-30T02:24:35Z

Judge systems may assume that the result of running a program on a test case is deterministic.
[...]
The assumption of determinism means that a judge system could choose to reuse the result of a previous run, or to re-run the equivalent test case.

Yeah so actually even if we would accept the former the later is still wrong since there is no word about the valdiator beeing deterministic...

jsannemo · 2026-01-30T03:20:01Z

And with the current formulation a judging system could decide that if

the submission is identical it would not need to be rerun. Yes, and the way you should rerun that is to change your random seed. I do not think that problems with test data generated randomly for each new submission are common enough to be what should dictate this. And to be honest I'm not sure I'm completely sold on the idea of fully random test data either. If I made such a problem, I'd request the solution to give me a seed instead (that I e.g. xor with a per-test case seed if I wanted multiple random test cases). That gives you both the behaviour you want and allows assuming determinism.

So if you're requiring deterministic output

That's not, at least according to me, the point, nor what the text is doing. It's about allowing the judging system to *assume* deterministic output. Clearly it can never *require* this. As you say, language features or bugs can introduce unintended non-determinism. The reason that IOI has in its rules that solutions must be deterministic is not to forbid randomized solutions: it's to make a fact of the judging process clear, which is that if your submission have *unintended non-determinism*, your verdict is not guaranteed. I argue that unintended non-determinism is a bug, and that you should not be guaranteed any verdict in that case.

Regarding rejudging: generally if something is accepted once it should

remain accepted I mean that's an opinion just as valid as mine that if you're non-deterministic you might not always be accepted. :-)

You do the rejudging because you expect a deterministic solution to get a

new verdict... [snip] And I want to add the rejudging should probably only happen to cases where this could happpen... You mean that the judge should skip rejudging of test cases that didn't change because the submissions can be assumed to be deterministic on the other cases? ;)

That is true (at least if the person who wrote the code did not make it

intentionally hard...) but also irrelevant. A derandomized solution is a different solution and can get a different verdict. I think it's totally relevant since it clearly shows that you as a problem author you don't gain anything by e.g. repeating a test case in the hopes of having it run multiple times, since it's trivial to make your randomized solution be random only over different test cases rather than each of your instance of the same test case, which really is the argument that was used most for why a problem might want non-determinism (in addition to the random-testdata one which I think is bad practice and should be used by the validator and submission together seeding a generator).

…

On Fri, 30 Jan 2026, 03:24 mzuenni, ***@***.***> wrote: *mzuenni* left a comment (Kattis/problem-package-format#575) <#575 (comment)> Judge systems may assume that the result of running a program on a test case is deterministic. [...] The assumption of determinism means that a judge system could choose to reuse the result of a previous run, or to re-run the equivalent test case. Yeah so actually even if we would accept the former the later is still wrong since there is no word about the valdiator beeing deterministic... — Reply to this email directly, view it on GitHub <#575 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHIMXNRHGNHOQVXED2CMR34JK6HTAVCNFSM6AAAAACTKRIPJSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTQMRRGQYTKMRRGM> . You are receiving this because you commented.Message ID: ***@***.***>

mzuenni · 2026-01-30T12:19:39Z

And with the current formulation a judging system could decide that if

the submission is identical it would not need to be rerun.

First, it would not need to but with the current formulation we would allow this wich feels very wrong.
Second, as already pointed out the verdict does not only depend on the submission. The output validator could also be non deterministic and should be rerun.

You mean that the judge should skip rejudging of test cases that didn't change because the submissions can be assumed to be deterministic on the other cases? ;)

Very much no. For me the judging process is used to generate a verdict according to the "verdict distribution" of the (randomized) submission. That means for each test case the submission is sampled once and that is aggregated.
If there are two identical test cases the clear interpretation is that that test case should be used twice to sample the distribution. (And I would define that a judging process is ok if its final verdict obeys the same distribution. This allows stuff like lazy judging but clearly not caching).
Now resampling must happen if a sample was generated the wrong way (like invalid test data, hard ware whatever) and by default nothing else should be resampled since that would again low stuff like "rerun this X times and take the worst verdict" which changes the observably verdict distribution.

Obviously a judge can still rejudge whatever he wants. That the right of a judge. But we only describe a problem package format here. A data format. IMO we should not write these assumptions about the submissions at all?

niemela

I'm strongly against removing this. We had a long discussion about this back when...

...maybe we should schedule a call?

jsannemo · 2026-01-30T16:33:22Z

I think we're not going to find agreement on this @mzuenni , but I want to point out that my view on

Obviously a judge can rejudge whatever he wants. That the right of a judge. But we only describe a problem package format here. A data format. We should not write assumptions about the submissions at all?

Is different, but I'm not sure you agree with yourself on this either. There are three options:

specify that when evaluating a submission this assumption can be made
specify that it mustn't be made
leave any assumption on this up to the judge itself.

I think most of your arguments for why this assumption should not be allowed very much means we must explicitly forbid it. It is not out of place for the PPF to dictate this: if parts of the judging process dictate how you get a verdict, we must specify it for a problem to mean the same thing across judges. And in fact I will now in some sense disagree with myself in saying that perhaps judges mustn't be allowed to either make or not make the assumption? (Although I still would prefer the defined behavior to be that judges should evaluate a single input only once, and as I've explained, no, you really can't and in my opinion shouldn't use multiple identical inputs to sample an output distribution: it's trivial to work around in submissions).

Anyways, I think @niemela idea of moving this to a live discussion is the right call.

mzuenni · 2026-01-30T17:08:13Z

you really can't and in my opinion shouldn't use multiple identical inputs to sample an output distribution: it's trivial to work around in submissions).

I am also not in favor of identical test cases (and we BAPCtools warns for identical inputs unless silenced) but this happened in the past.
But I would prefer the much more natural definition of "one input file means one run" instead of needing to first define what identical test cases mean, then assuming that submissions and output validators are deterministic, just for some judging systems to be able to cache them...

Also the core issue is much larger. Not only is the assumption currently in the spec not strong enough to actually allow caching, but also right now it would allow a judging system to cache stuff across submissions? (For example if the submitted files are identical). IMO this is not intended and should never be allowed.

Anyways, I think @niemela idea of moving this to a live discussion is the right call.

Yea i am happy with that

niemela

I'm strongly against this change.

This was not an oversight, we did this on purpose, after a long discussion.

mzuenni · 2026-02-27T14:41:58Z

I think this at least has to be reconsidered/rediscussed/rephrased. The current formulation makes is technically unsuitable for ICPC contest/forbids any kind of problem where non deterministic submissions are expected (the reasons are technical but still it shows that something is definitely wrong here)

RagnarGrootKoerkamp · 2026-02-27T17:08:09Z

Allright, so @mzuenni and me had some more discussion.

There's a few things going on:

Whether submissions are deterministic.
Why does this assumption exist?
How this assumption is used in practice?
What exactly is reused/cached?
When things are reused/cached (what is a 'test case' and when are they equal)?

1. Deterministic submissions

The spec currently writes:

judge systems may assume that the results of running a program on a test case is deterministic.

What is 'a program' here? The team submission only? What about the output validator (which surely can also be non-deterministic)?
In practice, team submissions are not deterministic (in ICPC, IOI, and elsewhere): for example many languages have randomized hashmap implementations. This is a fact.
Furthermore, interactive and multipass output validators/interactors are not deterministic either in practice.
Nothing is said about what the judge system can/may/will do in case this assumption is false. Such undefined behaviour seems very bad. Is it allowed to just crash? (Clearly not.)
Regardless of whether code is deterministic in practice, of course, the spec may allow judge systems to reuse results either way.
Outside of ICPC, one can easily imagine algorithms courses where students should implement randomized algorithms. There is no reason the spec could not support those.

2. Why this assumption

There seem to be a few reasons for wanting such an assumption:

Reduce time of judging a submission on 'identical' (for some definition) testcases.
Ensure consistency of results between testgroups, for example if group A is failed and A a subset of B, then B should also fail.
To save time and ensure consistency between resubmissions
To ensure consistency (and reduce time?) between rejudging runs
To ensure consistency between judge systems (main and shadow)

I assume points 3, 4, and 5 are currently not 'exploited' by any implementation, as results are not cached across submissions anyway, and non-deterministic solutions could already get a new verdict across resubmissions (which surely happens in practice), across rejudgings, and across judge systems.

Point 2 can now be handled, at least for groups, by require_pass. This can and probably should be extended to test cases as well.

Point 1 seems reasonable, but requires very careful definition. There seems to be some confusion what reusing "the result of a previous run" means: it can either be the verdict, or the output of the team submission (but only for 'standard' input-output problems).

3. How this assumption is used in practice

DOMjudge does not use it.
BAPCtools does not use it.
What exactly is currently implemented in Kattis? @mzuenni and me don't know.

4. What is reused?

For interactive and multipass problems, only the final verdict/score can be reused. output_validator_flags are a part of the "input" of the test case.
For standard input-output problems, the output of the team submission could be reused as well.

The hamilton path problem mentioned in the other thread wants to reuse the output, while the default assumption to me would be to only reuse the verdict/score.

What does Kattis do here?

5. When are things reused?

For interactive/multi-pass problems, output_validator_flags are part of the "input", and thus, it makes sense to define "test case equality" to also require equal output_validator_flags. That then implies that only the verdict/score can be reused, and not the team output. If different test groups have different output_validator_flags, that means everything should be rerun. (But require_pass still works of course, because those testcases are not part of the current group.)

Tagl · 2026-02-27T17:13:55Z

For interactive/multi-pass problems, output_validator_flags are part of the "input", and thus, it makes sense to define "test case equality" to also require equal output_validator_flags.

This is true for non-interactive output validators too.

niemela · 2026-02-27T18:27:15Z

I think framing this as "a permission granted to judge systems to cache results" has muddied the discussion. The assumption is better understood as a well-formedness requirement on the problem package, in the same category as "sample inputs must pass", or "all test data must be accepted by the input validator." A package that depends on non-deterministic behavior from submissions or validators is a malformed package. The caching optimization is a secondary benefit, not the primary motivation.

On evolutionaryexcerpt: fix it by putting a seed in the .in file and having the interactor initialize its RNG from that. Fully deterministic, reproducible, and in my opinion better designed.

On reusing output vs. verdict: the Hamilton path example (revalidate cached output with different output_validator_flags) only works if reuse is required, not merely allowed. The current "may assume" framing doesn't support that use case, so it's not a valid argument for this PR either way. That's a separate, larger discussion.

On test case equality: belongs in (and is discussed in) #567.

On wording gaps: agreed the text should be extended to cover the output validator, and clarified to apply within a single submission's evaluation. Happy to see a tightening PR. Not this one.

I still don't want to merge this. Should we try getting on a call?

RagnarGrootKoerkamp · 2026-02-27T19:00:25Z

I get the feeling we'll have to settle on agree to disagree, but either way:

A package that depends on non-deterministic behavior from submissions or validators is a malformed package.

Sure, we could say that. But why? We don't need it, and it's often not true in practice.
I understand that this is nice, but it just seems unworkable.

evolutionaryexcerpt

The entire point of the test data there is to not be deterministic. If I make fixed random input, there might be a deterministic submission that works 99% of the time but fails on one specific instance I created. I want teams to be able to just resubmit their code again because it works and just happened to hit a bad input.

if reuse is required

It seems that this is what most of the others want of this feature/discussion though, and if we don't say anything about reuse, we might as well not say anything about determinism?

Either way, the spec should specify the required cache/reuse behavior, and this should not be optional, since this actually gives different results between different judge systems.

niemela · 2026-02-27T19:24:38Z

I understand that this is nice, but it just seems unworkable.

Why would it be unworkable? I really don't get it (but maybe this is the "agree to disagree" part?)

evolutionaryexcerpt

The entire point of the test data there is to not be deterministic. If I make fixed random input, there might be a deterministic submission that works 99% of the time but fails on one specific instance I created. I want teams to be able to just resubmit their code again because it works and just happened to hit a bad input.

But such a problem is completely broken in any kind of setting (such as a contest, which is a extremely common setting for the problem format) where rejudging could happen. Having the expectation that re-submission changes the result just seems fundamentally broken to me.

Wouldn't a much better way to define this problem (if you don't want a single bad input to fail a submission) be to specify n instances and pass the submission if it passes k of them?

Either way, the spec should specify the required cache/reuse behavior, and this should not be optional, since this actually gives different results between different judge systems.

No, not for deterministic submissions.

mzuenni · 2026-03-01T13:36:11Z

Having the expectation that re-submission changes the result just seems fundamentally broken to me.

I would say that is a very reasonable assumption:

if you use something like std::random_device, or time, or pythons hash or...
if the problem statements says that this is the case

No, not for deterministic submissions.

But people do submit non deterministic submissions. The issue does exists and the choice of caching or not just increases the likelihood of mismatches between judging systems in those cases.

RagnarGrootKoerkamp · 2026-03-01T14:09:56Z

What if we have a problem where the intended solution is deterministic, but there may be randomized approaches that we want to ensure either WA or TLE. Then we definitely want to add those randomized/non-deterministic submissions to submissions/{WA,TLE} to ensure that they do indeed not pass (with high probability). Or maybe some randomized solution does pass all cases from time to time and we'd add it to submissions/mixed or submissions/bruteforce.

And fixing a seed is explicitly counterproductive here: we might want to rerun the randomized submission many times to estimate the probability with which it fails.

Sure, I get that having randomized solutions in accepted/ that only are AC 50% of the time is bad. But what if the failure probability is 2^-100? (each cases fails with 50%) or 2^-1000000 (an input fails with 2^-n, eg nwerc 2022 goingincircles)? Surely these can be safely put in submissions/accepted?

Do others (@jsannemo @Tagl @thorehusfeldt ?) agree with @niemela here that all jury submissions must be fully deterministic?

Tagl · 2026-03-01T16:04:39Z

I agree with caching of testcase results being the default desired behavior, but there should be an opt-out for that.

I do not agree that non-determinism should be forbidden.
However, there should be a very low chance of incorrect results from jury submissions and an even lower chance for the output validator if that requires non-determinism.

niemela · 2026-03-01T21:38:48Z

What if we have a problem where the intended solution is deterministic, but there may be randomized approaches that we want to ensure either WA or TLE. Then we definitely want to add those randomized/non-deterministic

Please note that "randomized" and "non-deterministic" are not the same thing here.

submissions to submissions/{WA,TLE} to ensure that they do indeed not pass (with high probability). Or maybe some randomized solution does pass all cases from time to time and we'd add it to submissions/mixed or submissions/bruteforce.

Sure. Set a seed and do that. Perfectly fine and useful.

And fixing a seed is explicitly counterproductive here: we might want to rerun the randomized submission many times to estimate the probability with which it fails.

I strongly disagree. Removing a seed is (arguably) one of the few things that are even easier than adding a seed.

Sure, I get that having randomized solutions in accepted/ that only are AC 50% of the time is bad.

Yes, that would be very bad, and obviously broken. Failing 10% or even 1% of the time would also be bad. Maybe we wouldn't call it obviously broken, but we would definitely call it flaky, and therefore slightly broken.

But what if the failure probability is 2^-100? (each cases fails with 50%) or 2^-1000000 (an input fails with 2^-n, eg nwerc 2022 goingincircles)? Surely these can be safely put in submissions/accepted?

Yes, sure, that is obviously acceptable in practice (although I would still claim that it's technically broken)

Do others (@jsannemo @Tagl @thorehusfeldt ?) agree with @niemela here that all jury submissions must be fully deterministic?

That is not exactly what I'm saying... but it is the spirit of what I'm saying.

Matistjati · 2026-03-01T22:10:32Z

Assuming determinism breaks this problem... https://aprilfools25.kattis.com/contests/qs8u7g/problems/gambling (score=last 2 digits of submission ID)

RagnarGrootKoerkamp · 2026-03-02T00:02:34Z

"randomized" and "non-deterministic" are not the same thing here

ok fair point. But in practice, I want my randomized submissions to be non-deterministic.

All jury submissions must be deterministic

As long as we do not specify the allowed behavior in case submissions are non-deterministic, "the judge may assume that submissions are determinitic" means "submissions must be deterministic" in practice, since otherwise it would be undefined behavior and bad things could happen.

that [2^-1000 failure prob] is obviously acceptable in practice

Ok, but that means that non-deterministic solutions are acceptable in practice. Then can the spec just say that?

Idea for a multi-pass problem where no deterministic solution passes, and this is honestly not very far-fetched:

Input: n=1000
passes: 2
Output: In each pass, print one number up to n. The numbers printed in each pass must be distinct.

Do not assume determinism

5b13d89

niemela requested changes Jan 30, 2026

View reviewed changes

niemela requested changes Feb 27, 2026

View reviewed changes

Conversation

mzuenni commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RagnarGrootKoerkamp commented Jan 29, 2026

Uh oh!

jsannemo commented Jan 29, 2026

Uh oh!

jsannemo commented Jan 29, 2026

Uh oh!

jsannemo commented Jan 29, 2026

Uh oh!

mzuenni commented Jan 29, 2026

Uh oh!

jsannemo commented Jan 29, 2026

Uh oh!

jsannemo commented Jan 29, 2026

Uh oh!

mzuenni commented Jan 29, 2026

Uh oh!

mzuenni commented Jan 29, 2026

Uh oh!

RagnarGrootKoerkamp commented Jan 29, 2026

Uh oh!

RagnarGrootKoerkamp commented Jan 29, 2026

Uh oh!

jsannemo commented Jan 29, 2026

Uh oh!

jsannemo commented Jan 29, 2026

Uh oh!

RagnarGrootKoerkamp commented Jan 29, 2026

Uh oh!

jsannemo commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mzuenni commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mzuenni commented Jan 29, 2026

Uh oh!

jsannemo commented Jan 29, 2026

Uh oh!

jsannemo commented Jan 29, 2026

Uh oh!

mzuenni commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RagnarGrootKoerkamp commented Jan 30, 2026

Uh oh!

mzuenni commented Jan 30, 2026

Uh oh!

jsannemo commented Jan 30, 2026 via email

Uh oh!

mzuenni commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

niemela left a comment

Choose a reason for hiding this comment

Uh oh!

jsannemo commented Jan 30, 2026

Uh oh!

mzuenni commented Jan 30, 2026

Uh oh!

niemela left a comment

Choose a reason for hiding this comment

Uh oh!

mzuenni commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RagnarGrootKoerkamp commented Feb 27, 2026

1. Deterministic submissions

2. Why this assumption

3. How this assumption is used in practice

4. What is reused?

5. When are things reused?

Uh oh!

Tagl commented Feb 27, 2026

Uh oh!

niemela commented Feb 27, 2026

Uh oh!

mzuenni commented Jan 29, 2026 •

edited

Loading

jsannemo commented Jan 29, 2026 •

edited

Loading

mzuenni commented Jan 29, 2026 •

edited

Loading

mzuenni commented Jan 30, 2026 •

edited

Loading

mzuenni commented Jan 30, 2026 •

edited

Loading

mzuenni commented Feb 27, 2026 •

edited

Loading

RagnarGrootKoerkamp commented Mar 1, 2026 •

edited

Loading

RagnarGrootKoerkamp commented Mar 2, 2026 •

edited

Loading