Conversation
|
Some thoughts:
So anyway, I'd say 'testcase' should never merge files with different basenames as long as we don't have a very good use case where this is helpful. |
Input validator flags possibly, output validator flags usually not. The big benefit is reducing judging effort by allowing e.g test cases in a group with N <= 100 to be reused in a group with N <= 100,000, where all inputs to the process of judging the test case are identical. This can make a big difference (>3x wall clock time for some problems) in judging time for e.g. IOI problems. |
Allowing an assumption to be made does not imply forcing the assumption to be made. |
An ugly workaround is a dummy output validator flag per case. |
This makes it even worse, because neither the problem setter nor the participant what choice the judging system made... but it impacts them. |
AFAIK Kattis already performs deduplication of (input, output validator flags) |
Well, you can certainly force a non-deduplication by the suggested workarounds, but you can't allow the optimization without letting judging systems perform it. |
I would say:
With the current spec there is NO way to fix https://open.kattis.com/problems/evolutionaryexcerpt |
then we need a different solution for that. But this is not it. It breaks older problems and makes certain types of problems that existed in the past impossible. |
Yeah ok, in this case I do get that you want to reuse results. Or did we already add some metadata somewhere also in the spec that groups depend on other groups, and are only allowed to run if those dependent groups/cases pass?
Yes that'd be super annoying.
(I'm assuming you meant |
We have |
Ah, I haven't kept up with the development recently. Do note that it's very common for cases to be reused in ways that are not strict subsets. Where the second two groups reuse just subsets of the first. |
I'm not sure I agree with that quantifier: I can count on one hand the number of problems I've written where I want to enforce this behavior.
No, output validator flags are what affect judging, input validator flags are irrelevant after install, no? |
Hmm yeah that's currently not supported, although I think it easily could be?
Yeah I haven't needed it much, but making a list of one
oh right indeed, my bad. I thought you were suggesting that kattis only runs once ignoring output validator flags, and then only checks the team output multiple times against different output_validator_flags. But then is not the case then. |
Can't you add dummy output validator flaga per test case? That's how I would assume it's implemented now, since the current spec formulation is how Kattis does it today. I'd argue the legacy spec allows it specifically because it's not explicit. :)
I.e, this is clearly not true since Kattis today works under this assumption (unless I'm mistaken, @niemela ) |
No. Even the exact same submission should be rejudged since the input is truly random every time. And with the current formulation a judging system could decide that if the submission is identical it would not need to be rerun.
no thats a false conclusion. This means that Kattis today has a bug since it clearly violoates whats written in the satement of that problem that is hosted on Kattis. (In other words either the statement or the implementation is wrong but both is Kattis responsibility here?)
I would argue that a judging is not allowed to change the judging process in any way that is observable for a user and this is very well observable. |
|
Also as @RagnarGrootKoerkamp pointed out, the following sentence has very weird consequences.
It allows a judging system to run the same submission on the same input multiple times and pick the "worst" verdict? For a deterministic submission this makes no difference and therefore should be fine? In generell we should not assume things about submission that are are not necessarily true. And we should not allow a judging system to make decission that influence the verdict of a submission/are observable for participants. The arguments in favor of this (that I have read so far) also show that you do not actually want to reuse a testcase but rather the outcome (verdict) of a submission on a testcase. If you want this you should do it directly and not do this in some hacky way with reusing input files. |
|
I see what you're saying, but I just don't think it's a very big problem, and in particular not enough to outweigh the benefit of having the sameness of test cases result in the same verdict implicitly. Specifically for randomized solutions, it's in most languages, and especially all languages at e.g. the ICPC and IOI, trivial to derandomize: you fix a seed, and if you assume your solution passes e.g 95% of submission attempts, you submit with two different seeds. In fact, repeating your test case is something I as a jury member would strongly discourage - it's easily defeated by selecting as seed a hash of the input. I can kind of buy the point about knowledge of this behavior benefiting those who know of the behaviour. At e.g. the IOI, the rules were very clear in that it's your responsibility to make your solution deterministic. However, I would argue this is always the case, and that the implications of not assuming determinism are worse. At any contest, your solution may be rejudged at the discretion of the judges for a number of reasons. As such, any nondeterminstic solution could change its verdict on a rejudgment. For example, a discovered hardware problem, invalid test data, etc etc. Making problems that explicitly count on the non-determinism of solutions rather than requiring determinism - and, informing contestants of this as e.g the IOI does in its rule - suddenly makes it such that your verdict might suddenly change on a rejudge. I think it's deeply problematic for this to be the case for how the problem is expected to be solved. |
Of course it's not a false conclusion? It doesn't "break older problems" - at most, older problems are today relying on unspecified behavior that make them broken. Defining undefined behavior is not breaking backwards compatibility.
I don't think anything in the old spec guarantees any sources of randomness being random: a sandbox may for example always return a fixed timestamp, always return the same /dev/random output etc. |
I disagree :)
I don't see any benefit of this. As mentioned before, it seems like you want to express something like "test group
That is fine for them. IOI can add whaterver rules they would like. We on the other hand should not add any rules or any unneccessary restriction.
This is not true for ICPC?
Yes, but what are you arguing here? You do the rejudging because you expect a deterministic solution to get a new verdict... Obviously any solution could get a different judging here?!
This is besides the point. If the competition has such a rule than the judging system can do the assumption. But again, we should not add such assumptions. |
So actually in our randomized-input problems, we use the non-determinism of the interactor, and very much rely on this. So also here, rejudging will be broken, and if we can't even require our own code to be deterministic, it probably doesn't add much to require that from submissions. Also, there is stuff like So if you're requiring deterministic output, you're basically forcing submissions to avoid a bunch of common language features, which seems completely basides the point. Regarding rejudging: generally if something is accepted once it should remain accepted, and there's not much one can do about it anyway. The other case is when rejudging a WA submission becomes AC only due to randomness. But in that case you still have the option to manually run it a few more times and/or to just not apply the rejudging. |
Yeah so actually even if we would accept the former the later is still wrong since there is no word about the valdiator beeing deterministic... |
|
And with the current formulation a judging system could decide that if
the submission is identical it would not need to be rerun.
Yes, and the way you should rerun that is to change your random seed. I do
not think that problems with test data generated randomly for each new
submission are common enough to be what should dictate this. And to be
honest I'm not sure I'm completely sold on the idea of fully random test
data either. If I made such a problem, I'd request the solution to give me
a seed instead (that I e.g. xor with a per-test case seed if I wanted
multiple random test cases). That gives you both the behaviour you want and
allows assuming determinism.
So if you're requiring deterministic output
That's not, at least according to me, the point, nor what the text is
doing. It's about allowing the judging system to *assume* deterministic
output. Clearly it can never *require* this. As you say, language features
or bugs can introduce unintended non-determinism. The reason that IOI has
in its rules that solutions must be deterministic is not to forbid
randomized solutions: it's to make a fact of the judging process clear,
which is that if your submission have *unintended non-determinism*, your
verdict is not guaranteed. I argue that unintended non-determinism is a
bug, and that you should not be guaranteed any verdict in that case.
Regarding rejudging: generally if something is accepted once it should
remain accepted
I mean that's an opinion just as valid as mine that if you're
non-deterministic you might not always be accepted. :-)
You do the rejudging because you expect a deterministic solution to get a
new verdict... [snip] And I want to add the rejudging should probably only
happen to cases where this could happpen...
You mean that the judge should skip rejudging of test cases that didn't
change because the submissions can be assumed to be deterministic on the
other cases? ;)
That is true (at least if the person who wrote the code did not make it
intentionally hard...) but also irrelevant. A derandomized solution is a
different solution and can get a different verdict.
I think it's totally relevant since it clearly shows that you as a problem
author you don't gain anything by e.g. repeating a test case in the hopes
of having it run multiple times, since it's trivial to make your randomized
solution be random only over different test cases rather than each of your
instance of the same test case, which really is the argument that was used
most for why a problem might want non-determinism (in addition to the
random-testdata one which I think is bad practice and should be used by the
validator and submission together seeding a generator).
…On Fri, 30 Jan 2026, 03:24 mzuenni, ***@***.***> wrote:
*mzuenni* left a comment (Kattis/problem-package-format#575)
<#575 (comment)>
Judge systems may assume that the result of running a program on a test
case is deterministic.
[...]
The assumption of determinism means that a judge system could choose to
reuse the result of a previous run, or to re-run the equivalent test case.
Yeah so actually even if we would accept the former the later is still
wrong since there is no word about the valdiator beeing deterministic...
—
Reply to this email directly, view it on GitHub
<#575 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHIMXNRHGNHOQVXED2CMR34JK6HTAVCNFSM6AAAAACTKRIPJSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTQMRRGQYTKMRRGM>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
First, it would not need to but with the current formulation we would allow this wich feels very wrong.
Very much no. For me the judging process is used to generate a verdict according to the "verdict distribution" of the (randomized) submission. That means for each test case the submission is sampled once and that is aggregated. Obviously a judge can still rejudge whatever he wants. That the right of a judge. But we only describe a problem package format here. A data format. IMO we should not write these assumptions about the submissions at all? |
niemela
left a comment
There was a problem hiding this comment.
I'm strongly against removing this. We had a long discussion about this back when...
...maybe we should schedule a call?
|
I think we're not going to find agreement on this @mzuenni , but I want to point out that my view on
Is different, but I'm not sure you agree with yourself on this either. There are three options:
I think most of your arguments for why this assumption should not be allowed very much means we must explicitly forbid it. It is not out of place for the PPF to dictate this: if parts of the judging process dictate how you get a verdict, we must specify it for a problem to mean the same thing across judges. And in fact I will now in some sense disagree with myself in saying that perhaps judges mustn't be allowed to either make or not make the assumption? (Although I still would prefer the defined behavior to be that judges should evaluate a single input only once, and as I've explained, no, you really can't and in my opinion shouldn't use multiple identical inputs to sample an output distribution: it's trivial to work around in submissions). Anyways, I think @niemela idea of moving this to a live discussion is the right call. |
I am also not in favor of identical test cases (and we BAPCtools warns for identical inputs unless silenced) but this happened in the past. Also the core issue is much larger. Not only is the assumption currently in the spec not strong enough to actually allow caching, but also right now it would allow a judging system to cache stuff across submissions? (For example if the submitted files are identical). IMO this is not intended and should never be allowed.
Yea i am happy with that |
niemela
left a comment
There was a problem hiding this comment.
I'm strongly against this change.
This was not an oversight, we did this on purpose, after a long discussion.
|
I think this at least has to be reconsidered/rediscussed/rephrased. The current formulation makes is technically unsuitable for ICPC contest/forbids any kind of problem where non deterministic submissions are expected (the reasons are technical but still it shows that something is definitely wrong here) |
|
Allright, so @mzuenni and me had some more discussion. There's a few things going on:
1. Deterministic submissionsThe spec currently writes:
2. Why this assumptionThere seem to be a few reasons for wanting such an assumption:
I assume points 3, 4, and 5 are currently not 'exploited' by any implementation, as results are not cached across submissions anyway, and non-deterministic solutions could already get a new verdict across resubmissions (which surely happens in practice), across rejudgings, and across judge systems. Point 2 can now be handled, at least for groups, by Point 1 seems reasonable, but requires very careful definition. There seems to be some confusion what reusing "the result of a previous run" means: it can either be the verdict, or the output of the team submission (but only for 'standard' input-output problems). 3. How this assumption is used in practice
4. What is reused?
The hamilton path problem mentioned in the other thread wants to reuse the output, while the default assumption to me would be to only reuse the verdict/score. What does Kattis do here? 5. When are things reused?
|
This is true for non-interactive output validators too. |
|
I think framing this as "a permission granted to judge systems to cache results" has muddied the discussion. The assumption is better understood as a well-formedness requirement on the problem package, in the same category as "sample inputs must pass", or "all test data must be accepted by the input validator." A package that depends on non-deterministic behavior from submissions or validators is a malformed package. The caching optimization is a secondary benefit, not the primary motivation. On evolutionaryexcerpt: fix it by putting a seed in the On reusing output vs. verdict: the Hamilton path example (revalidate cached output with different On test case equality: belongs in (and is discussed in) #567. On wording gaps: agreed the text should be extended to cover the output validator, and clarified to apply within a single submission's evaluation. Happy to see a tightening PR. Not this one. I still don't want to merge this. Should we try getting on a call? |
|
I get the feeling we'll have to settle on agree to disagree, but either way:
Sure, we could say that. But why? We don't need it, and it's often not true in practice.
The entire point of the test data there is to not be deterministic. If I make fixed random input, there might be a deterministic submission that works 99% of the time but fails on one specific instance I created. I want teams to be able to just resubmit their code again because it works and just happened to hit a bad input.
It seems that this is what most of the others want of this feature/discussion though, and if we don't say anything about reuse, we might as well not say anything about determinism? Either way, the spec should specify the required cache/reuse behavior, and this should not be optional, since this actually gives different results between different judge systems. |
Why would it be unworkable? I really don't get it (but maybe this is the "agree to disagree" part?)
But such a problem is completely broken in any kind of setting (such as a contest, which is a extremely common setting for the problem format) where rejudging could happen. Having the expectation that re-submission changes the result just seems fundamentally broken to me. Wouldn't a much better way to define this problem (if you don't want a single bad input to fail a submission) be to specify
No, not for deterministic submissions. |
I would say that is a very reasonable assumption:
But people do submit non deterministic submissions. The issue does exists and the choice of caching or not just increases the likelihood of mismatches between judging systems in those cases. |
|
What if we have a problem where the intended solution is deterministic, but there may be randomized approaches that we want to ensure either WA or TLE. Then we definitely want to add those randomized/non-deterministic submissions to And fixing a seed is explicitly counterproductive here: we might want to rerun the randomized submission many times to estimate the probability with which it fails. Sure, I get that having randomized solutions in Do others (@jsannemo @Tagl @thorehusfeldt ?) agree with @niemela here that all jury submissions must be fully deterministic? |
|
I agree with caching of testcase results being the default desired behavior, but there should be an opt-out for that. I do not agree that non-determinism should be forbidden. |
Please note that "randomized" and "non-deterministic" are not the same thing here.
Sure. Set a seed and do that. Perfectly fine and useful.
I strongly disagree. Removing a seed is (arguably) one of the few things that are even easier than adding a seed.
Yes, that would be very bad, and obviously broken. Failing 10% or even 1% of the time would also be bad. Maybe we wouldn't call it obviously broken, but we would definitely call it flaky, and therefore slightly broken.
Yes, sure, that is obviously acceptable in practice (although I would still claim that it's technically broken)
That is not exactly what I'm saying... but it is the spirit of what I'm saying. |
|
Assuming determinism breaks this problem... https://aprilfools25.kattis.com/contests/qs8u7g/problems/gambling (score=last 2 digits of submission ID) |
ok fair point. But in practice, I want my randomized submissions to be non-deterministic.
As long as we do not specify the allowed behavior in case submissions are non-deterministic, "the judge may assume that submissions are determinitic" means "submissions must be deterministic" in practice, since otherwise it would be undefined behavior and bad things could happen.
Ok, but that means that non-deterministic solutions are acceptable in practice. Then can the spec just say that? Idea for a multi-pass problem where no deterministic solution passes, and this is honestly not very far-fetched: Input: |
The assumption of determinism was added to the new spec in af7ac80, but I think this it is a mistake. It breaks backwards compatibility and also breaks the assumptions of all kinds of user groups.
First of: non deterministic submissins do exists and are also sometimes required. The assumption of determinism clearly breaks such randomized problems in one way or another.