Skip to content

Conversation

@fernando-villalba
Copy link
Collaborator

The API for the Multigres Operator is ready for review, here are some point of interest to take into consideration:

  • Verify that configuration structure of the manifest is true to the way users would use and understand Multigres.
  • What fields should be defaulted if the user was not providing templates or inline configuration?
  • Is the proposed shard/cell/tablegroup structure sufficient for a first iteration, or is it overly complex?
  • Decide whether the operator should reconcile labels and use them to keep track of its resources or we should add spec fields that do this.
  • Potentially split TopoServer definition and its template usage as its deployment model as well as image definitions are quite different from other Multigres components; this may be something we can tackle together with canary friendly deployment control. (Ref)

fernando-villalba and others added 30 commits October 27, 2025 11:18
Signed-off-by: Fernando Villalba <[email protected]>
Signed-off-by: Fernando Villalba <[email protected]>
Comment on lines 629 to 631
# Images passed down from global configuration
images:
multigateway: "multigres/multigres:latest"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cell can have topo server as well, so there could be another image definition here if we want to manage at cell level. However, we did discuss topo server would have its own image definition (covered as "Option 3" just below), and in that case, it makes this images stanza strange to land at the top level. I think it would be much clearer if it went under multigateway stanza considering topoServer has got its own there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see where you're coming from regarding symmetry with topoServer, but there are three reasons to keep images at the top level here:

  1. Shared Types: multigateway uses the shared StatelessSpec struct (same as MultiOrch). Adding image inside it would require modifying that shared struct (which shouldn't carry specific image tags in templates) or creating a custom struct just for Gateway, which reduces code reuse.
  2. Consistency: This matches the pattern in Shard and TableGroup CRs, which also separate the image versions from the configuration specs.
  3. Controller Logic: Since this is a Child CR written by the Parent Controller, it's cleaner to have the parent push all resolved images into a dedicated images block rather than injecting them into various sub-structs."

Copy link
Member

@rytswd rytswd Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I disagree with al of the points.

Re: Shared Types

Code reuse should not be the reason for a design decision like this. It makes code management easy when you can have a common logic handled by a shared struct, but when the fundamental requirements are different, we should allow diverging from the common setup.

Re: Consistency

Consistency is already lost because of topo server not following the same patter. I think the current image definitions of TableGroup and Shard make sense (although a bit unusual for CNCF projects), because they handle many resources at once (i.e. N number of TableGroups and M number of Shards).

Re: Controller Logic

The whole point of Child CR introduction is so that there is a "realised resource representation" in the cluster as a form of CR, rather than direct StatefulSet / Deployment. It doesn't matter whether that adds more controller code or not, making a simple operator isn't the goal. Replicating the image definition out to each of child CR would be simple anyways, and from the operation perspective, easier to find when each component has lengthy definitions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comparison to TopoServer to all other integral Multigres components as a departure from the norm is not entirely accurate, because TopoServer manages an external component, it is an optional, detachated component from the multigres cluster.

TopoServer nests the image only because it's a polymorphic type (Managed vs. External). MultiGateway is not; it is always a container. We shouldn't force standard components to adopt the awkward structure of edge-case components.

In this architecture (at least on this iteration), versioning is a global constraint pushed down from the Parent. Structurally separating images (the version context) from multigateway (the runtime config) accurately reflects that the Controller injects the version, while the Template dictates the config.

If we change this, Cell becomes the only Child CR where the primary application image is buried inside a struct, while Shard and TableGroup have them at the top. If you move the image inside spec.multigateway, you structurally imply that the Gateway could have a different version than the rest of the system if one were to edit that field. By keeping images grouped at the top level of every Child CR (Cell, Shard, TableGroup), you enforce the mental model that Versioning is global.

For all these reasons, I wouldn't change it. But if based on all of the above you still think we should, then I can disagree and commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless of whether we have image definition at top level or per spec.multigateway, the image details would flow down from MultigresCluster CR, which defines the images at the top level.

Because of the tight control we want to add for image versions used, I can understand the top level image list for MultigresCluster CR. However, child CRs are not meant to be manually edited, but still extremely useful for operation teams to understand the deployment setup. Especially if we were to provide canary and more complex version strategies, the top level definitions won't be sufficient. Also, other than Vitess, I personally don't know of any CNCF projects that define the image at the top level only. Those two are the main reasons I'm not comfortable with the current design.

You mention Cell being the only CR, but because TableGroup and Shard are structured exactly the same way, you can argue it's just TableGroup (which also has Shards inside) manages their images differently. This makes sense to me, because each TableGroup and Shard may not be related to each other, but as a part of MultigresCluster, there is a need of managing all of the versions together.

Cell CR is much less dynamic than TableGroup + Shards.


Considering the future version management needs, I think we should have it per resource. But at the same time, I am absolutely fine with having the top level definition similar to MultigresCluster, TableGroup, and Shard, provided that it doesn't say images. The fact that it says images tells me we are trying to generalise too much, where there is actually only a single image that can be configured at the top level.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rytswd I don't have a new technical reason to give you beyond what I laid out in the previous comments (specifically about Shared Types and Global Versioning Consistency across Shards/TableGroups and how other operators use this pattern too).

I understand your point about the Pod-like syntax, but I still believe that keeping the 'Global Version vs. Local Config' separation is the better architectural choice for the Operator as a whole, rather than optimizing Cell in isolation.

I feel we are just trading preferences at this point. Since I don't have any new arguments to add, I’ll leave the final call to you: we can either merge as-is (my preference) or I can implement the change to unblock the PR. Let me know which way you want to go.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I understand where you are coming from for this point, and although ultimately I disagree with this approach as I think Kubernetes is more about giving full control with idiomatic syntax, I'm fine to adapt to this approach.

I just got one last question, and if I can understand your point better, I'm OK merging as is:

I still believe that keeping the 'Global Version vs. Local Config' separation is the better architectural choice for the Operator as a whole, rather than optimizing Cell in isolation.

Could you clarify how this should apply to child CRs, as I thought this was a benefit for top level definition for "better management" rather about "ease of review"? If it is only for consistency and no other clear benefit, that's fine, too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for being flexible on this. To answer your question—yes, consistency is the primary driver, but there are two specific technical benefit for Child CRs as well:

  1. Since multigateway uses the shared StatelessSpec (same as MultiOrch), embedding the image inside it would force us to create a unique CellGatewaySpec struct just for the Cell CR (to avoid polluting StatelessSpec used elsewhere). Keeping the image at the top level allows us to reuse the generic struct cleanly. I know you mentioned you did not see this as a benefit, and I agree on its own is not a reason to do this, but I also see simpler code as a win too.
  2. For SREs looking at Child CRs (the realized state), keeping the image separate from the multigateway config visually reinforces that the version is a constraint injected by the Parent Controller, while the config (replicas/resources) is derived from the Template. In essence, the design spec structure conveys the intent.

So while consistency is the main motivation, it also keeps the Go types simpler and the 'Source of Truth' logic clearer in the realized resources.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification. I can understand reusing the same common spec for simplicity, though I would be worried about focusing on simplification too much, and end up making it more difficult to manage. For this one, the minimal spec is something we can iterate over from, and I can understand your point.

I don't quite understand your point about the second one, though:

keeping the image separate from the multigateway config visually reinforces that the version is a constraint injected by the Parent Controller

It is just a separate field, and you wouldn't be able to deduce where the input is coming from by looking at the structure. The declarative nature of YAML definitions are always intent driven, and having extra field just so that it "stands out" from other fields feels very strange and non-idiomatic.


I still disagree with this approach, because there is a common image definition spec.component.image we can simply adopt. The only downside of having spec.multigateway.image would be diversion from other CRs, and we could argue that those CRs may need to be updated to be idiomatic as well.

As we discussed long while, and it is true that this can be updated later on (though such update would be API breaking changes and we would need to bump the version), I'm leaving this open and still going ahead with the approval.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@teaglebuilt
It's debatable whether we would want to support all the fields or not. We will try to support most common fields, and as we haven't delved too much into it, there could be many that are still missing implementation. We will iterate over the spec and impl to support what's necessary for common use cases.

However, there are some specific fields that are only necessary in very special cases, and if you require such control, you may actually want to configure your Multigres cluster without an external solution like an operator. So our idea is to start with minimal spec, and expand as necessary. It may come down to open up all the Pod related control, but at the moment, that's not going to be the case for initial implementation.

Copy link
Member

@rytswd rytswd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through all the lines once again. Very minor points only for this round. There are 2 more discussion points left to be confirmed.

@teaglebuilt
Copy link

Would we support shards across different clusters since we have plans to support multi clustering?

@rytswd
Copy link
Member

rytswd commented Dec 15, 2025

@teaglebuilt
Could you clarify what "clusters" you mean in this context -- Multigres cluster, or Kubernetes cluster? I am assuming that you are talking about "multicluster Kubernetes", and for that, we will need further investigation on what makes most sense in terms of deployment strategy. We could technically consider "multicluster Kubernetes" as having more cells (i.e. zones / regions), but that really depends on the networking model deployed with multicluster Kubernetes.

We will need to go much more into the design specifically for this -- I see you commented on #39, and that probably needs to evolve into more thorough investigation and final design on multicluster support.

Copy link
Member

@rytswd rytswd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in this comment, I'm approving this spec. There is one part I do not agree, but we can leave the current state as final for now.

Thanks for the thorough work here! This sets a golden standard on how thorough documentation can be (and sometimes needs to be), and I can attest that it's been already helpful with controller logic implementation!

@fernando-villalba fernando-villalba merged commit 2589a91 into main Dec 15, 2025
2 checks passed
@fernando-villalba fernando-villalba deleted the multigres-cluster-cr-design branch December 15, 2025 08:55
@teaglebuilt
Copy link

@teaglebuilt Could you clarify what "clusters" you mean in this context -- Multigres cluster, or Kubernetes cluster? I am assuming that you are talking about "multicluster Kubernetes", and for that, we will need further investigation on what makes most sense in terms of deployment strategy. We could technically consider "multicluster Kubernetes" as having more cells (i.e. zones / regions), but that really depends on the networking model deployed with multicluster Kubernetes.

We will need to go much more into the design specifically for this -- I see you commented on #39, and that probably needs to evolve into more thorough investigation and final design on multicluster support.

Yeah that was kind of a an estranged thought i had after looking through the future specs and learning to understand the multigres design.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants