Skip to content

fix: RecordPreparer incorrectly injected __typename into embedded concrete-typed fields#1109

Open
marcdaniels-toast wants to merge 2 commits intoblock:mainfrom
marcdaniels-toast:fix/record-preparer-typename-injection
Open

fix: RecordPreparer incorrectly injected __typename into embedded concrete-typed fields#1109
marcdaniels-toast wants to merge 2 commits intoblock:mainfrom
marcdaniels-toast:fix/record-preparer-typename-injection

Conversation

@marcdaniels-toast
Copy link
Copy Markdown
Contributor

Summary

RecordPreparer was incorrectly injecting __typename into embedded fields typed as a concrete object type. The bug arose because @types_requiring_typename — a set that includes concrete types whose shared index mapping contains __typename (e.g. Person indexed alongside Company in a union index) — was being used both at the root level (correct) and during embedded field processing (incorrect).

When a type like Person is indexed in a shared union index, it legitimately needs __typename at the root document level so the datastore can distinguish subtypes. However, when Person appears as the type of an embedded field (e.g. Manufacturer.ceo), it is a concrete type and __typename is neither needed nor valid there. Injecting it caused a KeyError in CountAccumulator when traversing the prepared document, since __typename is not present in the embedded object's index mapping properties.

Fix: Split @types_requiring_typename into two sets:

  • @types_with_typename_required_in_json_schema — types where the JSON schema marks __typename as required (i.e. abstract types that can hold multiple subtypes). Used when deciding whether to preserve __typename on embedded fields during prepare_value_for_indexing.
  • @types_requiring_typename — a superset that also includes concrete types inheriting a shared index. Used only at the root level in prepare_for_index to inject __typename when absent.

The test schema is also updated to give Person and Company their own dedicated indexes (people and companies), making them types that are both directly indexed and embedded in another indexed type (Manufacturer.ceo). This exercises the previously untested case Myron noted in the issue. Once index inheritance is fully implemented, the index can be moved up to the NamedInventor interface, at which point the acceptance test will also directly exercise the bug fix path.

Test Plan

  • Unit test (record_preparer_spec.rb): A new example directly demonstrates the bug and fix. It builds a schema where Person shares a union index (requiring __typename at root) and is also embedded as Manufacturer.ceo (where __typename must be absent). The test verifies that preparing a root Person record injects __typename, while preparing a Manufacturer record with an embedded Person ceo does not.

  • Acceptance test (search_spec.rb): The existing "supports fetching interface fields" test is extended to index a Manufacturer with a ceo: Person and query it end-to-end via named_entities. This confirms the full round-trip works correctly for a type that is both a root indexed type and embedded in another indexed type.

Fixes #1097

…crete-typed fields

When a concrete type (e.g. Person) was indexed in a shared union index, it was added
to @types_requiring_typename. The old prepare_value_for_indexing used that set for both
root and embedded contexts, causing __typename to be injected into embedded fields like
Manufacturer.ceo even though no __typename is needed or expected there.

Fix by splitting the set into @types_with_typename_required_in_json_schema (used for
embedded processing) and @types_requiring_typename (superset, used only at root level
in prepare_for_index).

Also adds Person and Company as directly indexed types in the test schema, exercising
the case where a type is both a root indexed type and embedded in another indexed type.

Generated with Claude Code
…company

Now that Person and Company have their own indexes, the validation in
common_project_specs picks them up and tries to validate them as indexed
records -- which they aren't. They're embedded-only factories used for
union fields (Widget.inventor) and concrete embedded fields (Manufacturer.ceo).

Rename them to :embedded_person/:embedded_company to make the intent clear,
and add them to ignored_factories in common_project_specs_spec.rb.

Generated with Claude Code
# have at most 1 widget, so we use `components.shift` here to ensure that a component isn't re-assigned.
widget_components = Array.new(rand(3)) { components.shift }.compact
FactoryBot.build(:widget, components: widget_components)
FactoryBot.build(:widget, components: widget_components, inventor: inventors.sample)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
FactoryBot.build(:widget, components: widget_components, inventor: inventors.sample)
FactoryBot.build(:widget, components: widget_components, inventor: Faker::Base.sample(inventors))

Our faker usage in tests is designed to be deterministic:

# Make faker usage deterministic, based on the `full_description` string of the
# RSpec example. `String#sum` is being used purely for speed, as we do not
# need a particularly even distribution in random number seeds.
Faker::Config.random = Random.new(ex.full_description.sum)

...but using Array#sample (which uses the standard RNG) rather than Faker::Base.sample (which uses the RNG we assign based on the test description) defeats that.

# Superset of the above: also includes concrete types that inherit an index from an abstract
# supertype (e.g. `Person` indexed in the `inventors` index). Used only at the root level
# in `prepare_for_index` to inject `__typename` into root documents when needed.
@types_requiring_typename = @types_with_typename_required_in_json_schema | types_with_typename_in_index_mapping
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your bug report (and this PR) has helped me realize something: it's fundamentally unsound to model "is __typename required?" at the type level. As you've demonstrated, whether or not __typename is required is not just a property of the type--it's also contextual. A type that is both embedded within another indexed type and indexed on its own may have require __typename in one context but not another. So I'd like to move away from modeling it this way.

I think a better source of truth is the index mapping: if we're at a position that has a __typename field in the mapping, we should populate it. If we're at a position that has no __typename field in the mapping, then we should omit it.

Claude helped me validate this hypthesis. Here's the diff that it came up with:

diff --git a/elasticgraph-indexer/lib/elastic_graph/indexer/operation/update.rb b/elasticgraph-indexer/lib/elastic_graph/indexer/operation/update.rb
index 56bcf116..a2503756 100644
--- a/elasticgraph-indexer/lib/elastic_graph/indexer/operation/update.rb
+++ b/elasticgraph-indexer/lib/elastic_graph/indexer/operation/update.rb
@@ -27,7 +27,11 @@ module ElasticGraph
           update_target:,
           destination_index_mapping:
         )
-          prepared_record = record_preparer.prepare_for_index(event["type"], event["record"] || {"id" => event["id"]})
+          prepared_record = record_preparer.prepare_for_index(
+            event["type"],
+            event["record"] || {"id" => event["id"]},
+            mapping_properties: destination_index_mapping.dig("properties")
+          )
 
           Support::HashUtil
             .fetch_leaf_values_at_path(prepared_record, update_target.id_source.split("."))
diff --git a/elasticgraph-indexer/lib/elastic_graph/indexer/record_preparer.rb b/elasticgraph-indexer/lib/elastic_graph/indexer/record_preparer.rb
index 347b4fc2..749a903f 100644
--- a/elasticgraph-indexer/lib/elastic_graph/indexer/record_preparer.rb
+++ b/elasticgraph-indexer/lib/elastic_graph/indexer/record_preparer.rb
@@ -21,18 +21,10 @@ module ElasticGraph
             hash[type_name] = scalar_types_by_name[type_name]&.load_indexing_preparer&.extension_class
           end # : ::Hash[::String, SchemaArtifacts::RuntimeMetadata::extensionClass?]
 
-          # Derive the set of types that have __typename in their index mappings to supplement the set of types that
-          # explicitly require __typename in the JSON schema
-          mappings = schema_artifacts.index_mappings_by_index_def_name
-          types_with_typename_in_index_mapping = schema_artifacts.runtime_metadata.object_types_by_name.filter_map do |type_name, meta|
-            type_name if meta.index_definition_names.any? { |idx| mappings.dig(idx, "properties", "__typename") }
-          end.to_set
-
           @preparers_by_json_schema_version = ::Hash.new do |hash, version|
             hash[version] = RecordPreparer.new(
               indexing_preparer_by_scalar_type_name,
-              build_type_metas_from(@schema_artifacts.json_schemas_for(version)),
-              types_with_typename_in_index_mapping
+              build_type_metas_from(@schema_artifacts.json_schemas_for(version))
             )
           end
         end
@@ -58,10 +50,6 @@ module ElasticGraph
               {} # : ::Hash[::String, untyped]
             end # : ::Hash[::String, untyped]
 
-            required_fields = type_def.fetch("required") do
-              [] # : ::Array[::String]
-            end # : ::Array[::String]
-
             eg_meta_by_field_name = properties.filter_map do |prop_name, prop|
               eg_meta = prop["ElasticGraph"]
               [prop_name, eg_meta] if eg_meta
@@ -69,7 +57,6 @@ module ElasticGraph
 
             TypeMetadata.new(
               name: type,
-              requires_typename: required_fields.include?("__typename"),
               eg_meta_by_field_name: eg_meta_by_field_name
             )
           end
@@ -84,27 +71,16 @@ module ElasticGraph
       # still build operations for it, and the operations require a `RecordPreparer`, but we do
       # not send them to the datastore.
       module Identity
-        def self.prepare_for_index(type_name, record)
+        def self.prepare_for_index(type_name, record, mapping_properties: nil)
           record
         end
       end
 
-      def initialize(indexing_preparer_by_scalar_type_name, type_metas, types_with_typename_in_index_mapping)
+      def initialize(indexing_preparer_by_scalar_type_name, type_metas)
         @indexing_preparer_by_scalar_type_name = indexing_preparer_by_scalar_type_name
         @eg_meta_by_field_name_by_concrete_type = type_metas.to_h do |meta|
           [meta.name, meta.eg_meta_by_field_name]
         end
-
-        # Types where the JSON schema has `__typename` in `required` -- meaning the field can hold
-        # multiple subtypes and `__typename` is needed to identify which one.
-        @types_with_typename_required_in_json_schema = type_metas.filter_map do |meta|
-          meta.name if meta.requires_typename
-        end.to_set
-
-        # Superset of the above: also includes concrete types that inherit an index from an abstract
-        # supertype (e.g. `Person` indexed in the `inventors` index). Used only at the root level
-        # in `prepare_for_index` to inject `__typename` into root documents when needed.
-        @types_requiring_typename = @types_with_typename_required_in_json_schema | types_with_typename_in_index_mapping
       end
 
       # Prepares the given payload for being indexed into the named index.
@@ -121,18 +97,13 @@ module ElasticGraph
       #
       # Note: this method does not mutate the given `record`. Instead it returns a
       # copy with any updates applied to it.
-      def prepare_for_index(type_name, record)
-        result = prepare_value_for_indexing(record, type_name) # : ::Hash[::String, untyped]
-        # Add __typename if required but absent (e.g. for a mixed-type index).
-        if result.is_a?(::Hash) && @types_requiring_typename.include?(type_name) && !result.key?("__typename")
-          result["__typename"] = type_name
-        end
-        result
+      def prepare_for_index(type_name, record, mapping_properties: nil)
+        prepare_value_for_indexing(record, type_name, mapping_properties) # : ::Hash[::String, untyped]
       end
 
       private
 
-      def prepare_value_for_indexing(value, type_name)
+      def prepare_value_for_indexing(value, type_name, mapping_properties)
         type_name = type_name.delete_suffix("!")
 
         return nil if value.nil?
@@ -144,7 +115,7 @@ module ElasticGraph
         case value
         when ::Array
           element_type_name = type_name.delete_prefix("[").delete_suffix("]")
-          value.map { |v| prepare_value_for_indexing(v, element_type_name) }
+          value.map { |v| prepare_value_for_indexing(v, element_type_name, mapping_properties) }
         when ::Hash
           # `@eg_meta_by_field_name_by_concrete_type` does not have abstract types in it (e.g. type unions).
           # Instead, it'll have each concrete subtype in it.
@@ -153,15 +124,27 @@ module ElasticGraph
           # what the concrete subtype is. `__typename` is required on abstract types and indicates that.
           eg_meta_by_field_name = @eg_meta_by_field_name_by_concrete_type.fetch(value["__typename"] || type_name)
 
-          value.filter_map do |field_name, field_value|
+          # Determine whether __typename belongs at this position by checking the index mapping.
+          typename_in_mapping = mapping_properties&.key?("__typename")
+
+          prepared_fields = value.filter_map do |field_name, field_value|
             if field_name == "__typename"
-              # We only want to include __typename if we're dealing with a type that requires it
-              # (i.e. a type that can represent multiple subtypes, so __typename can differentiate between them).
-              [field_name, field_value] if @types_with_typename_required_in_json_schema.include?(type_name)
+              # Only include __typename if the index mapping has it at this position.
+              [field_name, field_value] if typename_in_mapping
             elsif (eg_meta = eg_meta_by_field_name[field_name])
-              [eg_meta.fetch("nameInIndex"), prepare_value_for_indexing(field_value, eg_meta.fetch("type"))]
+              name_in_index = eg_meta.fetch("nameInIndex")
+              nested_mapping_properties = mapping_properties&.dig(name_in_index, "properties")
+              [name_in_index, prepare_value_for_indexing(field_value, eg_meta.fetch("type"), nested_mapping_properties)]
             end
           end.to_h
+
+          # Inject __typename if the mapping requires it but it's absent from the record
+          # (e.g. for a concrete type indexed in a mixed-type index).
+          if typename_in_mapping && !value.key?("__typename")
+            prepared_fields["__typename"] = type_name
+          end
+
+          prepared_fields
         else
           # We won't have a registered preparer for enum types, since those aren't dumped in
           # runtime metadata `scalar_types_by_name`, and we can just return the value as-is in
@@ -173,8 +156,6 @@ module ElasticGraph
       TypeMetadata = ::Data.define(
         # The name of the type this metadata object is for.
         :name,
-        # Indicates if this type requires a `__typename` field.
-        :requires_typename,
         # The per-field ElasticGraph metadata, keyed by field name.
         :eg_meta_by_field_name
       )

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up on my comment above about it being better to include __typename based on the index mapping, I suspect there's a scenario which would fail with your implementation and pass with an implementation that consults the index mapping. Can you come up with a test that covers that?

end

factory :person, parent: :hash_base do
factory :embedded_person, parent: :hash_base do
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer not to have duplicate factory definitions of :person and :company, and force callers to specify :embedded_xyz vs :indexed_xyz.

Instead, I think we can make the :person and :company factories use :indexed_type as their parent, and then we can offer an .as_embedded refinement:

diff --git a/spec_support/lib/elastic_graph/spec_support/factories.rb b/spec_support/lib/elastic_graph/spec_support/factories.rb
index 371b0c55..798fae7a 100644
--- a/spec_support/lib/elastic_graph/spec_support/factories.rb
+++ b/spec_support/lib/elastic_graph/spec_support/factories.rb
@@ -17,6 +17,20 @@ require "faker"
 require "elastic_graph/indexer/test_support/converters"
 require "elastic_graph/support/hash_util"
 
+module ElasticGraph
+  module SpecSupport
+    module HashAsEmbedded
+      refine Hash do
+        # Strips indexed-type-only keys, converting an indexed factory hash
+        # to one suitable for embedding in another record.
+        def as_embedded
+          except(:__version, :__typename, :__json_schema_version)
+        end
+      end
+    end
+  end
+end
+
 # A counter that we increment for the `__version` value on each new factory-generated record.
 version_counter = 0
 
diff --git a/spec_support/lib/elastic_graph/spec_support/factories/widgets.rb b/spec_support/lib/elastic_graph/spec_support/factories/widgets.rb
index a692d477..91c0af50 100644
--- a/spec_support/lib/elastic_graph/spec_support/factories/widgets.rb
+++ b/spec_support/lib/elastic_graph/spec_support/factories/widgets.rb
@@ -8,6 +8,8 @@
 
 require "date"
 
+using ElasticGraph::SpecSupport::HashAsEmbedded
+
 # Note: it is *essential* that all factories defined here generate records
 # deterministically, in order for the request bodies to (and responses from)
 # the datastore to not change between VCR cassettes being recorded and replayed.
@@ -38,27 +40,13 @@ FactoryBot.define do
     is_draft { false }
   end
 
-  factory :embedded_person, parent: :hash_base do
-    __typename { "Person" }
-    id { Faker::Alphanumeric.alpha(number: 20) }
-    name { Faker::Name.name }
-    nationality { Faker::Nation.nationality }
-  end
-
-  factory :indexed_person, parent: :indexed_type do
+  factory :person, parent: :indexed_type do
     __typename { "Person" }
     name { Faker::Name.name }
     nationality { Faker::Nation.nationality }
   end
 
-  factory :embedded_company, parent: :hash_base do
-    __typename { "Company" }
-    id { Faker::Alphanumeric.alpha(number: 20) }
-    name { Faker::Company.name }
-    stock_ticker { name[0..3].upcase }
-  end
-
-  factory :indexed_company, parent: :indexed_type do
+  factory :company, parent: :indexed_type do
     __typename { "Company" }
     name { Faker::Company.name }
     stock_ticker { name[0..3].upcase }
@@ -119,7 +107,7 @@ FactoryBot.define do
       components.map { |c| c.fetch(:id) }
     end
 
-    inventor { build Faker::Base.sample([:embedded_person, :embedded_company]) }
+    inventor { build(Faker::Base.sample([:person, :company])).as_embedded }
     named_inventor { inventor }
     weight_in_ng { Faker::Number.between(from: 2**51, to: (2**53) - 1) }
     weight_in_ng_str { Faker::Number.between(from: 2**60, to: 2**61) }
@@ -154,7 +142,7 @@ FactoryBot.define do
     __typename { "Manufacturer" }
     name { Faker::Company.name }
     created_at { Faker::Time.between(from: recent_date - 30, to: recent_date).utc.iso8601 }
-    ceo { build(:embedded_person) }
+    ceo { build(:person).as_embedded }
   end
 
   factory :address, parent: :indexed_type do

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RecordPreparer Bug: Context-Unaware __typename Addition

2 participants