You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am currently encountering the following error when querying a subset of columns from an Elasticsearch index that contains conflicting mappings for the same field:
org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Incompatible types found in multi-mapping: Field [field1] has conflicting types of [OBJECT] and [TEXT].
I understand that type coercion support has been introduced (see: #1074), and coercion from complex types (e.g., OBJECT) to simple types (e.g., STRING) is not allowed.
However, in my query I am not referencing field1 at all. I’m only selecting a small subset of fields, which is correctly handled during column pruning in org/elasticsearch/spark/sql/DefaultSource.scala.
Despite this, at line 300, we instantiate ScalaEsRowRDD using the full inferred schema:
This triggers a call to the _mapping API, which attempts to retrieve the entire mapping for the index — including problematic fields that aren't actually required by the query.
Proposal:
I suggest these following improvements:
Use the user-specified schema from df.read.schema(...).options(...).load directly to construct ScalaEsRowRDD, bypassing the need to fetch or validate unused fields. mapping required for deserialize data return from elastic search by spark can also be acquired by converting from user defined schema. To avoid breaking change we could add an option specify when this mapping should be applied instead of using mapping from elastic search API
Modify lazySchema logic to only resolve and validate the projected subset of columns that are actually used in the query.
This change would prevent failures caused by irrelevant mapping conflicts and aligns with how Spark typically prunes columns during logical planning.
Willing to Contribute:
I'm happy to submit a PR to explore this improvement if the approach is acceptable.
The text was updated successfully, but these errors were encountered:
Feature Description:
I am currently encountering the following error when querying a subset of columns from an Elasticsearch index that contains conflicting mappings for the same field:
I understand that type coercion support has been introduced (see: #1074), and coercion from complex types (e.g., OBJECT) to simple types (e.g., STRING) is not allowed.
However, in my query I am not referencing field1 at all. I’m only selecting a small subset of fields, which is correctly handled during column pruning in org/elasticsearch/spark/sql/DefaultSource.scala.
Despite this, at line 300, we instantiate ScalaEsRowRDD using the full inferred schema:
This triggers a call to the _mapping API, which attempts to retrieve the entire mapping for the index — including problematic fields that aren't actually required by the query.
Proposal:
I suggest these following improvements:
Use the user-specified schema from
df.read.schema(...).options(...).load
directly to construct ScalaEsRowRDD, bypassing the need to fetch or validate unused fields. mapping required for deserialize data return from elastic search by spark can also be acquired by converting from user defined schema. To avoid breaking change we could add an option specify when this mapping should be applied instead of using mapping from elastic search APIModify lazySchema logic to only resolve and validate the projected subset of columns that are actually used in the query.
This change would prevent failures caused by irrelevant mapping conflicts and aligns with how Spark typically prunes columns during logical planning.
Willing to Contribute:
I'm happy to submit a PR to explore this improvement if the approach is acceptable.
The text was updated successfully, but these errors were encountered: