Restructure Spark Project Cross Compilation #1423

jbaiera · 2020-02-07T19:00:03Z

The way that ES-Hadoop currently handles cross compiling the Spark integration for various versions of Scala has reached a point where it needs to be reconsidered. Since code compiled with one major version of Scala is not binary compatible with a Scala runtime on a different major version, most Scala based projects must re-compile and release separate artifacts for each supported version of Scala.

This process of cross compilation is supported natively on SBT, but since this project is based first and foremost in Java and makes extensive use of the Elasticsearch project's testing facilities by ways of Gradle plugins, converting to SBT in order to fix the problems we are seeing is not an option.

The current process for cross compiling our Scala libraries in the project is to make use of an in-house gradle plugin that recursively launches the gradle build with a different version of Scala specified. The child build process performs the variant assembly, taking care to rename artifacts and the like as needed.

We've run into a number of problems with this process though:

With a recursive build Gradle cannot apply build optimizations in a normal fashion.
As core Gradle logic around project configuration changes, our cross compile plugin logic has broken, caused maintenance issues, and delayed upgrades.
As Spark and Scala release new supported variants on very different release schedules, the versions that ES-Hadoop support begin to diverge. For example, Spark 2.x does not support Scala 2.10 any more, and Spark 1.6 does not support Scala 2.12. The direction we want to take the project's structure would make our cross compile process incompatible with supporting new versions of Spark that diverge in their Scala support outside of major releases.
Testing the different variants requires a complicated array of CI configurations.

A potential solution to this issue that we are actively investigating is the usage of officially supported Gradle variant artifacts to organize the build logic. By using Gradle's variant system we solve the following problems:

All variants are built with one execution of the Gradle command, which allows more build optimizations to be applied.
Since the variant configuration is officially supported (unlike nested and recursive builds) we are better insulated to breakages when trying to upgrade.
We can correctly model the project as it always should have been while still supporting new Spark and Scala versions that diverge from the earlier supported versions of both.
We can test all the variants in one build.

nfx · 2020-02-10T16:15:34Z

I just wonder - would it make sense to move spark integration out to separate project?

jbaiera · 2020-02-10T16:39:00Z

I'm assuming you are suggesting that we use SBT for the Spark integration portion of the project. I'm -1 on this idea for the time being. Moving the integration to another project/build tool invites a different set of build challenges, mostly around the process of testing, license checking, and other functionalities we've come to depend on with Gradle and Elasticsearch Build Tools. I think it unwise to increase the build tool footprint of the project. Getting the benefit of using SBT for native cross compilation would just not offset the costs of adopting the tool enough (at this time). I'd be more amenable to the idea if the Gradle variant approach does not end up working correctly.

mmigdiso · 2020-07-06T22:56:37Z

hey @jbaiera , I see your point and as a newbie in es-hadoop code, I feel your pain. But on the other hand, don't you think that es-hadoop should somehow support the recent releases of spark? spark 3.0 is now out and many companies are already evaluating the migration plans for it. The ES-hadoop package is pretty critical for the spark users and will be a blocker for the spark3.0 migration for many.

jbaiera · 2020-07-14T16:40:21Z

@mmigdiso I completely understand the push for supporting the newer versions of Spark, but this work is very much needed in order to get there while still satisfying our compatibility requirements. Gradle's support for Scala is lacking in the cross compilation area, and in order to get where we need to be there are some complicated changes that need to be made. A decent number of PR's have already been merged that deal with this issue. I'll go through and link them up to it for the sake of visibility. The goal here is to have a build process that can handle the changes made to the Spark and Scala ecosystem going forward with minimal turn around time, starting with Spark 2.4.3 and 3.0 as well as Spark 2.12 and above.

Tagar · 2020-09-23T20:55:28Z

@jbaiera

Gradle's support for Scala is lacking in the cross compilation area

Would it be easier if we only had two options -

Spark 2.x with Scala 2.11
Spark 3.x with Scala 2.12.

Scala 2.12 support in Spark 2.4.x was experimental and not that many customers are using it.

Might be a bit more radical - Alternatively, to make things even easier, you potentially could deprecate support for Spark 2.x as Spark 2 only accepts bugfixes at this point. If somebody still using Spark 2, they can just pull an older version of elasticsearch-hadoop for Scala 2.11.

If you could build a separate release for just Scala 2.12 and Spark 3.x this would unlock a lot of customers migrating over to Spark 3. Scala 2.12 support is mandatory to migrate to Spark 3 (Scala 2.11 is not supported for Spark 3.x).

Thank you!!

jbaiera · 2020-09-24T17:32:31Z

Would it be easier if we only had two options -

Spark 2.x with Scala 2.11
Spark 3.x with Scala 2.12.

We considered this course of action but it became a problem when trying to juggle changes to the core shared library across Scala versions. The PR changes in #1521 allow us to build the core library for all versions of Scala/Spark, and each downstream SQL version can pick the appropriate version to work with.

mallman · 2020-12-10T17:01:07Z

@jbaiera It looks like the last time #1412 was acknowledged was in early Feb., and it seems there's a lot of frustration there. But how's this project going? Anywhere near where it needs to be to support Spark 3?

Honestly when I came to investigate Spark 3 support I thought maybe it was just a few compilation errors or refactorings that I could contribute. But the build system is the blocker. Perhaps just put Spark support in its own project and use a build system that solves this problem nicely like maven or sbt?

Cheers.

LannyRipple · 2021-01-21T18:37:47Z

How does one use this? I see Allow distribute scala 2.12 and update to spark 2.4.3 #1308 got closed but there are no tasks mentioning scala 2.12 so it doesn't seem we are any better off than we were.

Since Spark 2.4.5 (with latest now 2.4.7 which latest EMR is using) Scala 2.12 is the default and Spark 3.x explicitly removes compiling with Scala 2.11

mallman · 2021-01-21T21:12:16Z

How does one use this? I see Allow distribute scala 2.12 and update to spark 2.4.3 #1308 got closed but there are no tasks mentioning scala 2.12 so it doesn't seem we are any better off than we were.

Yup. I think a lot of people are frustrated by the lack of communication. My humble request to anyone reading—if you have a commercial support agreement with Elastic, please tell your support representative that this is important to you. If you are unsatisfied with their response, call your sales representative and tell them this is crucial to keeping your business with them.

jbaiera · 2021-01-25T17:15:17Z

Hey folks, I understand your frustration. This change has been a long time coming. The PR for supporting Scala 2.12 on Spark 2.4.x is up at #1589 and should be available in the next minor release.

jbaiera · 2021-04-06T15:18:10Z

This work is completed as of 7.12

lucaskjaero · 2021-04-06T16:41:52Z

Thanks for this! Do you know where the right place to submit docs updates? The installation guide page doesn't seem to have been updated to show the new support.

jbaiera · 2021-04-06T20:09:51Z

The installation guide page doesn't seem to have been updated to show the new support.

@lucaskjaero That's a great point. The docs are hosted on this project under the docs directory. I'll put up a quick PR to clarify this.

jbaiera · 2021-04-06T20:28:54Z

Opened #1638 for the docs update

jbaiera added :Spark :Build labels Feb 7, 2020

This was referenced Feb 7, 2020

[Feature] Spark3.0 support #1412

Closed

Allow distribute scala 2.12 and update to spark 2.4.3 #1308

Closed

mmigdiso mentioned this issue Jul 7, 2020

spark 3.0.0 module (ES 6.5) #1495

Closed

tomdubiel mentioned this issue Aug 11, 2020

Tdubiel/v6.8/update scalaspark versions interset MicroFocus/elasticsearch-hadoop#1

Merged

jbaiera mentioned this issue Aug 31, 2020

Cross compile Scala code using Gradle variants #1521

Merged

jbaiera closed this as completed Apr 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restructure Spark Project Cross Compilation #1423

Restructure Spark Project Cross Compilation #1423

jbaiera commented Feb 7, 2020

nfx commented Feb 10, 2020

jbaiera commented Feb 10, 2020 •

edited

Loading

mmigdiso commented Jul 6, 2020

jbaiera commented Jul 14, 2020

Tagar commented Sep 23, 2020

jbaiera commented Sep 24, 2020

mallman commented Dec 10, 2020

LannyRipple commented Jan 21, 2021

mallman commented Jan 21, 2021

jbaiera commented Jan 25, 2021

jbaiera commented Apr 6, 2021

lucaskjaero commented Apr 6, 2021

jbaiera commented Apr 6, 2021

jbaiera commented Apr 6, 2021

Restructure Spark Project Cross Compilation #1423

Restructure Spark Project Cross Compilation #1423

Comments

jbaiera commented Feb 7, 2020

nfx commented Feb 10, 2020

jbaiera commented Feb 10, 2020 • edited Loading

mmigdiso commented Jul 6, 2020

jbaiera commented Jul 14, 2020

Tagar commented Sep 23, 2020

jbaiera commented Sep 24, 2020

mallman commented Dec 10, 2020

LannyRipple commented Jan 21, 2021

mallman commented Jan 21, 2021

jbaiera commented Jan 25, 2021

jbaiera commented Apr 6, 2021

lucaskjaero commented Apr 6, 2021

jbaiera commented Apr 6, 2021

jbaiera commented Apr 6, 2021

jbaiera commented Feb 10, 2020 •

edited

Loading