Skip to content

Restructure Spark Project Cross Compilation #1423

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jbaiera opened this issue Feb 7, 2020 · 14 comments
Closed

Restructure Spark Project Cross Compilation #1423

jbaiera opened this issue Feb 7, 2020 · 14 comments

Comments

@jbaiera
Copy link
Member

jbaiera commented Feb 7, 2020

The way that ES-Hadoop currently handles cross compiling the Spark integration for various versions of Scala has reached a point where it needs to be reconsidered. Since code compiled with one major version of Scala is not binary compatible with a Scala runtime on a different major version, most Scala based projects must re-compile and release separate artifacts for each supported version of Scala.

This process of cross compilation is supported natively on SBT, but since this project is based first and foremost in Java and makes extensive use of the Elasticsearch project's testing facilities by ways of Gradle plugins, converting to SBT in order to fix the problems we are seeing is not an option.

The current process for cross compiling our Scala libraries in the project is to make use of an in-house gradle plugin that recursively launches the gradle build with a different version of Scala specified. The child build process performs the variant assembly, taking care to rename artifacts and the like as needed.

We've run into a number of problems with this process though:

  1. With a recursive build Gradle cannot apply build optimizations in a normal fashion.
  2. As core Gradle logic around project configuration changes, our cross compile plugin logic has broken, caused maintenance issues, and delayed upgrades.
  3. As Spark and Scala release new supported variants on very different release schedules, the versions that ES-Hadoop support begin to diverge. For example, Spark 2.x does not support Scala 2.10 any more, and Spark 1.6 does not support Scala 2.12. The direction we want to take the project's structure would make our cross compile process incompatible with supporting new versions of Spark that diverge in their Scala support outside of major releases.
  4. Testing the different variants requires a complicated array of CI configurations.

A potential solution to this issue that we are actively investigating is the usage of officially supported Gradle variant artifacts to organize the build logic. By using Gradle's variant system we solve the following problems:

  1. All variants are built with one execution of the Gradle command, which allows more build optimizations to be applied.
  2. Since the variant configuration is officially supported (unlike nested and recursive builds) we are better insulated to breakages when trying to upgrade.
  3. We can correctly model the project as it always should have been while still supporting new Spark and Scala versions that diverge from the earlier supported versions of both.
  4. We can test all the variants in one build.
@nfx
Copy link

nfx commented Feb 10, 2020

I just wonder - would it make sense to move spark integration out to separate project?

@jbaiera
Copy link
Member Author

jbaiera commented Feb 10, 2020

I'm assuming you are suggesting that we use SBT for the Spark integration portion of the project. I'm -1 on this idea for the time being. Moving the integration to another project/build tool invites a different set of build challenges, mostly around the process of testing, license checking, and other functionalities we've come to depend on with Gradle and Elasticsearch Build Tools. I think it unwise to increase the build tool footprint of the project. Getting the benefit of using SBT for native cross compilation would just not offset the costs of adopting the tool enough (at this time). I'd be more amenable to the idea if the Gradle variant approach does not end up working correctly.

@mmigdiso
Copy link

mmigdiso commented Jul 6, 2020

hey @jbaiera , I see your point and as a newbie in es-hadoop code, I feel your pain. But on the other hand, don't you think that es-hadoop should somehow support the recent releases of spark? spark 3.0 is now out and many companies are already evaluating the migration plans for it. The ES-hadoop package is pretty critical for the spark users and will be a blocker for the spark3.0 migration for many.

@jbaiera
Copy link
Member Author

jbaiera commented Jul 14, 2020

@mmigdiso I completely understand the push for supporting the newer versions of Spark, but this work is very much needed in order to get there while still satisfying our compatibility requirements. Gradle's support for Scala is lacking in the cross compilation area, and in order to get where we need to be there are some complicated changes that need to be made. A decent number of PR's have already been merged that deal with this issue. I'll go through and link them up to it for the sake of visibility. The goal here is to have a build process that can handle the changes made to the Spark and Scala ecosystem going forward with minimal turn around time, starting with Spark 2.4.3 and 3.0 as well as Spark 2.12 and above.

@Tagar
Copy link

Tagar commented Sep 23, 2020

@jbaiera

Gradle's support for Scala is lacking in the cross compilation area

Would it be easier if we only had two options -

  • Spark 2.x with Scala 2.11
  • Spark 3.x with Scala 2.12.

Scala 2.12 support in Spark 2.4.x was experimental and not that many customers are using it.

Might be a bit more radical - Alternatively, to make things even easier, you potentially could deprecate support for Spark 2.x as Spark 2 only accepts bugfixes at this point. If somebody still using Spark 2, they can just pull an older version of elasticsearch-hadoop for Scala 2.11.

If you could build a separate release for just Scala 2.12 and Spark 3.x this would unlock a lot of customers migrating over to Spark 3. Scala 2.12 support is mandatory to migrate to Spark 3 (Scala 2.11 is not supported for Spark 3.x).

Thank you!!

@jbaiera
Copy link
Member Author

jbaiera commented Sep 24, 2020

Would it be easier if we only had two options -

Spark 2.x with Scala 2.11
Spark 3.x with Scala 2.12.

We considered this course of action but it became a problem when trying to juggle changes to the core shared library across Scala versions. The PR changes in #1521 allow us to build the core library for all versions of Scala/Spark, and each downstream SQL version can pick the appropriate version to work with.

@mallman
Copy link

mallman commented Dec 10, 2020

@jbaiera It looks like the last time #1412 was acknowledged was in early Feb., and it seems there's a lot of frustration there. But how's this project going? Anywhere near where it needs to be to support Spark 3?

Honestly when I came to investigate Spark 3 support I thought maybe it was just a few compilation errors or refactorings that I could contribute. But the build system is the blocker. Perhaps just put Spark support in its own project and use a build system that solves this problem nicely like maven or sbt?

Cheers.

@LannyRipple
Copy link

How does one use this? I see Allow distribute scala 2.12 and update to spark 2.4.3 #1308 got closed but there are no tasks mentioning scala 2.12 so it doesn't seem we are any better off than we were.

Since Spark 2.4.5 (with latest now 2.4.7 which latest EMR is using) Scala 2.12 is the default and Spark 3.x explicitly removes compiling with Scala 2.11

@mallman
Copy link

mallman commented Jan 21, 2021

How does one use this? I see Allow distribute scala 2.12 and update to spark 2.4.3 #1308 got closed but there are no tasks mentioning scala 2.12 so it doesn't seem we are any better off than we were.

Yup. I think a lot of people are frustrated by the lack of communication. My humble request to anyone reading—if you have a commercial support agreement with Elastic, please tell your support representative that this is important to you. If you are unsatisfied with their response, call your sales representative and tell them this is crucial to keeping your business with them.

@jbaiera
Copy link
Member Author

jbaiera commented Jan 25, 2021

Hey folks, I understand your frustration. This change has been a long time coming. The PR for supporting Scala 2.12 on Spark 2.4.x is up at #1589 and should be available in the next minor release.

@jbaiera
Copy link
Member Author

jbaiera commented Apr 6, 2021

This work is completed as of 7.12

@jbaiera jbaiera closed this as completed Apr 6, 2021
@lucaskjaero
Copy link

Thanks for this! Do you know where the right place to submit docs updates? The installation guide page doesn't seem to have been updated to show the new support.

@jbaiera
Copy link
Member Author

jbaiera commented Apr 6, 2021

The installation guide page doesn't seem to have been updated to show the new support.

@lucaskjaero That's a great point. The docs are hosted on this project under the docs directory. I'll put up a quick PR to clarify this.

@jbaiera
Copy link
Member Author

jbaiera commented Apr 6, 2021

Opened #1638 for the docs update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants