-
Notifications
You must be signed in to change notification settings - Fork 988
Restructure Spark Project Cross Compilation #1423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I just wonder - would it make sense to move spark integration out to separate project? |
I'm assuming you are suggesting that we use SBT for the Spark integration portion of the project. I'm -1 on this idea for the time being. Moving the integration to another project/build tool invites a different set of build challenges, mostly around the process of testing, license checking, and other functionalities we've come to depend on with Gradle and Elasticsearch Build Tools. I think it unwise to increase the build tool footprint of the project. Getting the benefit of using SBT for native cross compilation would just not offset the costs of adopting the tool enough (at this time). I'd be more amenable to the idea if the Gradle variant approach does not end up working correctly. |
hey @jbaiera , I see your point and as a newbie in es-hadoop code, I feel your pain. But on the other hand, don't you think that es-hadoop should somehow support the recent releases of spark? spark 3.0 is now out and many companies are already evaluating the migration plans for it. The ES-hadoop package is pretty critical for the spark users and will be a blocker for the spark3.0 migration for many. |
@mmigdiso I completely understand the push for supporting the newer versions of Spark, but this work is very much needed in order to get there while still satisfying our compatibility requirements. Gradle's support for Scala is lacking in the cross compilation area, and in order to get where we need to be there are some complicated changes that need to be made. A decent number of PR's have already been merged that deal with this issue. I'll go through and link them up to it for the sake of visibility. The goal here is to have a build process that can handle the changes made to the Spark and Scala ecosystem going forward with minimal turn around time, starting with Spark 2.4.3 and 3.0 as well as Spark 2.12 and above. |
Would it be easier if we only had two options -
Scala 2.12 support in Spark 2.4.x was experimental and not that many customers are using it. Might be a bit more radical - Alternatively, to make things even easier, you potentially could deprecate support for Spark 2.x as Spark 2 only accepts bugfixes at this point. If somebody still using Spark 2, they can just pull an older version of elasticsearch-hadoop for Scala 2.11. If you could build a separate release for just Scala 2.12 and Spark 3.x this would unlock a lot of customers migrating over to Spark 3. Scala 2.12 support is mandatory to migrate to Spark 3 (Scala 2.11 is not supported for Spark 3.x). Thank you!! |
We considered this course of action but it became a problem when trying to juggle changes to the core shared library across Scala versions. The PR changes in #1521 allow us to build the core library for all versions of Scala/Spark, and each downstream SQL version can pick the appropriate version to work with. |
@jbaiera It looks like the last time #1412 was acknowledged was in early Feb., and it seems there's a lot of frustration there. But how's this project going? Anywhere near where it needs to be to support Spark 3? Honestly when I came to investigate Spark 3 support I thought maybe it was just a few compilation errors or refactorings that I could contribute. But the build system is the blocker. Perhaps just put Spark support in its own project and use a build system that solves this problem nicely like maven or sbt? Cheers. |
How does one use this? I see Since Spark 2.4.5 (with latest now 2.4.7 which latest EMR is using) Scala 2.12 is the default and Spark 3.x explicitly removes compiling with Scala 2.11 |
Yup. I think a lot of people are frustrated by the lack of communication. My humble request to anyone reading—if you have a commercial support agreement with Elastic, please tell your support representative that this is important to you. If you are unsatisfied with their response, call your sales representative and tell them this is crucial to keeping your business with them. |
Hey folks, I understand your frustration. This change has been a long time coming. The PR for supporting Scala 2.12 on Spark 2.4.x is up at #1589 and should be available in the next minor release. |
This work is completed as of 7.12 |
Thanks for this! Do you know where the right place to submit docs updates? The installation guide page doesn't seem to have been updated to show the new support. |
@lucaskjaero That's a great point. The docs are hosted on this project under the docs directory. I'll put up a quick PR to clarify this. |
Opened #1638 for the docs update |
The way that ES-Hadoop currently handles cross compiling the Spark integration for various versions of Scala has reached a point where it needs to be reconsidered. Since code compiled with one major version of Scala is not binary compatible with a Scala runtime on a different major version, most Scala based projects must re-compile and release separate artifacts for each supported version of Scala.
This process of cross compilation is supported natively on SBT, but since this project is based first and foremost in Java and makes extensive use of the Elasticsearch project's testing facilities by ways of Gradle plugins, converting to SBT in order to fix the problems we are seeing is not an option.
The current process for cross compiling our Scala libraries in the project is to make use of an in-house gradle plugin that recursively launches the gradle build with a different version of Scala specified. The child build process performs the variant assembly, taking care to rename artifacts and the like as needed.
We've run into a number of problems with this process though:
A potential solution to this issue that we are actively investigating is the usage of officially supported Gradle variant artifacts to organize the build logic. By using Gradle's variant system we solve the following problems:
The text was updated successfully, but these errors were encountered: