Skip to content

[Feature] Spark3.0 support #1412

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jonathanyuechun opened this issue Jan 9, 2020 · 62 comments · Fixed by #1592
Closed

[Feature] Spark3.0 support #1412

jonathanyuechun opened this issue Jan 9, 2020 · 62 comments · Fixed by #1592

Comments

@jonathanyuechun
Copy link

jonathanyuechun commented Jan 9, 2020

Feature description

Spark3 is currently in RC. Will there be support for Spark3 in the next release version (v8) or will we have to wait for v9 ?
more precisely, do you guys plan to start supporting spark3 only when official spark3 release is made ?

  • spark3 now support only scala 2.12
@jbaiera
Copy link
Member

jbaiera commented Jan 28, 2020

A considerable issue with our current spark support is how stretched the project has become across our supported versions of Spark, worsened by how Gradle manages building and maintaining Scala project code. One thing we're looking at is removing support for older versions, but this will not address all of the issues we've run into unfortunately.

We've been working internally to make the process better for a while now. It's been an unfortunate process of peeling back layers and finding more issues that need fixing, which is why there hasn't been much in the way of public issues on it.

I'll put up an issue soon that should detail the issues that we're running into (they're mostly to do with cross compiling Scala versions in Gradle) and what solutions we're exploring to fix them. As always, I appreciate the community's patience on this matter.

@jbaiera
Copy link
Member

jbaiera commented Feb 7, 2020

I've logged an issue detailing our Scala pains in the build.

@AceHack
Copy link

AceHack commented Jul 29, 2020

Is there a timeline on when this will be supported?

@dejanmiljkovic
Copy link

Same here. Any news when Spark 3.0 is going to be supported.

@dlubomski
Copy link

+1

@iercan
Copy link

iercan commented Sep 17, 2020

We need that too

@HarborZeng
Copy link

+1

8 similar comments
@Ananya-96
Copy link

+1

@akoshevoy
Copy link

+1

@axiangcoding
Copy link

+1

@bgmarsh
Copy link

bgmarsh commented Oct 2, 2020

+1

@pedrosk
Copy link

pedrosk commented Oct 5, 2020

+1

@pedrosk
Copy link

pedrosk commented Oct 5, 2020

+1

@apostaremczak
Copy link

+1

@wudj496
Copy link

wudj496 commented Oct 15, 2020

+1

@orderr66
Copy link

Alt Text

@mgolinelli
Copy link

+1

4 similar comments
@bry00
Copy link

bry00 commented Oct 18, 2020

+1

@genged
Copy link

genged commented Oct 19, 2020

+1

@sasincj
Copy link

sasincj commented Oct 20, 2020

+1

@bry00
Copy link

bry00 commented Oct 27, 2020

+1

@yingmei
Copy link

yingmei commented Oct 28, 2020

we are also waiting for the support of Spark 3, Please let us know the timeline.
Thanks.

@Walker555
Copy link

+1

@axiangcoding1
Copy link

Any progress on this feature?

@sunpe
Copy link

sunpe commented Jan 21, 2021

+1

@AceHack
Copy link

AceHack commented Jan 21, 2021

This is good news, would love to see some announcement on expected official release date.

@lyogev
Copy link

lyogev commented Jan 26, 2021

For people waiting, you can build this library locally, and it works for Spark 3.0/Scala 2.12 -- The blocker looks like tests.

If you can't wait for official support, you can build this version where the tests have been ripped out. Just clone lucaskjaero/elasticsearch-hadoop, and run ./gradlew -DskipTests=true build.

This builds a LOT on the work of @avnerl for actually upgrading the build -- I just ripped the broken parts out.

@lucaskjaero I'm unable to build, getting:

A problem occurred evaluating root project 'elasticsearch-hadoop'.
> Failed to apply plugin [class 'org.elasticsearch.gradle.info.GlobalBuildInfoPlugin']
   > Could not create plugin of type 'GlobalBuildInfoPlugin'.
      > Could not generate a decorated class for type GlobalBuildInfoPlugin.
         > org/gradle/jvm/toolchain/internal/InstallationLocation

Which JDK are u using to build?

@lucaskjaero
Copy link

@lyogev I'm using openjdk 15.0.2 2021-01-19. I'm not familiar enough with the codebase to troubleshoot. It looks like #1224 will officially address Scala 2.12, so it might be worth waiting until the next minor release.

@pablo0910
Copy link

For people waiting, you can build this library locally, and it works for Spark 3.0/Scala 2.12 -- The blocker looks like tests.
If you can't wait for official support, you can build this version where the tests have been ripped out. Just clone lucaskjaero/elasticsearch-hadoop, and run ./gradlew -DskipTests=true build.
This builds a LOT on the work of @avnerl for actually upgrading the build -- I just ripped the broken parts out.

@lucaskjaero I'm unable to build, getting:

A problem occurred evaluating root project 'elasticsearch-hadoop'.
> Failed to apply plugin [class 'org.elasticsearch.gradle.info.GlobalBuildInfoPlugin']
   > Could not create plugin of type 'GlobalBuildInfoPlugin'.
      > Could not generate a decorated class for type GlobalBuildInfoPlugin.
         > org/gradle/jvm/toolchain/internal/InstallationLocation

Which JDK are u using to build?

I had the same issue. For me, installing gradle on my computer instead of using gradlew solved it.

@rcongiu
Copy link

rcongiu commented Jan 29, 2021

+1

@jbaiera
Copy link
Member

jbaiera commented Jan 29, 2021

A new PR for supporting Spark 3.0 is open now, which should be free of any build system issues. If people are available to check out the project and build the artifact to test with their system, we'd be happy to accept feedback on any upgrade related issues you've run into on the PR itself.

To build the new version, make sure to set the current JAVA_HOME to be a supported JDK 11 (openjdk, or oracle). Additionally, you will need to create an ENV var named JAVA8_HOME that points to a JDK 8 distribution (needed for tests, fixtures, etc., and can't be skipped). The following command will then build the artifacts:

$> ./gradlew elasticsearch-spark-30:distribution --console=plain

The resulting artifacts will be available in spark/sql-30/build/distributions/

$> ls spark/sql-30/build/distributions/
elasticsearch-spark-30_2.12-8.0.0-SNAPSHOT-javadoc.jar
elasticsearch-spark-30_2.12-8.0.0-SNAPSHOT-sources.jar
elasticsearch-spark-30_2.12-8.0.0-SNAPSHOT.jar
elasticsearch-spark-30_2.12-8.0.0-SNAPSHOT.pom

edit: Updated the need for ENV properties in the build.

@Mpal7
Copy link

Mpal7 commented Feb 9, 2021

Thank you very much, any idea when it will be avialable on maven?

@jbaiera
Copy link
Member

jbaiera commented Feb 16, 2021

PR is merged. My hope is that the artifact will appear in our nightly snapshots in the next day or two (depending on when changes to snapshot deliveries take hold). Thanks again to the community for your patience on this update!

@jbaiera
Copy link
Member

jbaiera commented Feb 20, 2021

The Spark 3.0 support is available on Maven now as a snapshot of the 7.12 release branch.

Just a quick note - this is a snapshot, and the official tag line for this kind of code is "don't use it in production please" as it's not officially supported. That said, please feel free to pull it down and give it a test and let us know if we got something wrong before the official release!

@iercan
Copy link

iercan commented Feb 24, 2021

@jbaiera I've tested it with R via sparklyr. It looks fine. Thanks for the improvement

@iercan
Copy link

iercan commented Feb 26, 2021

@jbaiera On further tests I got below exception. Do you have any idea why?

Caused by: java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: scala.None$ is not a valid external type for schema of string
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, key_ipCountry), StringType), true, false) AS key_ipCountry#3652

@jbaiera
Copy link
Member

jbaiera commented Mar 3, 2021

@iercan Can you open a new issue with the full stack trace and a simple reproduction?

@sosipatra
Copy link

you can read es but you can't touch it...any update on this topic?
Alt Text

@jbaiera
Copy link
Member

jbaiera commented Mar 14, 2021

If you are experiencing problems with the new integration version, please open a bug issue with steps to reproduce the problem. Thanks!

@sosipatra
Copy link

If you are experiencing problems with the new integration version, please open a bug issue with steps to reproduce the problem. Thanks!

Do you have a timestamp when this will be available in production environment in Databricks? - I can't use something that is not official into my production env. :)

@jbaiera
Copy link
Member

jbaiera commented Mar 15, 2021

I unfortunately don't have an official concrete date for the 7.12 GA release. Sorry! Hopefully it will land soon!

@sosipatra
Copy link

I unfortunately don't have an official concrete date for the 7.12 GA release. Sorry! Hopefully it will land soon!

Many thanks for fast reply....at the same time I found a workaround that actually works for my project :D

@thispejo
Copy link

+1

@dlubomski
Copy link

I unfortunately don't have an official concrete date for the 7.12 GA release. Sorry! Hopefully it will land soon!

Many thanks for fast reply....at the same time I found a workaround that actually works for my project :D

Could you share workaround ?

@sosipatra
Copy link

sosipatra commented May 5, 2021

I unfortunately don't have an official concrete date for the 7.12 GA release. Sorry! Hopefully it will land soon!

Many thanks for fast reply....at the same time I found a workaround that actually works for my project :D

Could you share workaround ?

Cluster: 6.4 Extended Support (includes Apache Spark 2.4.5, Scala 2.11)
Installed the library: (JAR file) elasticsearch-spark-20_2.11-7.12.1 (https://www.elastic.co/guide/en/elasticsearch/hadoop/current/float.html)
Reading data from Elasticsearch (https://docs.databricks.com/data/data-sources/elasticsearch.html)

Use Data Frame API to access Elasticsearch index

@jbaiera
Copy link
Member

jbaiera commented May 10, 2021

Just to follow up here, 7.12 is generally available now so the standard installation routes can be used instead of any workarounds.

@psterk1
Copy link

psterk1 commented May 19, 2021

For people waiting, you can build this library locally, and it works for Spark 3.0/Scala 2.12 -- The blocker looks like tests.
If you can't wait for official support, you can build this version where the tests have been ripped out. Just clone lucaskjaero/elasticsearch-hadoop, and run ./gradlew -DskipTests=true build.
This builds a LOT on the work of @avnerl for actually upgrading the build -- I just ripped the broken parts out.

@lucaskjaero I'm unable to build, getting:

A problem occurred evaluating root project 'elasticsearch-hadoop'.
> Failed to apply plugin [class 'org.elasticsearch.gradle.info.GlobalBuildInfoPlugin']
   > Could not create plugin of type 'GlobalBuildInfoPlugin'.
      > Could not generate a decorated class for type GlobalBuildInfoPlugin.
         > org/gradle/jvm/toolchain/internal/InstallationLocation

Which JDK are u using to build?

I had the same issue. For me, installing gradle on my computer instead of using gradlew solved it.

I installed gradle 6.8.3 which includes the org.gradle.jvm.toolchain package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.