Skip to content

Fix spark config properties syntax in Pyspark notebook Docker image #1256

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 21, 2021

Conversation

tbluejeans
Copy link
Contributor

Hello,

Following the example properties in the spark default configuration, I believe the syntax is incorrect and might lead to these properties being ignored.

The syntax problem only occured to me when trying to add additional properties following the syntax with inverted comas. I am not sure if my suggested changes actually fix anything, as I don't know the reason for these properties. If so, it still may worth merging to prevent others from having problems adding properties.

Hope it helps!

@romainx romainx self-requested a review March 20, 2021 19:17
@romainx
Copy link
Collaborator

romainx commented Mar 21, 2021

Hello @tbluejeans ,

Thanks, good remark, however I'm a bit lost. Here is some context.

Previous fix

The issue #1168 was fixed in the PR #1198.
The behavior of the following code was not the same before and after.

from pyspark.sql import SparkSession
from pyspark.sql.functions import pandas_udf

# Spark session & context
spark = SparkSession.builder.master('local').config("spark.sql.execution.arrow.pyspark.enabled", "true").getOrCreate()
df = spark.createDataFrame([(1, 21), (2, 30)], ("id", "age"))
def filter_func(iterator):
    for pdf in iterator:
        yield pdf[pdf.id == 1]

df.mapInPandas(filter_func, df.schema).show()

If we try to run this code before and after the fix as expected

  • It does not work before the fix
  • It works after the fix
# Before the fix -> does not work
docker run -it --rm -p 8888:8888 -e JUPYTER_ENABLE_LAB=yes jupyter/pyspark-notebook:a0a544e6dc6e

# After the fix -> works
docker run -it --rm -p 8888:8888 -e JUPYTER_ENABLE_LAB=yes jupyter/pyspark-notebook:d113a601dbb8

So the fix — even with the weird syntax — in the spark-defaults.conf was working.
The corresponding test has been added in the unit test suite.

@pytest.mark.parametrize(
"test_file",
# TODO: add local_sparklyr
["local_pyspark", "local_spylon", "local_sparkR", "issue_1168"],

Now

Now even with no spark-defaults.conf it works! I'm guessing it's related with the bump of Spark, Arrow or Java versions

  • Spark: 3.0.1 -> 3.1.1
  • Arrow: 2.0.0 -> 3.0.0
  • Java: 11.0.9.1 -> 11.0.10

However I was unable to find the reference to a change log or release note. In consequence, I'm in favor of finding this information before changing anything and if it's not necessary anymore to simply remove the fix made in spark-defaults.conf. So if you have any information, please do not hesitate to share it with us, we will greatly appreciate it.

References

@romainx romainx added the status:On Hold Paused until other work completes label Mar 21, 2021
@romainx romainx removed their request for review March 21, 2021 08:46
@tbluejeans
Copy link
Contributor Author

tbluejeans commented Mar 21, 2021

Hi @romainx ,

Thank you for your detailed response. I spent some time today trying to understand this myself and digged into some release notes of the referenced libraries. I can basically confirm all your points above. It seems your config is read despite the syntax and the lines added were necessary at the time to get it running with Arrow.

Concerning the question which package update might have made these lines obsolete, I tried testing different package versions in isolation (Arrow, Spark, Java). The Spark version seems to be the relevant change. When using Spark version 3.0.2 , the exception is present without the additional properties being set. With version 3.1.1, I can also confirm that the lines added to the default properties are no longer required for this test case. The question remains why:

What I can see is that netty has been updated from version 4.1.47.Final to 4.1.51.Final but I cannot see anything in the release notes that explicitly references this issue. In the Spark release notes of 3.1.1, I can see that the arrow version was updated. Still no explicit reference to the problem.

Summary
As it is still mentioned in the Spark documentation to set these properties with Java 11 for Arrow support and you might want to stay downward-compatible to earlier Spark versions(<3.1.1), it is probably best to keep the additional properties until Spark removes it from the documentation.
Interestingly, I was having problems with the syntax of the default configs used here. I had copied those lines from here to add my own properties in a new container image. Here, I ran into problems which may be specific to those properties which I added. Therefore, I would prefer the syntax in this PR but as it seems to have no functional relevance, we can also just close this PR. I also tested the proposed syntax with the old Spark version and it works.

A more general topic: I am working on a container image which includes the delta packages (https://delta.io/) by default. It is only a few lines of spark config required. Does it make sense to integrate this into this container image or is it a too specific usecase?

@romainx
Copy link
Collaborator

romainx commented Mar 21, 2021

@tbluejeans very clear investigation thank for your time,

I think it is a reasonable choice.

As it is still mentioned in the Spark documentation to set these properties with Java 11 for Arrow support and you might want to stay downward-compatible to earlier Spark versions(<3.1.1), it is probably best to keep the additional properties until Spark removes it from the documentation.

I also agree that your proposal is better, because it has a cleaner syntax.
So thanks again, I will merge it 🚀.

For the second point.

A more general topic: I am working on a container image which includes the delta packages (https://delta.io/) by default. It is only a few lines of spark config required. Does it make sense to integrate this into this container image or is it a too specific usecase?

The answer is that is too specific — we try to keep the images as small and maintainable as possible —, however it could be very interesting to describe it in a recipe.

Many thanks

@romainx romainx self-requested a review March 21, 2021 16:09
Copy link
Collaborator

@romainx romainx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See discussion.

@romainx romainx merged commit 3395de4 into jupyter:master Mar 21, 2021
@tbluejeans tbluejeans deleted the fix-spark-properties branch May 25, 2021 09:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status:On Hold Paused until other work completes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants