-
Notifications
You must be signed in to change notification settings - Fork 3k
Fix spark config properties syntax in Pyspark notebook Docker image #1256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @tbluejeans , Thanks, good remark, however I'm a bit lost. Here is some context. Previous fixThe issue #1168 was fixed in the PR #1198. from pyspark.sql import SparkSession
from pyspark.sql.functions import pandas_udf
# Spark session & context
spark = SparkSession.builder.master('local').config("spark.sql.execution.arrow.pyspark.enabled", "true").getOrCreate()
df = spark.createDataFrame([(1, 21), (2, 30)], ("id", "age"))
def filter_func(iterator):
for pdf in iterator:
yield pdf[pdf.id == 1]
df.mapInPandas(filter_func, df.schema).show() If we try to run this code before and after the fix as expected
# Before the fix -> does not work
docker run -it --rm -p 8888:8888 -e JUPYTER_ENABLE_LAB=yes jupyter/pyspark-notebook:a0a544e6dc6e
# After the fix -> works
docker run -it --rm -p 8888:8888 -e JUPYTER_ENABLE_LAB=yes jupyter/pyspark-notebook:d113a601dbb8 So the fix — even with the weird syntax — in the docker-stacks/all-spark-notebook/test/test_spark_notebooks.py Lines 12 to 15 in 9fe5186
NowNow even with no
However I was unable to find the reference to a change log or release note. In consequence, I'm in favor of finding this information before changing anything and if it's not necessary anymore to simply remove the fix made in References |
Hi @romainx , Thank you for your detailed response. I spent some time today trying to understand this myself and digged into some release notes of the referenced libraries. I can basically confirm all your points above. It seems your config is read despite the syntax and the lines added were necessary at the time to get it running with Arrow. Concerning the question which package update might have made these lines obsolete, I tried testing different package versions in isolation (Arrow, Spark, Java). The Spark version seems to be the relevant change. When using Spark version 3.0.2 , the exception is present without the additional properties being set. With version 3.1.1, I can also confirm that the lines added to the default properties are no longer required for this test case. The question remains why: What I can see is that netty has been updated from version Summary A more general topic: I am working on a container image which includes the delta packages (https://delta.io/) by default. It is only a few lines of spark config required. Does it make sense to integrate this into this container image or is it a too specific usecase? |
@tbluejeans very clear investigation thank for your time, I think it is a reasonable choice.
I also agree that your proposal is better, because it has a cleaner syntax. For the second point.
The answer is that is too specific — we try to keep the images as small and maintainable as possible —, however it could be very interesting to describe it in a recipe. Many thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See discussion.
Hello,
Following the example properties in the spark default configuration, I believe the syntax is incorrect and might lead to these properties being ignored.
The syntax problem only occured to me when trying to add additional properties following the syntax with inverted comas. I am not sure if my suggested changes actually fix anything, as I don't know the reason for these properties. If so, it still may worth merging to prevent others from having problems adding properties.
Hope it helps!