Fix spark config properties syntax in Pyspark notebook Docker image #1256

tbluejeans · 2021-03-20T13:41:46Z

Hello,

Following the example properties in the spark default configuration, I believe the syntax is incorrect and might lead to these properties being ignored.

The syntax problem only occured to me when trying to add additional properties following the syntax with inverted comas. I am not sure if my suggested changes actually fix anything, as I don't know the reason for these properties. If so, it still may worth merging to prevent others from having problems adding properties.

Hope it helps!

romainx · 2021-03-21T08:45:17Z

Hello @tbluejeans ,

Thanks, good remark, however I'm a bit lost. Here is some context.

Previous fix

The issue #1168 was fixed in the PR #1198.
The behavior of the following code was not the same before and after.

from pyspark.sql import SparkSession
from pyspark.sql.functions import pandas_udf

# Spark session & context
spark = SparkSession.builder.master('local').config("spark.sql.execution.arrow.pyspark.enabled", "true").getOrCreate()
df = spark.createDataFrame([(1, 21), (2, 30)], ("id", "age"))
def filter_func(iterator):
    for pdf in iterator:
        yield pdf[pdf.id == 1]

df.mapInPandas(filter_func, df.schema).show()

If we try to run this code before and after the fix as expected

It does not work before the fix
It works after the fix

# Before the fix -> does not work
docker run -it --rm -p 8888:8888 -e JUPYTER_ENABLE_LAB=yes jupyter/pyspark-notebook:a0a544e6dc6e

# After the fix -> works
docker run -it --rm -p 8888:8888 -e JUPYTER_ENABLE_LAB=yes jupyter/pyspark-notebook:d113a601dbb8

So the fix — even with the weird syntax — in the spark-defaults.conf was working.
The corresponding test has been added in the unit test suite.

docker-stacks/all-spark-notebook/test/test_spark_notebooks.py

Lines 12 to 15 in 9fe5186

    
           @pytest.mark.parametrize( 
        
               "test_file", 
        
               # TODO: add local_sparklyr 
        
               ["local_pyspark", "local_spylon", "local_sparkR", "issue_1168"],

Now

Now even with no spark-defaults.conf it works! I'm guessing it's related with the bump of Spark, Arrow or Java versions

Spark: 3.0.1 -> 3.1.1
Arrow: 2.0.0 -> 3.0.0
Java: 11.0.9.1 -> 11.0.10

However I was unable to find the reference to a change log or release note. In consequence, I'm in favor of finding this information before changing anything and if it's not necessary anymore to simply remove the fix made in spark-defaults.conf. So if you have any information, please do not hesitate to share it with us, we will greatly appreciate it.

References

https://stackoverflow.com/questions/62109276/errorjava-lang-unsupportedoperationexception-for-pyspark-pandas-udf-documenta

tbluejeans · 2021-03-21T14:15:13Z

Hi @romainx ,

Thank you for your detailed response. I spent some time today trying to understand this myself and digged into some release notes of the referenced libraries. I can basically confirm all your points above. It seems your config is read despite the syntax and the lines added were necessary at the time to get it running with Arrow.

Concerning the question which package update might have made these lines obsolete, I tried testing different package versions in isolation (Arrow, Spark, Java). The Spark version seems to be the relevant change. When using Spark version 3.0.2 , the exception is present without the additional properties being set. With version 3.1.1, I can also confirm that the lines added to the default properties are no longer required for this test case. The question remains why:

What I can see is that netty has been updated from version 4.1.47.Final to 4.1.51.Final but I cannot see anything in the release notes that explicitly references this issue. In the Spark release notes of 3.1.1, I can see that the arrow version was updated. Still no explicit reference to the problem.

Summary
As it is still mentioned in the Spark documentation to set these properties with Java 11 for Arrow support and you might want to stay downward-compatible to earlier Spark versions(<3.1.1), it is probably best to keep the additional properties until Spark removes it from the documentation.
Interestingly, I was having problems with the syntax of the default configs used here. I had copied those lines from here to add my own properties in a new container image. Here, I ran into problems which may be specific to those properties which I added. Therefore, I would prefer the syntax in this PR but as it seems to have no functional relevance, we can also just close this PR. I also tested the proposed syntax with the old Spark version and it works.

A more general topic: I am working on a container image which includes the delta packages (https://delta.io/) by default. It is only a few lines of spark config required. Does it make sense to integrate this into this container image or is it a too specific usecase?

romainx · 2021-03-21T16:09:36Z

@tbluejeans very clear investigation thank for your time,

I think it is a reasonable choice.

As it is still mentioned in the Spark documentation to set these properties with Java 11 for Arrow support and you might want to stay downward-compatible to earlier Spark versions(<3.1.1), it is probably best to keep the additional properties until Spark removes it from the documentation.

I also agree that your proposal is better, because it has a cleaner syntax.
So thanks again, I will merge it 🚀.

For the second point.

A more general topic: I am working on a container image which includes the delta packages (https://delta.io/) by default. It is only a few lines of spark config required. Does it make sense to integrate this into this container image or is it a too specific usecase?

The answer is that is too specific — we try to keep the images as small and maintainable as possible —, however it could be very interesting to describe it in a recipe.

Many thanks

romainx

See discussion.

Fix config formatting

5b0b25d

romainx self-requested a review March 20, 2021 19:17

romainx added the status:On Hold Paused until other work completes label Mar 21, 2021

romainx removed their request for review March 21, 2021 08:46

romainx self-requested a review March 21, 2021 16:09

romainx approved these changes Mar 21, 2021

View reviewed changes

romainx merged commit 3395de4 into jupyter:master Mar 21, 2021

romainx mentioned this pull request Mar 21, 2021

[SPARK-29924][DOCS] Document Apache Arrow JDK11 requirement apache/spark#27356

Closed

tbluejeans deleted the fix-spark-properties branch May 25, 2021 09:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix spark config properties syntax in Pyspark notebook Docker image #1256

Fix spark config properties syntax in Pyspark notebook Docker image #1256

tbluejeans commented Mar 20, 2021

romainx commented Mar 21, 2021

tbluejeans commented Mar 21, 2021 •

edited

Loading

romainx commented Mar 21, 2021

romainx left a comment

Fix spark config properties syntax in Pyspark notebook Docker image #1256

Fix spark config properties syntax in Pyspark notebook Docker image #1256

Conversation

tbluejeans commented Mar 20, 2021

romainx commented Mar 21, 2021

Previous fix

Now

References

tbluejeans commented Mar 21, 2021 • edited Loading

romainx commented Mar 21, 2021

romainx left a comment

Choose a reason for hiding this comment

tbluejeans commented Mar 21, 2021 •

edited

Loading