Unable to Connect multiple Azure blob configured using same spark session #796

nikhilindikuzha · 2025-03-13T04:03:38Z

HI Team,

We have use case where we have created Apache iceberg table and its storage location is based on 2 azure blob storage One for hot layer and other for cold layer. We have used 2 s3 proxy end point for connecting to azure blob.

Problem: We are able to query the iceberg table if we have one s3 proxy which inturn connects to the azure blob. But we are not able pass 2 s3 points to the same spark session . Is there a way , we can achieve the same ?

gaul · 2025-03-13T06:43:58Z

Can you be more specific about what you want to do? Do you want one S3Proxy endpoint to serve two different storage backends, mapping one bucket to the first backend and the second bucket to the second backend?

nikhilindikuzha · 2025-03-13T09:01:29Z

We have created an Iceberg table that initially contains four records. After a few days, due to retention policies, we need to move the first two records (i.e., the corresponding Parquet data files) to another storage layer (cold storage) in Azure Blob Storage. Once the data files are moved, the Iceberg metadata is updated accordingly. As a result, the table now references data files located in two different storage layers.

When executing a SELECT * FROM table, Spark should be able to access and query both storage layers seamlessly.

To facilitate this, we have set up two S3 proxy endpoints, each corresponding to a different storage system. However, in Spark, we can only set a single S3 endpoint per session using:

spark.conf.set("fs.s3a.endpoint", "")
spark.conf.set("fs.s3a.access.key", "")
spark.conf.set("fs.s3a.secret.key", "")
spark.conf.set("fs.s3a.aws.credential.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCrednetialProvider")
spark.conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

This limitation prevents us from querying both storage locations within the same Spark session.

Question:
How can we configure Spark to support multiple S3 proxy endpoints simultaneously, allowing seamless querying of data from both storage layers?

gaul · 2025-03-13T15:47:01Z

Does https://github.com/gaul/s3proxy/wiki/Middleware-bucket-locator do what you want? @timuralp

chiragjn · 2025-03-29T04:22:39Z

This might also help
https://hadoop.apache.org/docs/r3.4.1/hadoop-aws/tools/hadoop-aws/connecting.html#Using_Per-Bucket_Configuration_to_access_data_round_the_world

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to Connect multiple Azure blob configured using same spark session #796

Unable to Connect multiple Azure blob configured using same spark session #796

nikhilindikuzha commented Mar 13, 2025

gaul commented Mar 13, 2025

nikhilindikuzha commented Mar 13, 2025 •

edited

Loading

gaul commented Mar 13, 2025

chiragjn commented Mar 29, 2025

Unable to Connect multiple Azure blob configured using same spark session #796

Unable to Connect multiple Azure blob configured using same spark session #796

Comments

nikhilindikuzha commented Mar 13, 2025

gaul commented Mar 13, 2025

nikhilindikuzha commented Mar 13, 2025 • edited Loading

gaul commented Mar 13, 2025

chiragjn commented Mar 29, 2025

nikhilindikuzha commented Mar 13, 2025 •

edited

Loading