Skip to content

Commit 5656879

Browse files
pjbulljayqi
andauthored
Initial ADLS gen2 support (#453)
* minimal ADLS gen2 support * add rigs back * Make mocked tests work with adls * add rigs back; make explicit no dirs * Update testing and hns key * format * update mocked tests * windows agnostic * set gen2 var in CI * new adls fucntionality; better tests and instantiation * Code review comments * Tweak HISTORY.md * TEMP: debug test code * don't close non-existent file * Revert "TEMP: debug test code" This reverts commit bb36a52. --------- Co-authored-by: Jay Qi <[email protected]>
1 parent f3605a6 commit 5656879

19 files changed

+635
-134
lines changed

.env.example

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,10 @@ AWS_SECRET_ACCESS_KEY=your_secret_access_key
55

66
AZURE_STORAGE_CONNECTION_STRING=DefaultEndpointsProtocol=https;AccountName=your_account_name;AccountKey=your_account_key;EndpointSuffix=core.windows.net
77

8+
# if testing with ADLS Gen2 storage, set credentials for that account here
9+
AZURE_STORAGE_GEN2_CONNECTION_STRING=DefaultEndpointsProtocol=https;AccountName=your_account_name;AccountKey=your_account_key;EndpointSuffix=core.windows.net
10+
11+
812
GOOGLE_APPLICATION_CREDENTIALS=.gscreds.json
913
# or
1014
GCP_PROJECT_ID=your_project_id

.github/workflows/tests.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -102,6 +102,7 @@ jobs:
102102
env:
103103
LIVE_AZURE_CONTAINER: ${{ secrets.LIVE_AZURE_CONTAINER }}
104104
AZURE_STORAGE_CONNECTION_STRING: ${{ secrets.AZURE_STORAGE_CONNECTION_STRING }}
105+
AZURE_STORAGE_GEN2_CONNECTION_STRING: ${{ secrets.AZURE_STORAGE_GEN2_CONNECTION_STRING }}
105106
LIVE_GS_BUCKET: ${{ secrets.LIVE_GS_BUCKET }}
106107
LIVE_S3_BUCKET: ${{ secrets.LIVE_S3_BUCKET }}
107108
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ docs/docs/changelog.md
66
docs/docs/contributing.md
77

88
# perf output
9-
perf-results.csv
9+
perf-*.csv
1010

1111
## GitHub Python .gitignore ##
1212
# https://github.com/github/gitignore/blob/master/Python.gitignore

CONTRIBUTING.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,15 @@ Finally, you may want to run your tests against live servers to ensure that the
8181
make test-live-cloud
8282
```
8383

84+
#### Azure live backend tests
85+
86+
For Azure, you can test both against Azure Blob Storage backends and Azure Data Lake Storage Gen2 backends. To run these tests, you need to set connection strings for both of the backends by setting the following environment variables (in your `.env` file for local development). If `AZURE_STORAGE_GEN2_CONNECTION_STRING` is not set, only the blob storage backend will be tested. To set up a storage account with ADLS Gen2, go through the normal creation flow for a storage account in the Azure portal and select "Enable Hierarchical Namespace" in the "Advanced" tab of the settings when configuring the account.
87+
88+
```bash
89+
AZURE_STORAGE_CONNECTION_STRING=your_connection_string
90+
AZURE_STORAGE_GEN2_CONNECTION_STRING=your_connection_string
91+
```
92+
8493
You can copy `.env.example` to `.env` and fill in the credentials and bucket/container names for the providers you want to test against. **Note that the live tests will create and delete files on the cloud provider.**
8594

8695
You can also skip providers you do not have accounts for by commenting them out in the `rig` and `s3_like_rig` variables defined at the end of `tests/conftest.py`.

HISTORY.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515
- Changed `LocalClient` so that client instances using the default storage access the default local storage directory through the `get_default_storage_dir` rather than having an explicit reference to the path set at instantiation. This means that calling `get_default_storage_dir` will reset the local storage for all clients using the default local storage, whether the client has already been instantiated or is instantiated after resetting. This fixes unintuitive behavior where `reset_local_storage` did not reset local storage when using the default client. (Issue [#414](https://github.com/drivendataorg/cloudpathlib/issues/414))
1616
- Added a new `local_storage_dir` property to `LocalClient`. This will return the current local storage directory used by that client instance.
1717
by reference through the `get_default_ rather than with an explicit.
18+
- Added Azure Data Lake Storage Gen2 support (Issue [#161](https://github.com/drivendataorg/cloudpathlib/issues/161), PR [#450](https://github.com/drivendataorg/cloudpathlib/pull/450)), thanks to [@M0dEx](https://github.com/M0dEx) for PR [#447](https://github.com/drivendataorg/cloudpathlib/pull/447) and PR [#449](https://github.com/drivendataorg/cloudpathlib/pull/449)
1819

1920
## v0.18.1 (2024-02-26)
2021

cloudpathlib/azure/azblobclient.py

Lines changed: 179 additions & 44 deletions
Large diffs are not rendered by default.

cloudpathlib/azure/azblobpath.py

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,8 @@
33
from tempfile import TemporaryDirectory
44
from typing import TYPE_CHECKING
55

6+
from cloudpathlib.exceptions import CloudPathIsADirectoryError
7+
68
try:
79
from azure.core.exceptions import ResourceNotFoundError
810
except ImportError:
@@ -44,8 +46,7 @@ def is_file(self) -> bool:
4446
return self.client._is_file_or_dir(self) == "file"
4547

4648
def mkdir(self, parents=False, exist_ok=False):
47-
# not possible to make empty directory on blob storage
48-
pass
49+
self.client._mkdir(self, parents=parents, exist_ok=exist_ok)
4950

5051
def touch(self, exist_ok: bool = True):
5152
if self.exists():
@@ -84,6 +85,17 @@ def stat(self):
8485
)
8586
)
8687

88+
def replace(self, target: "AzureBlobPath") -> "AzureBlobPath":
89+
try:
90+
return super().replace(target)
91+
92+
# we can rename directories on ADLS Gen2
93+
except CloudPathIsADirectoryError:
94+
if self.client._check_hns():
95+
return self.client._move_file(self, target)
96+
else:
97+
raise
98+
8799
@property
88100
def container(self) -> str:
89101
return self._no_prefix.split("/", 1)[0]

cloudpathlib/cloudpath.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -251,7 +251,7 @@ def client(self):
251251

252252
def __del__(self) -> None:
253253
# make sure that file handle to local path is closed
254-
if self._handle is not None:
254+
if self._handle is not None and self._local.exists():
255255
self._handle.close()
256256

257257
# ensure file removed from cache when cloudpath object deleted

docs/docs/authentication.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -211,6 +211,13 @@ client.set_as_default_client()
211211
cp3 = CloudPath("s3://cloudpathlib-test-bucket/")
212212
```
213213

214+
## Accessing Azure DataLake Storage Gen2 (ADLS Gen2) storage with hierarchical namespace enabled
215+
216+
Some Azure storage accounts are configured with "hierarchical namespace" enabled. This means that the storage account is backed by the Azure DataLake Storage Gen2 product rather than Azure Blob Storage. For many operations, the two are the same and one can use the Azure Blob Storage API. However, for some operations, a developer will need to use the Azure DataLake Storage API. The `AzureBlobClient` class implemented in cloudpathlib is designed to detect if hierarchical namespace is enabled and use the Azure DataLake Storage API in the places where it is necessary or it provides a performance improvement. Usually, a user of cloudpathlib will not need to know if hierarchical namespace is enabled and the storage account is backed by Azure DataLake Storage Gen2 or Azure Blob Storage.
217+
218+
If needed, the Azure SDK provided `DataLakeServiceClient` object can be accessed via the `AzureBlobClient.data_lake_client`. The Azure SDK provided `BlobServiceClient` object can be accessed via `AzureBlobClient.service_client`.
219+
220+
214221
## Pickling `CloudPath` objects
215222

216223
You can pickle and unpickle `CloudPath` objects normally, for example:

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ dependencies = [
3333
]
3434

3535
[project.optional-dependencies]
36-
azure = ["azure-storage-blob>=12"]
36+
azure = ["azure-storage-blob>=12", "azure-storage-file-datalake>=12"]
3737
gs = ["google-cloud-storage"]
3838
s3 = ["boto3>=1.34.0"]
3939
all = ["cloudpathlib[azure]", "cloudpathlib[gs]", "cloudpathlib[s3]"]

tests/conftest.py

Lines changed: 73 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,13 @@
11
import os
22
from pathlib import Path, PurePosixPath
33
import shutil
4+
from tempfile import TemporaryDirectory
45
from typing import Dict, Optional
56

67
from azure.storage.blob import BlobServiceClient
8+
from azure.storage.filedatalake import (
9+
DataLakeServiceClient,
10+
)
711
import boto3
812
import botocore
913
from dotenv import find_dotenv, load_dotenv
@@ -26,8 +30,10 @@
2630
LocalS3Path,
2731
)
2832
import cloudpathlib.azure.azblobclient
33+
from cloudpathlib.azure.azblobclient import _hns_rmtree
2934
import cloudpathlib.s3.s3client
30-
from .mock_clients.mock_azureblob import mocked_client_class_factory, DEFAULT_CONTAINER_NAME
35+
from .mock_clients.mock_azureblob import MockBlobServiceClient, DEFAULT_CONTAINER_NAME
36+
from .mock_clients.mock_adls_gen2 import MockedDataLakeServiceClient
3137
from .mock_clients.mock_gs import (
3238
mocked_client_class_factory as mocked_gsclient_class_factory,
3339
DEFAULT_GS_BUCKET_NAME,
@@ -109,17 +115,20 @@ def create_test_dir_name(request) -> str:
109115
return test_dir
110116

111117

112-
@fixture()
113-
def azure_rig(request, monkeypatch, assets_dir):
118+
def _azure_fixture(conn_str_env_var, adls_gen2, request, monkeypatch, assets_dir):
114119
drive = os.getenv("LIVE_AZURE_CONTAINER", DEFAULT_CONTAINER_NAME)
115120
test_dir = create_test_dir_name(request)
116121

117122
live_server = os.getenv("USE_LIVE_CLOUD") == "1"
118123

124+
connection_kwargs = dict()
125+
tmpdir = TemporaryDirectory()
126+
119127
if live_server:
120128
# Set up test assets
121-
blob_service_client = BlobServiceClient.from_connection_string(
122-
os.getenv("AZURE_STORAGE_CONNECTION_STRING")
129+
blob_service_client = BlobServiceClient.from_connection_string(os.getenv(conn_str_env_var))
130+
data_lake_service_client = DataLakeServiceClient.from_connection_string(
131+
os.getenv(conn_str_env_var)
123132
)
124133
test_files = [
125134
f for f in assets_dir.glob("**/*") if f.is_file() and f.name not in UPLOAD_IGNORE_LIST
@@ -130,13 +139,25 @@ def azure_rig(request, monkeypatch, assets_dir):
130139
blob=str(f"{test_dir}/{PurePosixPath(test_file.relative_to(assets_dir))}"),
131140
)
132141
blob_client.upload_blob(test_file.read_bytes(), overwrite=True)
142+
143+
connection_kwargs["connection_string"] = os.getenv(conn_str_env_var)
133144
else:
134-
monkeypatch.setenv("AZURE_STORAGE_CONNECTION_STRING", "")
135-
# Mock cloud SDK
145+
# pass key mocked params to clients via connection string
146+
monkeypatch.setenv(
147+
"AZURE_STORAGE_CONNECTION_STRING", f"{Path(tmpdir.name) / test_dir};{adls_gen2}"
148+
)
149+
monkeypatch.setenv("AZURE_STORAGE_GEN2_CONNECTION_STRING", "")
150+
136151
monkeypatch.setattr(
137152
cloudpathlib.azure.azblobclient,
138153
"BlobServiceClient",
139-
mocked_client_class_factory(test_dir),
154+
MockBlobServiceClient,
155+
)
156+
157+
monkeypatch.setattr(
158+
cloudpathlib.azure.azblobclient,
159+
"DataLakeServiceClient",
160+
MockedDataLakeServiceClient,
140161
)
141162

142163
rig = CloudProviderTestRig(
@@ -145,19 +166,47 @@ def azure_rig(request, monkeypatch, assets_dir):
145166
drive=drive,
146167
test_dir=test_dir,
147168
live_server=live_server,
169+
required_client_kwargs=connection_kwargs,
148170
)
149171

150-
rig.client_class().set_as_default_client() # set default client
172+
rig.client_class(**connection_kwargs).set_as_default_client() # set default client
173+
174+
# add flag for adls gen2 rig to skip some tests
175+
rig.is_adls_gen2 = adls_gen2
176+
rig.connection_string = os.getenv(conn_str_env_var) # used for client instantiation tests
151177

152178
yield rig
153179

154180
rig.client_class._default_client = None # reset default client
155181

156182
if live_server:
157-
# Clean up test dir
158-
container_client = blob_service_client.get_container_client(drive)
159-
to_delete = container_client.list_blobs(name_starts_with=test_dir)
160-
container_client.delete_blobs(*to_delete)
183+
if blob_service_client.get_account_information().get("is_hns_enabled", False):
184+
_hns_rmtree(data_lake_service_client, drive, test_dir)
185+
186+
else:
187+
# Clean up test dir
188+
container_client = blob_service_client.get_container_client(drive)
189+
to_delete = container_client.list_blobs(name_starts_with=test_dir)
190+
to_delete = sorted(to_delete, key=lambda b: len(b.name.split("/")), reverse=True)
191+
192+
container_client.delete_blobs(*to_delete)
193+
194+
else:
195+
tmpdir.cleanup()
196+
197+
198+
@fixture()
199+
def azure_rig(request, monkeypatch, assets_dir):
200+
yield from _azure_fixture(
201+
"AZURE_STORAGE_CONNECTION_STRING", False, request, monkeypatch, assets_dir
202+
)
203+
204+
205+
@fixture()
206+
def azure_gen2_rig(request, monkeypatch, assets_dir):
207+
yield from _azure_fixture(
208+
"AZURE_STORAGE_GEN2_CONNECTION_STRING", True, request, monkeypatch, assets_dir
209+
)
161210

162211

163212
@fixture()
@@ -420,10 +469,20 @@ def local_s3_rig(request, monkeypatch, assets_dir):
420469
rig.client_class.reset_default_storage_dir() # reset local storage directory
421470

422471

472+
# create azure fixtures for both blob and gen2 storage
473+
azure_rigs = fixture_union(
474+
"azure_rigs",
475+
[
476+
azure_rig, # azure_rig0
477+
azure_gen2_rig, # azure_rig1
478+
],
479+
)
480+
423481
rig = fixture_union(
424482
"rig",
425483
[
426-
azure_rig,
484+
azure_rig, # azure_rig0
485+
azure_gen2_rig, # azure_rig1
427486
gs_rig,
428487
s3_rig,
429488
custom_s3_rig,

tests/mock_clients/mock_adls_gen2.py

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
from datetime import datetime
2+
from pathlib import Path, PurePosixPath
3+
from shutil import rmtree
4+
from azure.core.exceptions import ResourceNotFoundError
5+
from azure.storage.filedatalake import FileProperties
6+
7+
from tests.mock_clients.mock_azureblob import _JsonCache, DEFAULT_CONTAINER_NAME
8+
9+
10+
class MockedDataLakeServiceClient:
11+
def __init__(self, test_dir, adls):
12+
# root is parent of the test specific directort
13+
self.root = test_dir.parent
14+
self.test_dir = test_dir
15+
self.adls = adls
16+
self.metadata_cache = _JsonCache(self.root / ".metadata")
17+
18+
@classmethod
19+
def from_connection_string(cls, conn_str, credential):
20+
# configured in conftest.py
21+
test_dir, adls = conn_str.split(";")
22+
adls = adls == "True"
23+
test_dir = Path(test_dir)
24+
return cls(test_dir, adls)
25+
26+
def get_file_system_client(self, file_system):
27+
return MockedFileSystemClient(self.root, self.metadata_cache)
28+
29+
30+
class MockedFileSystemClient:
31+
def __init__(self, root, metadata_cache):
32+
self.root = root
33+
self.metadata_cache = metadata_cache
34+
35+
def get_file_client(self, key):
36+
return MockedFileClient(key, self.root, self.metadata_cache)
37+
38+
def get_directory_client(self, key):
39+
return MockedDirClient(key, self.root)
40+
41+
def get_paths(self, path, recursive=False):
42+
yield from (
43+
MockedFileClient(
44+
PurePosixPath(f.relative_to(self.root)), self.root, self.metadata_cache
45+
).get_file_properties()
46+
for f in (self.root / path).glob("**/*" if recursive else "*")
47+
)
48+
49+
50+
class MockedFileClient:
51+
def __init__(self, key, root, metadata_cache) -> None:
52+
self.key = key
53+
self.root = root
54+
self.metadata_cache = metadata_cache
55+
56+
def get_file_properties(self):
57+
path = self.root / self.key
58+
59+
if path.exists() and path.is_dir():
60+
fp = FileProperties(
61+
**{
62+
"name": self.key,
63+
"size": 0,
64+
"ETag": "etag",
65+
"Last-Modified": datetime.fromtimestamp(path.stat().st_mtime),
66+
"metadata": {"hdi_isfolder": True},
67+
}
68+
)
69+
fp["is_directory"] = True # not part of object def, but still in API responses...
70+
return fp
71+
72+
elif path.exists():
73+
fp = FileProperties(
74+
**{
75+
"name": self.key,
76+
"size": path.stat().st_size,
77+
"ETag": "etag",
78+
"Last-Modified": datetime.fromtimestamp(path.stat().st_mtime),
79+
"metadata": {"hdi_isfolder": False},
80+
"Content-Type": self.metadata_cache.get(self.root / self.key, None),
81+
}
82+
)
83+
84+
fp["is_directory"] = False
85+
return fp
86+
else:
87+
raise ResourceNotFoundError
88+
89+
def rename_file(self, new_name):
90+
new_path = self.root / new_name[len(DEFAULT_CONTAINER_NAME + "/") :]
91+
(self.root / self.key).rename(new_path)
92+
93+
94+
class MockedDirClient:
95+
def __init__(self, key, root) -> None:
96+
self.key = key
97+
self.root = root
98+
99+
def delete_directory(self):
100+
rmtree(self.root / self.key)
101+
102+
def exists(self):
103+
return (self.root / self.key).exists()
104+
105+
def create_directory(self):
106+
(self.root / self.key).mkdir(parents=True, exist_ok=True)
107+
108+
def rename_directory(self, new_name):
109+
new_path = self.root / new_name[len(DEFAULT_CONTAINER_NAME + "/") :]
110+
(self.root / self.key).rename(new_path)

0 commit comments

Comments
 (0)