Description
Bug description
When using CSVLogger
and save an experiment on a remote file storage, even though it manages to save core files, like the checkpoints, metrics.csv
and hyp.yaml
, in the middle of the first epoch it fails and raise the error The blob type is invalid for this operation.
Note that is not the case with the version of 2.0.0
, which it worked without any issues
The issue appears to be generated that some remote FS do not support append
or "a"
operation:
https://github.com/Lightning-AI/lightning/blob/874825857ffc09923407ada36814e11adb66c352/src/lightning/fabric/loggers/csv_logs.py#L231
Changing the above to only "w"
will resolve the issue, but of course we do have to sent the whole information.
Maybe one way, to not store information on memory, to download the file , append and upload, only of remote file storages.
What version are you seeing the problem on?
v2.1
How to reproduce the bug
# main.py
"""
Execute:
>>> export FSSPEC_ABFS='{"anon": false}'
>>> pip install pytorch_lightning adlfs
>>> python main.py
"""
import pytorch_lightning as pl
from pytorch_lightning.demos import boring_classes
OUTPUT_DIR = "az://<container-name>@<name>.blob.core.windows.net/tmp/"
class TestModel(boring_classes.BoringModel):
def training_step(self, batch, batch_idx):
loss = self.step(batch)
self.log("train/StepLoss", loss, prog_bar=True)
return {"loss": loss}
model = TestModel()
trainer = pl.Trainer(
max_epochs=10,
logger=[
pl.loggers.CSVLogger(save_dir=OUTPUT_DIR),
],
)
trainer.fit(model)
Error messages and logs
azure.core.exceptions.ResourceExistsError: The blob type is invalid for this operation.
RequestId:91549e35-001e-0027-3224-077714000000
Time:2023-10-25T09:21:49.3754217Z
ErrorCode:InvalidBlobType
Content: <?xml version="1.0" encoding="utf-8"?><Error><Code>InvalidBlobType</Code><Message>The blob type is invalid for this operation.
RequestId:91549e35-001e-0027-3224-077714000000
Time:2023-10-25T09:21:49.3754217Z</Message></Error>```
Environment
Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):
More info
No response
cc @Borda