Skip to content

CSVLogger fails on remote FS on version 2.1.0 #18861

Open
@ioangatop

Description

@ioangatop

Bug description

When using CSVLogger and save an experiment on a remote file storage, even though it manages to save core files, like the checkpoints, metrics.csv and hyp.yaml, in the middle of the first epoch it fails and raise the error The blob type is invalid for this operation.

Note that is not the case with the version of 2.0.0, which it worked without any issues

The issue appears to be generated that some remote FS do not support append or "a" operation:
https://github.com/Lightning-AI/lightning/blob/874825857ffc09923407ada36814e11adb66c352/src/lightning/fabric/loggers/csv_logs.py#L231

Changing the above to only "w" will resolve the issue, but of course we do have to sent the whole information.
Maybe one way, to not store information on memory, to download the file , append and upload, only of remote file storages.

What version are you seeing the problem on?

v2.1

How to reproduce the bug

# main.py
"""
Execute:
>>> export FSSPEC_ABFS='{"anon": false}'
>>> pip install pytorch_lightning adlfs
>>> python main.py
"""
import pytorch_lightning as pl
from pytorch_lightning.demos import boring_classes

OUTPUT_DIR = "az://<container-name>@<name>.blob.core.windows.net/tmp/"

class TestModel(boring_classes.BoringModel):
    def training_step(self, batch, batch_idx):
        loss = self.step(batch)
        self.log("train/StepLoss", loss, prog_bar=True)
        return {"loss": loss}

model = TestModel()
trainer = pl.Trainer(
    max_epochs=10,
    logger=[
        pl.loggers.CSVLogger(save_dir=OUTPUT_DIR),
    ],
)
trainer.fit(model)

Error messages and logs

azure.core.exceptions.ResourceExistsError: The blob type is invalid for this operation.
RequestId:91549e35-001e-0027-3224-077714000000
Time:2023-10-25T09:21:49.3754217Z
ErrorCode:InvalidBlobType
Content: <?xml version="1.0" encoding="utf-8"?><Error><Code>InvalidBlobType</Code><Message>The blob type is invalid for this operation.
RequestId:91549e35-001e-0027-3224-077714000000
Time:2023-10-25T09:21:49.3754217Z</Message></Error>```

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

cc @Borda

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions