Add synchronous parameter to MLflowLogger #19639

clumsy · 2024-03-15T20:35:11Z

What does this PR do?

Adding an asynchronous version of MLflowLogger. It will do the I/O in a separate thread, thus not blocking the caller (e.g. main training thread).
Note that the values and timestamps of the metrics, params and tags are created synchronously prior the asynchronous I/O thus preserving the original content.

Fixes #19710

Before submitting

Was this discussed/agreed via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

📚 Documentation preview 📚: https://pytorch-lightning--19639.org.readthedocs.build/en/19639/

awaelchli · 2024-03-16T13:52:29Z

@clumsy Thanks for the PR. I'm not sure we should do this. I couldn't believe that MLFlowLogger is blocking, so I did a quick search and found this: mlflow/mlflow#1550 (comment)
Maybe it's too old, and things have changed, but it says there by the maintainer that the mlflow client is not thread-save. Furthermore, they suggest using log_batch, which we already do in the regular MLFlowLogger. There is a synchronous=True|False flag, maybe we should use that?
https://mlflow.org/docs/latest/python_api/mlflow.client.html?highlight=log_batch#mlflow.client.MlflowClient.log_batch

clumsy · 2024-03-18T13:04:03Z

@awaelchli The issue you linked suggest using the workaround provided in this PR - i.e. don't make asynchronous calls using fluent API like log_metrics but use log_batch instead. And it's not even that much related to not being thread-safe (if you check latest impl that makes use of a new synchronous flag is simply passed to the log_batch call here, the problem is the data like timestamp is captured within log_metrics itself here, hence it may not match the time of the original invocation.

The call to log_batch (at least with synchronous=True which is the default) is blocking (sends requests, awaits responses), dispatching downstream impl from here eventually making the blocking calls here for REST impl for example.

Bumping the bounds of mlflow dependency also works but I think maybe unnecessary.
Here's why in my opinion:

data is safely published, a defensive copy is created before calling log_batch() in log_metrics()
GIL provides a happens-before edge to make make sure the data we captured in the calling thread will be the same in the dispatching thread
current log() API does not return anything, so we don't need to block eventually or even return a Future
the new synchronous flag is still experimental
this implementation is basically the same as AsyncCheckpointIO but for metrics/tags instead of the checkpoint
this impl can be marked experimental and remain opt-in

The difference this tweak brings is very significant in our experiments, basically making the calls to log() to publish metrics to MLflow free from the perspective of the training thread which is no longer blocked by the I/O of this Logger.

awaelchli · 2024-03-27T00:02:21Z

@clumsy I just vote against exposing a feature flag as an entire subclass of the that logger. That's just not the Lightning way. Let's investigate adding a flag to the existing logger and passing down the synchronous flag to the log_batch method. It's fine if it's experimental, we wouldn't enable it by default.

clumsy · 2024-03-27T23:23:42Z

A small correction, @awaelchli - this PR is not using the experimental feature flag from the recent MLflow versions.
In fact it's not using MLflow's asynchronous implementation altogether and makes asynchronous metric, tag and parameter publishing available for any version of MLflow.

awaelchli · 2024-03-28T20:49:55Z

Sure, I'm fine with this being an implementation detail. I still don't see enough justification to make this an entirely new class. I stay with my recommendation: MLFlowLogger(async=True|False) with False being the default.

clumsy · 2024-03-28T22:36:58Z

How about this then, @awaelchli:

We add async flag to MLFlowLogger
We check if the user passed async=True
If Yes - we check what is mlflow version
If >=2.8.0 we use the new synchronous=False
If <2.8.0 we use a thread pool approach (this PR)

dannyfriar · 2024-03-28T22:41:56Z

How about this then, @awaelchli:

We add async flag to MLFlowLogger

We check if the user passed async=True

If Yes - we check what is mlflow version

If >=2.8.0 we use the new synchronous=False

If <2.8.0 we use a thread pool approach (this PR)

Personally I think that adds a lot of code complexity to do in two different ways. I'd vote for expose a flag and then just log a warning in the case that the mlflow version is below 2.8.0.

I'd also vote for calling the argument synchronous to match the mlflow argument.

clumsy · 2024-03-28T22:46:44Z

As for code complexity - isn't it an implementation detail? The user never sees this.

Logging, or silently continuing with the synchronous workflow when indicated otherwise by the user violates the principle of least astonishment I'm afraid. I'd rather we assert the flag (or its value False) is not used for mlflow<2.8.0

I like the naming suggestion to keep it consistent with mlflow.

awaelchli · 2024-03-28T23:06:10Z

I'm fine with either approach. I would only add the manual implementation if upgrading mlflow adds barriers for users. But no strong opinion on this.

dannyfriar · 2024-03-28T23:09:05Z

It is implementation detail but still adds more code to the project to maintain for relatively little benefit IMO (supporting async in old mlflow versions) . The addition of the synchronousflag alone is a tiny change.

I think raising an exception in that case rather than logging a warning makes sense.

clumsy · 2024-03-29T14:14:20Z

Please check the new version of this PR, @dannyfriar, @awaelchli
Added optional synchronous flag to MLFlowLogger) that defaults to None - which leaves it for mlflow discretion to decide, e.g. if they decide to change the default behavior in the future.
When the flag is provided we will pass it via kwargs where applicable (for us it's just log_batch currently), but we verify the current mlflow version supports it.
Unfortunately this means that the users of older versions of mlflow (e.g. because their MLflow endpoint is <2.8.0 and they cannot use newer clients) won't get the luxury of asynchronous MLflow logger.

dannyfriar · 2024-03-29T14:20:41Z

Nice - that LGTM!

src/lightning/pytorch/loggers/mlflow.py

awaelchli · 2024-04-03T22:16:38Z

Thank you for the feature @clumsy and @dannyfriar for discussion!

clumsy requested review from awaelchli, carmocca, justusschock, Borda and williamFalcon as code owners March 15, 2024 20:35

github-actions bot added the pl Generic label for PyTorch Lightning package label Mar 15, 2024

awaelchli added discussion In a discussion stage logger: mlflow labels Mar 16, 2024

awaelchli mentioned this pull request Mar 28, 2024

Expose client argument in MLFlowLogger #19710

Closed

clumsy force-pushed the feat/async_mlflow_logger branch from a9fea7e to 4232c83 Compare March 29, 2024 14:08

clumsy changed the title ~~feat: add AsyncMLflowLogger~~ feat: add synchronous parameter to MLflowLogger Mar 29, 2024

feat: add synchronous parameter to MLflowLogger (Lightning-AI#19710)

17a1f7e

clumsy force-pushed the feat/async_mlflow_logger branch 2 times, most recently from f57862f to 5e5ccd1 Compare March 29, 2024 15:14

awaelchli approved these changes Mar 29, 2024

View reviewed changes

src/lightning/pytorch/loggers/mlflow.py Outdated Show resolved Hide resolved

src/lightning/pytorch/loggers/mlflow.py Outdated Show resolved Hide resolved

awaelchli added feature Is an improvement or enhancement community This PR is from the community and removed discussion In a discussion stage labels Mar 29, 2024

awaelchli added this to the 2.3 milestone Mar 29, 2024

awaelchli changed the title ~~feat: add synchronous parameter to MLflowLogger~~ Add synchronous parameter to MLflowLogger Mar 29, 2024

clumsy force-pushed the feat/async_mlflow_logger branch from 5e5ccd1 to 17a1f7e Compare March 29, 2024 16:20

carmocca approved these changes Apr 3, 2024

View reviewed changes

mergify bot added the ready PRs ready to be merged label Apr 3, 2024

Merge branch 'master' into feat/async_mlflow_logger

caead00

awaelchli merged commit ce88483 into Lightning-AI:master Apr 3, 2024

clumsy deleted the feat/async_mlflow_logger branch April 4, 2024 21:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add synchronous parameter to MLflowLogger #19639

Add synchronous parameter to MLflowLogger #19639

Uh oh!

clumsy commented Mar 15, 2024 •

edited

Loading

Uh oh!

awaelchli commented Mar 16, 2024 •

edited

Loading

Uh oh!

clumsy commented Mar 18, 2024 •

edited

Loading

Uh oh!

awaelchli commented Mar 27, 2024

Uh oh!

clumsy commented Mar 27, 2024

Uh oh!

awaelchli commented Mar 28, 2024

Uh oh!

clumsy commented Mar 28, 2024 •

edited

Loading

Uh oh!

dannyfriar commented Mar 28, 2024

Uh oh!

clumsy commented Mar 28, 2024 •

edited

Loading

Uh oh!

awaelchli commented Mar 28, 2024

Uh oh!

dannyfriar commented Mar 28, 2024

Uh oh!

clumsy commented Mar 29, 2024

Uh oh!

dannyfriar commented Mar 29, 2024

Uh oh!

Uh oh!

Uh oh!

awaelchli commented Apr 3, 2024

Uh oh!

Uh oh!

Add synchronous parameter to MLflowLogger #19639

Add synchronous parameter to MLflowLogger #19639

Uh oh!

Conversation

clumsy commented Mar 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

PR review

Uh oh!

awaelchli commented Mar 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clumsy commented Mar 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

awaelchli commented Mar 27, 2024

Uh oh!

clumsy commented Mar 27, 2024

Uh oh!

awaelchli commented Mar 28, 2024

Uh oh!

clumsy commented Mar 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dannyfriar commented Mar 28, 2024

Uh oh!

clumsy commented Mar 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

awaelchli commented Mar 28, 2024

Uh oh!

dannyfriar commented Mar 28, 2024

Uh oh!

clumsy commented Mar 29, 2024

Uh oh!

dannyfriar commented Mar 29, 2024

Uh oh!

Uh oh!

Uh oh!

awaelchli commented Apr 3, 2024

Uh oh!

Uh oh!

clumsy commented Mar 15, 2024 •

edited

Loading

awaelchli commented Mar 16, 2024 •

edited

Loading

clumsy commented Mar 18, 2024 •

edited

Loading

clumsy commented Mar 28, 2024 •

edited

Loading

clumsy commented Mar 28, 2024 •

edited

Loading