Skip to content

Add synchronous parameter to MLflowLogger #19639

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 3, 2024

Conversation

clumsy
Copy link
Contributor

@clumsy clumsy commented Mar 15, 2024

What does this PR do?

Adding an asynchronous version of MLflowLogger. It will do the I/O in a separate thread, thus not blocking the caller (e.g. main training thread).
Note that the values and timestamps of the metrics, params and tags are created synchronously prior the asynchronous I/O thus preserving the original content.

Fixes #19710

Before submitting
  • Was this discussed/agreed via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist
  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

📚 Documentation preview 📚: https://pytorch-lightning--19639.org.readthedocs.build/en/19639/

@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label Mar 15, 2024
@awaelchli
Copy link
Contributor

awaelchli commented Mar 16, 2024

@clumsy Thanks for the PR. I'm not sure we should do this. I couldn't believe that MLFlowLogger is blocking, so I did a quick search and found this: mlflow/mlflow#1550 (comment)
Maybe it's too old, and things have changed, but it says there by the maintainer that the mlflow client is not thread-save. Furthermore, they suggest using log_batch, which we already do in the regular MLFlowLogger. There is a synchronous=True|False flag, maybe we should use that?
https://mlflow.org/docs/latest/python_api/mlflow.client.html?highlight=log_batch#mlflow.client.MlflowClient.log_batch

@awaelchli awaelchli added discussion In a discussion stage logger: mlflow labels Mar 16, 2024
@clumsy
Copy link
Contributor Author

clumsy commented Mar 18, 2024

@awaelchli The issue you linked suggest using the workaround provided in this PR - i.e. don't make asynchronous calls using fluent API like log_metrics but use log_batch instead. And it's not even that much related to not being thread-safe (if you check latest impl that makes use of a new synchronous flag is simply passed to the log_batch call here, the problem is the data like timestamp is captured within log_metrics itself here, hence it may not match the time of the original invocation.

The call to log_batch (at least with synchronous=True which is the default) is blocking (sends requests, awaits responses), dispatching downstream impl from here eventually making the blocking calls here for REST impl for example.

Bumping the bounds of mlflow dependency also works but I think maybe unnecessary.
Here's why in my opinion:

  • data is safely published, a defensive copy is created before calling log_batch() in log_metrics()
  • GIL provides a happens-before edge to make make sure the data we captured in the calling thread will be the same in the dispatching thread
  • current log() API does not return anything, so we don't need to block eventually or even return a Future
  • the new synchronous flag is still experimental
  • this implementation is basically the same as AsyncCheckpointIO but for metrics/tags instead of the checkpoint
  • this impl can be marked experimental and remain opt-in

The difference this tweak brings is very significant in our experiments, basically making the calls to log() to publish metrics to MLflow free from the perspective of the training thread which is no longer blocked by the I/O of this Logger.

@awaelchli
Copy link
Contributor

@clumsy I just vote against exposing a feature flag as an entire subclass of the that logger. That's just not the Lightning way. Let's investigate adding a flag to the existing logger and passing down the synchronous flag to the log_batch method. It's fine if it's experimental, we wouldn't enable it by default.

@clumsy
Copy link
Contributor Author

clumsy commented Mar 27, 2024

A small correction, @awaelchli - this PR is not using the experimental feature flag from the recent MLflow versions.
In fact it's not using MLflow's asynchronous implementation altogether and makes asynchronous metric, tag and parameter publishing available for any version of MLflow.

@awaelchli
Copy link
Contributor

Sure, I'm fine with this being an implementation detail. I still don't see enough justification to make this an entirely new class. I stay with my recommendation: MLFlowLogger(async=True|False) with False being the default.

@clumsy
Copy link
Contributor Author

clumsy commented Mar 28, 2024

How about this then, @awaelchli:

  1. We add async flag to MLFlowLogger
  2. We check if the user passed async=True
  3. If Yes - we check what is mlflow version
  4. If >=2.8.0 we use the new synchronous=False
  5. If <2.8.0 we use a thread pool approach (this PR)

@dannyfriar
Copy link

How about this then, @awaelchli:

  1. We add async flag to MLFlowLogger
  2. We check if the user passed async=True
  3. If Yes - we check what is mlflow version
  4. If >=2.8.0 we use the new synchronous=False
  5. If <2.8.0 we use a thread pool approach (this PR)

Personally I think that adds a lot of code complexity to do in two different ways. I'd vote for expose a flag and then just log a warning in the case that the mlflow version is below 2.8.0.

I'd also vote for calling the argument synchronous to match the mlflow argument.

@clumsy
Copy link
Contributor Author

clumsy commented Mar 28, 2024

As for code complexity - isn't it an implementation detail? The user never sees this.

Logging, or silently continuing with the synchronous workflow when indicated otherwise by the user violates the principle of least astonishment I'm afraid. I'd rather we assert the flag (or its value False) is not used for mlflow<2.8.0

I like the naming suggestion to keep it consistent with mlflow.

@awaelchli
Copy link
Contributor

I'm fine with either approach. I would only add the manual implementation if upgrading mlflow adds barriers for users. But no strong opinion on this.

@dannyfriar
Copy link

It is implementation detail but still adds more code to the project to maintain for relatively little benefit IMO (supporting async in old mlflow versions) . The addition of the synchronousflag alone is a tiny change.

I think raising an exception in that case rather than logging a warning makes sense.

@clumsy clumsy force-pushed the feat/async_mlflow_logger branch from a9fea7e to 4232c83 Compare March 29, 2024 14:08
@clumsy clumsy changed the title feat: add AsyncMLflowLogger feat: add synchronous parameter to MLflowLogger Mar 29, 2024
@clumsy
Copy link
Contributor Author

clumsy commented Mar 29, 2024

Please check the new version of this PR, @dannyfriar, @awaelchli
Added optional synchronous flag to MLFlowLogger) that defaults to None - which leaves it for mlflow discretion to decide, e.g. if they decide to change the default behavior in the future.
When the flag is provided we will pass it via kwargs where applicable (for us it's just log_batch currently), but we verify the current mlflow version supports it.
Unfortunately this means that the users of older versions of mlflow (e.g. because their MLflow endpoint is <2.8.0 and they cannot use newer clients) won't get the luxury of asynchronous MLflow logger.

@dannyfriar
Copy link

Nice - that LGTM!

@clumsy clumsy force-pushed the feat/async_mlflow_logger branch 2 times, most recently from f57862f to 5e5ccd1 Compare March 29, 2024 15:14
@awaelchli awaelchli added feature Is an improvement or enhancement community This PR is from the community and removed discussion In a discussion stage labels Mar 29, 2024
@awaelchli awaelchli added this to the 2.3 milestone Mar 29, 2024
@awaelchli awaelchli changed the title feat: add synchronous parameter to MLflowLogger Add synchronous parameter to MLflowLogger Mar 29, 2024
@clumsy clumsy force-pushed the feat/async_mlflow_logger branch from 5e5ccd1 to 17a1f7e Compare March 29, 2024 16:20
@mergify mergify bot added the ready PRs ready to be merged label Apr 3, 2024
@awaelchli awaelchli merged commit ce88483 into Lightning-AI:master Apr 3, 2024
@awaelchli
Copy link
Contributor

Thank you for the feature @clumsy and @dannyfriar for discussion!

@clumsy clumsy deleted the feat/async_mlflow_logger branch April 4, 2024 21:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community This PR is from the community feature Is an improvement or enhancement logger: mlflow pl Generic label for PyTorch Lightning package ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expose client argument in MLFlowLogger
5 participants