Skip to content

Add parameter sftp_prefetch to SFTPToGCSOperator #33274

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Aug 19, 2023
Merged

Conversation

functicons
Copy link
Contributor

@functicons functicons commented Aug 9, 2023

The current SFTPHook.retrieve_file(...) has SFTP prefetch enabled which makes it fail to download SFTP files in some cases. This change allows the user to set prefetch and sets the default to True to keep backwards compatibility.


@boring-cyborg boring-cyborg bot added area:providers provider:google Google (including GCP) related issues labels Aug 9, 2023
@boring-cyborg
Copy link

boring-cyborg bot commented Aug 9, 2023

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: [email protected]
    Slack: https://s.apache.org/airflow-slack

Copy link
Contributor

@ahidalgob ahidalgob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi!

  • Do you know in what situations (and how common they are) does prefetch=True make the download fail? I think it's a great idea to add this option but not sure if we want to change the current default behavior, as this would decrease performance by default.
  • The prefetch argument in SFTP.Client.get() was added in paramiko 2.8, and right now the sftp provider package constraints paramiko>=2.6 (through the ssh provider), so this could result in a broken installation of this package (unless there's some implicit dependency somewhere). We would need to add this explicit constraint in the provider dependencies.

@@ -151,7 +154,7 @@ def _copy_single_object(
)

with NamedTemporaryFile("w") as tmp:
sftp_hook.retrieve_file(source_path, tmp.name)
sftp_hook.get_conn().get(source_path, tmp.name, prefetch=self.sftp_prefetch)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the SFTPHook provides a nice interface so users don't interact with SFTPClient directly, so I would choose to extend SFTPHook.retrieve_file instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

@functicons
Copy link
Contributor Author

Hi!

  • Do you know in what situations (and how common they are) does prefetch=True make the download fail? I think it's a great idea to add this option but not sure if we want to change the current default behavior, as this would decrease performance by default.
  • The prefetch argument in SFTP.Client.get() was added in paramiko 2.8, and right now the sftp provider package constraints paramiko>=2.6 (through the ssh provider), so this could result in a broken installation of this package (unless there's some implicit dependency somewhere). We would need to add this explicit constraint in the provider dependencies.

Should I add paramiko>=2.8 in the sftp provider or update the one in ssh provider? The former makes more sense to me, but I'm not sure if that's valid in Airflow.

@functicons
Copy link
Contributor Author

AFAIK, it happens for some files, the exact cause is not clear, but disabling prefetch solves the problem. I reverted the default to True to keep backwards compatibility.

  • Do you know in what situations (and how common they are) does prefetch=True make the download fail? I think it's a great idea to add this option but not sure if we want to change the current default behavior, as this would decrease performance by default.

@potiuk
Copy link
Member

potiuk commented Aug 10, 2023

Should I add paramiko>=2.8 in the sftp provider or update the one in ssh provider? The former makes more sense to me, but I'm not sure if that's valid in Airflow.

Yes. please update it in provider.yaml for the provider - also pre-commit should update it automatically in generated/provider_dependencies.json after you do (make sure you have pre-commit installed).

@functicons functicons requested review from ahidalgob and potiuk August 11, 2023 16:32
@potiuk
Copy link
Member

potiuk commented Aug 13, 2023

I suggest to install pre-commit (It will fix static checks for you) - also docs neeed fixing it seems.

@functicons
Copy link
Contributor Author

Check failures have been fixed.

@functicons
Copy link
Contributor Author

@potiuk anything else needed from my side? Or could you help merge the PR?

@functicons
Copy link
Contributor Author

functicons commented Aug 18, 2023

Can anyone help merge this PR? We need it to unblock some users ASAP. Thanks!

@potiuk
Copy link
Member

potiuk commented Aug 19, 2023

While this is cool that you created a PR and contributed, it's not cool to say:

We need it to unblock some users ASAP

This is your problem with ASAP. Please be patient and empathetic towards maintainers. Everyone here have a lot on our plate (and in case you missed it - last week most of the maintainers were focused on 2.7.0 release) and to be honest one new feature in one of 5000 or so operators is not a reason to speed up things. Making such statements might at most get people annoyed that you are trying to put a pressure on all-volunteer people who do stuff mostly in their free time so that you can use the software for free.

Merging will not mean that your users will be unblocked. It will need also to get released and you will have to upgrade the provider. Are you going to expect that in order to solve your problem also release managers should drop everything and run release now and ASAP just to release that one thing ?

And you are absolutely not blocked @functicons. You can easily copy the whole new SFTPToGCSOperator and use your own copy until it is released, so by trying to exercise pressure, you basically expect people to work for free to drop everything else they are doing and do stuff faster so that you can avoid a little annoyance on your side.

I recommend a little more understanding on how things work here. Your "pressure" might work backwards. You might get people reluctant to look positively at your PR if you do so.

@functicons
Copy link
Contributor Author

functicons commented Aug 19, 2023

While this is cool that you created a PR and contributed, it's not cool to say:

We need it to unblock some users ASAP

This is your problem with ASAP. Please be patient and empathetic towards maintainers. Everyone here have a lot on our plate (and in case you missed it - last week most of the maintainers were focused on 2.7.0 release) and to be honest one new feature in one of 5000 or so operators is not a reason to speed up things. Making such statements might at most get people annoyed that you are trying to put a pressure on all-volunteer people who do stuff mostly in their free time so that you can use the software for free.

Merging will not mean that your users will be unblocked. It will need also to get released and you will have to upgrade the provider. Are you going to expect that in order to solve your problem also release managers should drop everything and run release now and ASAP just to release that one thing ?

And you are absolutely not blocked @functicons. You can easily copy the whole new SFTPToGCSOperator and use your own copy until it is released, so by trying to exercise pressure, you basically expect people to work for free to drop everything else they are doing and do stuff faster so that you can avoid a little annoyance on your side.

I recommend a little more understanding on how things work here. Your "pressure" might work backwards. You might get people reluctant to look positively at your PR if you do so.

Sorry, I am new to the Airflow community. Take your time. Sorry again and thank you!

@functicons
Copy link
Contributor Author

functicons commented Aug 19, 2023

I recommend a little more understanding on how things work here.

My question is mainly about why the PR has been approved but not merged after several days, what's the common reasons for the delay between approval and merge? Are there concerns or other checks need to be done? (not putting pressure on or blaming anybody, just trying to understand how things work in this community, other contributors might run into the same situation and have the same question).

Please be patient and empathetic towards maintainers.

I believe helping contributors understand the reasons will make them more patient and empathetic. In my case, I just don't know the reason. (again not putting pressure on or blaming anybody, just trying to make things work better here).

@potiuk
Copy link
Member

potiuk commented Aug 19, 2023

My question is mainly about why the PR has been approved but not merged after several days, what's the common reasons for the delay between approval and merge? (not putting pressure or blaming anybody, just trying to understand how things work in this community, other contributors might have the same question).

I recommend a little more understanding on how things work here.

My question is mainly about why the PR has been approved but not merged after several days, what's the common reasons for the delay between approval and merge? (not putting pressure or blaming anybody, just trying to understand how things work in this community, other contributors might have the same question).

Please be patient and empathetic towards maintainers.

I believe helping contributors understand the reasons will make them more patient and empathetic. In my case, I just don't know the reason. (again not putting pressure or blaming anybody, just trying to make things work better here).

I believe everything I wrote was precisely about that. Explaining why it happens and why you should be patient.

It is also explained very shortly in the helpful list of suggestions when you submitted this PR (just scroll up a few pages and you will see this)

Be patient and persistent. It might take some time to get a review or get the final approval from Committers.

It's also explained in more detail in the CONTRIBUTING documentation (also linked in the helpful "first commit" message above.

But also if you need more explanation I recommend you to watch my talk https://airflowsummit.org/sessions/2022/hey-maintainer-exercise-your-empathy/ from Airflow Summit where I explain how important is empathy in those relations.

I encourage you to go through those and read/watch them. We go to a great lenght (including writing documentation, giving talks and even responding to messages like this explaining that empathy is needed and why) so I hope the message is passed. If you have other ways how we can do it, I am all ears. Any concrete suggestions how we can inform people like you more that this is how it works, are welcome.

No hard feeling and don't take it personal. I just wanted to make you aware that it might look very much like trying to pressure people here by saying "my customers are blocked, this is needed ASAP". They are not blocked (as I explained) and ASAP is not a word that you use in open-source when you have volunteers working on the project. It might work in a customer world but here, cooperation, empathy goes much longer way than ASAP (and related) words.

@potiuk potiuk merged commit 533afb5 into apache:main Aug 19, 2023
@boring-cyborg
Copy link

boring-cyborg bot commented Aug 19, 2023

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

@functicons
Copy link
Contributor Author

functicons commented Aug 21, 2023

If you have other ways how we can do it, I am all ears. Any concrete suggestions how we can inform people like you more that this is how it works, are welcome.

@potiuk I think that it's mainly about setting the right expectation. It would be helpful if there is an auto-generated message telling people the expected/average delay (e.g., 7 days) in a certain state, and how long is considered too long (e.g., >14 days), they can have the right expectation, and act accordingly.

@potiuk
Copy link
Member

potiuk commented Aug 21, 2023

If you have other ways how we can do it, I am all ears. Any concrete suggestions how we can inform people like you more that this is how it works, are welcome.

@potiuk I think that it's mainly about setting the right expectation. It would be helpful if there is an auto-generated message telling people the expected/average delay (e.g., 7 days) in a certain state, and how long is considered too long (e.g., >14 days), they can have the right expectation, and act accordingly.

Can you pleas emake a PR with proposal to https://github.com/apache/airflow/blob/main/.github/boring-cyborg.yml#L645 on how it might look like?

@potiuk
Copy link
Member

potiuk commented Aug 21, 2023

We can iterate (as usual) on the proposal. But as everything in the OSS, it starts with a PR.

@potiuk
Copy link
Member

potiuk commented Aug 21, 2023

Just to explain - you are probably one of the best people in the world now who could propose something like that, since you felt frustrated by not knowing that and having gone through the process recently, you probably know best what words there woudl have made it clear and what kind of messaging would be right. We can work out some technicalities of how long etc. but having a message constructed in the way yoy think is good for people like you, without the assumptions maintainers have in their heads is probably best start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:google Google (including GCP) related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants