Skip to content

Remove VCF support #1264

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Oct 10, 2024
Merged

Remove VCF support #1264

merged 9 commits into from
Oct 10, 2024

Conversation

tomwhite
Copy link
Collaborator

@tomwhite tomwhite commented Sep 24, 2024

  • User visible changes (including notable bug fixes) are documented in changelog.rst

This removes the VCF reading and writing functionality from sgkit, since better implementations are available (and being actively developed) in the bio2zarr and vcztools projects.

I had hoped to remove only the VCF reading functions in this PR, but the writing side is quite entwined (e.g. due to testing), so it would be much easier to remove them both at once. We haven't deprecated the VCF write functions yet (#1245), since that is waiting on releasing vcztools.

So I've marked this as a draft as it's not ready to be merged yet. It would be good to get people's feedback though as this is quite a big change.

@tomwhite tomwhite added documentation Improvements or additions to documentation IO Issues related to reading and writing common third-party file formats labels Sep 24, 2024
Copy link
Collaborator

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 !!

It mainly looks like such a big diff because of all the tests, and data, but that's all ported to bio2zarr, pretty much.

@tomwhite
Copy link
Collaborator Author

tomwhite commented Oct 2, 2024

We discussed this in the developers meeting and the consensus there was it would be good to merge this as it removes a lot of unsupported code.

All tests are now passing except for the ones that use NumPy 2, which will need a bio2zarr release (see sgkit-dev/bio2zarr#256).

@tomwhite tomwhite merged commit f64dddb into sgkit-dev:main Oct 10, 2024
9 checks passed
This was referenced Oct 10, 2024
@tomwhite tomwhite mentioned this pull request Apr 7, 2025
@pmoris
Copy link

pmoris commented Apr 30, 2025

Could anyone please point me to the suggested replacement for the deprecated functions?

I'm trying to figure out how to update a python package that relied on the vcf_to_zarr functionality offered in sgkit.io.vcf. The bio2zarr documentation implies that the interal APIs are not yet ready for general use, and vcztools seems to be more of a CLI tool rather than a python package too. I've considered pinning an older version of sgkit, but it appears that the underlying dependency cyvcf2 requires python<=3.12, and I'd avoid having to manage (and enforce) too many pinned dependencies...

Any advice would be much appreciated!

@jeromekelleher
Copy link
Collaborator

Sorry for the breakage @pmoris, we really don't want to do that!

We don't have a stable API in the python bio2zarr docs yet, but hopefully we'll get there soon. The reason for this is that we're focusing on large-scale stuff, which often doesn't make sense to run in the context of a single Python session.

If the dataset is fairly small, then something that should be future proof is to run python -m bio2zarr vcf2zarr convert in a subprocess. How big is your dataset? Perhaps you could open an issue on bio2zarr to track your use-case?

@pmoris
Copy link

pmoris commented May 1, 2025

Thanks for the quick response, @jeromekelleher !

Our use case is this tool for DAPC analysis in Python, where we made use of your vcf reader to create sparse CSRs (https://gitlab.com/uhasselt-bioinfo/dapcy/-/blob/main/dapcy/geno2csr.py?ref_type=heads#L22).

Our main developer has just updated a few things to address some reviewer comments and while testing them out, I noticed I could no longer use our package after installing it in a fresh environment (since we hadn't explicitly pinned any packages and it now imports non-existing modules).

In any case, dataset size will depend on whatever any users will use.

I can give the subprocess approach a shot though!

  1. Are bigger files not suitable for vcf2zarr convert? What would be the limit more or less?
  2. Can I store the output of that process in a python object, or should I be writing out a temporary file instead?
  3. Any other suggestions to replace the lines of code I linked above would be of course welcome as well. All we really need is to generate this sparse matrix representation.

@jeromekelleher
Copy link
Collaborator

As you're assuming that the genotype matrix fits into memory, I think the VCFs are going to be quite small. In this case, i think using the vcf2zarr convert in a subprocess will work well enough for now until we have a stable Python API.

xref sgkit-dev/bio2zarr#364

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation IO Issues related to reading and writing common third-party file formats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants