Remove VCF support #1264

tomwhite · 2024-09-24T14:22:33Z

User visible changes (including notable bug fixes) are documented in changelog.rst

This removes the VCF reading and writing functionality from sgkit, since better implementations are available (and being actively developed) in the bio2zarr and vcztools projects.

I had hoped to remove only the VCF reading functions in this PR, but the writing side is quite entwined (e.g. due to testing), so it would be much easier to remove them both at once. We haven't deprecated the VCF write functions yet (#1245), since that is waiting on releasing vcztools.

So I've marked this as a draft as it's not ready to be merged yet. It would be good to get people's feedback though as this is quite a big change.

jeromekelleher

+1 !!

It mainly looks like such a big diff because of all the tests, and data, but that's all ported to bio2zarr, pretty much.

tomwhite · 2024-10-02T14:43:22Z

We discussed this in the developers meeting and the consensus there was it would be good to merge this as it removes a lot of unsupported code.

All tests are now passing except for the ones that use NumPy 2, which will need a bio2zarr release (see sgkit-dev/bio2zarr#256).

pmoris · 2025-04-30T15:14:12Z

Could anyone please point me to the suggested replacement for the deprecated functions?

I'm trying to figure out how to update a python package that relied on the vcf_to_zarr functionality offered in sgkit.io.vcf. The bio2zarr documentation implies that the interal APIs are not yet ready for general use, and vcztools seems to be more of a CLI tool rather than a python package too. I've considered pinning an older version of sgkit, but it appears that the underlying dependency cyvcf2 requires python<=3.12, and I'd avoid having to manage (and enforce) too many pinned dependencies...

Any advice would be much appreciated!

jeromekelleher · 2025-04-30T15:33:18Z

Sorry for the breakage @pmoris, we really don't want to do that!

We don't have a stable API in the python bio2zarr docs yet, but hopefully we'll get there soon. The reason for this is that we're focusing on large-scale stuff, which often doesn't make sense to run in the context of a single Python session.

If the dataset is fairly small, then something that should be future proof is to run python -m bio2zarr vcf2zarr convert in a subprocess. How big is your dataset? Perhaps you could open an issue on bio2zarr to track your use-case?

pmoris · 2025-05-01T07:25:57Z

Thanks for the quick response, @jeromekelleher !

Our use case is this tool for DAPC analysis in Python, where we made use of your vcf reader to create sparse CSRs (https://gitlab.com/uhasselt-bioinfo/dapcy/-/blob/main/dapcy/geno2csr.py?ref_type=heads#L22).

Our main developer has just updated a few things to address some reviewer comments and while testing them out, I noticed I could no longer use our package after installing it in a fresh environment (since we hadn't explicitly pinned any packages and it now imports non-existing modules).

In any case, dataset size will depend on whatever any users will use.

I can give the subprocess approach a shot though!

Are bigger files not suitable for vcf2zarr convert? What would be the limit more or less?
Can I store the output of that process in a python object, or should I be writing out a temporary file instead?
Any other suggestions to replace the lines of code I linked above would be of course welcome as well. All we really need is to generate this sparse matrix representation.

jeromekelleher · 2025-05-01T08:41:33Z

As you're assuming that the genotype matrix fits into memory, I think the VCFs are going to be quite small. In this case, i think using the vcf2zarr convert in a subprocess will work well enough for now until we have a stable Python API.

xref sgkit-dev/bio2zarr#364

tomwhite added documentation Improvements or additions to documentation IO Issues related to reading and writing common third-party file formats labels Sep 24, 2024

jeromekelleher approved these changes Sep 24, 2024

View reviewed changes

tomwhite force-pushed the remove-vcf branch from eeef90b to e80624c Compare October 2, 2024 13:57

tomwhite marked this pull request as ready for review October 2, 2024 13:59

tomwhite force-pushed the remove-vcf branch from e80624c to 107d5ff Compare October 2, 2024 14:17

tomwhite mentioned this pull request Oct 2, 2024

Run tests against numpy 2 sgkit-dev/bio2zarr#256

Closed

tomwhite added 8 commits October 2, 2024 15:30

Add a test to check sgkit can read vcf2zarr datasets

8221440

Remove sgkit/tests/io/vcf

155494d

Remove sgkit/io/vcf

fe2b024

Remove vcf_to_zarr usage

9188b74

Remove sgkit/tests/test_vcfzarr_reader.py

f26e5ab

Remove read_scikit_allel_vcfzarr

baa3895

Remove VCF docs and refer to bio2zarr and vcztools

b64aa42

Update changelog

7f59bdb

tomwhite force-pushed the remove-vcf branch from 107d5ff to 7f59bdb Compare October 2, 2024 14:30

Use dev version of bio2zarr for testing NumPy 2

9ebc74c

tomwhite merged commit f64dddb into sgkit-dev:main Oct 10, 2024
9 checks passed

This was referenced Oct 10, 2024

Fix windows wheels test #1269

Merged

Deprecate VCF write functions #1245

Closed

tomwhite mentioned this pull request Apr 7, 2025

Release 0.10.0 #1304

Closed

pmoris mentioned this pull request May 22, 2025

Windows support sgkit-dev/bio2zarr#174

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove VCF support #1264

Remove VCF support #1264

Uh oh!

tomwhite commented Sep 24, 2024 •

edited

Loading

Uh oh!

jeromekelleher left a comment

Uh oh!

tomwhite commented Oct 2, 2024 •

edited

Loading

Uh oh!

Uh oh!

pmoris commented Apr 30, 2025

Uh oh!

jeromekelleher commented Apr 30, 2025

Uh oh!

pmoris commented May 1, 2025

Uh oh!

jeromekelleher commented May 1, 2025

Uh oh!

Uh oh!

Remove VCF support #1264

Remove VCF support #1264

Uh oh!

Conversation

tomwhite commented Sep 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

tomwhite commented Oct 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

pmoris commented Apr 30, 2025

Uh oh!

jeromekelleher commented Apr 30, 2025

Uh oh!

pmoris commented May 1, 2025

Uh oh!

jeromekelleher commented May 1, 2025

Uh oh!

Uh oh!

tomwhite commented Sep 24, 2024 •

edited

Loading

tomwhite commented Oct 2, 2024 •

edited

Loading