-
Notifications
You must be signed in to change notification settings - Fork 35
Remove VCF support #1264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove VCF support #1264
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 !!
It mainly looks like such a big diff because of all the tests, and data, but that's all ported to bio2zarr, pretty much.
We discussed this in the developers meeting and the consensus there was it would be good to merge this as it removes a lot of unsupported code. All tests are now passing except for the ones that use NumPy 2, which will need a bio2zarr release (see sgkit-dev/bio2zarr#256). |
Could anyone please point me to the suggested replacement for the deprecated functions? I'm trying to figure out how to update a python package that relied on the Any advice would be much appreciated! |
Sorry for the breakage @pmoris, we really don't want to do that! We don't have a stable API in the python bio2zarr docs yet, but hopefully we'll get there soon. The reason for this is that we're focusing on large-scale stuff, which often doesn't make sense to run in the context of a single Python session. If the dataset is fairly small, then something that should be future proof is to run |
Thanks for the quick response, @jeromekelleher ! Our use case is this tool for DAPC analysis in Python, where we made use of your vcf reader to create sparse CSRs (https://gitlab.com/uhasselt-bioinfo/dapcy/-/blob/main/dapcy/geno2csr.py?ref_type=heads#L22). Our main developer has just updated a few things to address some reviewer comments and while testing them out, I noticed I could no longer use our package after installing it in a fresh environment (since we hadn't explicitly pinned any packages and it now imports non-existing modules). In any case, dataset size will depend on whatever any users will use. I can give the subprocess approach a shot though!
|
As you're assuming that the genotype matrix fits into memory, I think the VCFs are going to be quite small. In this case, i think using the |
changelog.rst
This removes the VCF reading and writing functionality from sgkit, since better implementations are available (and being actively developed) in the bio2zarr and vcztools projects.
I had hoped to remove only the VCF reading functions in this PR, but the writing side is quite entwined (e.g. due to testing), so it would be much easier to remove them both at once. We haven't deprecated the VCF write functions yet (#1245), since that is waiting on releasing vcztools.
So I've marked this as a draft as it's not ready to be merged yet. It would be good to get people's feedback though as this is quite a big change.