Skip to content

Wire up CRAM 3.1 codecs for reading. #1736

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
May 9, 2025
Merged

Wire up CRAM 3.1 codecs for reading. #1736

merged 8 commits into from
May 9, 2025

Conversation

cmnbroad
Copy link
Collaborator

@cmnbroad cmnbroad commented Jan 22, 2025

A few notes:

  • The carrot tests (.carrot folder and CarrotTest.java) are still a WIP.
  • The (beta API) FASTACodecV1_0 changes are to fix an issue that came up when I added the (beta api) CRAM 3.1 codec and a corresponding test; the codec did not properly handle gzipped fasta files.
  • The NameTokenisationEncoder and EncodeToken class changes are to fix an issue discovered on the separate CRAM 3.1 writer branch.
  • The CRAMContextModel class (unused for now) is an accumulator class that is threaded throughout the CRAM write code for use by write profiles that depend on having access to context.

@cmnbroad cmnbroad added the cram label Jan 22, 2025
@cmnbroad cmnbroad marked this pull request as ready for review February 3, 2025 13:52
@cmnbroad cmnbroad marked this pull request as draft February 3, 2025 13:52
@cmnbroad cmnbroad force-pushed the cn_cram_3_1_wiring branch 2 times, most recently from fd4e834 to 68c3b8a Compare February 4, 2025 13:55
@cmnbroad cmnbroad force-pushed the cn_cram_3_1_wiring branch 2 times, most recently from f012878 to d89abba Compare February 11, 2025 21:02
@cmnbroad cmnbroad force-pushed the cn_cram_3_1_wiring branch 3 times, most recently from 1a5dc3a to 79792f2 Compare February 24, 2025 19:48
@cmnbroad cmnbroad force-pushed the cn_cram_3_1_wiring branch 3 times, most recently from 840fc9d to c8ba753 Compare March 24, 2025 14:59
@cmnbroad cmnbroad marked this pull request as ready for review March 24, 2025 18:44
Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmnbroad Some comments. One potential bug and a request to rename the new build commands so it's clear they're for a special test build.

Dockerfile Outdated
@@ -0,0 +1,13 @@
# This docker build is here to enable Carrot tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if there's a way to have carrot build a non-default docker image, but if that's possible I would call this something like CarrotDockerFile so people don't naively try to use it for non-carrot things.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit removed, to be revived on the write branch.

build.gradle Outdated
@@ -79,6 +79,21 @@ jar {
}
}

// A shadow jar that includes all dependencies PLUS test code, for use with carrot tests
shadowJar {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets make separate carrotShadowJar task instead of using the default. Again, people might just run shadowjar and assume it's the standard htsjdk thing.

* the newest version, since it has no write implementation yet.
*/
@Override
protected List<ReadsCodec> filterByVersion(final List<ReadsCodec> candidateCodecs, final HtsVersion htsVersion) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, but I wonder if we should make this something that exists generically.

return buffer.array();
}

final byte[] bytes = new byte[buffer.remaining()];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably am misunderstanding, but doesn't remaining return the length between the current position and the end? Isn't what you want the length from arrayOffset() to limit()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, very good catch. Fixed.

@@ -57,6 +64,9 @@ public ExternalCompressor getCompressorForMethod(
final int compressorSpecificArg) {
switch (compressionMethod) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a switch expression which lets you drop the default condition. It's fine though.

sharedRANSNx16Decode = new RANSNx16Decode();
}
compressorCache.put(
new Tuple(BlockCompressionMethod.RANSNx16, ransArg),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: Tuple<>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be the sort of place where a private CompressorConfig record type could clarify things instead of the awkward tuples.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -56,6 +55,8 @@
* on the corresponding reference contig.
*/
public class Slice {
private final CRAMVersion cramVersion;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh, can we have crams that vary version by slice?!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us hope not. Its just for convenience - there have always been cases where slice needs to discriminate by version when reading a slice, but for some reason I thought I needed it when writing now too - although so far there are no references to it.

// reads file 1 (bam or cram)
// reads file 2 (bam or cram)
// reference
public static void main(final String[] args) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels awkwardly placed but doing it differently seems like more of a hassle. I think you've come up with a pretty good solution.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually reverting the carrot changes in this branch (reverting the one commit) since its not complete/working anyway. I'll revive it for the write branch though.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmnbroad Some comments. One potential bug and a request to rename the new build commands so it's clear they're for a special test build.

@jmarshall
Copy link
Member

The release notes for the recent 4.2.0 release note that it includes CRAM 3.1 codecs due to #1714. However it appears that, without this PR's addition to CramVersions.java in particular, the recent release remains unable to read CRAM v3.1 files.

@lbergelson
Copy link
Member

@jmarshall Ah, sorry for the confusion. It's literally just the codecs in that PR. The wiring isn't in yet. It's coming soon.

@jmarshall
Copy link
Member

No confusion 😄 I just wanted to make sure that this PR hadn't been inadvertently omitted from the release or anything similar. Particularly in the light of samtools/htslib#1900 and the related release notes for the (imminent?) upcoming htslib/samtools releases — cf samtools/htslib#1898 first paragraph of the addition.

@cmnbroad cmnbroad force-pushed the cn_cram_3_1_wiring branch from c8ba753 to 3362a10 Compare April 22, 2025 19:35
@cmnbroad cmnbroad force-pushed the cn_cram_3_1_wiring branch from 3362a10 to 144f5f8 Compare April 22, 2025 20:53
Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I'm merging!

@lbergelson lbergelson merged commit c6729bb into master May 9, 2025
2 of 4 checks passed
@lbergelson lbergelson deleted the cn_cram_3_1_wiring branch May 9, 2025 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants