Skip to content

feat: Add sanity check to indexed FASTA file #1745

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

yfarjoun
Copy link
Contributor

@yfarjoun yfarjoun commented May 9, 2025

The last position in the fasta file (according ot the index) must be close to the size of the fasta file itself. A mismatch could indicate a corrupt fasta file, or an in correct index.

(tests included)

Description

Please explain the changes you made here.
Explain the motivation for making this change. What existing problem does the pull request solve?

Things to think about before submitting:

  • Make sure your changes compile and new tests pass locally.
  • Add new tests or update existing ones:
    • A bug fix should include a test that previously would have failed and passes now.
    • New features should come with new tests that exercise and validate the new functionality.
  • Extended the README / documentation, if necessary
  • Check your code style.
  • Write a clear commit title and message
    • The commit message should describe what changed and is targeted at htsjdk developers
    • Breaking changes should be mentioned in the commit message.

…e (according ot the index) must be close to the size of the fasta file itself. A mismatch could indicate a corrupt fasta file, or an in correct index.

(tests included)
@yfarjoun yfarjoun requested a review from tfenne May 9, 2025 01:48
Comment on lines 162 to 166
final Iterator<FastaSequenceIndexEntry> iterator = FastaSequenceIndex.iterator();
FastaSequenceIndexEntry fastaSequenceIndex = null;
while (iterator.hasNext()) {
fastaSequenceIndex = iterator.next();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It kinda sucks to have to do this iteration. I wonder if it would make sense to expand the PR a little bit, and modify FastaSequenceIndex to either store index entries in a pair of (Array/ArrayList, HashMap (instead of LinkedHashMap))? Or just to hold an extra pointer to the final entry in the index?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several operations that explicitly take advantage of the linked-list aspect...come to think of it, I my sanity-check only works on a "freshly make" object, since operations such as "rename" remove and add entries, which moved the renamed entry to the "end" of the linked-list.... I only need it to work upon initialization, but I need to add some protections. upshot is that a list would be overkill, I'll go with the reference to the last element.

while (iterator.hasNext()) {
fastaSequenceIndex = iterator.next();
}
assert fastaSequenceIndex != null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asserts can be disabled at runtime - are you ok with that, or do you want to always check this?

* @param fastaFile Used for error reporting only.
* @param index index file to check against the dictionary.
*/
public static void sanityCheckFastaAgainstIndex(final String fastaFile,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why String and not Path or File?

throw new IllegalArgumentException("The fasta file is shorter (%d) than its index claims (%d). Please reindex the fasta.".formatted(fastaLength, lastSequenceEnd));
}
// not sure why need to add 1 here.
if (lastSequenceEnd + fastaSequenceIndex.getTerminatorLength() + 1 < fastaLength) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I wonder if we should allow for more than 1? What if there are a handful of blank lines at the end? Maybe we could try a nested solution, that if the file is longer than you have here ... then we read the content after the last base and ensure it's all whitespace?

final long lastSequenceStart = fastaSequenceIndex.getLocation();
final long lastSequenceEnd = lastSequenceStart + fastaSequenceIndex.getOffset(lastSequenceLength);

final long fastaLength = new File(fastaFile).length();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this safe to do? Does AbstractIndexedFastaSequenceFile support or have sub-classes that support URL based access (like http:// or ftp://)? If so ... is there a method to call to get the file/object length?

final int basesPerLine = this.getBasesPerLine();
final int bytesPerLine = this.getBytesPerLine();

return ((pos - 1) / basesPerLine) * bytesPerLine + (pos - 1) % basesPerLine;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return ((pos - 1) / basesPerLine) * bytesPerLine + (pos - 1) % basesPerLine;
return ((pos - 1) / basesPerLine) * bytesPerLine + ((pos - 1) % basesPerLine);

Just for clarity.

@yfarjoun
Copy link
Contributor Author

yfarjoun commented May 9, 2025

Thanks @tfenne. I responded to all your comments (I think!) please re-review.

BTW, the test failures were due to the FTP site not responding.

Copy link
Member

@tfenne tfenne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - should anyone else review?

final long lastSequenceEnd = lastSequenceStart + lastSequenceIndex.getOffset(lastSequenceLength);

final long fastaLength = Files.size(fastaFile);
//Question: should we worry about files with lots of whitespace in their end?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't worry about "lots of whitespace" at the end. I think that is still technically valid fasta.

if (!Character.isWhitespace((char) b)) {
throw new IllegalArgumentException(
("The fasta file %s is too long (relative to the index). In particular has a non-whitespace " +
"character (%c) as a position too great (%d) given the claims of its index (%d)." +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"character (%c) as a position too great (%d) given the claims of its index (%d)." +
"character (%c) at a position (%d) beyond the last base according to its index (%d)." +

("The fasta file %s is too long (relative to the index). In particular has a non-whitespace " +
"character (%c) as a position too great (%d) given the claims of its index (%d)." +
" Please reindex the fasta.")
.formatted(fastaFile.getFileName(), (char) b, posOfInterest + i, lastSequenceEnd));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would emit the absolute path here

@yfarjoun
Copy link
Contributor Author

@lbergelson is @tfenne 's review enough here? who else needs to review?

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks sane to me I think. I need to convince myself that it works for fasta.gz though. If it doesn't it should be failing tests here so I'm assuming it works out somehow. I have a few comments. I started it running against GATK's tests to see if it hits any weirrd problems I didn't think of. broadinstitute/gatk#9179

final long fastaLength = Files.size(fastaFile);

if (lastSequenceEnd > fastaLength) {
throw new IllegalArgumentException("The fasta file (%s) is shorter (%d) than its index (%s) claims (%d). Please reindex the fasta.".formatted(fastaFile.toUri().toString(),fastaLength, index.toString(), lastSequenceEnd));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IllegalArgument seems like the wrong exception here. I might declare a new IncompatibleIndexException or something like that, it would make it easier for calling code to deal with this. If for example someone wanted to automatically reindex on a failure.

.formatted(fastaFile.toUri().toString(), (char) b, posOfInterest + i, lastSequenceEnd,index.toString()));
}
}
posOfInterest += channelBuffer.limit();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there might be minor bug here. When you clear the buffer it should set limit = capacity, but I don't think readFromPosition guarantees that the as many bytes as possible are read, it's possible for it to return fewer than requested. I think this might have the effect of skipping parts of the fasta file on incomplete reads.

Setting it with position or capturing the return value from readFromPosition would both do the right thing I think.

Unlikely to be a real issue since the bug would be to sometimes fail to detect a mismatch if there are characters embedded partway through a block of white space.

* @param fastaFile Path to fasta file
* @param fastaSequenceIndexes index file to check against the fasta file.
*
* @throws IOException in case of io-error when reading fastaFile
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should say what it throws if it detects and error.

* @param pos the (1-based) position in the contig that is requested
* @return the offset (0-based) from 'location' where pos is located in the file.
*/
public long getOffset(long pos) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

* @return The name of the last entry that was added when parsing the index file. Only guarranteed to be correct just
* after initialization. Protected for access from AbstractIndexedFastaSequenceFile.
*/
protected String getLastSequence() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love this with the caveats but it's fine if it's clearly labelled.

*/
protected FastaSequenceIndex() {}
protected FastaSequenceIndex() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is really only used for unit testing.

@lbergelson
Copy link
Member

Hmn, I see a test failure that's not the broken FTP ones.


Gradle suite > Gradle test > htsjdk.samtools.util.SequenceUtilTest > testCalculateNmTag FAILED
    java.lang.IllegalArgumentException: The fasta file file:///home/runner/work/htsjdk/htsjdk/src/test/resources/htsjdk/samtools/SequenceUtil/reference_with_lower_and_uppercase.fasta is too long (relative to the index). In particular has a non-whitespace character (a) at a position (44) beyond the last base (44), according to its index (htsjdk.samtools.reference.FastaSequenceIndex@1f). Please reindex the fasta.
        at htsjdk.samtools.reference.AbstractIndexedFastaSequenceFile.sanityCheckFastaAgainstIndex(AbstractIndexedFastaSequenceFile.java:190)
        at htsjdk.samtools.reference.IndexedFastaSequenceFile.sanityCheckFastaAgainstIndex(IndexedFastaSequenceFile.java:47)
        at htsjdk.samtools.reference.IndexedFastaSequenceFile.<init>(IndexedFastaSequenceFile.java:84)
        at htsjdk.samtools.reference.IndexedFastaSequenceFile.<init>(IndexedFastaSequenceFile.java:116)
        at htsjdk.samtools.reference.ReferenceSequenceFileFactory.getReferenceSequenceFile(ReferenceSequenceFileFactory.java:144)
        at htsjdk.samtools.reference.ReferenceSequenceFileFactory.getReferenceSequenceFile(ReferenceSequenceFileFactory.java:175)
        at htsjdk.samtools.reference.ReferenceSequenceFileFactory.getReferenceSequenceFile(ReferenceSequenceFileFactory.java:105)
        at htsjdk.samtools.reference.ReferenceSequenceFileFactory.getReferenceSequenceFile(ReferenceSequenceFileFactory.java:93)
        at htsjdk.samtools.reference.ReferenceSequenceFileFactory.getReferenceSequenceFile(ReferenceSequenceFileFactory.java:82)
        at htsjdk.samtools.util.SequenceUtilTest.testCalculateNmTag(SequenceUtilTest.java:496)

@lbergelson
Copy link
Member

It looks like FastaSequenceIndexCreator uses the no args constructor so now it causes explosions. Example

I think the lastSequence cache needs to be kept up to date or just dynamically checked. I assume there was a performance reason for caching it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants