-
Notifications
You must be signed in to change notification settings - Fork 242
feat: Add sanity check to indexed FASTA file #1745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…e (according ot the index) must be close to the size of the fasta file itself. A mismatch could indicate a corrupt fasta file, or an in correct index. (tests included)
final Iterator<FastaSequenceIndexEntry> iterator = FastaSequenceIndex.iterator(); | ||
FastaSequenceIndexEntry fastaSequenceIndex = null; | ||
while (iterator.hasNext()) { | ||
fastaSequenceIndex = iterator.next(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It kinda sucks to have to do this iteration. I wonder if it would make sense to expand the PR a little bit, and modify FastaSequenceIndex
to either store index entries in a pair of (Array/ArrayList, HashMap (instead of LinkedHashMap))? Or just to hold an extra pointer to the final entry in the index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are several operations that explicitly take advantage of the linked-list aspect...come to think of it, I my sanity-check only works on a "freshly make" object, since operations such as "rename" remove and add entries, which moved the renamed entry to the "end" of the linked-list.... I only need it to work upon initialization, but I need to add some protections. upshot is that a list would be overkill, I'll go with the reference to the last element.
while (iterator.hasNext()) { | ||
fastaSequenceIndex = iterator.next(); | ||
} | ||
assert fastaSequenceIndex != null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Asserts can be disabled at runtime - are you ok with that, or do you want to always check this?
* @param fastaFile Used for error reporting only. | ||
* @param index index file to check against the dictionary. | ||
*/ | ||
public static void sanityCheckFastaAgainstIndex(final String fastaFile, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why String and not Path or File?
throw new IllegalArgumentException("The fasta file is shorter (%d) than its index claims (%d). Please reindex the fasta.".formatted(fastaLength, lastSequenceEnd)); | ||
} | ||
// not sure why need to add 1 here. | ||
if (lastSequenceEnd + fastaSequenceIndex.getTerminatorLength() + 1 < fastaLength) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I wonder if we should allow for more than 1? What if there are a handful of blank lines at the end? Maybe we could try a nested solution, that if the file is longer than you have here ... then we read the content after the last base and ensure it's all whitespace?
final long lastSequenceStart = fastaSequenceIndex.getLocation(); | ||
final long lastSequenceEnd = lastSequenceStart + fastaSequenceIndex.getOffset(lastSequenceLength); | ||
|
||
final long fastaLength = new File(fastaFile).length(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this safe to do? Does AbstractIndexedFastaSequenceFile
support or have sub-classes that support URL based access (like http:// or ftp://)? If so ... is there a method to call to get the file/object length?
final int basesPerLine = this.getBasesPerLine(); | ||
final int bytesPerLine = this.getBytesPerLine(); | ||
|
||
return ((pos - 1) / basesPerLine) * bytesPerLine + (pos - 1) % basesPerLine; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return ((pos - 1) / basesPerLine) * bytesPerLine + (pos - 1) % basesPerLine; | |
return ((pos - 1) / basesPerLine) * bytesPerLine + ((pos - 1) % basesPerLine); |
Just for clarity.
Thanks @tfenne. I responded to all your comments (I think!) please re-review. BTW, the test failures were due to the FTP site not responding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - should anyone else review?
final long lastSequenceEnd = lastSequenceStart + lastSequenceIndex.getOffset(lastSequenceLength); | ||
|
||
final long fastaLength = Files.size(fastaFile); | ||
//Question: should we worry about files with lots of whitespace in their end? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't worry about "lots of whitespace" at the end. I think that is still technically valid fasta.
if (!Character.isWhitespace((char) b)) { | ||
throw new IllegalArgumentException( | ||
("The fasta file %s is too long (relative to the index). In particular has a non-whitespace " + | ||
"character (%c) as a position too great (%d) given the claims of its index (%d)." + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"character (%c) as a position too great (%d) given the claims of its index (%d)." + | |
"character (%c) at a position (%d) beyond the last base according to its index (%d)." + |
("The fasta file %s is too long (relative to the index). In particular has a non-whitespace " + | ||
"character (%c) as a position too great (%d) given the claims of its index (%d)." + | ||
" Please reindex the fasta.") | ||
.formatted(fastaFile.getFileName(), (char) b, posOfInterest + i, lastSequenceEnd)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would emit the absolute path here
@lbergelson is @tfenne 's review enough here? who else needs to review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks sane to me I think. I need to convince myself that it works for fasta.gz though. If it doesn't it should be failing tests here so I'm assuming it works out somehow. I have a few comments. I started it running against GATK's tests to see if it hits any weirrd problems I didn't think of. broadinstitute/gatk#9179
final long fastaLength = Files.size(fastaFile); | ||
|
||
if (lastSequenceEnd > fastaLength) { | ||
throw new IllegalArgumentException("The fasta file (%s) is shorter (%d) than its index (%s) claims (%d). Please reindex the fasta.".formatted(fastaFile.toUri().toString(),fastaLength, index.toString(), lastSequenceEnd)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IllegalArgument seems like the wrong exception here. I might declare a new IncompatibleIndexException
or something like that, it would make it easier for calling code to deal with this. If for example someone wanted to automatically reindex on a failure.
.formatted(fastaFile.toUri().toString(), (char) b, posOfInterest + i, lastSequenceEnd,index.toString())); | ||
} | ||
} | ||
posOfInterest += channelBuffer.limit(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there might be minor bug here. When you clear the buffer it should set limit = capacity, but I don't think readFromPosition guarantees that the as many bytes as possible are read, it's possible for it to return fewer than requested. I think this might have the effect of skipping parts of the fasta file on incomplete reads.
Setting it with position or capturing the return value from readFromPosition would both do the right thing I think.
Unlikely to be a real issue since the bug would be to sometimes fail to detect a mismatch if there are characters embedded partway through a block of white space.
* @param fastaFile Path to fasta file | ||
* @param fastaSequenceIndexes index file to check against the fasta file. | ||
* | ||
* @throws IOException in case of io-error when reading fastaFile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should say what it throws if it detects and error.
* @param pos the (1-based) position in the contig that is requested | ||
* @return the offset (0-based) from 'location' where pos is located in the file. | ||
*/ | ||
public long getOffset(long pos) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
* @return The name of the last entry that was added when parsing the index file. Only guarranteed to be correct just | ||
* after initialization. Protected for access from AbstractIndexedFastaSequenceFile. | ||
*/ | ||
protected String getLastSequence() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love this with the caveats but it's fine if it's clearly labelled.
*/ | ||
protected FastaSequenceIndex() {} | ||
protected FastaSequenceIndex() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this is really only used for unit testing.
Hmn, I see a test failure that's not the broken FTP ones.
|
It looks like I think the lastSequence cache needs to be kept up to date or just dynamically checked. I assume there was a performance reason for caching it? |
The last position in the fasta file (according ot the index) must be close to the size of the fasta file itself. A mismatch could indicate a corrupt fasta file, or an in correct index.
(tests included)
Description
Please explain the changes you made here.
Explain the motivation for making this change. What existing problem does the pull request solve?
Things to think about before submitting: