Skip to content

Resolves ambiguous text parsing at EOF #320

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Oct 1, 2021
Merged

Conversation

zslayton
Copy link
Contributor

This PR modifies the text reader to be able to unambiguously
parse data found at the end of the stream. Because the
parser always operates on a fixed buffer, it could not easily
distinguish between the end of the buffer (more data to come)
and the end of the stream being loaded into the buffer (EOF).

When EOF is detected, the reader will now append a sentinel
value to the end of the stream and re-attempt parsing. If the
sentinel value is found, EndOfStream is reported. If a different
value is found, that value is returned instead and the sentinel
is discarded.

Fixes #318. See that issue for more details.

This PR is based on the read-text-annotations branch, used in PR #319. The diff below is with that branch. Once #319 is merged, I'll rebase this with main.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

* `parse_symbol` can now match text symbol IDs (e.g. `$23`)
* Symbols read from the stream are now represented as
  `OwnedSymbolToken`s instead of `String`s to allow for the case
  in which a given symbol's ID was found and the text has not yet
  been looked up.
* The reader's `next()` method now returns an (annotations, item)
  tuple instead of just an item. If there are no annotations on
  an item, the annotations Vec is empty.
This PR modifies the text reader to be able to unambiguously
parse data found at the end of the stream. Because the
parser always operates on a fixed buffer, it could not easily
distinguish between the end of the buffer (more data to come)
and the end of the stream being loaded into the buffer (EOF).

When EOF is detected, the reader will now append a sentinel
value to the end of the stream and re-attempt parsing. If the
sentinel value is found, EndOfStream is reported. If a different
value is found, that value is returned instead and the sentinel
is discarded.

Fixes #318. See that issue for more details.
@codecov
Copy link

codecov bot commented Sep 30, 2021

Codecov Report

Merging #320 (793b2fd) into main (30cabc2) will decrease coverage by 0.02%.
The diff coverage is 81.39%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #320      +/-   ##
==========================================
- Coverage   91.53%   91.50%   -0.03%     
==========================================
  Files          62       62              
  Lines        9282     9314      +32     
==========================================
+ Hits         8496     8523      +27     
- Misses        786      791       +5     
Impacted Files Coverage Δ
src/text/parsers/clob.rs 96.20% <ø> (-0.05%) ⬇️
src/text/reader.rs 93.01% <78.94%> (-2.42%) ⬇️
src/text/mod.rs 86.20% <100.00%> (+3.44%) ⬆️
src/text/parsers/string.rs 100.00% <100.00%> (ø)
src/text/text_buffer.rs 99.25% <100.00%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 30cabc2...793b2fd. Read the comment docs.

Base automatically changed from read-text-annotations to main September 30, 2021 21:41
Copy link
Contributor

@jobarr-amzn jobarr-amzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@jobarr-amzn
Copy link
Contributor

It occurs to me to ask though- #318 proposes a text IVM as the sentinel value but this PR uses 0. What's the reasoning behind the selection of the sentinel value? Any value will work so it might as well be shorter than the IVM?

@zslayton
Copy link
Contributor Author

zslayton commented Oct 1, 2021

It occurs to me to ask though- #318 proposes a text IVM as the sentinel value but this PR uses 0. What's the reasoning behind the selection of the sentinel value? Any value will work so it might as well be shorter than the IVM?

Good eye! Two reasons:

  • I realized that if I'm discarding the sentinel value anyway, I might has well make it the cheapest possible thing to parse. Reading a 1-digit integer is cheaper than reading an 8-character $ion_1_0.
  • I haven't written the parser rule for reading an IVM yet. 😛

@zslayton zslayton merged commit 2a02a57 into main Oct 1, 2021
@zslayton zslayton deleted the resolve-eof-ambiguities branch October 1, 2021 01:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle incomplete values at the end of an input stream
2 participants