Resolves ambiguous text parsing at EOF #320

zslayton · 2021-09-30T15:46:24Z

This PR modifies the text reader to be able to unambiguously
parse data found at the end of the stream. Because the
parser always operates on a fixed buffer, it could not easily
distinguish between the end of the buffer (more data to come)
and the end of the stream being loaded into the buffer (EOF).

When EOF is detected, the reader will now append a sentinel
value to the end of the stream and re-attempt parsing. If the
sentinel value is found, EndOfStream is reported. If a different
value is found, that value is returned instead and the sentinel
is discarded.

Fixes #318. See that issue for more details.

This PR is based on the read-text-annotations branch, used in PR #319. The diff below is with that branch. Once #319 is merged, I'll rebase this with main.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

* `parse_symbol` can now match text symbol IDs (e.g. `$23`) * Symbols read from the stream are now represented as `OwnedSymbolToken`s instead of `String`s to allow for the case in which a given symbol's ID was found and the text has not yet been looked up. * The reader's `next()` method now returns an (annotations, item) tuple instead of just an item. If there are no annotations on an item, the annotations Vec is empty.

This PR modifies the text reader to be able to unambiguously parse data found at the end of the stream. Because the parser always operates on a fixed buffer, it could not easily distinguish between the end of the buffer (more data to come) and the end of the stream being loaded into the buffer (EOF). When EOF is detected, the reader will now append a sentinel value to the end of the stream and re-attempt parsing. If the sentinel value is found, EndOfStream is reported. If a different value is found, that value is returned instead and the sentinel is discarded. Fixes #318. See that issue for more details.

codecov · 2021-09-30T15:51:46Z

Codecov Report

Merging #320 (793b2fd) into main (30cabc2) will decrease coverage by 0.02%.
The diff coverage is 81.39%.

@@            Coverage Diff             @@
##             main     #320      +/-   ##
==========================================
- Coverage   91.53%   91.50%   -0.03%     
==========================================
  Files          62       62              
  Lines        9282     9314      +32     
==========================================
+ Hits         8496     8523      +27     
- Misses        786      791       +5

Impacted Files	Coverage Δ
src/text/parsers/clob.rs	`96.20% <ø> (-0.05%)`	⬇️
src/text/reader.rs	`93.01% <78.94%> (-2.42%)`	⬇️
src/text/mod.rs	`86.20% <100.00%> (+3.44%)`	⬆️
src/text/parsers/string.rs	`100.00% <100.00%> (ø)`
src/text/text_buffer.rs	`99.25% <100.00%> (+0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 30cabc2...793b2fd. Read the comment docs.

jobarr-amzn

Looks good!

jobarr-amzn · 2021-10-01T01:15:49Z

It occurs to me to ask though- #318 proposes a text IVM as the sentinel value but this PR uses 0. What's the reasoning behind the selection of the sentinel value? Any value will work so it might as well be shorter than the IVM?

zslayton · 2021-10-01T01:30:51Z

It occurs to me to ask though- #318 proposes a text IVM as the sentinel value but this PR uses 0. What's the reasoning behind the selection of the sentinel value? Any value will work so it might as well be shorter than the IVM?

Good eye! Two reasons:

I realized that if I'm discarding the sentinel value anyway, I might has well make it the cheapest possible thing to parse. Reading a 1-digit integer is cheaper than reading an 8-character $ion_1_0.
I haven't written the parser rule for reading an IVM yet. 😛

zslayton added 3 commits September 29, 2021 16:28

cargo fmt

6ce7714

zslayton requested review from desaikd and jobarr-amzn September 30, 2021 15:46

Removed unused import from unit test

c67a3ab

zslayton added 3 commits September 30, 2021 12:28

cargo fmt again

66698a9

More unit test coverage, removed old debug println

c78d3e0

cargo fmt again again

852ea10

Base automatically changed from read-text-annotations to main September 30, 2021 21:41

zslayton and others added 2 commits September 30, 2021 17:52

Merge branch 'main' into resolve-eof-ambiguities

630e521

cargo fmt following merge from upstream

793b2fd

jobarr-amzn approved these changes Oct 1, 2021

View reviewed changes

zslayton merged commit 2a02a57 into main Oct 1, 2021

zslayton deleted the resolve-eof-ambiguities branch October 1, 2021 01:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolves ambiguous text parsing at EOF #320

Resolves ambiguous text parsing at EOF #320

zslayton commented Sep 30, 2021

codecov bot commented Sep 30, 2021 •

edited

Loading

jobarr-amzn left a comment

jobarr-amzn commented Oct 1, 2021

zslayton commented Oct 1, 2021

Resolves ambiguous text parsing at EOF #320

Resolves ambiguous text parsing at EOF #320

Conversation

zslayton commented Sep 30, 2021

codecov bot commented Sep 30, 2021 • edited Loading

Codecov Report

jobarr-amzn left a comment

Choose a reason for hiding this comment

jobarr-amzn commented Oct 1, 2021

zslayton commented Oct 1, 2021

codecov bot commented Sep 30, 2021 •

edited

Loading