Skip to content

Add annotated YAML loading #6

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 2, 2024
Merged

Add annotated YAML loading #6

merged 6 commits into from
Jul 2, 2024

Conversation

Ethiraric
Copy link
Contributor

Add support for adding a Marker to every Yaml node through MarkedYaml. The process was as follows:

  • Add a new MarkedYaml structure that owns a Marker and data (a Yaml-like).
  • The Yaml-like field must have an Array(Vec<MarkedYaml>) variant and not an Array(Vec<Yaml>) one as this would mean that child nodes are not annotated. A YamlData<Node> enum was added, which mimics Yaml but uses Node instead of Yaml for the Array and Hash variants.
  • Change the YamlLoader so that it is generic over the type of node that we parse.
  • Add a LoadableNode trait that gathers methods that a Node must support (conversion from a basic Yaml object, access to the hash and array fields, ...).
  • Implement LoadableNode for both Yaml and MarkedYaml.
  • Add Marker-specific methods to LoadableNode to populate the marker field of MarkedYaml and use a dummy function for Yaml.
  • Change YamlLoader<Node> so that it calls that function
  • Add load_from_* functions to MarkedYaml.

I also moved load_from_* methods for Yaml from YamlLoader to Yaml. This makes it clearer to call (Yaml::load_from_str as opposed to YamlLoader::load_from_str) and also mimics the API of MarkedYaml.

This PR makes changes in almost all of the codebase. I apologize to those who we working and on whose code I have wreaked havoc.

I have released saphyr-parser v0.0.2. This was necessary to expose Marker and have it be #[derive(Default)].

However, writing tests for this feature is kind of useless at this time. I wrote a basic one and it turns out the Markers returned by the parser are not always correct. When loading foo: bar, it reports a DocumentStart at offset 3 instead of 0. foo is correctly marked at offset 0, though. Calling MarkedYaml::load_from_* correctly populates MarkedYaml with the incorrect Markers the parser returns. I'll have to fix that and make a new release of saphyr-parser prior to writing tests for MarkedYaml.

This fixes Ethiraric/yaml-rust2#27 that, to this day, I still cannot transfer to this repository.

cc @felipesere @yannham

 * Make `YamlLoader` generic on the type of the `Node`. This is required
   because deeper node need to have annotations too.
 * Add a `LoadableYamlNode` trait, required for YAML node types to be
   loaded by `YamlLoader`. It contains methods required by `YamlLoader`
   during loading.
 * Implement `LoadableYamlNode` for `Yaml`.
 * Take `load_from_str` out of `YamlLoader` for parsing non-annotated
   nodes. This avoids every user to specify the generics in
   `YamlLoader::<Yaml>::load_from_str`.
A few changes have had to be made to `LoadableYamlNode`:
  * The `From<Yaml>` requirement has been removed as it can be
    error-prone. It was not a direct conversion as it is unable to
    handle `Yaml::Hash` or `Yaml::Array` with a non-empty array/map.
  * Instead, `from_bare_yaml` was added, which does essentially the same
    as `From` but does not leak for users of the library.
  * `with_marker` has been added to populate the marker for the `Node`.
    The function is empty for `Yaml`.

`load_from_*` methods have been added to `MarkedYaml` for convenience.
They load YAML using the markers.

The markers returned from `saphyr-parser` are not all correct, meaning
that tests are kind of useless for now as they will fail due to bugs
outside of the scope of this library.
This would make more sense in user code:
```rs
Yaml::load_from_str("foo"); // Explicit that we're parsing YAML
load_from_str("foo"); // Too implicit, too generic, may be from another
                         lib
```

Plus, this mirrors `MarkedYaml`'s behavior.
@Ethiraric Ethiraric requested a review from davvid June 13, 2024 20:54
@felipesere
Copy link

Can we also have the end markers? 😅

@Ethiraric
Copy link
Contributor Author

I would have liked to have end markers, but as you may have noticed when implementing it on your side, it is rather complicated.

The parser can hardly provide that information. We don't build a full AST and functions that scan tokens sometimes need to look some characters ahead and can't provide accurate end markers. We could change scanning functions to store potential markers and update them when context is known, but I know not if it's worth the price.

Alternatively (and as you implemented), we can "guess" the end of marker from the node's contents and the starting position. But it doesn't help that YAML allows false to be written as !!bool\n0. Even plain scalars can have newlines in the source document that are not part of the content. We'd also have to know the indentation at the time of scanning which also affects the end marker. Escape sequences also are an issue with strings (\ua1b2 is 6 characters in the source, 1 in the data).

If we were to add a SpannedYaml now, it would be very approximate. I've striven for this crate to be working for even the smallest details and I don't think that now is a good time to implement that. I'm not saying it will never be, but I think we'll have to rework the parser a bit before we proceed to that.

Now the good news is that LoadableNode is part of the public API. Since Markers are added to the Node after the contents of the node, you may create your own implementation that uses the contents to guess the end marker. Alternatively, you can "post-process" a parsed MarkedYaml and derive the end marker of a node based on the start marker of the next node.

self.data.is_hash()
}

fn is_badvalue(&self) -> bool {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename this as bad_value (even the counterpart is BadValue or something along the line).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I'm gonna rename this to bad_value. Null can sometimes be a "bad value" and that underscore would suggest that it means "is the object a [general concept of] bad value". Without underscore, badvalue more directly refers to the BadValue variant, imo.

Copy link

@yannham yannham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very familiar with the parsing bit, so I can't say much here, but in general LGTM and I would totally make use of this. Thanks for doing it!

Comment on lines +36 to +37
/// Notable differences with [`Yaml`]:
/// * Indexing cannot return `BadValue` and will panic instead.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doesn't have to be in this PR specifically - I'm happy to get this version for now, and leave this for future improvement. But I wonder if you couldn't just add a trait for that? trait BadValue { fn bad_value() -> Self } and add this constraint to Node. If it's important, that is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out this is more difficult to implement than I thought since I'd need to initialize the Marker in MarkedYaml, but the fields are private.

I'll leave it as-is, but other than that, the function can be added to LoadableYamlNode.

}

// I don't know if it's okay to implement that, but we need it for the hashmap.
impl Eq for MarkedYaml {}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the time, the offending types for Eq are types that can contain floats transitively (because this is the typical primitive value that isn't Eq). Which isn't the case here, so I believe it's entirely fine.

src/loader.rs Outdated
Comment on lines 147 to 150
if self.doc_stack.is_empty() {
self.doc_stack.push(node);
} else {
let parent = self.doc_stack.last_mut().unwrap();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: you can avoid the unwrap by reorganizing the if-then-else:

if let Some(parent) = self.doc_stack.last_mut() {
  //...code of the current else...
}
else {
    self.doc_stack.push(node);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like nitpicks like these. I rely too much on unwrap sometimes. Good catch!

///
/// [`Array`]: `Yaml::Array`
/// [`Hash`]: `Yaml::Hash`
fn from_bare_yaml(yaml: Yaml) -> Self;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it's worth it, but interestingly with the parametrized Yaml representation, you can define a type that is exactly "a shallow YAML tree of 1 level that doesn't contain any more content", which is YamlData<()>, so that from_bare_yaml could return a YamlData<()>, which would better reflect what it actually represents

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really think that is worth it, as it would require more constraints on the Node generic at the time of loading. We would need to add Node: From<YamlData<()>>, which I don't need that we want.

@Ethiraric Ethiraric merged commit 17b4ae4 into master Jul 2, 2024
3 checks passed
@yannham
Copy link

yannham commented Jul 3, 2024

@Ethiraric very excited about it! I don't know any solid and active serde-compatible library for YAML/JSON that handle position. Do you plan to make a release with this PR?

@Ethiraric
Copy link
Contributor Author

I can make a release. Beware however, it seems like markers are out-of-sync. The markers we receive from the parser do not match the input.

It needs a release in the parser, but I'm hoping to get my refactor prior to the next release. I've changed the way the input is handled so that I can optimize loading for in-memory strings, where we know in advance the bounds of the buffer.

@yannham
Copy link

yannham commented Jul 4, 2024

Beware however, it seems like markers are out-of-sync. The markers we receive from the parser do not match the input.

Sorry, I'm not sure to understand this part.

@Ethiraric
Copy link
Contributor Author

The parser fails to properly keep track of the markers. I tried writing tests for this feature, including matching the markers, and it failed.

For instance

foo: bar

May return a marker {index: 0, line: 1, col: 1} for foo, which would be correct, but {index: 3, line: 1, col: 3} for bar, which is incorrect.

We receive that data from saphyr-parser. It's the parser that needs fixing, and nothing is actionable in this crate directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Exposing Marker in Yaml
5 participants