Skip to content

GeoJSON to/from custom struct using serde #199

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Aug 29, 2022
Merged

GeoJSON to/from custom struct using serde #199

merged 10 commits into from
Aug 29, 2022

Conversation

michaelkirk
Copy link
Member

@michaelkirk michaelkirk commented Aug 2, 2022

  • I agree to follow the project's code of conduct.
  • I added an entry to CHANGES.md if knowledge of this change could be valuable to users.

This is an attempt to improve the ergonomics of parsing GeoJson using serde (FIXES #184) .

This PR is a draft because there are a lot of error paths related to invalid parsing that I'd like to add tests for, but I first wanted to check in on overall direction of the API. What do people think? I think this is ready for review!

Examples

Given some geojson like this:

let feature_collection_string = r#"{
    "type": "FeatureCollection",
    "features": [
        {
           "type": "Feature",
           "geometry": {
             "type": "Point",
             "coordinates": [125.6, 10.1]
           },
           "properties": {
             "name": "Dinagat Islands",
             "age": 123
           }
        },
        {
           "type": "Feature",
           "geometry": {
             "type": "Point",
             "coordinates": [2.3, 4.5]
           },
           "properties": {
             "name": "Neverland",
             "age": 456
           }
         }
   ]
}"#
.as_bytes();

let io_reader = std::io::BufReader::new(feature_collection_string);

Before

Deserialization

You used to parse it like this:

use geojson:: FeatureIterator;
let feature_reader = FeatureIterator::new(io_reader);
for feature in feature_reader.features() {
    let feature = feature.expect("valid geojson feature");

    let name = feature.property("name").unwrap().as_str().unwrap();
    let age = feature.property("age").unwrap().as_u64().unwrap();
    let geometry = feature.geometry.value.try_into().unwrap();

    if name == "Dinagat Islands" {
        assert_eq!(123, age);
        assert_matches!(geometry, geo_types::Point::new(125.6, 10.1).into());
    } else if name == "Neverland" {
        assert_eq!(456, age);
        assert_matches!(geometry, geo_types::Point::new(2.3, 4.5).into());
    } else {
        panic!("unexpected name: {}", name);
    }
}

Serialization

Then, to write it back to geojson, you'd have to either do all your processing strictly with the geojson types, or somehow convert your entities from and back to one of the GeoJson entities:

Something like:

// The implementation of this method is potentially a little messy / boilerplatey
let feature_collection: geojson::FeatureCollection = some_custom_method_to_convert_to_geojson(&my_structs);

// This part is easy enough though
serde_json::to_string(&geojson);

After

But now you also have the option of parsing it into your own declarative struct using serde like this:

Declaration

use geojson::{ser::serialize_geometry, de::deserialize_geometry};
use serde::{Serialize, Deserialize};

#[derive(Serialize, Deserialize)]
struct MyStruct {
    // You can parse directly to geo_types via these helpers, otherwise this field will need to be a `geojson::Geometry`
    #[serde(serialize_with = "serialize_geometry", deserialize_with = "deserialize_geometry")]
    geometry: geo_types::Point<f64>,
    name: String,
    age: u64,
}

Deserialization

for feature in geojson::de::deserialize_feature_collection::<MyStruct>(io_reader).unwrap() {
    let my_struct = feature.expect("valid geojson feature");

    if my_struct.name == "Dinagat Islands" {
        assert_eq!(my_struct.age, 123);
        assert_eq!(my_struct.geometry, geo_types::Point::new(125.6, 10.1));
    } else if my_struct.name == "Neverland" {
        assert_eq!(my_struct.age, 456);
        assert_eq!(my_struct.geometry, geo_types::Point::new(2.3, 4.5));
    } else {
        panic!("unexpected name: {}", my_struct.name);
    }
}

Serialization

let my_structs: Vec<MyStruct> = get_my_structs();
geojson::ser::to_feature_collection_writer(writer, &my_structs).expect("valid serialization");

Caveats

Performance

Performance currently isn't great. There's a couple of things which seem ridiculous in the code that I've marked with PERF: that I don't have an immediate solution for. This is my first time really diving into the internals of serde and it's kind of a lot! My hope is that performance improvements would be possible with no or little changes to the API.

Some specific numbers (from some admittedly crude benchmarks):

Old Deserialization:

FeatureReader::features (countries.geojson)                                                                                                                                  
                        time:   [5.8497 ms 5.8607 ms 5.8728 ms]                                                                                                              

New Deserialization:

FeatureReader::deserialize (countries.geojson)                                                                                                                               
                        time:   [7.1702 ms 7.1865 ms 7.2035 ms]                                                                                                              

Old serialization:

serialize geojson::FeatureCollection struct (countries.geojson)                                                                             
                        time:   [3.1471 ms 3.1552 ms 3.1637 ms]

New serialization:

serialize custom struct (countries.geojson)                                                                             
                        time:   [3.8076 ms 3.8144 ms 3.8219 ms]

So the new "ergonomic" serialization/deserialization takes about 1.2x the time as the old way. Though it's actually probably a bit better than that because with the new form you have all your data ready to use. With the old way, you still need to go through this dance before you can start your analysis:

let name = feature.property("name").unwrap().as_str().unwrap();
let age = feature.property("age").unwrap().as_u64().unwrap();
let geometry = feature.geometry.value.try_into().unwrap();

Anyway, I think this kind of speed difference is well worth the improved usability for my use cases.

Foreign Members

This doesn't support anything besides geometry and properties - e.g. foreign members are dropped. I'm hopeful that this is useful even with that limitation, but if not, maybe we can think of a way to accommodate most people's needs.

@mapoulos
Copy link

I like this a lot! The only possible feature I can think of at the moment is the ability to have a concrete geo type (eg a polygon) instead of the Geometry enum, which would save some conversions if you new what the type was ahead of time (I could see that being tricky to implement).

I would have definitely liked something like this when I was doing geospatial work last summer.

@michaelkirk
Copy link
Member Author

have a concrete geo type (eg a polygon) instead of the Geometry enum, which would save some conversions if you new what the type was ahead of time

That's a great idea. I think I can add that pretty easily actually. Thanks for looking!

@michaelkirk
Copy link
Member Author

have a concrete geo type (eg a polygon) instead of the Geometry enum, which would save some conversions if you new what the type was ahead of time

I've pushed up another commit that does just this.

@b4l
Copy link
Member

b4l commented Aug 10, 2022

I like the idea a lot. Do you have in mind to add serialization capabilities as well? I was wondering how a roundtrip would look like in regards to foreign members. Could they be preserved with the current approach to potentially achieve a lossless roundtrip?

@michaelkirk
Copy link
Member Author

michaelkirk commented Aug 10, 2022

Thanks for looking @b4l!

Could [foreign members] be preserved with the current approach to potentially achieve a lossless roundtrip?

Not with the current approach. An example of the current behavior to highlight this:

{ 
   "type": "Feature",
    "foo": "bar",   // <-- foreign member "foo"
    properties: {
        "foo": 123  // <-- property "foo"
    }
}

The approach in this PR flattens properties into the top level of the struct, and simply ignores any foreign members:

MyStruct {
    geometry: Geometry,
    foo: u64
}

I can see why this behavior might not work for some use cases. The use case driving this PR is to have processing code that is agnostic (or requires only minimal changes) between formats. e.g. I'd like to have code that can read it's input from either GeoJSON or CSV (and some day maybe other formats as well) while only changing a couple lines of code.

Having multiple sources of inputs requires a sort of "lowest common denominator" of what can be accounted for, and foreign members fall outside of that.

If you did want to keep all the structural information, like foreign members, you can still do that same as always. You have to deserialize to a geojson::Feature or maybe even just a json::Object depending on your use cases.

It's also possible that if we threw enough proc_macro at the problem that we could come up with something more powerful, but I don't plan to work on that just yet.

@michaelkirk
Copy link
Member Author

Do you have in mind to add serialization capabilities as well?

Oh sorry I forgot to answer this in my original reply. If this seems like a good overall direction, I'm happy to add serialization.

@b4l
Copy link
Member

b4l commented Aug 10, 2022

Thanks for the explanations @michaelkirk !

My two potential use cases I have in mind are enforcing a certain schema with a concrete type/struct and support for https://github.com/opengeospatial/ogc-feat-geo-json.

This PR gives us the ability to enforce a concrete geometry type and a certain properties struct. I don't really know how important my use cases are, just wanted to mention them to raise awareness.

@metasim
Copy link
Contributor

metasim commented Aug 18, 2022

This feature would be brilliant!

bors bot added a commit that referenced this pull request Aug 18, 2022
200: Flexible field ordering for FeatureIterator r=frewsxcv a=michaelkirk

- [x] I agree to follow the project's [code of conduct](https://github.com/georust/geo/blob/master/CODE_OF_CONDUCT.md).
- [x] I added an entry to `CHANGES.md` if knowledge of this change could be valuable to users.
---

I encountered this while working on #199, but I thought it made sense to extract the fix and address it separately.

Previously, `FeatureIterator` could read feature collections written like this:

`{ "type": "FeatureCollection", "features": [ ... ] }`

But would error when encountering a feature collection written like this:

`{"features": [ ... ],  "type": "FeatureCollection" }`

Now we handle both orderings.

Co-authored-by: Michael Kirk <[email protected]>
@rmanoka
Copy link
Contributor

rmanoka commented Aug 22, 2022

This is indeed a fantastic idea: really flexible and ergonomic conversion to/from geojson! Thanks @michaelkirk

@michaelkirk michaelkirk force-pushed the mkirk/serde branch 2 times, most recently from 228a61b to 489be8a Compare August 25, 2022 03:57
@michaelkirk michaelkirk changed the title Deserialize geojson into custom struct using serde GeoJSON to/from custom struct using serde Aug 25, 2022
@michaelkirk
Copy link
Member Author

Thanks for the feedback everybody - I think this is ready for review!

Do you have in mind to add serialization capabilities as well?

I've gone and done this and made some other substantial changes, error handling, and docs. It's a big diff, but a lot of it is tests and docs.

I've updated the description with some caveats. Please review them and let me know what you think.

Especially interesting to me would be if you're willing to integrate this into an actual use case. I'm curious to see what kind of rough edges surface. I'm happy to help with the integration if you want to ping me.

Also, I'd super appreciate if anyone has time to build and review the new docs.

@michaelkirk michaelkirk marked this pull request as ready for review August 25, 2022 05:07
@rmanoka
Copy link
Contributor

rmanoka commented Aug 25, 2022

The ser. support is awesome! It should address #176 too, right?

closes #176

@michaelkirk
Copy link
Member Author

The ser. support is awesome! It should address #176 too, right?

Unfortunately, it's probably not much of a performance improvement at this point. In fact, it might be worse in some ways.

For one, there's this really unfortunate PERF note here: https://github.com/georust/geojson/pull/199/files#diff-02292e8d776b8c5c924b5e32f227e772514cb68b37f3f7b384f02cdb6717a181R319

...where, in order to "move" a structs fields into the output's "properties" field, we roundtrip each of your structs to a serde_json::Object.

We do this round trip with only one feature at a time, not the entire feature collection at once, so its memory impact should be limited, but still, it's pretty gross. With the time I had, I wasn't able to come up with a better solution.

But the bigger problem might be that (currently anyway) serialization requires a &[T: Deserialize], so you already need your entire slice of structs to be in memory.

@rmanoka
Copy link
Contributor

rmanoka commented Aug 25, 2022

The conversion to JsonObject at Feature level is still okay. Currently, our serialize impl.:

impl Serialize for FeatureCollection {
    fn serialize<S>(&self, serializer: S) -> std::result::Result<S::Ok, S::Error>
    where
        S: Serializer,
    {
        JsonObject::from(self).serialize(serializer)
    }
}

converts the whole feature collection into a much larger representation before writing. Typically the user is already storing the representation in memory, and the conversion takes a lot (~10x) more memory. This PR nicely side-steps this issue, right?

Copy link
Member

@urschrei urschrei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, and the docs in ser and de are particularly helpful and clear, which leads me to my only question: should we call out this functionality early on in the "main" docs?


use std::io::Read;

pub struct FeatureReader<R> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we get a short doc string here so it's visible in the docs?
Screenshot 2022-08-25 at 14 05 27

Otherwise their description sounds too similar. I think for people who
want it, ultimately it'd be best if FeatureIterator were hidden opaquely
behind FeatureReader::features().

This is similar to `CSVReader::string_records()` - which deserializes to
the libraries lowest-common-denominator built in type (geojson::Feature in our case)

And then we have `FeatureReader::deserialize` which, just like `CSVReader::deserialize`,
allows the user to deserialize directly to their custom struct.
@michaelkirk
Copy link
Member Author

Could we get a short doc string here so it's visible in the docs?

I've done this in 493463e

And then I realized that the short doc string was basically indistinguishable from FeatureIterator. To avoid the confusion to users of having similar sounding things, I decided to deprecate the public FeatureIterator in 269e29d, in favor of accessing it opaquely (as an impl Iterator) on FeatureReader::features.

The thought is that we wouldn't delete it, but eventually make it private once the deprecation cycle has run its course.

So to recap the new suggested usage:

FeatureReader::features() gives you an iterator over Features, just like FeatureIterator::new(io_stream) does today.

And then:

FeatureReader::.deserialize() gives you an iterator over your custom serde struct.

This is somewhat inspired by the approach taken by csv::Reader::records() vs. csv::Reader::deserialize()

WDYT?

@michaelkirk
Copy link
Member Author

should we call out this functionality early on in the "main" docs?

Done in 413b959

@urschrei
Copy link
Member

Could we get a short doc string here so it's visible in the docs?

I've done this in 493463e

And then I realized that the short doc string was basically indistinguishable from FeatureIterator. To avoid the confusion to users of having similar sounding things, I decided to deprecate the public FeatureIterator in 269e29d, in favor of accessing it opaquely (as an impl Iterator) on FeatureReader::features.

The thought is that we wouldn't delete it, but eventually make it private once the deprecation cycle has run its course.

So to recap the new suggested usage:

FeatureReader::features() gives you an iterator over Features, just like FeatureIterator::new(io_stream) does today.

And then:

FeatureReader::.deserialize() gives you an iterator over your custom serde struct.

This is somewhat inspired by the approach taken by csv::Reader::records() vs. csv::Reader::deserialize()

WDYT?

Yep, this seems clear and sensible to me.

Copy link
Contributor

@rmanoka rmanoka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

src/ser.rs Outdated
let mut json_object: JsonObject = {
// PERF: this is an extra round-trip just to juggle some fields around.
// How can we skip this?
let bytes = serde_json::to_vec(self.feature)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

serde-json provides a to_value method that we can use (and unwrap to the map enum variant). That would avoid the round-trip.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had no idea - this is just what I needed. Thanks @rmanoka!

With this 35% improvement, we're now a bit faster (22%) than the legacy bench.

serializing the old way

serialize geojson::FeatureCollection struct (countries.geojson)
                      time:   [3.1352 ms 3.1433 ms 3.1520 ms]
                       change: [-0.3321% -0.0199% +0.3402%] (p = 0.91 > 0.05)

serializing the new way

serialize custom struct (countries.geojson)
                      time:   [2.4598 ms 2.4647 ms 2.4697 ms]
                       change: [-35.490% -35.313% -35.150%] (p = 0.00 < 0.05)
                       Performance has improved.

With the 35% improvement, we're now a bit faster (22%) than the legacy bench.

    Running benches/serialize.rs (target/release/deps/serialize-0873325e9a7982ee)
    serialize geojson::FeatureCollection struct (countries.geojson)
			    time:   [3.1352 ms 3.1433 ms 3.1520 ms]
			    change: [-0.3321% -0.0199% +0.3402%] (p = 0.91 > 0.05)
			    No change in performance detected.
    Found 4 outliers among 100 measurements (4.00%)
      4 (4.00%) high mild

    serialize custom struct (countries.geojson)
			    time:   [2.4598 ms 2.4647 ms 2.4697 ms]
			    change: [-35.490% -35.313% -35.150%] (p = 0.00 < 0.05)
			    Performance has improved.
    Found 1 outliers among 100 measurements (1.00%)
      1 (1.00%) high mild
@michaelkirk
Copy link
Member Author

michaelkirk commented Aug 26, 2022

I pushed up some examples/* as documentation and to facilitate memory profiling. I've gotten a couple approvals and only pushed up some non-controversial(I think) followups, so I'll plan to merge in a couple days unless I hear otherwise.

Thanks for the reviews!

Some followup commentary:

It should address #176 too, right?

before:
Screen Shot 2022-08-26 at 11 50 56 AM

The left hump is the deserialization, the right hump is the serialization.

after:
Screen Shot 2022-08-26 at 11 50 20 AM

Note that there is still a big hump for deserialization. Likely due to my other PERF note: https://github.com/georust/geojson/pull/199/files#diff-a9463680bdf3fa7278b52b437bfbe9072e20023a015621ed23bcb589f6ccd4b5R155

It's not any worse than before, so I'm inclined to merge as-is.

One follow-up solution might be to do something like we do in FeatureIterator, where we don't parse the FeatureCollection, rather we "fast forward" until we hit the "features" property and start parsing there. The current FeatureIterator lacks some robustness (see #200 (comment)) , so I'd rather address that separately than try to fit it into this PR.

@michaelkirk
Copy link
Member Author

bors r=urschrei,rmanoka

@bors
Copy link
Contributor

bors bot commented Aug 29, 2022

Build succeeded:

@bors bors bot merged commit 956aa80 into master Aug 29, 2022
bors bot added a commit that referenced this pull request Sep 2, 2022
205: Add FeatureWriter to stream writes r=urschrei a=michaelkirk

- [x] I agree to follow the project's [code of conduct](https://github.com/georust/geo/blob/master/CODE_OF_CONDUCT.md).
- [x] I added an entry to `CHANGES.md` if knowledge of this change could be valuable to users.
---

This allows the user to incrementally serialize a FeatureCollection, rather than requiring them to do it one-shot. This is a pretty logical counterpart to the FeatureReader introduced in #199.

Co-authored-by: Michael Kirk <[email protected]>
@urschrei urschrei deleted the mkirk/serde branch February 5, 2025 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improved ergonomics with ToGeoJson and TryFromGeoJson traits
6 participants