Skip to content

Rewrite using JSON3(read, Type) #63

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Mar 19, 2023
Merged

Rewrite using JSON3(read, Type) #63

merged 15 commits into from
Mar 19, 2023

Conversation

evetion
Copy link
Member

@evetion evetion commented Jan 17, 2023

Will be a severe breaking release as almost everything changed, but should also offer great performance.

  • Fix Feature(Collection) api/tests
  • Needs convert/construction logic from Tables/GeoInterface other PRs.
  • Figure id, crs, etc.
  • Precompile most calls by using SnoopCompile on test samples

Our previous code was much more flexible in allowing anything to be parsed as GeoJSON, but we will miss out on that, but gain performance. Setting Feature(Collection) to mutable does allow some flexibility, such as not writing the bbox or other attributes if they're nothing. It does double the compilation time, but is almost on par once compiled. I can imagine we add other nullable attributes such as id and crs to be slightly backwards compatible.

Added a dep on ComputedFieldTypes to calculate the type of the bounding box, but it might be overkill.

@evetion
Copy link
Member Author

evetion commented Jan 17, 2023

  • Do we want to subtype AbstractArray for Geometry?

@rafaqz
Copy link
Member

rafaqz commented Jan 17, 2023

Would have been nice to keep the same files so we can see the changes ;)

@evetion
Copy link
Member Author

evetion commented Jan 17, 2023

Would have been nice to keep the same files so we can see the changes ;)

I tried, but failed. Essentially the following happened:

  • json.jl => io.jl
  • geometries.jl, features.jl => geojson_types.jl

@evetion
Copy link
Member Author

evetion commented Jan 23, 2023

Ok, so this requires the unreleased GeoInterface for getting the extent (using trait instead of geomtrait, so it also works on FeatureCollections).

The remaining error is from Aqua, which complains about Unbound type parameters, in all the constructors for Geometry, like:

GeoJSON.Polygon(bbox::Union{Nothing, Vector{Float32}}, coordinates::Union{Nothing, Array{Array{Tuple{Vararg{Float32, D}}, 1}, 1}}) where D @ GeoJSON ~/code/GeoJSON.jl/src/geojson_types.jl:46

I'm not sure what the unbound problem is, so help would be appreciated.

I've removed the ComputedFieldTypes (so bbox is not D*2 NTuple, just a Vector), as it added a Type Parameter that we don't control, for little to no gain.

I've changed the coordinate number to Float32, as it should be enough for WGS84 precision.

The remaining thing is to improve the shadowing of fields vs keys in the properties of Feature(Collections), as we also need to add id and crs to those exclusions.

I also still need to write a changelog with all the breaking changes in this release.
As of now I have:

  • The keys of the properties of a Feature(Collections) are now Symbols.
  • All Type signatures changed, everything is now typed, so no more JSON3.Array and JSON3.Object
  • Custom attributes (apart from id, bbox and crs) are now unsupported and ignored.
  • Geometries are not AbstractVectors anymore.
  • Property values that are numbers are not forced to Float64 anymore.
  • Geometry coordinate type is now Float32.

@rafaqz
Copy link
Member

rafaqz commented Jan 23, 2023

The unbound problem is with the Union - there is no D if the type is Nothing. Maybe we have to remove the possibility to have nothing and use an empty vector instead.

Or we need to add explicit inner constructors where the D always has to be supplied manually, even when the type is Nothing. You will also need to remove the @kw_def and do that manually.

Something like:

struct Polygon{D} <: AbstractGeometry{D}
    bbox::Union{Nothing,Vector{Float32}}
    coordinates::Union{Nothing,Vector{Vector{NTuple{D,Float32}}}}
    Polygon{D}(bbox, coordinates) = new{D}(bbox, coordinates)
    Polygon{D}(; bbox=nothing, coordinates=nothing)  =   Polygon{D}(bbox, coordinates)
end

I think there is no way around providing D manually.

@evetion
Copy link
Member Author

evetion commented Jan 23, 2023

I think there is no way around providing D manually.

Well you can do

Point(; bbox=nothing, coordinates::NTuple{D,T}) where {D,T<:Real} = Point{D}(bbox, coordinates)

But then you have unbound T. Otherwise you have to hardcode the Type to Float32, or change the signature from {D}, to {D,T} for everything. What do you think?

@rafaqz
Copy link
Member

rafaqz commented Jan 23, 2023

Don't we know D anyway? Why not use it? Maybe I need to read over the code

@evetion
Copy link
Member Author

evetion commented Jan 23, 2023

This is mainly for end users, to easily construct a thing with kw arguments. And D can be inferred when coordinates is passed.

@evetion
Copy link
Member Author

evetion commented Jan 23, 2023

Ok, I've now set {D,T} as the type parameters. Sadly, I still get the unbound errors. Another thing, precompile doesn't seem to do much at all, even with SnoopPrecompile. Reading in the 24MB always takes 2.5s at the first time, afterwards it drops to 0.04s. Maybe we need to place a bit of JSON3 reads just in the module itself to warm things up a bit. Or make a PR over at JSON3 to precompile much more types?

Anyway, I'd appreciate a deeper review, as this is close to complete.

@rafaqz
Copy link
Member

rafaqz commented Jan 23, 2023

Ok well I didn't recommend adding T ;)

D and nothing is the problem... If there is no D we'll have to make it up (e.g. 2) and it has to be in a separate method to the one where D is known to avoid the unbound types

But personally I would force everyone to use D, or like GeometryBasics.jl, add Polygon2 as a shortcut to that. They have the same problem.

I'll try review tomorrow or wednesday

@evetion
Copy link
Member Author

evetion commented Jan 24, 2023

After some experiments with precompile/snoopprecompile, I just added a lot of precompile statements from --trace-compile. This is the only reliable method that got initial times down for parsing the 24MB featurecollection from GeoMakie.

Also fixed the unbound parameters test by defaulting to 2d for Point(coordinates=(1,2)) construction. Nice thing about adding T is that we essentially added back numbertype support back to GeoJSON and people can choose between Int, Float16 etc.

Copy link
Member

@rafaqz rafaqz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly great, some comments on types and minor improvments.

@evetion evetion marked this pull request as ready for review February 3, 2023 14:21
@rafaqz
Copy link
Member

rafaqz commented Feb 9, 2023

Seems that write here no longer accepts other geometries? It needs to write anything not just GeoJSON objects

Edit: we just have to remove the types on write and it works fine hmm no doesn't save correctly

This fixes it:

write(io, obj) = JSON3.write(io, _lower(obj))
write(obj) = JSON3.write(_lower(obj))

@rafaqz
Copy link
Member

rafaqz commented Mar 3, 2023

I've been using this branch for day to day work, but hitting some bugs.

For example his file was created and loads properly with main, but fails on this branch.
https://gist.github.com/rafaqz/3ba5a1d29df15d2e2a61bb577482a470

@evetion
Copy link
Member Author

evetion commented Mar 3, 2023

Maybe its properties being null ?

Yep, that's it. Should be {Nothing, NamedTuple}, now it's just NamedTuple.

@rafaqz
Copy link
Member

rafaqz commented Mar 3, 2023

Sweet easy fix

@evetion
Copy link
Member Author

evetion commented Mar 3, 2023

I've been using this branch for day to day work, but hitting some bugs.

Can you describe the other bugs? Otherwise, especially if you can use this for your day to day work, let's merge this to main, revive #59 and make a breaking release.

@rafaqz
Copy link
Member

rafaqz commented Mar 3, 2023

One was it's really slow using mixed geometry type features from overpass turbo. I'll have to put an MWE together.

Something else didn't work but it might have been null again.

@rafaqz
Copy link
Member

rafaqz commented Mar 6, 2023

Ok MWE of slow loads. I chose a workload that would take a second or so so we can profile it. My use case needs the whole island, which takes a lot longer.

The file is here: https://drive.google.com/file/d/1B-3ziH0oWivLye6V0wKk40Nxr3PTsAb0/view?usp=sharing

using GeoJSON
using ProfileView
@time GeoJSON.read(read("slow.geojson"))

Gives:

julia> @time GeoJSON.read(read("slow.geojson"))
  1.430720 seconds (1.13 M allocations: 54.966 MiB)
FeatureCollection with 5789 Features

And the profile:

@profview GeoJSON.read(read("slow.geojson"))

Mostly type instability in StructTypes.jl
2023-03-06-104518_1049x844

It may be better if the properties are a Dict? Still wont be fast, but better. The more I look at it the more I hate GeoJSON as a format.

@evetion
Copy link
Member Author

evetion commented Mar 7, 2023

Yeah, the format isn't that great. Anyway, slow.geojson isn't that slow for me. Oddly, when profiling, I seem to arrive at the invalid method, while no error is thrown:
image

Not sure if that's an error in my installation. To be fair, changing the code back to Dict doesn't seem to do much in terms of time, and only a little bit in terms of allocations.

@rafaqz
Copy link
Member

rafaqz commented Mar 7, 2023

How slow is "isn't that slow" ?

Are you on julia 1.9? That's a very different profile it would be good to know whats happening here.

@rafaqz
Copy link
Member

rafaqz commented Mar 7, 2023

I think the other half of my profile was that "invalid" method. So you just have that part, without the StructTypes instabilities. Interesting.

@evetion
Copy link
Member Author

evetion commented Mar 7, 2023

How slow is "isn't that slow" ?

Properties made into Dict again

1.9

julia> @time GeoJSON.read(data)
  1.686686 seconds (3.22 M allocations: 157.501 MiB, 4.47% gc time, 64.26% compilation time)
FeatureCollection with 5789 Features

julia> @time GeoJSON.read(data)
  0.600645 seconds (983.32 k allocations: 39.599 MiB, 1.28% gc time)
FeatureCollection with 5789 Features

1.8

julia> @time GeoJSON.read(data)
  2.148807 seconds (3.77 M allocations: 174.611 MiB, 3.27% gc time, 69.70% compilation time)
FeatureCollection with 5789 Features

julia> @time GeoJSON.read(data)
  0.646021 seconds (936.02 k allocations: 38.156 MiB, 5.79% gc time)
FeatureCollection with 5789 Features

this PR (still with NamedTuple)

1.9

julia> @time GeoJSON.read(data)
  7.736202 seconds (10.49 M allocations: 623.268 MiB, 1.06% gc time, 87.94% compilation time)
FeatureCollection with 5789 Features

julia> @time GeoJSON.read(data)
  0.806888 seconds (1.15 M allocations: 46.902 MiB, 1.44% gc time)
FeatureCollection with 5789 Features

1.8

julia> @time GeoJSON.read(data)
  8.658965 seconds (14.97 M allocations: 787.488 MiB, 1.99% gc time, 88.66% compilation time)
FeatureCollection with 5789 Features

julia> @time GeoJSON.read(data)
  0.821887 seconds (1.10 M allocations: 44.540 MiB, 0.69% gc time)
FeatureCollection with 5789 Features

So it seems that while the NamedTuple allocates slightly more, it is mostly much harder on type inference, the first call is much much slower. These timings are done on an M1 macbook.

@evetion
Copy link
Member Author

evetion commented Mar 7, 2023

I think the other half of my profile was that "invalid" method. So you just have that part, without the StructTypes instabilities. Interesting.

Also, I don't understand why it would hit invalid, it's basicly everything in my profiles (whether Dict or NamedTuple) and this is the code:

@noinline invalid(error, buf, pos, T) = throw(ArgumentError("""
invalid JSON at byte position $pos while parsing type $T: $error
$(String(buf[max(1, pos-25):min(end, pos+25)]))
"""))

@rafaqz
Copy link
Member

rafaqz commented Mar 7, 2023

Nice, so Dict it is!

But that invalid is super weird, how does it take all time but not throw visibly??

Is it getting caught by a try/catch block somewhere and ignored?

@rafaqz
Copy link
Member

rafaqz commented Mar 7, 2023

I guess its this try - catch:

https://github.com/quinnj/JSON3.jl/blob/main/src/structs.jl#L82-L86

We could remove the try catch and just run the first read and see what the error is.

@evetion
Copy link
Member Author

evetion commented Mar 7, 2023

I guess its this try - catch:

quinnj/JSON3.jl@main/src/structs.jl#L82-L86

We could remove the try catch and just run the first read and see what the error is.

Yes and it's caused by the id of the Feature, having an unholy Union.

Changing from id::Union{Nothing,Int,String} to id::Union{Nothing,String} you get:

julia> @time GeoJSON.read(data)
  1.174992 seconds (3.05 M allocations: 148.982 MiB, 2.30% gc time, 91.93% compilation time)
FeatureCollection with 5789 Features

julia> @time GeoJSON.read(data)
  0.086503 seconds (890.70 k allocations: 35.536 MiB)
FeatureCollection with 5789 Features

Spec however says:

o If a Feature has a commonly used identifier, that identifier
SHOULD be included as a member of the Feature object with the name
"id", and the value of this member is either a JSON string or
number.

@rafaqz
Copy link
Member

rafaqz commented Mar 7, 2023

I would burn the spec for that speedup, especially if it just means ids are always strings.

But can we hack around it some how by defining a custom read?

Where we check isnumeric?

@evetion
Copy link
Member Author

evetion commented Mar 7, 2023

I would burn the spec for that speedup, especially if it just means ids are always strings.

But can we hack around it some how by defining a custom read?

Where we check isnumeric?

Or setting it to Any? Ugly, but just as fast.

@rafaqz
Copy link
Member

rafaqz commented Mar 7, 2023

Oh yeah let's do that. But does it actually parse the numbers? Or just make them all strings anyway.

@evetion
Copy link
Member Author

evetion commented Mar 12, 2023

Oh yeah let's do that. But does it actually parse the numbers? Or just make them all strings anyway.

julia> d = """{
           "type": "Feature",
           "id": 1,
           "bbox": [-180.0, -90.0, 180.0, 90.0],
           "geometry": {"type": "MultiLineString", "coordinates": [[[3.75, 9.25], [-130.95, 1.52]], [[23.15, -34.25], [-1.35, -4.65], [3.45, 77.95]]]},
           "properties": {"title": "Dict 1"}
       }""";

julia> f = GeoJSON.read(d)
Feature with 2D MultiLineString geometry and 2 properties: (:geometry, :title)

julia> f.id
1

julia> typeof(f.id)
Int64

String also works. I did note that accessing the properties is a bit type unstable, due to us overloading getproperty, but it still seems pretty fast to me.

Only some culling of the precompile is needed. While the large examples (yours, the original Makie one), now take ~1s to parse for the first time and ~0.05s for subsequent calls, precompiling now takes ~15s.

@rafaqz
Copy link
Member

rafaqz commented Mar 18, 2023

I'm fine with shifting that time to precompile for that performance gain. SnoopCompile @snoopi_deep might show some obvious fixable problems too.

But we should just get this merged, I've already switched to it for practical work, the precompile is less than it was taking to load single json files.

@evetion
Copy link
Member Author

evetion commented Mar 18, 2023

Feel free to merge

Copy link
Member

@rafaqz rafaqz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm.

Im wondering how many of those precompile statements actually work in practice, just running the functions tends to be better if there is any type instability. But we can fix that later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants