Skip to content

Propertynames of feature collection falsely taken from the first element #45

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jaakkor2 opened this issue Jul 11, 2022 · 7 comments · Fixed by #50
Closed

Propertynames of feature collection falsely taken from the first element #45

jaakkor2 opened this issue Jul 11, 2022 · 7 comments · Fixed by #50

Comments

@jaakkor2
Copy link

I do not know if all GeoJSON features should share the same properties. But if not, the following is a bit of a nuisance.

With GeoJSON 0.6.0

a = """{"type": "FeatureCollection", "features": [
    {"type": "Feature", "geometry": {"type": "Polygon", "coordinates": [[[0.0, 0.0], [1.0, 0.0], [0.0, 1.0], [0.0, 0.0]]]}, "properties": {"foo": "pippo", "bar": "pluto"}},
    {"type": "Feature", "geometry": {"type": "Polygon", "coordinates": [[[0.0, 0.0], [-1.0, 0.0], [0.0, -1.0], [0.0, 0.0]]]}, "properties": {"foo": "fulano"}}
]}"""
b = GeoJSON.read(a)

now propertynames(b) gives (:geometry, :bar, :foo).

This is same what I get when tab expanding

julia> b.<TAB>
bar       foo       geometry

There is b.foo

julia> b.foo
2-element Vector{String}:
 "pippo"
 "fulano"

but there is no b.bar

julia> b.bar
ERROR: KeyError: key :bar not found
Stacktrace:
  [1] getindex(h::Dict{Symbol, Int64}, key::Symbol)
    @ Base .\dict.jl:498
  [2] get(obj::JSON3.Object{Base.CodeUnits{UInt8, String}, SubArray{UInt64, 1, Vector{UInt64}, Tuple{UnitRange{Int64}}, true}}, key::Symbol)
    @ JSON3 C:\Users\jaakkor2\.julia\packages\JSON3\GoF7x\src\JSON3.jl:86
  [3] getproperty(obj::JSON3.Object{Base.CodeUnits{UInt8, String}, SubArray{UInt64, 1, Vector{UInt64}, Tuple{UnitRange{Int64}}, true}}, prop::Symbol)
    @ JSON3 C:\Users\jaakkor2\.julia\packages\JSON3\GoF7x\src\JSON3.jl:126
  [4] getproperty(f::GeoJSON.Feature{JSON3.Object{Base.CodeUnits{UInt8, String}, SubArray{UInt64, 1, Vector{UInt64}, Tuple{UnitRange{Int64}}, true}}}, nm::Symbol)
    @ GeoJSON C:\Users\jaakkor2\.julia\packages\GeoJSON\XejMl\src\features.jl:46
  [5] _broadcast_getindex_evalf
    @ .\broadcast.jl:670 [inlined]
  [6] _broadcast_getindex
    @ .\broadcast.jl:643 [inlined]
  [7] getindex
    @ .\broadcast.jl:597 [inlined]
  [8] copyto_nonleaf!(dest::Vector{String}, bc::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(getproperty), Tuple{Base.Broadcast.Extruded{GeoJSON.FeatureCollection{GeoJSON.Feature{JSON3.Object{Base.CodeUnits{UInt8, String}, SubArray{UInt64, 1, Vector{UInt64}, Tuple{UnitRange{Int64}}, true}}}, JSON3.Object{Base.CodeUnits{UInt8, String}, Vector{UInt64}}, JSON3.Array{JSON3.Object, Base.CodeUnits{UInt8, String}, SubArray{UInt64, 1, Vector{UInt64}, Tuple{UnitRange{Int64}}, true}}}, Tuple{Bool}, Tuple{Int64}}, Base.RefValue{Symbol}}}, iter::Base.OneTo{Int64}, state::Int64, count::Int64)
    @ Base.Broadcast .\broadcast.jl:1055
  [9] copy
    @ .\broadcast.jl:907 [inlined]
 [10] materialize
    @ .\broadcast.jl:860 [inlined]
 [11] getproperty(fc::GeoJSON.FeatureCollection{GeoJSON.Feature{JSON3.Object{Base.CodeUnits{UInt8, String}, SubArray{UInt64, 1, Vector{UInt64}, Tuple{UnitRange{Int64}}, true}}}, JSON3.Object{Base.CodeUnits{UInt8, String}, Vector{UInt64}}, JSON3.Array{JSON3.Object, Base.CodeUnits{UInt8, String}, SubArray{UInt64, 1, Vector{UInt64}, Tuple{UnitRange{Int64}}, true}}}, nm::Symbol)
    @ GeoJSON C:\Users\jaakkor2\.julia\packages\GeoJSON\XejMl\src\features.jl:103
 [12] top-level scope
    @ REPL[96]:1
@rafaqz
Copy link
Member

rafaqz commented Jul 11, 2022

It seems the spec doesn't say either way. So mixed names are probably acceptable.

Your problem comes from our maybe overly simplistic implementation of the Tables.jl interface. We can instead use a custom function to get column names and leave propertynames to just show features.

@visr
Copy link
Member

visr commented Jul 11, 2022

Indeed that is a valid GeoJSON file.

Currently the undocumented assumption for the Tables interface is that the GeoJSON file must be tabular, i.e. the propertynames of each Feature must be equal. So the simple answer is currently that the Tables interface doesn't apply since this file is not tabular. But it's good to have this issue to discuss this.

Note that propertynames does work per feature. So it's still possible to work with this file, just not through the Tables interface.

julia> propertynames(b)
(:geometry, :bar, :foo)

julia> propertynames.(b)
2-element Vector{Tuple{Symbol, Symbol, Vararg{Symbol}}}:
 (:geometry, :bar, :foo)
 (:geometry, :foo)

Some tests related to this are here: https://github.com/JuliaGeo/GeoJSON.jl/blob/v0.6.0/test/runtests.jl#L228-L248

The current behavior, selecting the first row, was simple to implement and fast. And my assumption was that non tabular GeoJSON files are rare (e.g. GDAL doesn't export them like that, it includes null values for missing properties). If we can come up with another solution that isn't much slower, that would be great. Looping over all features to get the union of all propertynames sounds like it could be expensive, but we'd have to measure, and try out ways to address this.

@rafaqz
Copy link
Member

rafaqz commented Jul 11, 2022

We can at least fix this error, we should check the feature property exists before calling getproperty on line 46 of features.jl. And return missing if not. Currently that check happens afterwards on line 48, which is too late.

We could also make a keyword argument to specify the column names explicitly in the case they are not those of the first feature, or to force looping and detecting them.

@visr
Copy link
Member

visr commented Jul 11, 2022

I'm somewhat hesitant about that getproperty change on Feature. The nothing check is at the end to support JSON null, not keys that are not there. I wrote in the tests:

We could support it by having getproperty(f::Feature, :not_present) return missing if needed, but then you always get missing instead of KeyError.

If e.g. featurecollection.population and feature.population both return missing when the population property never appears anywhere, that could also be confusing to users. They might think it's there but always null.

or to force looping and detecting them

If indeed benchmarking shows it to be too expensive to do by default, we should probably do this.

@rafaqz
Copy link
Member

rafaqz commented Jul 11, 2022

But if we have an argument to force looping, in that case we will also need to make a change somewhat as I suggest so that the missing keys in individual features show as missing instead of throwing an error (for a mixed property table to load at all).

To get the best of both we could put the detected FeatureCollection keys as a type parameter of all its child Features (like a NamedTuple) so we have localised keys to check against instead of always returning missing for any missing key.

@visr
Copy link
Member

visr commented Jul 11, 2022

This is an implementation of the complete looping function that you can use in the meanwhile:

function properlynames(fc::GeoJSON.FeatureCollection)
    f, features = Iterators.peel(fc)
    set = Set(propertynames(f))
    for f in features
        for name in keys(GeoJSON.properties(f))
            push!(set, name)
        end
    end
    return set
end

For a huge GeoJSON with 16k rows and 300 cols, this takes 4 seconds. For comparison, getproperty (fetch a column) on that FeatureCollection is around 2 seconds, and DataFrame(fc) is currently 500 seconds. In this example each feature has the same properties, like it will for most files. Would be nice if we could somehow improve performance for that case.

If we could for instance cache the propertynames in the FeatureCollection on first computation, that is not a bad penalty. The DataFrame(fc) method also could use a look.

To get the best of both we could put the detected FeatureCollection keys as a type parameter of all its child Features (like a NamedTuple) so we have localised keys to check against instead of always returning missing for any missing key.

How about having getproperty(::FeatureCollection) not call getproperty(::Feature)? To me getproperty(::Feature, :bar) throwing a KeyError when bar is not in that feature seems correct. For getproperty(::FeatureCollection) it should only throw a KeyError if bar is in none of the underlying features.

@rafaqz
Copy link
Member

rafaqz commented Jul 11, 2022

That's possible as well, although the problem is table rows usually have all the same properties as the table they come from. So it will break the Tables.jl rows interface.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants