Skip to content
This repository was archived by the owner on Nov 28, 2017. It is now read-only.

reboot - Map and key insertion order #8

Closed
mdedetrich opened this issue Jul 26, 2015 · 11 comments
Closed

reboot - Map and key insertion order #8

mdedetrich opened this issue Jul 26, 2015 · 11 comments

Comments

@mdedetrich
Copy link
Contributor

Stackoverflow question is here http://stackoverflow.com/questions/31635076/scala-immutable-map-like-datastructure-that-has-constant-effective-constant-look

@eed3si9n raised a potential issue, and that is the fact that JObject uses a Map, which means that there is no specified ordering on keys. The JSON specification says that you shouldn't rely on the object having any specific order, however this does have practical implications.

Quoting what @eed3si9n said in gitter

suppose I want to construct a JSON object { "a": 1, "b": 2 }
and this represents some serialized value
for caching I might just take the SHA1 of it
if I use your Map version of JObject to implement it, the SHA1 could change depending on the weather
so the behavior of JObject becomes non-idempotent (I am not sure if this is the right term)
this is not a what-if scenario as sbt currently uses json4s to serialize case classes via Scala Pickling - https://github.com/sbt/serialization

The problem is, that Scala stdlib does not have any immutable map like data structure which preserves key ordering that also has c or eC lookup time. There is LinkedMap, which is a Map backed by a LinkedList, however that has Linear lookup time. There is LinkedHashMap which has a constant lookup time, however its mutable. There are some solutions

  1. Roll our own datastructure (which inherits from Scala's collection library, so it can be treated to be exactly the same as a normal Map) which has all of the properties we need
  2. Internally maintain a Vector as well as a Map so we can have some concept of ordering
  3. Just use a SortedMap. We still can't guarantee that that the key order will be the same according to insertion, however we can guarantee that the keys will be Sorted, which means that if you SHA-1 two JObject's that have the same contents, they will be equal. It also means that when you serialize, we can guarantee that it will always be ordered. EDIT: Default implementation of SortedMap is a TreeMap, which has log lookup time.
  4. Converted the Map to a SortedMap by some Ordering, which means that if you do something like what @eed3si9n above, you would convert your JObject's Map to a SortedMap before doing a SHA-1. The nicest way to implement this is by having some sorted method on a JValue, which will replace all internal JObjects to something like JObjectSorted and return a JValueSorted. You can then serialize a JValueSorted if you need this sought of functionality.

Taking into account all theoretical/practical/time considerations, I think 4 is most realistic. It fixes @eed3si9n issue indirectly, while still maintaining immutability and c/eC lookup time.

EDIT: The implementation for 3 (which uses a SortedMap), has only one implementation in Scala's stdlib, which is a TreeMap. TreeMap has a log lookup time, which is definitely is not as bad as n, but its not a c/eC. Looks like we may need to roll our own data structure, if possible.

@eed3si9n
Copy link

Maybe the term I was looking for is referential transparency. Random ordering is a side effect.

@rossabaker
Copy link
Contributor

See Argonaut's JsonObject for some prior art on this issue. This is solution 2.

@eed3si9n
Copy link

@rossabaker yea. I didn't know about their implementation, but it's what I had in mind last night.

private[argonaut] case class JsonObjectInstance(
  fieldsMap: Map[JsonField, Json] = Map.empty,
  orderedFields: Vector[JsonField] = Vector.empty
)

@rossabaker
Copy link
Contributor

Are you depending on this to get a round trip from String => JValue => String? Even if JObject field order is preserved, whitespace is not.

It seems that one could devise a hash that operates on JSON structure rather than a hashing a serialized form. I'm not up on my pickling, but would that work better for you than depending on order?

@sirthias
Copy link

For the record this is the discussion we've so far had on this topic in spray-json:
spray/spray-json#119
Basically the finding in the end was that being able to keep the parsed order was not as much of an issue as originally thought.

Still, for me one of the core design goals of a common AST is perf potential.
Basic AST construction from a parser should be a fast as possible.
For JSON numbers this could mean parsing into strings and deciding lazily (i.e. on access) whether to treat the number as an Int, Float, Double or BigDecimal.
For JSON objects we could do the same thing and simply parse into an (internal) array and only decide lazily on how to access the members.
This would mean that the user could decide which hashing/sorting/etc. costs she wants to pay when accessing object members.

@eed3si9n
Copy link

I want def foo(a: Int, b: Int): String to be referentially transparent.

It seems that one could devise a hash that operates on JSON structure rather than a hashing a serialized form. I'm not up on my pickling, but would that work better for you than depending on order?

We could implement special hashing and equality, etc but eventually the random ordering becomes observable via String (or File or whatever), which is not desirable.

@eed3si9n
Copy link

For JSON objects we could do the same thing and simply parse into an (internal) array and only decide lazily on how to access the members.

I like that idea.

@sirthias
Copy link

I agree that maintaining member order has all sorts of practical advantages.
Also, simply constructing a Vector or even Array is strictly faster than mandatorily sorting members at the same time. Therefore I'd vote for a fast unordered storage underneath and relegating "by-member-name" access to a secondary "view" level.

@mdedetrich
Copy link
Contributor Author

Ok looks we will go for 2, although ideally we would have to provide our own Map interface so the fact that the internal structure is a Vector is hidden to a user. It will probably provide an interface that is the exact same as LinkedMap

One of the reasons for using a Map like structure in the first place is so that querying the JSON AST is not painful like it is when using a list like structure

Will have a look at argonaut tonight

@mdedetrich
Copy link
Contributor Author

@eed3si9n The obvious side effect of storing another internal structure is memory usage. I am not sure how significant this is, but I still have a nagging thought in the back of my head regarding this. A really large JObject, which is not uncommon, would take a lot more memory than just a single data structure.

It also seems that from the comments mentioned in spray/spray-json#119, most people argued to just use a Map, and that any sought of ordering should be delegated (which is point 4). Still kind of torn halfway inbetween

EDIT: My comment in gitter

I am still kind of torn on what to do in regards of ordering for JObject. Had a proper look at the spray discussion with @sirthias linked, and most people argued that Ordering should not be something that JObject should consider (and using a Map for a JObject fixes all of our performance characteristics which is another big problem)

I am more leaning towards making a view to an ordered JSON, although I know that doing such a thing can have potential performance impacts (it should be noted, that there is no free lunch here. Although we can maintain ordering, we either have to use a TreeMap, which has O(log(n)) performance, or we need to maintain 2 datastructures, which greatly increases memory usage for large JSON data, this being an even bigger problem for us since we are on the JVM). There is also the final point about the JSON spec not specifying any ordering, and it can be argued that having a datastructure implementation of SortedMap gives a false impression to users that JSON Objects should be sorted (when in fact, users should not be taking into account any sought of ordering)

@mdedetrich
Copy link
Contributor Author

There is now an implementation (org.json4s.basic.ast), which represents JObject as a JArray[(String,JValue)]. This has an implied insertion order.

If this is still a problem (i.e. in regards to using org.json4s.ast and not org.json4s.basic.ast) then reopen the ticket, however note that the "use org.json4s.ast and provide your own DSL/library/code ontop" argument may be more applicable.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

No branches or pull requests

4 participants