Skip to content

What would a "FlatBuffers2" binary format look like? #5875

Open
@aardappel

Description

@aardappel

FlatBuffer's binary format has been set in stone for 6.5 years now, because we value binary forwards/backwards compatibility highly, and because we have a large investment in 15? or so language implementations / parsers etc. that would not be easy to redo.

So "V2" in the sense of a new format that breaks backwards compatibility may never happen. But there is definitely a list of issues with the existing format that if a new format were to ever happen, would be nice to address. I realized I never made such a list. It would be nice to at least fantasize what such a format could look like :)

Please comment what you would like to see. Note that this list is purely for things that would change the binary encoding, or larger additions to the binary encoding. Anything that can be solved with code / new APIs outside of the binary format does not belong on this list.

  1. Remove all padding. Modern CPUs can access unaligned data at normal speed. This would shrink the format somewhat, and encourage other variable size things. If anyone ever needs padding to be compatible with a C struct, explicit padding can always be added, or it can be an opt-in feature.
  2. Make unions into a single field (and vectors of them into a single value). Also make the type part 16-bit while we're at it, so a union is always a 6-byte struct.
  3. Remove the 2nd field of the vtable. This field stores the table size, but it is never used in any implementations. This was intended for a streaming API that never happened.
  4. Allow different size vtable offsets. Currently they're always 16-bit, but for small tables 8-bit would be feasible. Since we code-gen vtable access, this would come at no cost. Use with care of course, because once you choose this smaller size you can't undo it when your table grows.
  5. Allow inline vtables when we determined they're unlikely to be shared. Saves an offset.
  6. Allow inline strings, vectors (and maybe scalars), meaning a vtable offset would refer directly to the string, rather than to the string offset. Saves the offset. Of course puts more pressure on the vtable offset size, so use with case. Similarly, could even do inline scalars of all small scalar types. Of course this makes it more likely that vtables are unequal, so this is a tradeoff.. would work well with inline vtables.
  7. Remove 0-termination of strings. Only C/C++ care for this, and C++ has been moving toward string_view recently, and both have been using size_t arguments for a long time rather than relying on strlen. Other languages don't use it. For passing to super-old C APIs that expect 0-termination, either swap the terminating byte temporarily while passing that string, or copy.
  8. Allow 8 and 16 bit size fields on strings and vectors, currently they're always 32. Good for small strings. Combine all the string optimisations above together, and the string "a" goes from 12 bytes (2 vtable + 4 offset + 4 size + 1 string + 1 terminator) to 3 bytes (1 vtable + 1 size + 1 string). Of course this very inflexible and special purpose, but gives users more options for compact data storage. Again, like all format variation above, this comes at no runtime cost, just some codegen complexity.
  9. Construct the buffer forwards (rather than backwards like currently all implementations). This simplifies a lot of code and buffer management. Unsigned child offsets would now always point downwards in memory. Downside: must now detect table fields pointing to the table itself.
  10. Always have a file_identifer, and make it the first thing in the buffer? Always have a length field as well?
  11. Support 64-bit offsets from the start. They would be optional for vectors and certain other things, allowing buffers >2GB. See https://github.com/google/flatbuffers/projects/10#card-14545298
  12. For a buffer that has entirely un-shared vtables (see 5), it now becomes more feasible to allow in-place mutation of more complex things. This is definitely a complex/contentious feature, but I think if we ever re-booted the format this should be designed in from the start if possible.
  13. Deeply integrated FlexBuffers, basically allowing any field to cheaply be a FlexBuffers value such that it effectively becomes FlatBuffers's "dynamic type". Sharing of strings across such values rather than being an isolated nested buffer.
  14. Nested vectors. Not strictly a breaking change, but a new format would probably want to have them from the start.
  15. Built-in LEBs (variable sized integers) as an optional varint type for fields. They could be added to the existing format but make a lot more sense in a system with no alignment.

@rw @mikkelfj @vglavnyy @mzaks @mustiikhalil @dnfield @dbaileychess @lu-wang-g @stewartmiles @alexames @paulovap @AustinSchuh @maxburke @svenk177 @jean-airoldie @krojew @iceb0y @evanw @KageKirin @llchan @schoetbi @evolutional

Metadata

Metadata

Assignees

No one assigned

    Labels

    not-staleExplicitly exempt marking this stale

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions