Skip to content

Tracking: layered store 2022 Q1 #1753

Closed
@tomjridge

Description

@tomjridge

This is an issue to track the new layered store implementation.

The current branch: https://github.com/tomjridge/irmin/tree/2022-04-22_layers_rebased_on_3.2.0

Older branches:

A recent tezos branch, with additional code to trigger gc every so often, is here: https://github.com/tomjridge/tezos/tree/2022-03-14_layers

Victor's branch, to integrate layers into Tezos properly, is here: https://gitlab.com/nomadic-labs/tezos/-/tree/vicall@tomjridge@layered_store

Todo (additional entries to be added when discovered):

  • Add clear documentation for IO.Unix interface used by pack_store.ml, so it is possible to work out what the semantics is
  • Implement external sorting and other external routines via mmaps
    • sorting
    • extent calculation
  • Port/rework prototype code from https://github.com/tomjridge/sparse-file/tree/master/src into a subdirectory under irmin-pack
  • Change the store pack file to use a control+objstore+suffix ("layers") rather than a plain file
    • X Identify the exact interface used by the pack_store
    • X Determine how to implement this interface on top of the layers
    • Implement a replacement IO, suitable for layers
  • Implement the missing part of the worker: the calculation of reachable objects from a commit
  • Implement a simple mechanism to trigger GC from a given commit
  • Proper integration with irmin APIs
    • X how to trigger GC
    • how to properly compute reachability from a commit (still needs looking at - want to avoid use of create_reach.exe)
  • Test, for example, by replaying some existing trace and periodically performing GC on a recent commit
    • X Get trace replay with GC every n commits working ; this is working
    • X Get tezos node bootstrapping with GC working
    • X Get baking node working, with RO irmin instances
    • X Test restart behaviour, when killing a process in the middle of bootstrapping (for instance); TJR: I tested this quite a bit, and things seemed ok; still likely there are errors, if we kill a process at an inopportune time; could do with more testing
  • Bug fixing (at 2022-04-21)
    • X RO implementation needs finishing
    • Unbounded memory usage when using layers, compared to main; TJR: after finishing RO impl, cannot reproduce this error
    • After stopping a node, restart attempts to read from gap; likely this is caused by some startup behaviour of a tezos-node e.g. it attempts to access an "old" commit, or the parent of the current GC commit; TJR: after finishing RO impl, cannot reproduce this error
  • Benchmarking; perhaps refinement of the code (eg calculation of reachable objects)
  • Proper testing and performance measuring for Tezos use case - they want to GC every cycle, but only keep the last 6 cycles; how does this affect timings for Repo.iter? What is the impact on IO? Also, what is the space overhead? (presumably we need an extra 6 cycles worth of storage if we are GC'ing from 6 cycles ago - this will be copied to the next suffix file; and on top of this we have the sparse file overhead for live objects from the commit, 3GB currently)
  • "Hardening" pass, where all the FIXMEs are addressed, corner cases fixed, etc.
  • Merging into main irmin repo

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions