Document IO_intf.ml #1758

tomjridge · 2022-02-09T10:52:06Z

No description provided.

craigfe · 2022-02-09T10:54:23Z

src/irmin-pack/IO.ml

  }

  let name t = t.file
  let header_size = (* offset + version *) Int63.of_int 16

+  (* FIXME what is unsafe? *)


Normally worth checking the safe variant to see what guards it's putting in place, as they vary by context. In this case, unsafe_flush doesn't check that the instance is read-write, but flush does.

But what is "unsafe"?

It's unsafe in the sense that unsafe_flush readonly_instance will silently return () rather than raising the appropriate RO_not_allowed exception.

craigfe · 2022-02-09T10:57:06Z

src/irmin-pack/IO.ml

@@ -208,6 +230,8 @@ module Unix : S = struct

  let close t = Raw.close t.raw
  let exists file = Sys.file_exists file
+  (** FIXME why include this in the interface? *)


IO.exists is called in checks.ml and traverse_pack_file.ml. In both cases, this is because v ~fresh:false raises some undefined exception if the file already exists but the call-site wants something else to happen.

But why not just use Sys.file_exists?

This code was written with a view to Mirage compatibility and potentially having an in-memory IO implementation (for a compact but slow in-memory instantiation of irmin-pack).

Almost all usages of IO (including the ones of IO.exists IIRC) are agnostic to whether the string passed to IO.v is a filepath or not: they just need it to be some identifier of a potentially-shared resource. Being consistent in keeping IO as an abstraction over file descriptors simplifies changing this in future, and makes the caller assumptions about how the existing IO implementation behaves explicit in its interface.

craigfe · 2022-02-09T11:16:19Z

src/irmin-pack/IO.ml

+       occurred. FIXME it is not clear what this is supposed to do for a read/write
+       instance. Probably this should only be used when [t] is readonly. *) 


It seems fine to me for force_offset to return t.offset for read-write instances, in which case the semantics of this function become e.g. "Returns the next free data offset in the underlying file, after synchronising with the file system if necessary". Raising an exception also seems fine.

But currently force_offset does not return t.offset for read-write instances. Are you suggesting the code should be changed?

Yes, I think changing the code to do something consistent with the assumptions made elsewhere makes sense.

Ioana: Remove this comment, stick to comment in mli

craigfe · 2022-02-09T11:24:11Z

src/irmin-pack/IO_intf.ml

+  (** [force_offset t] returns the last offset for data that was reliably synced to disk
+      from the internal buffer, and as a side effect it updates the [t.offset] field to


It's a little unclear to me what "offset" means here.

I the docs for set and read above I read "offset" to mean "real offset in the underlying file" (after adjusting for the header sizes), but offset and force_offset both return offsets in the contiguous data suffix built by append. I somewhat prefer the latter approach, as the IO abstraction exists partly to hide the details of metadata management & expose functions for interacting with the data itself, but either is fine so long as we are consistent.

Are you saying we should change the code? I'm just aiming to document the code that is already there.

I'm saying that the API needs to make it clear whether an "offset" is an offset into the underlying file or an offset into the data segment. The functions offset and force_offset return the latter, but the new doc-comments for read and set describe reading and writing at offset off + header_size. Perhaps those comments just need to say that they read/write from off-many bytes into the data segment?

I'm still mystified about force_offset. Let's assume it is for readonly instances. Then what is it supposed to do? As I understand it, unlike Index, there is no need to "keep up with" data being written to a file. @craigfe do you have any idea what this function is supposed to do?

When a read-only instance explicitly syncs with the writer, it first checks if any additional data has been written to the pack file since the offset was last read (here). If so, then the dictionary and index are sync-ed too. force_offset exists in order to support this explicit sync process in the three places it's called (atomic_write.ml, dict.ml, and pack_store.ml).

Ah. So it is used to trigger syncs for the dictionary and index in the readonly instance. OK. I understood this at one point but then forgot when focused on pack store itself. So, readonly instances maintain "IO.offset" as a way to track when the underlying data changes, and "force_offset" is used to read when offset has been changed by the RW instance? So IO.offset doesn't just return the length of the underlying file for RO instances, it returns the value of the underlying offset field, which may lag the real offset.

I guess a follow up question would be: what if the file data changes but the offset doesn't (eg via set)? If RO instances use increases in offset to trigger updates, then updating via set seems risky in that it wouldn't trigger an update in the RO instances (but I would expect that maybe it would, depending on what the desired semantics should be).

src/irmin-pack/IO.ml

craigfe · 2022-02-09T11:32:26Z

src/irmin-pack/IO_intf.ml

+  (** [truncate t] sets the internal field [t.offset] to 0 and [t.flushed] to point just
+      after the header data. The underlying file is not truncated - the data is still
+      there. A future flush will persist [t.offset] in the underlying file header.
+
+      [t.flushed] is an internal field that is used to trigger flushing from the internal
+      buffer *)


I think I preferred the original comment here w.r.t. getting an understanding of the semantics. Perhaps we could extend that one to say that the purging is delayed until the next internal flush? The field names and implementation details seem likely to become outdated and make it difficult to check that the function actually behaves as required by call-sites.

I agree that it is not nice to expose the internal implementation details. But it wasn't clear to me what this function is actually supposed to do. It sets some internal fields which has various consequences. But (for example) the raw offset is not updated till some time later. This means that, until a flush occurs, there is garbage data that will be read as real data if a crash occurs. Basically, I am not sure what this function is supposed to do, without a flush of the underlying raw offset.

The "bad" scenario is: start with huge file; "truncate"; write some data, but not enough to trigger a flush; crash; restart; now the offset points to the end of the huge file, but we thought we were writing new data at the beginning of the file.

I agree that this is a weird function, and we should consider separately whether the callers can actually tolerate the flush being lost due to laziness (I suspect they cannot, and so this bad case is indeed a bug).

w.r.t. to documenting the current behaviour though, it seems fine to me to say that the actual truncation is delayed until the next flush.

You mentioned that RO instances use the offset to trigger resyncs of dictionary and index.

irmin/src/irmin-pack/pack_store.ml

Lines 556 to 558 in 1d83f08

let former_offset = IO.offset t.pack.block in

let offset = IO.force_offset t.pack.block in

if offset > former_offset then (

The code says something like: if offset > former_offset then .... How does this work with truncate, which presumably sets offset to 0... why not if offset <> former_offset then... ?

Or, indeed, how does this work with set, which mutates the file data but doesn't alter offset at all.

tomjridge · 2022-02-24T11:19:37Z

src/irmin-pack/IO.ml

    mutable flushed : int63;
+    (** [flushed] is the "maximum last flushed offset"; it is updated in {!unsafe_flush}


Probably change to "last flushed offset"?

tomjridge · 2022-02-24T11:26:51Z

src/irmin-pack/IO.ml

@@ -122,6 +137,9 @@ module Unix : S = struct
  let protect f x = try f x with e -> protect_unix_exn e
  let safe f x = try f x with e -> ignore_enoent e

+  (** [mkdir pth] creates the directory [pth]; if there is already a directory, then


change to dirname not pth

tomjridge · 2022-02-24T11:30:54Z

src/irmin-pack/IO.ml

@@ -148,6 +166,7 @@ module Unix : S = struct
    Buffer.clear t.buf

  let v ~version ~fresh ~readonly file =
+    (* check the version is [Some _]; fail if not *)


Ioana: remove this comment

tomjridge · 2022-02-24T11:36:50Z

src/irmin-pack/IO.ml

@@ -199,6 +220,7 @@ module Unix : S = struct
            | Some v -> v
            | None -> Version.invalid_arg v_string
          in
+          (* f. raises if the actual version is greater than the [version] supplied *)


change f. to following

tomjridge · 2022-02-24T11:38:27Z

src/irmin-pack/IO_intf.ml

  val read_buffer : t -> off:int63 -> buf:bytes -> len:int -> int
+  (** [read_buffer t ~off ~buf ~len] tries to read [len] bytes from [t] at offset
+      [off+header_size] *)
+
  val offset : t -> int63


Ioana: clarify what type of offset (real vs offset-in-data); don't talk about append

samoht · 2022-07-07T07:52:50Z

Is this PR still relevant now that we have the new IO module?

tomjridge · 2022-07-07T08:25:52Z

Probably not relevant. Closing.

Document IO_intf.ml

fe8feda

tomjridge added the no-changelog-needed No changelog is needed here label Feb 9, 2022

tomjridge requested review from Ngoguey42 and craigfe February 9, 2022 10:52

maiste added area/backend About the backend type/doc Change documentation labels Feb 9, 2022

tomjridge mentioned this pull request Feb 9, 2022

Tracking: layered store 2022 Q1 #1753

Closed

15 tasks

craigfe reviewed Feb 9, 2022

View reviewed changes

Tweak docs

783a53f

tomjridge commented Feb 24, 2022

View reviewed changes

tomjridge mentioned this pull request Mar 8, 2022

irmin-pack: Strengthen crash and SWMR consistencies #1690

Closed

tomjridge closed this Jul 7, 2022

		occurred. FIXME it is not clear what this is supposed to do for a read/write
		instance. Probably this should only be used when [t] is readonly. *)

		(** [force_offset t] returns the last offset for data that was reliably synced to disk
		from the internal buffer, and as a side effect it updates the [t.offset] field to

	let former_offset = IO.offset t.pack.block in
	let offset = IO.force_offset t.pack.block in
	if offset > former_offset then (

		mutable flushed : int63;
		(** [flushed] is the "maximum last flushed offset"; it is updated in {!unsafe_flush}

Document IO_intf.ml #1758

Document IO_intf.ml #1758

Uh oh!

Conversation

tomjridge commented Feb 9, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomjridge Feb 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

samoht commented Jul 7, 2022

Uh oh!

tomjridge commented Jul 7, 2022

Uh oh!

Uh oh!

tomjridge Feb 9, 2022 •

edited

Loading