Skip to content

Document IO_intf.ml #1758

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

Conversation

tomjridge
Copy link
Contributor

No description provided.

@tomjridge tomjridge added the no-changelog-needed No changelog is needed here label Feb 9, 2022
@maiste maiste added area/backend About the backend type/doc Change documentation labels Feb 9, 2022
@tomjridge tomjridge mentioned this pull request Feb 9, 2022
15 tasks
}

let name t = t.file
let header_size = (* offset + version *) Int63.of_int 16

(* FIXME what is unsafe? *)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally worth checking the safe variant to see what guards it's putting in place, as they vary by context. In this case, unsafe_flush doesn't check that the instance is read-write, but flush does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what is "unsafe"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unsafe in the sense that unsafe_flush readonly_instance will silently return () rather than raising the appropriate RO_not_allowed exception.

@@ -208,6 +230,8 @@ module Unix : S = struct

let close t = Raw.close t.raw
let exists file = Sys.file_exists file
(** FIXME why include this in the interface? *)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IO.exists is called in checks.ml and traverse_pack_file.ml. In both cases, this is because v ~fresh:false raises some undefined exception if the file already exists but the call-site wants something else to happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why not just use Sys.file_exists?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was written with a view to Mirage compatibility and potentially having an in-memory IO implementation (for a compact but slow in-memory instantiation of irmin-pack).

Almost all usages of IO (including the ones of IO.exists IIRC) are agnostic to whether the string passed to IO.v is a filepath or not: they just need it to be some identifier of a potentially-shared resource. Being consistent in keeping IO as an abstraction over file descriptors simplifies changing this in future, and makes the caller assumptions about how the existing IO implementation behaves explicit in its interface.

Comment on lines +110 to +111
occurred. FIXME it is not clear what this is supposed to do for a read/write
instance. Probably this should only be used when [t] is readonly. *)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems fine to me for force_offset to return t.offset for read-write instances, in which case the semantics of this function become e.g. "Returns the next free data offset in the underlying file, after synchronising with the file system if necessary". Raising an exception also seems fine.

Copy link
Contributor Author

@tomjridge tomjridge Feb 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But currently force_offset does not return t.offset for read-write instances. Are you suggesting the code should be changed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think changing the code to do something consistent with the assumptions made elsewhere makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ioana: Remove this comment, stick to comment in mli

Comment on lines +72 to +73
(** [force_offset t] returns the last offset for data that was reliably synced to disk
from the internal buffer, and as a side effect it updates the [t.offset] field to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little unclear to me what "offset" means here.

I the docs for set and read above I read "offset" to mean "real offset in the underlying file" (after adjusting for the header sizes), but offset and force_offset both return offsets in the contiguous data suffix built by append. I somewhat prefer the latter approach, as the IO abstraction exists partly to hide the details of metadata management & expose functions for interacting with the data itself, but either is fine so long as we are consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying we should change the code? I'm just aiming to document the code that is already there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm saying that the API needs to make it clear whether an "offset" is an offset into the underlying file or an offset into the data segment. The functions offset and force_offset return the latter, but the new doc-comments for read and set describe reading and writing at offset off + header_size. Perhaps those comments just need to say that they read/write from off-many bytes into the data segment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still mystified about force_offset. Let's assume it is for readonly instances. Then what is it supposed to do? As I understand it, unlike Index, there is no need to "keep up with" data being written to a file. @craigfe do you have any idea what this function is supposed to do?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a read-only instance explicitly syncs with the writer, it first checks if any additional data has been written to the pack file since the offset was last read (here). If so, then the dictionary and index are sync-ed too. force_offset exists in order to support this explicit sync process in the three places it's called (atomic_write.ml, dict.ml, and pack_store.ml).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. So it is used to trigger syncs for the dictionary and index in the readonly instance. OK. I understood this at one point but then forgot when focused on pack store itself. So, readonly instances maintain "IO.offset" as a way to track when the underlying data changes, and "force_offset" is used to read when offset has been changed by the RW instance? So IO.offset doesn't just return the length of the underlying file for RO instances, it returns the value of the underlying offset field, which may lag the real offset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess a follow up question would be: what if the file data changes but the offset doesn't (eg via set)? If RO instances use increases in offset to trigger updates, then updating via set seems risky in that it wouldn't trigger an update in the RO instances (but I would expect that maybe it would, depending on what the desired semantics should be).

Comment on lines +92 to +97
(** [truncate t] sets the internal field [t.offset] to 0 and [t.flushed] to point just
after the header data. The underlying file is not truncated - the data is still
there. A future flush will persist [t.offset] in the underlying file header.

[t.flushed] is an internal field that is used to trigger flushing from the internal
buffer *)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I preferred the original comment here w.r.t. getting an understanding of the semantics. Perhaps we could extend that one to say that the purging is delayed until the next internal flush? The field names and implementation details seem likely to become outdated and make it difficult to check that the function actually behaves as required by call-sites.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it is not nice to expose the internal implementation details. But it wasn't clear to me what this function is actually supposed to do. It sets some internal fields which has various consequences. But (for example) the raw offset is not updated till some time later. This means that, until a flush occurs, there is garbage data that will be read as real data if a crash occurs. Basically, I am not sure what this function is supposed to do, without a flush of the underlying raw offset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "bad" scenario is: start with huge file; "truncate"; write some data, but not enough to trigger a flush; crash; restart; now the offset points to the end of the huge file, but we thought we were writing new data at the beginning of the file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is a weird function, and we should consider separately whether the callers can actually tolerate the flush being lost due to laziness (I suspect they cannot, and so this bad case is indeed a bug).

w.r.t. to documenting the current behaviour though, it seems fine to me to say that the actual truncation is delayed until the next flush.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mentioned that RO instances use the offset to trigger resyncs of dictionary and index.

let former_offset = IO.offset t.pack.block in
let offset = IO.force_offset t.pack.block in
if offset > former_offset then (

The code says something like: if offset > former_offset then .... How does this work with truncate, which presumably sets offset to 0... why not if offset <> former_offset then... ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, indeed, how does this work with set, which mutates the file data but doesn't alter offset at all.

mutable flushed : int63;
(** [flushed] is the "maximum last flushed offset"; it is updated in {!unsafe_flush}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably change to "last flushed offset"?

@@ -122,6 +137,9 @@ module Unix : S = struct
let protect f x = try f x with e -> protect_unix_exn e
let safe f x = try f x with e -> ignore_enoent e

(** [mkdir pth] creates the directory [pth]; if there is already a directory, then
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to dirname not pth

@@ -148,6 +166,7 @@ module Unix : S = struct
Buffer.clear t.buf

let v ~version ~fresh ~readonly file =
(* check the version is [Some _]; fail if not *)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ioana: remove this comment

@@ -199,6 +220,7 @@ module Unix : S = struct
| Some v -> v
| None -> Version.invalid_arg v_string
in
(* f. raises if the actual version is greater than the [version] supplied *)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change f. to following

val read_buffer : t -> off:int63 -> buf:bytes -> len:int -> int
(** [read_buffer t ~off ~buf ~len] tries to read [len] bytes from [t] at offset
[off+header_size] *)

val offset : t -> int63
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ioana: clarify what type of offset (real vs offset-in-data); don't talk about append

@samoht
Copy link
Member

samoht commented Jul 7, 2022

Is this PR still relevant now that we have the new IO module?

@tomjridge
Copy link
Contributor Author

Probably not relevant. Closing.

@tomjridge tomjridge closed this Jul 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/backend About the backend no-changelog-needed No changelog is needed here type/doc Change documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants