-
Notifications
You must be signed in to change notification settings - Fork 161
Document IO_intf.ml #1758
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document IO_intf.ml #1758
Conversation
} | ||
|
||
let name t = t.file | ||
let header_size = (* offset + version *) Int63.of_int 16 | ||
|
||
(* FIXME what is unsafe? *) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally worth checking the safe variant to see what guards it's putting in place, as they vary by context. In this case, unsafe_flush
doesn't check that the instance is read-write, but flush
does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But what is "unsafe"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unsafe in the sense that unsafe_flush readonly_instance
will silently return ()
rather than raising the appropriate RO_not_allowed
exception.
@@ -208,6 +230,8 @@ module Unix : S = struct | |||
|
|||
let close t = Raw.close t.raw | |||
let exists file = Sys.file_exists file | |||
(** FIXME why include this in the interface? *) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IO.exists
is called in checks.ml
and traverse_pack_file.ml
. In both cases, this is because v ~fresh:false
raises some undefined exception if the file already exists but the call-site wants something else to happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But why not just use Sys.file_exists?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code was written with a view to Mirage compatibility and potentially having an in-memory IO implementation (for a compact but slow in-memory instantiation of irmin-pack
).
Almost all usages of IO
(including the ones of IO.exists
IIRC) are agnostic to whether the string passed to IO.v
is a filepath or not: they just need it to be some identifier of a potentially-shared resource. Being consistent in keeping IO
as an abstraction over file descriptors simplifies changing this in future, and makes the caller assumptions about how the existing IO implementation behaves explicit in its interface.
occurred. FIXME it is not clear what this is supposed to do for a read/write | ||
instance. Probably this should only be used when [t] is readonly. *) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems fine to me for force_offset
to return t.offset
for read-write instances, in which case the semantics of this function become e.g. "Returns the next free data offset in the underlying file, after synchronising with the file system if necessary". Raising an exception also seems fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But currently force_offset
does not return t.offset
for read-write instances. Are you suggesting the code should be changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think changing the code to do something consistent with the assumptions made elsewhere makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ioana: Remove this comment, stick to comment in mli
(** [force_offset t] returns the last offset for data that was reliably synced to disk | ||
from the internal buffer, and as a side effect it updates the [t.offset] field to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little unclear to me what "offset" means here.
I the docs for set
and read
above I read "offset" to mean "real offset in the underlying file" (after adjusting for the header sizes), but offset
and force_offset
both return offsets in the contiguous data suffix built by append
. I somewhat prefer the latter approach, as the IO
abstraction exists partly to hide the details of metadata management & expose functions for interacting with the data itself, but either is fine so long as we are consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying we should change the code? I'm just aiming to document the code that is already there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm saying that the API needs to make it clear whether an "offset" is an offset into the underlying file or an offset into the data segment. The functions offset
and force_offset
return the latter, but the new doc-comments for read
and set
describe reading and writing at offset off + header_size
. Perhaps those comments just need to say that they read/write from off
-many bytes into the data segment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still mystified about force_offset
. Let's assume it is for readonly instances. Then what is it supposed to do? As I understand it, unlike Index, there is no need to "keep up with" data being written to a file. @craigfe do you have any idea what this function is supposed to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a read-only instance explicitly sync
s with the writer, it first checks if any additional data has been written to the pack file since the offset was last read (here). If so, then the dictionary and index are sync-ed too. force_offset
exists in order to support this explicit sync process in the three places it's called (atomic_write.ml
, dict.ml
, and pack_store.ml
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. So it is used to trigger syncs for the dictionary and index in the readonly instance. OK. I understood this at one point but then forgot when focused on pack store itself. So, readonly instances maintain "IO.offset" as a way to track when the underlying data changes, and "force_offset" is used to read when offset has been changed by the RW instance? So IO.offset doesn't just return the length of the underlying file for RO instances, it returns the value of the underlying offset field, which may lag the real offset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess a follow up question would be: what if the file data changes but the offset doesn't (eg via set
)? If RO instances use increases in offset to trigger updates, then updating via set seems risky in that it wouldn't trigger an update in the RO instances (but I would expect that maybe it would, depending on what the desired semantics should be).
(** [truncate t] sets the internal field [t.offset] to 0 and [t.flushed] to point just | ||
after the header data. The underlying file is not truncated - the data is still | ||
there. A future flush will persist [t.offset] in the underlying file header. | ||
|
||
[t.flushed] is an internal field that is used to trigger flushing from the internal | ||
buffer *) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I preferred the original comment here w.r.t. getting an understanding of the semantics. Perhaps we could extend that one to say that the purging is delayed until the next internal flush? The field names and implementation details seem likely to become outdated and make it difficult to check that the function actually behaves as required by call-sites.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it is not nice to expose the internal implementation details. But it wasn't clear to me what this function is actually supposed to do. It sets some internal fields which has various consequences. But (for example) the raw offset is not updated till some time later. This means that, until a flush occurs, there is garbage data that will be read as real data if a crash occurs. Basically, I am not sure what this function is supposed to do, without a flush of the underlying raw offset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "bad" scenario is: start with huge file; "truncate"; write some data, but not enough to trigger a flush; crash; restart; now the offset points to the end of the huge file, but we thought we were writing new data at the beginning of the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this is a weird function, and we should consider separately whether the callers can actually tolerate the flush being lost due to laziness (I suspect they cannot, and so this bad case is indeed a bug).
w.r.t. to documenting the current behaviour though, it seems fine to me to say that the actual truncation is delayed until the next flush.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mentioned that RO instances use the offset to trigger resyncs of dictionary and index.
irmin/src/irmin-pack/pack_store.ml
Lines 556 to 558 in 1d83f08
let former_offset = IO.offset t.pack.block in | |
let offset = IO.force_offset t.pack.block in | |
if offset > former_offset then ( |
The code says something like: if offset > former_offset then ...
. How does this work with truncate, which presumably sets offset to 0... why not if offset <> former_offset then...
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, indeed, how does this work with set
, which mutates the file data but doesn't alter offset at all.
mutable flushed : int63; | ||
(** [flushed] is the "maximum last flushed offset"; it is updated in {!unsafe_flush} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably change to "last flushed offset"?
@@ -122,6 +137,9 @@ module Unix : S = struct | |||
let protect f x = try f x with e -> protect_unix_exn e | |||
let safe f x = try f x with e -> ignore_enoent e | |||
|
|||
(** [mkdir pth] creates the directory [pth]; if there is already a directory, then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change to dirname not pth
@@ -148,6 +166,7 @@ module Unix : S = struct | |||
Buffer.clear t.buf | |||
|
|||
let v ~version ~fresh ~readonly file = | |||
(* check the version is [Some _]; fail if not *) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ioana: remove this comment
@@ -199,6 +220,7 @@ module Unix : S = struct | |||
| Some v -> v | |||
| None -> Version.invalid_arg v_string | |||
in | |||
(* f. raises if the actual version is greater than the [version] supplied *) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change f. to following
val read_buffer : t -> off:int63 -> buf:bytes -> len:int -> int | ||
(** [read_buffer t ~off ~buf ~len] tries to read [len] bytes from [t] at offset | ||
[off+header_size] *) | ||
|
||
val offset : t -> int63 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ioana: clarify what type of offset (real vs offset-in-data); don't talk about append
Is this PR still relevant now that we have the new IO module? |
Probably not relevant. Closing. |
No description provided.