-
Notifications
You must be signed in to change notification settings - Fork 66
push tags to store even when no need to write object #860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thank you for the feedback! 🙌 You are probably aware that you can force the pin to write, even if the content hasn't changed, via Have you taken a look at the vignette on writing with custom metadata? It might be helpful to take a look at the folder structure for how pins are stored, to get a little more clarity on what you are wanting to do in terms of versions. Here are two pins, with several versions for the second pins:
|
You are correct. That would be nice for more voluminous datasets as it would save up time for frequent writes. (Though I admit it might not concern so many persons.) Also, it would require metadata handling on my side in order to save previous tags (see after).
I would imagine that the exact same file (as determined by hash) would not be considered a new version, hence the idea to only update the tags.
I had and I now have again. So your suggested work around is : I take the previous object's tags, add the new tag, save the new object the new tag list then delete the previous version ? That does save disk space. It does not save writing IO. But that's a start :)
My pipeline evolves over time. Say I have two versions v0 and v1. Most of scenarios have the same data cleaning strategy but once in a while you may have specific (heavy) treatment that you only want to perform in that case. I would then save the
... where I would use the hash for identity and would disregard completely the time stamp. |
Also, in your work-around, how do I identify the object+version with the same hash as mine ? |
Looks like the hash is computed post-hoc from disk file. (I tried and I do not get the same result as with the memory-computed hash of the same object.) So that means that the work-around now looks like :
With this solution I have to write twice my (possibly) heavy objects whereas I conceptually don't need any writing at all (except the tag updates). The alternative solution, as you mentionned, is to continuously add new copies of the objects (simpler, probably faster - I did not check - but uses considerably more disk space) but this defeats the point of the |
Any reason why you compute hash from disk ? |
You mean as opposed to a hash of the R object? Since what the pins package does is read/write files at its core, we prioritized recording the hash of the file itself. If you think of a CSV or JSON, there are lots of ways to get an R object from CSV or JSON, so recording the hash of the R object isn't as meaningful as the hash of the file itself i.e. what the pin is storing. You are totally right that your use case isn't a perfect fit for the way the pins package considers a dataset's metadata as very strongly linked to a given version of the data; we don't currently support updating metadata without updating the pin as a whole. If I was in your position, I think a workaround I would consider is to add my own little file to sit beside the official metadata that lives in library(pins)
## you write several versions:
board <- board_temp(versioned = TRUE)
pin_write(board, sample(1:100, 10), "nice-numbers", type = "json")
#> Creating new version '20250317T214320Z-21f72'
#> Writing to pin 'nice-numbers'
pin_write(board, sample(1:100, 10), "nice-numbers", type = "json")
#> Creating new version '20250317T214320Z-e97c1'
#> Writing to pin 'nice-numbers'
## identify the version you want to add new tags to:
version <- pin_versions(board, "nice-numbers")$version[1]
version_dir <- fs::path(board$path, "nice-numbers", version)
## write some new tags in that directory:
yaml::write_yaml(
list(scenario = c("a", "b"), pipeline = "v1"),
fs::path(version_dir, "tags.yaml")
) Created on 2025-03-17 with reprex v2.1.1 You could write a wrapper function for this, to do the checking appropriate for your use case to find the right version and to read/write your own separate metadata file. We can keep this issue open to track interest in possible changes or extensions to this package along these lines. |
Thanks for following back. I am going to try your suggested work around and report back. My confusion is that when you say something like :
... "version" does mean "as identified by hash" to me, whereas it does mean "as identified by [some mix of timestamp and hash]" to the
I feel like I see where this comes from historically (probably timestamps came first) and I am not necessarily suggesting that it changes, but I believe I am not the only one confused by the behaviour, with for instance #826 or #827. As for when it comes to the computation of the hash from file, my question comes because if I correctly understand the code you first write the file to a temp file then compute the hash then move the object to its definitive position. It is rather costless for small objects on a local disk, but may be more expensive for more massive objects and/or remote archives. Also, it requires writing the file to the store before even knowing whether you actually need to write it to disk. |
Imagine I have a pipeline consisting multiple (costly) processing steps existing along multiple scenarios A, B, C, D, etc. I pin most objects of the pipeline, each with the tag "scenario X".
Many of the steps are the same, so that when I pin an object it often gets rejected as not having changed, and the tag is rejected as well.
However, when I want to load all objects from scenario X, many of them go missing and I end up tracking the scenario <-> objects relationship on the side.
Would love to be able to update the tags with the union of the existing and the incoming ones.
Or maybe am I missing something ? Like an obvious way to go around the problem ?
Not sure I would recommend this behaviour to be the default though, as I am afraid to miss some contexts where it would absolutely not make sense. Maybe a
always_add_tags
attribute ?The text was updated successfully, but these errors were encountered: