Skip to content

push tags to store even when no need to write object #860

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
katossky opened this issue Mar 10, 2025 · 7 comments
Open

push tags to store even when no need to write object #860

katossky opened this issue Mar 10, 2025 · 7 comments
Labels
feature a feature request or enhancement

Comments

@katossky
Copy link

Imagine I have a pipeline consisting multiple (costly) processing steps existing along multiple scenarios A, B, C, D, etc. I pin most objects of the pipeline, each with the tag "scenario X".

Many of the steps are the same, so that when I pin an object it often gets rejected as not having changed, and the tag is rejected as well.

However, when I want to load all objects from scenario X, many of them go missing and I end up tracking the scenario <-> objects relationship on the side.

Would love to be able to update the tags with the union of the existing and the incoming ones.

Or maybe am I missing something ? Like an obvious way to go around the problem ?

Not sure I would recommend this behaviour to be the default though, as I am afraid to miss some contexts where it would absolutely not make sense. Maybe a always_add_tags attribute ?

@juliasilge
Copy link
Member

Thank you for the feedback! 🙌

You are probably aware that you can force the pin to write, even if the content hasn't changed, via force_identical_write:
https://pins.rstudio.com/reference/pin_read.html#arg-force-identical-write
I am guessing you don't want to force the write to update the metadata, but rather update the metadata without updating the pin contents file? How are you thinking about this making a new version, or not?

Have you taken a look at the vignette on writing with custom metadata?
https://pins.rstudio.com/articles/customize-pins-metadata.html
If you had an existing version with certain metadata, you could write a new version that made a union of old and new metadata.

It might be helpful to take a look at the folder structure for how pins are stored, to get a little more clarity on what you are wanting to do in terms of versions. Here are two pins, with several versions for the second pins:

.
├── mtcars
│   └── 20250311T213658Z-e5d8a
│       ├── data.txt
│       └── mtcars.rds
└── really-nice-numbers
    ├── 20250311T213700Z-d2ae2
    │   ├── data.txt
    │   └── really-nice-numbers.json
    ├── 20250311T213701Z-5cc23
    │   ├── data.txt
    │   └── really-nice-numbers.json
    ├── 20250311T213701Z-a742f
    │   ├── data.txt
    │   └── really-nice-numbers.json
    ├── 20250311T213702Z-3a641
    │   ├── data.txt
    │   └── really-nice-numbers.json
    └── 20250311T213702Z-ed389
        ├── data.txt
        └── really-nice-numbers.json

@katossky
Copy link
Author

I am guessing you don't want to force the write to update the metadata

You are correct. That would be nice for more voluminous datasets as it would save up time for frequent writes. (Though I admit it might not concern so many persons.) Also, it would require metadata handling on my side in order to save previous tags (see after).

How are you thinking about this making a new version, or not?

I would imagine that the exact same file (as determined by hash) would not be considered a new version, hence the idea to only update the tags.

Have you taken a look at the vignette on writing with custom metadata?

I had and I now have again. So your suggested work around is : I take the previous object's tags, add the new tag, save the new object the new tag list then delete the previous version ? That does save disk space. It does not save writing IO. But that's a start :)

more clarity on what you are wanting to do in terms of versions

My pipeline evolves over time. Say I have two versions v0 and v1. Most of scenarios have the same data cleaning strategy but once in a while you may have specific (heavy) treatment that you only want to perform in that case. I would then save the cleanedData object, which would be most often the same. Say I introduced this altertnative handling for scenario A in V1. In that case I would have the following tree :

.
└── cleanedData
    ├── 20250311T213702Z-3a641
    │   ├── data.txt
    │   ├── scenario_A:pipeline_v1
    │   └── cleanedData.parquet
    └── 20250311T213702Z-ed389
        ├── data.txt
        ├── scenario_A:pipeline_v0 scenario_B:pipeline_v0 scenario_B:pipeline_v1
        └── cleanedData.parquet

... where I would use the hash for identity and would disregard completely the time stamp.

@katossky
Copy link
Author

Also, in your work-around, how do I identify the object+version with the same hash as mine ?

@katossky
Copy link
Author

Looks like the hash is computed post-hoc from disk file. (I tried and I do not get the same result as with the memory-computed hash of the same object.) So that means that the work-around now looks like :

  1. force-save the current file (this creates a new version)
  2. check whether the hash is used twice
  3. if it is, then pick up tags from the first pin's metadata and merge with the new tag
  4. delete both pins
  5. save the file as a new pin with the consolidated tags

With this solution I have to write twice my (possibly) heavy objects whereas I conceptually don't need any writing at all (except the tag updates). The alternative solution, as you mentionned, is to continuously add new copies of the objects (simpler, probably faster - I did not check - but uses considerably more disk space) but this defeats the point of the pins package at this point.

@katossky
Copy link
Author

Any reason why you compute hash from disk ?

@juliasilge
Copy link
Member

Any reason why you compute hash from disk ?

You mean as opposed to a hash of the R object? Since what the pins package does is read/write files at its core, we prioritized recording the hash of the file itself. If you think of a CSV or JSON, there are lots of ways to get an R object from CSV or JSON, so recording the hash of the R object isn't as meaningful as the hash of the file itself i.e. what the pin is storing.

You are totally right that your use case isn't a perfect fit for the way the pins package considers a dataset's metadata as very strongly linked to a given version of the data; we don't currently support updating metadata without updating the pin as a whole. If I was in your position, I think a workaround I would consider is to add my own little file to sit beside the official metadata that lives in data.txt, something like this:

library(pins)

## you write several versions:
board <- board_temp(versioned = TRUE)
pin_write(board, sample(1:100, 10), "nice-numbers", type = "json")
#> Creating new version '20250317T214320Z-21f72'
#> Writing to pin 'nice-numbers'
pin_write(board, sample(1:100, 10), "nice-numbers", type = "json")
#> Creating new version '20250317T214320Z-e97c1'
#> Writing to pin 'nice-numbers'

## identify the version you want to add new tags to:
version <- pin_versions(board, "nice-numbers")$version[1]
version_dir <- fs::path(board$path, "nice-numbers", version)

## write some new tags in that directory:
yaml::write_yaml(
  list(scenario = c("a", "b"), pipeline = "v1"),
  fs::path(version_dir, "tags.yaml")
)

Created on 2025-03-17 with reprex v2.1.1

You could write a wrapper function for this, to do the checking appropriate for your use case to find the right version and to read/write your own separate metadata file.

We can keep this issue open to track interest in possible changes or extensions to this package along these lines.

@juliasilge juliasilge added the feature a feature request or enhancement label Mar 17, 2025
@katossky
Copy link
Author

Thanks for following back. I am going to try your suggested work around and report back.

My confusion is that when you say something like :

the pins package considers a dataset's metadata as very strongly linked to a given version of the data

... "version" does mean "as identified by hash" to me, whereas it does mean "as identified by [some mix of timestamp and hash]" to the pins package. So to me it makes sense that metadata should be added to the same version of the object (= object with same hash irrespective of time stamp). I completely understand that time-aware versioning be also needed for some users, but to me its the metadata that should bear the timestamp, not the object. Something like :

.
└── cleanedData
    ├── 3a641
    │   ├── 20250311T213702Z
    │   │   └── tags: initialTag, anotherInitialTagToBeRemoved
    │   ├── 20250315T213702Z
    │   │   └── tags: initialTag, aNewTagJustAdded
    │   └── cleanedData.parquet
    └── ed389
        ├── 20250315T213702Z
        │   └── tags: initialTag
        └── cleanedData.parquet

I feel like I see where this comes from historically (probably timestamps came first) and I am not necessarily suggesting that it changes, but I believe I am not the only one confused by the behaviour, with for instance #826 or #827.


As for when it comes to the computation of the hash from file, my question comes because if I correctly understand the code you first write the file to a temp file then compute the hash then move the object to its definitive position. It is rather costless for small objects on a local disk, but may be more expensive for more massive objects and/or remote archives. Also, it requires writing the file to the store before even knowing whether you actually need to write it to disk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants