push tags to store even when no need to write object #860

katossky · 2025-03-10T17:55:42Z

Imagine I have a pipeline consisting multiple (costly) processing steps existing along multiple scenarios A, B, C, D, etc. I pin most objects of the pipeline, each with the tag "scenario X".

Many of the steps are the same, so that when I pin an object it often gets rejected as not having changed, and the tag is rejected as well.

However, when I want to load all objects from scenario X, many of them go missing and I end up tracking the scenario <-> objects relationship on the side.

Would love to be able to update the tags with the union of the existing and the incoming ones.

Or maybe am I missing something ? Like an obvious way to go around the problem ?

Not sure I would recommend this behaviour to be the default though, as I am afraid to miss some contexts where it would absolutely not make sense. Maybe a always_add_tags attribute ?

The text was updated successfully, but these errors were encountered:

juliasilge · 2025-03-11T21:37:35Z

Thank you for the feedback! 🙌

You are probably aware that you can force the pin to write, even if the content hasn't changed, via force_identical_write:
https://pins.rstudio.com/reference/pin_read.html#arg-force-identical-write
I am guessing you don't want to force the write to update the metadata, but rather update the metadata without updating the pin contents file? How are you thinking about this making a new version, or not?

Have you taken a look at the vignette on writing with custom metadata?
https://pins.rstudio.com/articles/customize-pins-metadata.html
If you had an existing version with certain metadata, you could write a new version that made a union of old and new metadata.

It might be helpful to take a look at the folder structure for how pins are stored, to get a little more clarity on what you are wanting to do in terms of versions. Here are two pins, with several versions for the second pins:

.
├── mtcars
│   └── 20250311T213658Z-e5d8a
│       ├── data.txt
│       └── mtcars.rds
└── really-nice-numbers
    ├── 20250311T213700Z-d2ae2
    │   ├── data.txt
    │   └── really-nice-numbers.json
    ├── 20250311T213701Z-5cc23
    │   ├── data.txt
    │   └── really-nice-numbers.json
    ├── 20250311T213701Z-a742f
    │   ├── data.txt
    │   └── really-nice-numbers.json
    ├── 20250311T213702Z-3a641
    │   ├── data.txt
    │   └── really-nice-numbers.json
    └── 20250311T213702Z-ed389
        ├── data.txt
        └── really-nice-numbers.json

katossky · 2025-03-14T13:41:26Z

I am guessing you don't want to force the write to update the metadata

You are correct. That would be nice for more voluminous datasets as it would save up time for frequent writes. (Though I admit it might not concern so many persons.) Also, it would require metadata handling on my side in order to save previous tags (see after).

How are you thinking about this making a new version, or not?

I would imagine that the exact same file (as determined by hash) would not be considered a new version, hence the idea to only update the tags.

Have you taken a look at the vignette on writing with custom metadata?

I had and I now have again. So your suggested work around is : I take the previous object's tags, add the new tag, save the new object the new tag list then delete the previous version ? That does save disk space. It does not save writing IO. But that's a start :)

more clarity on what you are wanting to do in terms of versions

My pipeline evolves over time. Say I have two versions v0 and v1. Most of scenarios have the same data cleaning strategy but once in a while you may have specific (heavy) treatment that you only want to perform in that case. I would then save the cleanedData object, which would be most often the same. Say I introduced this altertnative handling for scenario A in V1. In that case I would have the following tree :

.
└── cleanedData
    ├── 20250311T213702Z-3a641
    │   ├── data.txt
    │   ├── scenario_A:pipeline_v1
    │   └── cleanedData.parquet
    └── 20250311T213702Z-ed389
        ├── data.txt
        ├── scenario_A:pipeline_v0 scenario_B:pipeline_v0 scenario_B:pipeline_v1
        └── cleanedData.parquet

... where I would use the hash for identity and would disregard completely the time stamp.

katossky · 2025-03-14T14:03:28Z

Also, in your work-around, how do I identify the object+version with the same hash as mine ?

katossky · 2025-03-17T07:01:53Z

Looks like the hash is computed post-hoc from disk file. (I tried and I do not get the same result as with the memory-computed hash of the same object.) So that means that the work-around now looks like :

force-save the current file (this creates a new version)
check whether the hash is used twice
if it is, then pick up tags from the first pin's metadata and merge with the new tag
delete both pins
save the file as a new pin with the consolidated tags

With this solution I have to write twice my (possibly) heavy objects whereas I conceptually don't need any writing at all (except the tag updates). The alternative solution, as you mentionned, is to continuously add new copies of the objects (simpler, probably faster - I did not check - but uses considerably more disk space) but this defeats the point of the pins package at this point.

katossky · 2025-03-17T07:03:26Z

Any reason why you compute hash from disk ?

juliasilge · 2025-03-17T21:46:21Z

Any reason why you compute hash from disk ?

You mean as opposed to a hash of the R object? Since what the pins package does is read/write files at its core, we prioritized recording the hash of the file itself. If you think of a CSV or JSON, there are lots of ways to get an R object from CSV or JSON, so recording the hash of the R object isn't as meaningful as the hash of the file itself i.e. what the pin is storing.

You are totally right that your use case isn't a perfect fit for the way the pins package considers a dataset's metadata as very strongly linked to a given version of the data; we don't currently support updating metadata without updating the pin as a whole. If I was in your position, I think a workaround I would consider is to add my own little file to sit beside the official metadata that lives in data.txt, something like this:

library(pins)

## you write several versions:
board <- board_temp(versioned = TRUE)
pin_write(board, sample(1:100, 10), "nice-numbers", type = "json")
#> Creating new version '20250317T214320Z-21f72'
#> Writing to pin 'nice-numbers'
pin_write(board, sample(1:100, 10), "nice-numbers", type = "json")
#> Creating new version '20250317T214320Z-e97c1'
#> Writing to pin 'nice-numbers'

## identify the version you want to add new tags to:
version <- pin_versions(board, "nice-numbers")$version[1]
version_dir <- fs::path(board$path, "nice-numbers", version)

## write some new tags in that directory:
yaml::write_yaml(
  list(scenario = c("a", "b"), pipeline = "v1"),
  fs::path(version_dir, "tags.yaml")
)

^{Created on 2025-03-17 with reprex v2.1.1}

You could write a wrapper function for this, to do the checking appropriate for your use case to find the right version and to read/write your own separate metadata file.

We can keep this issue open to track interest in possible changes or extensions to this package along these lines.

katossky · 2025-03-17T22:40:05Z

Thanks for following back. I am going to try your suggested work around and report back.

My confusion is that when you say something like :

the pins package considers a dataset's metadata as very strongly linked to a given version of the data

... "version" does mean "as identified by hash" to me, whereas it does mean "as identified by [some mix of timestamp and hash]" to the pins package. So to me it makes sense that metadata should be added to the same version of the object (= object with same hash irrespective of time stamp). I completely understand that time-aware versioning be also needed for some users, but to me its the metadata that should bear the timestamp, not the object. Something like :

.
└── cleanedData
    ├── 3a641
    │   ├── 20250311T213702Z
    │   │   └── tags: initialTag, anotherInitialTagToBeRemoved
    │   ├── 20250315T213702Z
    │   │   └── tags: initialTag, aNewTagJustAdded
    │   └── cleanedData.parquet
    └── ed389
        ├── 20250315T213702Z
        │   └── tags: initialTag
        └── cleanedData.parquet

I feel like I see where this comes from historically (probably timestamps came first) and I am not necessarily suggesting that it changes, but I believe I am not the only one confused by the behaviour, with for instance #826 or #827.

As for when it comes to the computation of the hash from file, my question comes because if I correctly understand the code you first write the file to a temp file then compute the hash then move the object to its definitive position. It is rather costless for small objects on a local disk, but may be more expensive for more massive objects and/or remote archives. Also, it requires writing the file to the store before even knowing whether you actually need to write it to disk.

katossky mentioned this issue Mar 11, 2025

dev vs prod pins? #827

Open

juliasilge added the feature a feature request or enhancement label Mar 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

push tags to store even when no need to write object #860

push tags to store even when no need to write object #860

katossky commented Mar 10, 2025

juliasilge commented Mar 11, 2025

Uh oh!

katossky commented Mar 14, 2025

Uh oh!

katossky commented Mar 14, 2025

Uh oh!

katossky commented Mar 17, 2025

Uh oh!

katossky commented Mar 17, 2025

Uh oh!

juliasilge commented Mar 17, 2025

Uh oh!

katossky commented Mar 17, 2025

Uh oh!

push tags to store even when no need to write object #860

push tags to store even when no need to write object #860

Comments

katossky commented Mar 10, 2025

juliasilge commented Mar 11, 2025

Uh oh!

katossky commented Mar 14, 2025

Uh oh!

katossky commented Mar 14, 2025

Uh oh!

katossky commented Mar 17, 2025

Uh oh!

katossky commented Mar 17, 2025

Uh oh!

juliasilge commented Mar 17, 2025

Uh oh!

katossky commented Mar 17, 2025

Uh oh!