Added format validation to retrieved data manifest #349

afshinmhCLG · 2021-01-16T08:46:11Z

Issue: currently retrieval of manifest is considered successful when http call does not return an error response. However, there are no checks about the implied format of the manifest (e.g. data.txt).

Proposed solution: Rather than relying only on http call’s success, an additional verification of the “implied” format of the manifest is added (i.e. checking whether “path” is element of the retrieved manifest).

Use-case: This is being proposed in support of making custom board for labkey, an open-source data platform used in clinical research setting (see https://www.labkey.org). When data.txt does not exist, the API for labkey, rather than returning and error response, returns a success response but encodes the missingness of data.txt in the body of the response. As it is, casting the custom board_labkey as datatxt fails because the call to retrieve manifest returns successfully however the body of the message lacks the anticipated structure of data.txt. By making the implicit assumption about the structure of manifest explicit, board_datatxt remains functional (e.g. for S3board which I have tested) but can also add utility for labkey (and possibly other similar APIs) that encode missing files in the body of the response.

I intend to also submit a PR for board_labkey

Thank you!

hadley · 2021-03-09T14:39:22Z

Hi @afshinmhCLG — I'm taking over maintenance of pins and since I'm still coming up to speed with it, I'd really appreciate if you could give me a bit more context. When does this cause a problem? Can you generate a reprex, even if it's something that I can't run locally?

afshinmhCLG · 2021-03-09T16:49:13Z

Hi @hadley ,
Thank you for the follow up! Let me first provide some additional context. I will get back to you with a reprex also (I am a bit struggling with how to provide a reprex given the API dependency). Let's say we are starting a board of type datatxt such as S3 board. In this case, the existing workflow first tries to

Retrieve any existing data manifest metadata, data.txt, from the S3 location
However, since file doesn't exist yet, the S3 API returns an error
The error is caught in the response and is taken as implication of data.txt not existing
It then proceeds to create data.txt and takes the subsequent steps successfully in the process

The above works perfectly. Now, if you are dealing with a different API (e.g. Labkey) where "File doesn't exist" is not encoded as an error but rather the missingness is encoded in the body of the message, step 3 above passes (i.e. no error is caught). The workflow will then wrongly assumes successful retrieval of a data.txt and proceeds to parse the content per anticipated structure of the data.txt. Here, clearly as the content is not the content of a data.txt file (rather just text indicating missingness), we encounter an error.
The proposed solution adds a check to step 3 such that if no error is observed, it additionally verifies that the anticipated contents (e.g. a path) can be retrieved, and if it can't, it will not accept the retrieved content as a valid data.txt and proceeds to behave as if an error was caught on step 3.

hadley · 2021-03-09T17:44:44Z

Why not make the wrapper around labkey throw an error? I don't think it's pins responsibility to figure out whether the body of an HTTP response represents an error or not (indeed, I have no idea how you could do that because errors could be represented in any number of ways).

afshinmhCLG · 2021-03-09T20:23:31Z

Thanks @hadley!
Absolutely a wrapper is a possibility. Thinking about the scope of pins package makes a lot of sense as well.
What I want to highlight is that the suggested change makes what's already implicit in the code explicit and I am wondering if it is a good change for that alone. Right now, the workflow does something like

get (from the board API) data.txt -> dt
if error from API, assume no data.txt exists on board, create one and post it to the board
else access dt$path

The PR simply augments step 2 to make sure that dt is indeed a list containing element "path" before calling dt$path on step 3. No other checking is done as you are absolutely right: "errors could be represented in any number of ways".
In other words, I see this as a general improvement to the code, but one that augments the ability of pins to be easily extended to APIs similar to labkey out of the box (btw, this ability to extend is one of the best features of pins).
Thanks again!

hadley · 2021-05-18T12:23:49Z

Now that I understand the datatxt format, unfortunately I have determined that it is not a good fit for pins' needs. The problem is maintaining a single central list of pins — if two pins should be written at the same time, there is a substantial chance of a race condition so that only one pin will recorded in the site metadata. This means that I don't think it's worthwhile to patch up smaller bugs in (what is now) legacy_datatxt(). I will hopefully be able to start writing a new board_s3() in the near future, and that is something that you could use to extend pins if needed. (That said, I don't think extending pins right now is a good idea because the interface is very new).

afshinmhCLG · 2021-05-24T19:10:31Z

Thanks @hadley! Is support for extending pins within the vision for pins going forward? IMHO, it is the single most valuable feature of pins. Picture large orgs with lots of historical places to store your data objects, from well-established cloud-based solns like S3 to less known data platforms like labkey etc. Try to move your data workflow from one to another and you are left dealing with API differences and here comes pins with a uniform API, so you can move in between all those seamlessly and expect the same behavior (as far as versioning etc.)

hadley · 2021-05-24T21:54:07Z

Yes, definitely, I just want to make sure the extension interface is solid before encouraging others to use it.

github-actions · 2022-08-28T00:11:26Z

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

Added format validation to retrieved data manifest

be7b6fa

hadley closed this May 18, 2021

github-actions bot locked and limited conversation to collaborators Aug 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added format validation to retrieved data manifest #349

Added format validation to retrieved data manifest #349

Uh oh!

afshinmhCLG commented Jan 16, 2021

Uh oh!

hadley commented Mar 9, 2021

Uh oh!

afshinmhCLG commented Mar 9, 2021 •

edited

Loading

Uh oh!

hadley commented Mar 9, 2021

Uh oh!

afshinmhCLG commented Mar 9, 2021

Uh oh!

hadley commented May 18, 2021

Uh oh!

afshinmhCLG commented May 24, 2021

Uh oh!

hadley commented May 24, 2021

Uh oh!

github-actions bot commented Aug 28, 2022

Uh oh!

Uh oh!

Added format validation to retrieved data manifest #349

Added format validation to retrieved data manifest #349

Uh oh!

Conversation

afshinmhCLG commented Jan 16, 2021

Uh oh!

hadley commented Mar 9, 2021

Uh oh!

afshinmhCLG commented Mar 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hadley commented Mar 9, 2021

Uh oh!

afshinmhCLG commented Mar 9, 2021

Uh oh!

hadley commented May 18, 2021

Uh oh!

afshinmhCLG commented May 24, 2021

Uh oh!

hadley commented May 24, 2021

Uh oh!

github-actions bot commented Aug 28, 2022

Uh oh!

Uh oh!

afshinmhCLG commented Mar 9, 2021 •

edited

Loading