Skip to content

Added format validation to retrieved data manifest #349

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

afshinmhCLG
Copy link

Issue: currently retrieval of manifest is considered successful when http call does not return an error response. However, there are no checks about the implied format of the manifest (e.g. data.txt).

Proposed solution: Rather than relying only on http call’s success, an additional verification of the “implied” format of the manifest is added (i.e. checking whether “path” is element of the retrieved manifest).

Use-case: This is being proposed in support of making custom board for labkey, an open-source data platform used in clinical research setting (see https://www.labkey.org). When data.txt does not exist, the API for labkey, rather than returning and error response, returns a success response but encodes the missingness of data.txt in the body of the response. As it is, casting the custom board_labkey as datatxt fails because the call to retrieve manifest returns successfully however the body of the message lacks the anticipated structure of data.txt. By making the implicit assumption about the structure of manifest explicit, board_datatxt remains functional (e.g. for S3board which I have tested) but can also add utility for labkey (and possibly other similar APIs) that encode missing files in the body of the response.

I intend to also submit a PR for board_labkey

Thank you!

@hadley
Copy link
Member

hadley commented Mar 9, 2021

Hi @afshinmhCLG — I'm taking over maintenance of pins and since I'm still coming up to speed with it, I'd really appreciate if you could give me a bit more context. When does this cause a problem? Can you generate a reprex, even if it's something that I can't run locally?

@afshinmhCLG
Copy link
Author

afshinmhCLG commented Mar 9, 2021

Hi @hadley ,
Thank you for the follow up! Let me first provide some additional context. I will get back to you with a reprex also (I am a bit struggling with how to provide a reprex given the API dependency). Let's say we are starting a board of type datatxt such as S3 board. In this case, the existing workflow first tries to

  1. Retrieve any existing data manifest metadata, data.txt, from the S3 location
  2. However, since file doesn't exist yet, the S3 API returns an error
  3. The error is caught in the response and is taken as implication of data.txt not existing
  4. It then proceeds to create data.txt and takes the subsequent steps successfully in the process

The above works perfectly. Now, if you are dealing with a different API (e.g. Labkey) where "File doesn't exist" is not encoded as an error but rather the missingness is encoded in the body of the message, step 3 above passes (i.e. no error is caught). The workflow will then wrongly assumes successful retrieval of a data.txt and proceeds to parse the content per anticipated structure of the data.txt. Here, clearly as the content is not the content of a data.txt file (rather just text indicating missingness), we encounter an error.
The proposed solution adds a check to step 3 such that if no error is observed, it additionally verifies that the anticipated contents (e.g. a path) can be retrieved, and if it can't, it will not accept the retrieved content as a valid data.txt and proceeds to behave as if an error was caught on step 3.

@hadley
Copy link
Member

hadley commented Mar 9, 2021

Why not make the wrapper around labkey throw an error? I don't think it's pins responsibility to figure out whether the body of an HTTP response represents an error or not (indeed, I have no idea how you could do that because errors could be represented in any number of ways).

@afshinmhCLG
Copy link
Author

Thanks @hadley!
Absolutely a wrapper is a possibility. Thinking about the scope of pins package makes a lot of sense as well.
What I want to highlight is that the suggested change makes what's already implicit in the code explicit and I am wondering if it is a good change for that alone. Right now, the workflow does something like

  1. get (from the board API) data.txt -> dt
  2. if error from API, assume no data.txt exists on board, create one and post it to the board
  3. else access dt$path

The PR simply augments step 2 to make sure that dt is indeed a list containing element "path" before calling dt$path on step 3. No other checking is done as you are absolutely right: "errors could be represented in any number of ways".
In other words, I see this as a general improvement to the code, but one that augments the ability of pins to be easily extended to APIs similar to labkey out of the box (btw, this ability to extend is one of the best features of pins).
Thanks again!

@hadley
Copy link
Member

hadley commented May 18, 2021

Now that I understand the datatxt format, unfortunately I have determined that it is not a good fit for pins' needs. The problem is maintaining a single central list of pins — if two pins should be written at the same time, there is a substantial chance of a race condition so that only one pin will recorded in the site metadata. This means that I don't think it's worthwhile to patch up smaller bugs in (what is now) legacy_datatxt(). I will hopefully be able to start writing a new board_s3() in the near future, and that is something that you could use to extend pins if needed. (That said, I don't think extending pins right now is a good idea because the interface is very new).

@hadley hadley closed this May 18, 2021
@afshinmhCLG
Copy link
Author

Thanks @hadley! Is support for extending pins within the vision for pins going forward? IMHO, it is the single most valuable feature of pins. Picture large orgs with lots of historical places to store your data objects, from well-established cloud-based solns like S3 to less known data platforms like labkey etc. Try to move your data workflow from one to another and you are left dealing with API differences and here comes pins with a uniform API, so you can move in between all those seamlessly and expect the same behavior (as far as versioning etc.)

@hadley
Copy link
Member

hadley commented May 24, 2021

Yes, definitely, I just want to make sure the extension interface is solid before encouraging others to use it.

@github-actions
Copy link

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Aug 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants