Skip to content

Add ZIP explorer to import individual files from local or remote ZIP archives #20054

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 102 commits into from
May 9, 2025

Conversation

davelopez
Copy link
Contributor

@davelopez davelopez commented Apr 21, 2025

This pull request introduces a new feature for previewing the content of ZIP archives and importing individual files from them. This is useful when you only need a particular file or set of small files from the ZIP archive and don't want to upload the entire archive. It also works with remote ZIP archives, meaning you don't need to download the ZIP to your computer or upload it to Galaxy to be able to select and extract individual files and send them to Galaxy.

Features

Works with local and remote ZIP archives (with some limitations)

  • For local ZIP archives, the main limitation is that of your browser's local storage, as individual files you select from the ZIP archive are extracted and stored in the browser's local storage before being sent to Galaxy. This means that if you select a large file, it may not fit in your browser's local storage and you will probably better off uploading the entire ZIP archive.
  • For remote ZIP archives, the main limitation is that the remote ZIP archive must be publicly accessible. This means that if the remote ZIP archive is behind a login or requires authentication, you will not be able to preview or extract files from it. The remote ZIP archive must also be accessible via HTTPS and the host must accept byte-range requests. This is a requirement for the browser to be able to preview the ZIP archive and extract files from it without downloading the entire archive.

Preview ZIP archives

  • The preview feature allows you to browse the contents of the ZIP archive before selecting files to import.
  • It recognizes some known archive structures like RO-Crate and Galaxy export archives, and will display additional information about the archive if it recognizes it.

Import Galaxy workflow files

  • The ZIP explorer will recognize Galaxy workflow files and will display them in a separate section in the preview. This allows you to easily identify and import workflow files from the ZIP archive.

How to use it

  1. Go to the upload dialog in Galaxy.
    image

  2. Click the Explore ZIP button.

  3. The ZIP Explorer wizard will be shown.

  4. Select a ZIP archive from your local file system or enter the URL of a remote ZIP archive.
    image

  5. Click the Next button to preview the contents of the ZIP archive.
    image

  6. Once the ZIP preview is loaded, you can decide which files you want to import into Galaxy in the next step.

  7. Click the Next button to select the files you want to import.
    image

  8. Before clicking Import, you will have the chance to review what elements will be imported.
    image

  9. Click Import and the selected files will be extracted from the ZIP archive and sent to Galaxy.

  10. After the upload, the files will be available in your current history, and any workflow files will be available in your workflow list, ready to run or edit.
    image

Technical details

New Remote ZIP archive extractor file source

The new Remote ZIP archive extractor file source is a new file source that allows you to extract files from a remote ZIP archive without downloading the entire archive.

It handles URIs like:

zip://extract?source={remote_zip_url}&header_offset={header_offset}&compress_size={compress_size}&compression_method={compression_method}

Where:

  • remote_zip_url is the URL of the remote ZIP archive.
  • header_offset is the offset of the file header in the ZIP archive.
  • compress_size is the size of the compressed file in the ZIP archive.
  • compression_method is the compression method used in the ZIP archive (0 for no compression, 8 for deflate are the only ones supported).

These parameters must be known in advance, so the ZIP explorer will extract them from the ZIP archive when you preview it. The Remote ZIP archive extractor file source will then use these parameters to extract the files from the remote ZIP archive without downloading the entire archive.

New API endpoint for proxying requests

The new /api/proxy API endpoint allows HEAD and GET requests to be proxyed to remote ZIP archives to avoid CORS issues.

New client dependency ro-crate-zip-explorer

Most of the code for the ZIP explorer is in a new typescript library ro-crate-zip-explorer that handles the preview and extraction of files from local and remote ZIP archives.

How to test the changes?

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

@davelopez davelopez added area/UI-UX kind/feature area/API highlight Included in user-facing release notes at the top area/backend labels Apr 21, 2025
@jmchilton
Copy link
Member

Really beautiful - I'm in awe!

@davelopez davelopez force-pushed the explore_rocrate_import branch from d8351ae to e2468c2 Compare April 23, 2025 09:31
@nuwang
Copy link
Member

nuwang commented Apr 23, 2025

This is really awesome, nice work! I have an alternative suggestion for the UI. Rather than having "Explore Zip" as a top-level option, would it be feasible to have it as a contextual option which appears as you drill down? E.g., you browse into a remote file source, and if it's a zip file, you get an option to drill-down into the zip file and select nested files. Uploading a local file would work in the same way - if the file is detected to be a zip - there's a contextual option to drill-down into the file. I don't know how workflow imports would be handled, but presumably in a similar way?

This would reduce the number of top-level options which may not be relevant to the user unless they are specifically browsing a zip file. Not sure whether this is feasible, but just putting it out there for discussion.

@davelopez
Copy link
Contributor Author

Thanks for the suggestions @nuwang! I appreciate it!

if the file is detected to be a zip

Unfortunately, in my testing, this is really hard to tell; from most URLs you cannot know if the file at the end is a ZIP file, most headers don't specify the correct MIME type, just use "application/octet-stream", or use redirects until you get to the actual file, etc. And inspecting the last 65kb of every URL to search for the ZIP EOCD would be a bit too much 😞
On the other hand, this feature was intended to make use of the metadata in "special" kinds of ZIP files, RO-Crates, and Galaxy exports. The main idea is to present to the user this rich information about a particular analysis exported using these structured ZIPs; the fact that it can be used with any type of ZIP is just an added advantage. I know RO-Crate exports in Galaxy don't have much useful metadata yet, but this may change over time, and other external publications packaged in RO-Crate have this rich information available already.

I see the value in tapping into this functionality for file sources for drilling down on ZIPs, but to me, that would be an "additional" feature, not a replacement for this UI.

I also understand the concern about increasing the number of top-level options, and I already simplified the first approach that was taking a whole section in the Uploader to just a single button 😅 Anyway, I'm happy to look for alternatives to reduce the number of options not relevant to the user, but I don't know how to make this discoverable for the user and explicit (since we cannot rely on auto-dectection of ZIPs) at the same time 🤔

@nuwang
Copy link
Member

nuwang commented Apr 24, 2025

@davelopez Thanks for clarifying. If the main purpose is to explore rich archive files, what do you think of "Explore Archive" as an alternative name? (Assuming future archive formats might include tar archives or formats altogether different to zip).

@yvanlebras
Copy link
Contributor

Youhouhou ! So amazing!!! Really amazing and important need!!!

@Marie59
Copy link
Contributor

Marie59 commented Apr 25, 2025

Waouh this is super useful !! Can't wait to use it !

@davelopez davelopez force-pushed the explore_rocrate_import branch 2 times, most recently from 2d4cd0c to d84f4fe Compare April 25, 2025 16:05
@davelopez davelopez force-pushed the explore_rocrate_import branch from d84f4fe to 99c9ef0 Compare April 27, 2025 15:56
@davelopez davelopez marked this pull request as ready for review April 28, 2025 15:39
@github-actions github-actions bot added this to the 25.0 milestone Apr 28, 2025
@davelopez
Copy link
Contributor Author

This is finally ready for review. I added a few API and Selenium tests.

I've changed the terminology in some places to refer to "compressed archives" instead of ZIP in case more compressed formats are supported in the future, and updated the screenshot in the PR description.

@davelopez
Copy link
Contributor Author

Test failures are unrelated

@mvdbeek
Copy link
Member

mvdbeek commented Apr 29, 2025

On the other hand, this feature was intended to make use of the metadata in "special" kinds of ZIP files, RO-Crates, and Galaxy exports. The main idea is to present to the user this rich information about a particular analysis exported using these structured ZIPs; the fact that it can be used with any type of ZIP is just an added advantage

Wouldn't it be more useful to add this functionality right into those places (i.e. history and invocation imports ?).

  • This means that if you select a large file, it may not fit in your browser's local storage and you will probably better off uploading the entire ZIP archive.

This seems bad, is there no way to make this a lazy process so we only send chunks ? If this is not possible I think i'd rather we don't add the local zip functionality (or limit it to something like 100MB ?) in favor of not killing the user's tab (or is it even the whole browser ?). From a UX perspective I think the remote file exploration is a much bigger win.

  • For remote ZIP archives, the main limitation is that the remote ZIP archive must be publicly accessible. This means that if the remote ZIP archive is behind a login or requires authentication, you will not be able to preview or extract files from it. The remote ZIP archive must also be accessible via HTTPS and the host must accept byte-range requests.

Could we hook remote file sources into this ? That would be one way to handle non-public data, with a possible future extension making it possible to prompt users for authentication ?

@davelopez
Copy link
Contributor Author

Wouldn't it be more useful to add this functionality right into those places (i.e. history and invocation imports ?).

I thought about this briefly, but then I thought that users who want to use that import functionality are likely to want to import the whole thing all the time rather than just a single file 🤔
In any case, I plan to add the preview display feature to the import Invocation and History pages in a follow-up if that seems useful (I'll create another issue).

This seems bad, is there no way to make this a lazy process so we only send chunks ? If this is not possible I think i'd rather we don't add the local zip functionality (or limit it to something like 100MB ?) in favor of not killing the user's tab (or is it even the whole browser ?). From a UX perspective I think the remote file exploration is a much bigger win.

I can try to limit the file size to 100MB and then also add some kind of warning or make files above this size non-selectable. Not sure about the chunking... I'll need to investigate more.

Could we hook remote file sources into this ? That would be one way to handle non-public data, with a possible future extension making it possible to prompt users for authentication ?

We could try to do that, but I'm not sure how right now. If so, we will only allow exploring archives that are exclusively in those file sources and not arbitrarily hosted somewhere?

@mvdbeek
Copy link
Member

mvdbeek commented Apr 29, 2025

I thought about this briefly, but then I thought that users who want to use that import functionality are likely to want to import the whole thing all the time rather than just a single file

makes sense

On the other hand, this feature was intended to make use of the metadata in "special" kinds of ZIP files, RO-Crates, and Galaxy exports.

That's not part of the PR now though, right ? It would certainly be pretty cool if we can select which datasets from a history/invocation we materialize.

Is @nuwang's suggestion off the table ? What would be the problem with attempting to fetch the last bytes of pasted URLs ? Aren't you doing that in the component anyway ?

@davelopez
Copy link
Contributor Author

Sorry for the verbosity, I'll try to explain my reasoning 😅

That's not part of the PR now though, right ? It would certainly be pretty cool if we can select which datasets from a history/invocation we materialize.

I might be using the word "metadata" too loosely 😅

The Galaxy "metadata" coming from the datasets_attrs.txt is being read and used to import the files, but rather "partially", as only the "name" and "description" is passed to the fetch call for each dataset. I decided to only pass the "name" and "description" because I was trying to find a "common set of metadata" that may exist in other structured formats (like non-galaxy RO-crates). But we could add as much metadata as the fetch API allows in the case of Galaxy exports, if that is more useful.

This is all happening in the client code (in client/src/composables/zipExplorer.ts), not in this "RemoteZipFilesSource".
In particular, the datasets metadata is collected in the GalaxyZipExplorer class and the file source is only used to construct the URL that we pass to the fetch endpoint to perform the upload of the actual remote file.

Is @nuwang's suggestion off the table ? What would be the problem with attempting to fetch the last bytes of pasted URLs ? Aren't you doing that in the component anyway ?

Maybe I did not completely understood Nuwang's suggestion. So what I understood is more like the other kind of file source we commented earlier. From an existing file source (S3, posix, Google Drive, Dropbox etc.) we could add an extension to detect a ZIP or compressed archive and display it as a folder and drill down from there selecting individual files. That is completely in the table, but probably not in this PR since it serves a different use case and can be a project on its own.

The use case I used for designing this is: "The user has an URL to a remote ZIP file containing an RO-crate and wants to import some files (maybe workflows) from it without having to download or upload the entire archive. It would also be helpful to show a preview with the RO-Crate manifest information before importing."

From the suggestion, the part that I did not understand well is how I can handle the use case of a "random url pointing to a ZIP" using the File Source approach. In the case of a file inside another file source, then yes, we could simply check the file ends with ".zip", then change the icon to a folder and then attempt to read the ZIP EOCD on drill-down and so on --still I don't know how to handle or preserve the state in the file source while navigating the ZIP without having to request the EOCD each time, but hopefully we could figure out something caching mechanism or it sounds a bit too inefficient 🤔--.

@mvdbeek
Copy link
Member

mvdbeek commented Apr 30, 2025

I understood @nuwang's comment as making this available everywhere (no problem if file sources is out of scope currently) instead of requiring the top level button. So if you paste a url we could just check to see if it is a zip file and if so offer the wizard ?

@davelopez
Copy link
Contributor Author

Oh! I see now... so just hide the button and on URL paste make the button visible again or something similar?
Ok, I'll see what I can do 👍

@davelopez davelopez marked this pull request as draft May 4, 2025 21:08
@davelopez davelopez force-pushed the explore_rocrate_import branch from 726c818 to 79e2dec Compare May 5, 2025 22:51
davelopez and others added 21 commits May 8, 2025 23:57
…omponents

So other formats might be supported in the future.
From "zip archive" to "compressed archive" for consistency
Co-authored-by: Marius van den Beek <[email protected]>
Move isValidUrl and getProxiedUrl to module
Co-authored-by: Marius van den Beek <[email protected]>
Do not reopen if the same zip is "explored" again and already open
@davelopez davelopez force-pushed the explore_rocrate_import branch from 848156a to eb36e4b Compare May 8, 2025 22:00
@davelopez davelopez force-pushed the explore_rocrate_import branch from eb36e4b to 8c4fa1c Compare May 8, 2025 22:22
@davelopez
Copy link
Contributor Author

Rebased, conflict resolved, and "explore" button style updated according to the poll result:

image

@mvdbeek mvdbeek merged commit 9fb0028 into galaxyproject:dev May 9, 2025
51 of 58 checks passed
@mvdbeek
Copy link
Member

mvdbeek commented May 9, 2025

Thank you @davelopez!

@davelopez davelopez deleted the explore_rocrate_import branch May 9, 2025 10:15
@jmchilton jmchilton mentioned this pull request May 14, 2025
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/API area/backend area/UI-UX highlight Included in user-facing release notes at the top kind/feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants