Description
This issue is used to track my GSoC process.
Original proposal:
1. Introducing headless via Rod
Current Status
Currently, Zeno does not have headless archiving capabilities. Two years ago, in [PR #55](#55), there was an attempt to use [Rod](https://go-rod.github.io/) to add headless/headfull capabilities to Zeno.
Rod is a high-level driver for DevTools Protocol. It's widely used for web automation and scraping. Rod can automate most things in the browser that can be done manually.
However, #55 was put on hold due to concerns that the Chrome DevTools Protocol (CDP) could internally manipulate network data (e.g., modifying http headers, transparently decompressing payloads).
Preliminary Research
After a cursory glance at Rod's request hijacking code, it appears that Rod's request hijacking functionality works outside of CDP. As double-check, I asked the developers of Rod weeks ago and got confirmation that this is indeed the case, and that it is possible for an external http.Client
(from our WARC lib) to take full control of the headless browser's network requests.
Hijacking Requests|Respond with ctx.LoadResponse():
* As documented here: <https://go-rod.github.io/#/network/README?id=hijack-requests>
* The http.Client passed to ctx.LoadResponse() operates outside of the CDP.
* This means the http.Client has complete control over the network req/resp, allowing access to the original, unprocessed data.
* The flow is like this: browser --request-> rod ---> server ---> rod --response-> browser
Thus, Rod can be safely integrated into Zeno
Plan
implement headless feature to Zeno v2.
- In headless mode, the assets postprocess is disabled and outlinks extraction is maintained.
- You can configure which matching URL/domain patterns use headless and which don't.
Known Limitations:
-
CDP limitations may result in incomplete HTTP header information for requests.
When requests are initiated via JS, CDP provides incomplete header information. Therefore, the request information (e.g. http header) that Rod gets from CDP is already with some missing information.
This could affect the archiving of sites that use JavaScript to set custom headers and rely on those headers for correct backend responses.
-
WebSocket request can not be hijacked by CDP/rod yet
2. Better CSS extractor
Current Status
-
Inline CSS Currently Zeno parses inline css inside html and tries to extract the values in the
url()
token and<string>
token in the css. For example:background: url(http://example.com/background.jpg); background: url('http://example.com/background.jpg'); @import "http://example.com/style.css";
But currently Zeno only uses two simple regexes to do the extraction.
urlRegex = regexp.MustCompile(`(?m)url\((.*?)\)`) backgroundImageRegex = regexp.MustCompile(`(?:\(['"]?)(.*?)(?:['"]?\))`)
This causes Zeno to often parse out a lot of non-existent relative paths from the inline css of a web page, e.g.
12, 34, 56
in parentheses of csscolor: rgb(12, 34, 56)
will be matched bybackgroundImageRegex
, which leads to Zeno crawling a bunch of url assests with annoying 404s. -
Separate css files are not supported
Zeno doesn't support parsing individual CSS files.
Preliminary research
url[(](https://www.w3.org/TR/css-values-4/#security))
,src()
and@import <string>
in css can generate web requests. (https://www.w3.org/TR/css-values-4/#security)
src()
is not yet implemented by any browser and can be ignored at this stage [(https://cssdb.org/#src-function)](https://cssdb.org/#src-function). So the only tokens for which CSS can generate web requests (without JS) are url()
and @import <string>
.
- CSS allows for custom attributes.
So would anyone realistically put the URL of a page resource in a CSS attribute and then use it with the help of JS to fetch the value and use it? I did a search on GitHub Code Search for /url =. *getPropertyValue\(/ AND (language:JavaScript OR language:TypeScript OR language:HTML).
It turns out that there is indeed a bunch of real-world code that puts custom <string>
URLs into css and then fetches the value in JS. This weird coding paradigm is widespread, so I think Zeno needs to extract URLs from plain CSS <string>
values that start with https
or http
or //
as well.
- There are two types of
url()
, unquoted and quoted, which are parsed differently and have their own escaping rules. [(https://www.w3.org/TR/css-values-4/#funcdef-url)](https://www.w3.org/TR/css-values-4/#funcdef-url) - The @import
<string>
should be parsed asurl[(](https://www.w3.org/TR/css-cascade-5/#at-ruledef-import)"")
(https://www.w3.org/TR/css-cascade-5/#at-ruledef-import)
After I read the CSS standard and started looking for existing open source Golang CSS parsing libraries, I found that none of them currently do the fine-grained extraction of the actual value of url()/<string>
, they're all lexer/tokenizer slicing libraries, and they're not very usable. For example, https://github.com/tdewolff/parse/ can only extract the whole Token like url( "http://a\"b.c" )
, it can't extract the http://a"b.c
value.
Plan
- Develop a dedicated parser to handle and token values, focusing on escapes, newlines, and spaces. Integrate this parser with the existing open-source CSS parser.
- Add support for extracting urls from separate css files. (The CSS standard specifies that in a separate CSS file, the relative URL should have the location of the CSS file itself as the base address, not the location of the html document.)
3. Check Content-Length before writing WARC
Current status
When a download has started, if disk space is low, workers in the middle of a download will continue until it finishes, although diskWatcher will signal workers to pause processing the new item.
If the total size of resources being downloaded by workers at the time diskWatcher signals pause is greater than the --min-space-required threshold, we will likely run out of storage, which can have unintended consequences.
Plan
- Implement a
--max-content-length
parameter to check the Content-Length header before downloading. If the header exceeds the limit, the download will be skipped. For streaming content with unknown size, downloads will be cancelled if the downloaded size exceeds--max-content-length
(and a seencheck will be added to prevent duplicate downloads). - Implement a
--min-space-urgent
parameter to cancel in-progress downloads when disk space becomes critically low.
4. Creating a Dummy Test Site for Zeno
Current status
Zeno is a complex concurrent system with many components working together, and due to the lack of a proper test suite (we certainly can't use a real website for testing), there are currently only unit tests for few components, and lack of integration tests and E2E tests.
In a local development environment, it is difficult to trigger or reproduce problems (such as memory leaks) that only occur in production environments such as high concurrency, long running times, high packet loss, and complex web pages.
Develop a dummy test site backend for testing Zeno, so that we can do "free" long high-bandwidth stress tests of Zeno locally, as well as bring in real-world web sites and file formats for integration or end-to-end testing. The idea is similar to https://httpbin.org/.
Plan
For example
/random/file.bin?size=2G&stream=ture&chunked=1024
would return a streamed file of unknown size for Zeno.
/html/seed-0/depth-3-size-2
generates 3 layers of outbound html to test Zeno's html outbound extraction and hops capabilities.
/html/seed-00/depth-2-size-2
/html/seed-001/depth-1-size-2
/html/seed-002/depth-1-size-2
/html/seed-01/depth-2-size-2
/html/seed-010/depth-1-size-2
/html/seed-011/depth-1-size-2
/extractor/xml/rss.xml?articles=5&podcast=true
will generate an RSS feed with 5 audios.
The real web is complex, and as Zeno supports more sites and file types, the test suite will need to add test cases accordingly, which will be a long-term project.
5. Enable HQ Control of Zeno Clients via WebSocket
Current Status
There is WebSocket communication between Zeno and HQ, but it is currently only used for Zeno to send one-way heartbeat packets to HQ.
Solution
Make it possible for Zeno to listen for control data sent by HQ WebSocket, so that HQ has the ability to control the Zeno nodes it accesses, e.g. to pause/start/stop a Zeno node.