[GSoC] Progress Tracking

> This issue is used to track my GSoC process.

---

# Original proposal:

## 1. Introducing headless via Rod
<details>

### Current Status

Currently, Zeno does not have headless archiving capabilities. Two years ago, in [[PR #55](https://github.com/internetarchive/zeno/pull/55)](https://github.com/internetarchive/zeno/pull/55), there was an attempt to use [[Rod](https://go-rod.github.io/)](https://go-rod.github.io/) to add headless/headfull capabilities to Zeno.

> Rod is a high-level driver for DevTools Protocol. It's widely used for web automation and scraping. Rod can automate most things in the browser that can be done manually.
> 

However, #55 was put on hold due to concerns that the Chrome DevTools Protocol (CDP) could internally manipulate network data (e.g., modifying http headers, transparently decompressing payloads).

### Preliminary Research

After a cursory glance at Rod's request hijacking code, it appears that Rod's request hijacking functionality works outside of CDP. As double-check, I asked the developers of Rod weeks ago and got confirmation that this is indeed the case, and that it is possible for an external `http.Client` (from our WARC lib) to take full control of the headless browser's network requests.

```markdown
Hijacking Requests|Respond with ctx.LoadResponse():
   * As documented here: <https://go-rod.github.io/#/network/README?id=hijack-requests>
   * The http.Client passed to ctx.LoadResponse() operates outside of the CDP.
   * This means the http.Client has complete control over the network req/resp, allowing access to the original, unprocessed data.
   * The flow is like this: browser --request-> rod ---> server ---> rod --response-> browser
```

Thus, Rod can be safely integrated into Zeno

### Plan

implement headless feature to Zeno v2.

- In headless mode, the assets postprocess is disabled and outlinks extraction is maintained.
- You can configure which matching URL/domain patterns use headless and which don't.

### Known Limitations:

1. CDP limitations may result in incomplete HTTP header information for requests.
    
    > When requests are initiated via JS, CDP provides incomplete header information. Therefore, the request information (e.g. http header) that Rod gets from CDP is already with some missing information.
    > 
    
    This could affect the archiving of sites that use JavaScript to set custom headers and rely on those headers for correct backend responses.
    
2. WebSocket request can not be hijacked by CDP/rod yet
</details>

## 2. Better CSS extractor

<details>

### Current Status

1. Inline CSS Currently Zeno parses inline css inside html and tries to extract the values in the `url()` token and `<string>` token in the css. For example:
    
    ```css
    background: url(http://example.com/background.jpg);
    background: url('http://example.com/background.jpg');
    @import "http://example.com/style.css";
    ```
    
    But currently Zeno only uses two simple regexes to do the extraction.
    
    ```go
    urlRegex = regexp.MustCompile(`(?m)url\((.*?)\)`)
    backgroundImageRegex = regexp.MustCompile(`(?:\(['"]?)(.*?)(?:['"]?\))`)
    ```
    
    This causes Zeno to often parse out a lot of non-existent relative paths from the inline css of a web page, e.g. `12, 34, 56` in parentheses of css `color: rgb(12, 34, 56)` will be matched by `backgroundImageRegex`, which leads to Zeno crawling a bunch of url assests with annoying 404s.
    
2. Separate css files are not supported
    
    Zeno doesn't support parsing individual CSS files.
    

### Preliminary research

- `url[(](https://www.w3.org/TR/css-values-4/#security))` ,`src()`and `@import <string>` in css can generate web requests. [(https://www.w3.org/TR/css-values-4/#security)](https://www.w3.org/TR/css-values-4/#security)

`src()` is not yet implemented by any browser and can be ignored at this stage [[(](https://cssdb.org/#src-function)https://cssdb.org/#src-function)](https://cssdb.org/#src-function). So the only tokens for which CSS can generate web requests (without JS) are `url()` and `@import <string>`.

---

- CSS allows for custom attributes.

So would anyone realistically put the URL of a page resource in a CSS attribute and then use it with the help of JS to fetch the value and use it? I did a search on GitHub Code Search for `/url =. *getPropertyValue\(/ AND (language:JavaScript OR language:TypeScript OR language:HTML).` It turns out that there is indeed a bunch of real-world code that puts custom `<string>` URLs into css and then fetches the value in JS. This weird coding paradigm is widespread, so I think Zeno needs to extract URLs from plain CSS `<string>` values that start with `https` or `http` or `//` as well.

---

- There are two types of `url()`, unquoted and quoted, which are parsed differently and have their own escaping rules. [[(](https://www.w3.org/TR/css-values-4/#funcdef-url)https://www.w3.org/TR/css-values-4/#funcdef-url)](https://www.w3.org/TR/css-values-4/#funcdef-url)
- The @import `<string>` should be parsed as `url[(](https://www.w3.org/TR/css-cascade-5/#at-ruledef-import)"")`[(https://www.w3.org/TR/css-cascade-5/#at-ruledef-import)](https://www.w3.org/TR/css-cascade-5/#at-ruledef-import)

---

After I read the CSS standard and started looking for existing open source Golang CSS parsing libraries, I found that none of them currently do the fine-grained extraction of the actual value of `url()/<string>`, they're all lexer/tokenizer slicing libraries, and they're not very usable. For example, https://github.com/tdewolff/parse/ can only extract the whole Token like `url( "http://a\"b.c" )`, it can't extract the `http://a"b.c` value.

### Plan

- Develop a dedicated parser to handle <url> and <string> token values, focusing on escapes, newlines, and spaces. Integrate this parser with the existing open-source CSS parser.
- Add support for extracting urls from separate css files. (The CSS standard specifies that in a separate CSS file, the relative URL should have the location of the CSS file itself as the base address, not the location of the html document.)

</details>


## 3. Check Content-Length before writing WARC

<details>

### Current status

When a download has started, if disk space is low, workers in the middle of a download will continue until it finishes, although diskWatcher will signal workers to pause processing the new item.

If the total size of resources being downloaded by workers at the time diskWatcher signals pause is greater than the --min-space-required threshold, we will likely run out of storage, which can have unintended consequences.

### Plan

- Implement a `--max-content-length` parameter to check the Content-Length header before downloading. If the header exceeds the limit, the download will be skipped. For streaming content with unknown size, downloads will be cancelled if the downloaded size exceeds `--max-content-length` (and a seencheck will be added to prevent duplicate downloads).
- Implement a `--min-space-urgent` parameter to cancel in-progress downloads when disk space becomes critically low.

</details>

## 4. Creating a Dummy Test Site for Zeno

<details>

### Current status

Zeno is a complex concurrent system with many components working together, and due to the lack of a proper test suite (we certainly can't use a real website for testing), there are currently only unit tests for few components, and lack of integration tests and E2E tests.

In a local development environment, it is difficult to trigger or reproduce problems (such as memory leaks) that only occur in production environments such as high concurrency, long running times, high packet loss, and complex web pages.

Develop a dummy test site backend for testing Zeno, so that we can do "free" long high-bandwidth stress tests of Zeno locally, as well as bring in real-world web sites and file formats for integration or end-to-end testing. The idea is similar to [https://httpbin.org/](https://httpbin.org/ip).

### Plan

---

For example

`/random/file.bin?size=2G&stream=ture&chunked=1024` would return a streamed file of unknown size for Zeno.

`/html/seed-0/depth-3-size-2` generates 3 layers of outbound html to test Zeno's html outbound extraction and hops capabilities.

- `/html/seed-00/depth-2-size-2`
    - `/html/seed-001/depth-1-size-2`
    - `/html/seed-002/depth-1-size-2`
- `/html/seed-01/depth-2-size-2`
    - `/html/seed-010/depth-1-size-2`
    - `/html/seed-011/depth-1-size-2`

`/extractor/xml/rss.xml?articles=5&podcast=true` will generate an RSS feed with 5 audios.

---

The real web is complex, and as Zeno supports more sites and file types, the test suite will need to add test cases accordingly, which will be a long-term project.

</details>

## 5. Enable HQ Control of Zeno Clients via WebSocket

<details>

### Current Status

There is WebSocket communication between Zeno and HQ, but it is currently only used for Zeno to send one-way heartbeat packets to HQ.

### Solution

Make it possible for Zeno to listen for control data sent by HQ WebSocket, so that HQ has the ability to control the Zeno nodes it accesses, e.g. to pause/start/stop a Zeno node.

</details>

## 6. Addressing Existing GitHub Issues


---


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[GSoC] Progress Tracking #322

Original proposal:

1. Introducing headless via Rod

Current Status

Preliminary Research

Plan

Known Limitations:

2. Better CSS extractor

Current Status

Preliminary research

Plan

3. Check Content-Length before writing WARC

Current status

Plan

4. Creating a Dummy Test Site for Zeno

Current status

Plan

5. Enable HQ Control of Zeno Clients via WebSocket

Current Status

Solution

6. Addressing Existing GitHub Issues

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[GSoC] Progress Tracking #322

Description

Original proposal:

1. Introducing headless via Rod

Current Status

Preliminary Research

Plan

Known Limitations:

2. Better CSS extractor

Current Status

Preliminary research

Plan

3. Check Content-Length before writing WARC

Current status

Plan

4. Creating a Dummy Test Site for Zeno

Current status

Plan

5. Enable HQ Control of Zeno Clients via WebSocket

Current Status

Solution

6. Addressing Existing GitHub Issues

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions