Skip to content

Commit 1d18399

Browse files
authored
Merge pull request #222 from bellingcat/feat/yt-dlp-pots
yt-dlp proposed extractor_args and PO Token client.
2 parents 25f1f5d + c510c04 commit 1d18399

File tree

7 files changed

+304
-92
lines changed

7 files changed

+304
-92
lines changed

docs/source/how_to/authentication_how_to.md

Lines changed: 113 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -106,5 +106,117 @@ Finally,Some important things to remember:
106106

107107
## Authenticating on XXXX site with username/password
108108

109-
```{note} This section is still under construction 🚧
109+
```{note}
110+
This section is still under construction 🚧
110111
```
112+
113+
114+
# Proof of Origin Tokens
115+
116+
YouTube uses **Proof of Origin Tokens (POT)** as part of its bot detection system to verify that requests originate from valid clients. If a token is missing or invalid, some videos may return errors like "Sign in to confirm you're not a bot."
117+
118+
yt-dlp provides [a detailed guide to POTs](https://github.com/yt-dlp/yt-dlp/wiki/PO-Token-Guide).
119+
120+
### How Auto Archiver Uses POT
121+
This feature is enabled for the Generic Archiver via two yt-dlp plugins:
122+
123+
- **Client-side plugin**: [yt-dlp-get-pot](https://github.com/coletdjnz/yt-dlp-get-pot)
124+
Detects when a token is required and requests one from a provider.
125+
126+
- **Provider plugin**: [bgutil-ytdlp-pot-provider](https://github.com/Brainicism/bgutil-ytdlp-pot-provider)
127+
Includes both a Python plugin and a **Node.js server or script** to generate the token.
128+
129+
These are installed in our Poetry environment.
130+
131+
### Integration Methods
132+
133+
**Docker (Recommended)**:
134+
135+
When running the Auto Archiver using the Docker image, we use the [Node.js token generation script](https://github.com/Brainicism/bgutil-ytdlp-pot-provider/tree/master/server).
136+
This is to avoid managing a separate server process, and is handled automatically inside the Docker container when needed.
137+
138+
This is already included in the Docker image, however if you need to disable this you can set the config option `bguils_po_token_method` under the `generic_extractor` section of your `orchestration.yaml` config file to "disabled".
139+
```yaml
140+
generic_extractor:
141+
bguils_po_token_method: "disabled"
142+
```
143+
144+
**PyPi/ Local**:
145+
146+
When using the Auto Archiver PyPI package, or running locally, you will need additional system requirements to run the token generation script, namely either Docker, or Node.js and Yarn.
147+
148+
See the [bgutil-ytdlp-pot-provider](https://github.com/Brainicism/bgutil-ytdlp-pot-provider?tab=readme-ov-file#a-http-server-option) documentation for more details.
149+
150+
⚠️WARNING⚠️: This will add the server scripts to the home directory of wherever this is running.
151+
152+
- You can set the config option `bguils_po_token_method` under the `generic_extractor` section of your `orchestration.yaml` config file to "script" to enable the token generation script process locally.
153+
- Alternatively you can run the bgutil-ytdlp-pot-provider server separately using their Docker image or Node.js server.
154+
155+
### Notes
156+
157+
- The token generation script is only triggered when needed by yt-dlp, so it should have no effect unless YouTube requests a POT.
158+
- If you're running the Auto Archiver in Docker, this is set up automatically.
159+
- If you're running locally, you'll need to run the setup script manually or enable the feature in your config.
160+
- You can set up both the server and the script, and the plugin will fallback on each other if needed. This is recommended for robustness!
161+
162+
### Configurations:
163+
164+
## Configurations Summary
165+
166+
| Option | Behavior | Docker Default? |
167+
|------------| ------------------------------------------------------------------------------------------------------------------------------------------ | --------------- |
168+
| `auto` | Docker: Automatically downloads and uses the token generation script. Local: Does nothing; assumes a separate server is running externally. | ✅ Yes |
169+
| `script` | Explicitly downloads and uses the token generation script, even locally. | ❌ No |
170+
| `disabled` | Disables token generation completely. | ❌ No |
171+
172+
Example configuration:
173+
174+
175+
```yaml
176+
generic_extractor:
177+
# ...
178+
bguils_po_token_method: "script"
179+
# For debugging add the verbose flag here:
180+
ytdlp_args: "--no-abort-on-error --abort-on-error --verbose"
181+
182+
```
183+
184+
**Advanced Configuration:**
185+
186+
If you change the default port of the bgutil-ytdlp-pot-provider server, you can pass the updated values using our `extractor_args` option for the gereric extractor.
187+
188+
```yaml
189+
generic_extractor:
190+
ytdlp_args: "--no-abort-on-error --abort-on-error --verbose"
191+
ytdlp_update_interval: 5
192+
bguils_po_token_method: "script"
193+
extractor_args:
194+
youtube:
195+
getpot_bgutil_baseurl: "http://127.0.0.1:8080"
196+
player_client: web,tv
197+
```
198+
For more details on this for bgutils see [here](https://github.com/Brainicism/bgutil-ytdlp-pot-provider?tab=readme-ov-file#usage)
199+
200+
### Checking the logs
201+
202+
To verify that the POT process working, look for the following lines in your log after adding the config option:
203+
204+
```shell
205+
[GetPOT] BgUtilScript: Generating POT via script: /Users/you/bgutil-ytdlp-pot-provider/server/build/generate_once.js
206+
[debug] [GetPOT] BgUtilScript: Executing command to get POT via script: /Users/you/.nvm/versions/node/v20.18.0/bin/node /Users/you/bgutil-ytdlp-pot-provider/server/build/generate_once.js -v ymCMy8OflKM
207+
[debug] [GetPOT] BgUtilScript: stdout:
208+
{"poToken":"MlMxojNFhEJvUzGeHEkVRSK_luXtwcDnwSNIOgaUutqB7t99nmlNvtWgYayboopG6ZopZgmQ-6PJCWEMHv89MIiFGGlJRY25Fkwzxmia_8uYgf5AWf==","generatedAt":"2025-03-26T10:45:26.156Z","visitIdentifier":"ymCMy8OflKM"}
209+
[debug] [GetPOT] Fetching gvs PO Token for tv client
210+
```
211+
212+
If it can't find the script or something, you'll see something like this:
213+
```shell
214+
[debug] [GetPOT] Fetching player PO Token for tv client
215+
WARNING: [GetPOT] BgUtilScript: Script path doesn't exist: /Users/you/bgutil-ytdlp-pot-provider/server/build/generate_once.js. Please make sure the script has been transpiled correctly.
216+
WARNING: [GetPOT] BgUtilHTTP: Error reaching GET http://127.0.0.1:4416/ping (caused by TransportError). Please make sure that the server is reachable at http://127.0.0.1:4416.
217+
[debug] [GetPOT] No player PO Token provider available for tv client
218+
```
219+
220+
In this case check that the script has been transpiled correctly and is available at the path specified in the log,
221+
or that the server is running and reachable.
222+

poetry.lock

Lines changed: 28 additions & 56 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,7 @@ dependencies = [
5656
"rfc3161-client (>=1.0.1,<2.0.0)",
5757
"cryptography (>44.0.1,<45.0.0)",
5858
"opentimestamps (>=0.4.5,<0.5.0)",
59+
"bgutil-ytdlp-pot-provider (>=0.7.3,<0.8.0)",
5960
]
6061

6162
[tool.poetry.group.dev.dependencies]

src/auto_archiver/modules/generic_extractor/__manifest__.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,11 @@
7474
"default": "inf",
7575
"help": "Use to limit the number of videos to download when a channel or long page is being extracted. 'inf' means no limit.",
7676
},
77+
"bguils_po_token_method": {
78+
"default": "auto",
79+
"help": "Set up a Proof of origin token provider. This process has additional requirements. See [authentication](https://auto-archiver.readthedocs.io/en/latest/how_to/authentication_how_to.html) for more information.",
80+
"choices": ["auto", "script", "disabled"],
81+
},
7782
"extractor_args": {
7883
"default": {},
7984
"help": "Additional arguments to pass to the yt-dlp extractor. See https://github.com/yt-dlp/yt-dlp/blob/master/README.md#extractor-arguments.",

0 commit comments

Comments
 (0)