Skip to content

Commit 5a67570

Browse files
committed
fix(parsing.tgraph): no longer proxies images from *.wp.com
Signed-off-by: Rongrong <[email protected]>
1 parent d59f9b6 commit 5a67570

File tree

3 files changed

+40
-6
lines changed

3 files changed

+40
-6
lines changed

docs/CHANGELOG.md

+4
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,10 @@
22

33
## Unreleased
44

5+
### Enhancements
6+
7+
- **No longer proxies images from `*.wp.com` when generating Telegraph posts**: `*.wp.com` is in the blocklist of `wsrv.nl` (environment variable `IMAGES_WESERV_NL`). Thus, these images are no longer proxied when generating Telegraph posts. All images from `*.wp.com` can be accessed with any referer header, so they are now kept as is.
8+
59
### Bug fixes
610

711
- **Canonical `DATABASE_URL` not recognized**: Since v2.9.0, `DATABASE_URL` is canonicalized before connecting to the corresponding database. However, a canonical URL pointing to a local path cannot be recognized when checking the validity of the scheme (database type). Both canonical (`scheme:/path/to/file.db`) and traditional (`scheme:///path/to/file.db`) forms of such URLs are recognized correctly now.

docs/CHANGELOG.zh.md

+4
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,10 @@
22

33
## 未发布
44

5+
### 增强
6+
7+
- **生成 Telegraph 文章时,不再代理来自 `*.wp.com` 的图像**: `*.wp.com` 位于 `wsrv.nl` (环境变量 `IMAGES_WESERV_NL`) 的阻断列表中。因此,在生成 Telegraph 文章时,这些图像不再被代理。来自 `*.wp.com` 的所有图片都可以用任何 refer 头访问,因此它们现在保持原样。
8+
59
### Bug 修复
610

711
- **无法识别规范的 `DATABASE_URL`**: 自 v2.9.0 起, 在连接到相应的数据库之前,`DATABASE_URL` 被规范化。然而,在检查 scheme (数据库类型) 的合法性时,无法识别指向本地路径的规范 URL。现在,此类 URL 的规范 (`scheme:/path/to/file.db`) 和传统 (`scheme:///path/to/file.db`) 形式都被正确识别。

src/parsing/tgraph.py

+32-6
Original file line numberDiff line numberDiff line change
@@ -20,8 +20,9 @@
2020
from collections.abc import Awaitable
2121

2222
import asyncio
23-
import time
2423
import aiographfix as aiograph
24+
import re
25+
import time
2526
from io import BytesIO
2627
from bs4 import BeautifulSoup
2728
from contextlib import suppress
@@ -40,6 +41,29 @@
4041
else:
4142
convert_table_to_png = None
4243

44+
DOMAIN_PATTERN_TEMPLATE: Final[str] = r'^https?://(?:[^./]+\.)?(?:{domains})\.?(?:/|:|$)'
45+
BLOCKED_BY_WESERV_DOMAIN: Final[set[str]] = {
46+
'sinaimg.cn',
47+
'wp.com',
48+
}
49+
BLOCKED_BY_WESERV_RE: Final[re.Pattern] = re.compile(
50+
DOMAIN_PATTERN_TEMPLATE.format(
51+
domains='|'.join(map(re.escape, BLOCKED_BY_WESERV_DOMAIN)),
52+
),
53+
re.I,
54+
)
55+
ALLOW_REFERER_DOMAIN: Final[set[str]] = set(filter(None, {
56+
'wp.com',
57+
env.IMG_RELAY_SERVER.partition('://')[2].partition('/')[0].strip('.'),
58+
env.IMAGES_WESERV_NL.partition('://')[2].partition('/')[0].strip('.'),
59+
}))
60+
ALLOW_REFERER_RE: Final[re.Pattern] = re.compile(
61+
DOMAIN_PATTERN_TEMPLATE.format(
62+
domains='|'.join(map(re.escape, ALLOW_REFERER_DOMAIN)),
63+
),
64+
re.I,
65+
)
66+
4367
logger = log.getLogger('RSStT.tgraph')
4468

4569
apis: Optional[APIs] = None
@@ -309,13 +333,15 @@ async def generate_page(self):
309333
if not isAbsoluteHttpLink(attr_content):
310334
tag.replaceWithChildren()
311335
continue
312-
if not attr_content.startswith(env.IMG_RELAY_SERVER):
336+
if not ALLOW_REFERER_RE.match(attr_content):
313337
if tag.name == 'video':
314338
attr_content = env.IMG_RELAY_SERVER + attr_content
315-
if tag.name == 'img' and not attr_content.startswith(env.IMAGES_WESERV_NL):
316-
if attr_content.split('.', 1)[1].split('/', 1)[0] == 'sinaimg.cn':
317-
attr_content = env.IMG_RELAY_SERVER + attr_content
318-
attr_content = construct_weserv_url(attr_content)
339+
elif tag.name == 'img':
340+
attr_content = (
341+
env.IMG_RELAY_SERVER + attr_content
342+
if BLOCKED_BY_WESERV_RE.match(attr_content)
343+
else construct_weserv_url(attr_content)
344+
)
319345
tag.attrs = {attr_name: attr_content}
320346

321347
if self.feed_title:

0 commit comments

Comments
 (0)