Skip to content

Commit 7a80519

Browse files
committed
perf: monkey-patch feedparser for performant URI resolution
Relative URIs are now resolved using lxml and yarl, cutting down the overhead of feed parsing by more than 50%. The resolver will not pass all test cases in feedparser due to the difference between lxml and sgmllib when handling malformed HTML. This is mostly harmless, so let's monkey-patch feedparser to benefit from lxml's performant HTML tree parsing. Signed-off-by: Rongrong <[email protected]>
1 parent 91dd86c commit 7a80519

File tree

6 files changed

+272
-1
lines changed

6 files changed

+272
-1
lines changed

docs/CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,11 @@
11
# Changelog
22

3+
## Unreleased
4+
5+
### Enhancements
6+
7+
- **Improve feed parsing performance**: Relative URIs are now resolved using `lxml` and `yarl`, cutting down the overhead of feed parsing by more than 50%.
8+
39
## v2.10.0: Container health check, chat-specific #hashtags, and more
410

511
### Highlights

docs/CHANGELOG.zh.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,11 @@
11
# 更新日志
22

3+
## 未发布
4+
5+
### 增强
6+
7+
- **提高 feed 解析性能**:现在使用 `lxml``yarl` 解析相对 URI,将 feed 解析的开销减少了超过 50%。
8+
39
## v2.10.0: 容器健康检查、特定于聊天的 #hashtag 和更多
410

511
### 亮点

src/compat/__init__.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,8 +21,11 @@
2121
if sys.version_info < (3, 9):
2222
raise RuntimeError("This bot requires Python 3.9 or later")
2323

24+
import feedparser.mixin
25+
import feedparser.urls
2426
import listparser.opml
2527

28+
from .lxml_uri_resolver import resolve_relative_uris
2629
from .listparser_opml_mixin import OpmlMixin
2730
from .utils import (
2831
INT64_T_MAX,
@@ -44,6 +47,10 @@
4447
"bozo_exception_removal_wrapper",
4548
]
4649

47-
# Monkey-patching `listparser.opml.OpmlMixin` to support `text` and `title_orig`
50+
# Monkey-patch `feedparser` to use lxml for performant URI resolution.
51+
feedparser.urls.resolve_relative_uris = resolve_relative_uris
52+
feedparser.mixin.resolve_relative_uris = resolve_relative_uris
53+
54+
# Monkey-patch `listparser.opml.OpmlMixin` to support `text` and `title_orig`.
4855
# https://github.com/kurtmckee/listparser/issues/71
4956
listparser.opml.OpmlMixin.start_opml_outline = OpmlMixin.start_opml_outline
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# RSS to Telegram Bot
2+
# Copyright (C) 2025 Rongrong <[email protected]>
3+
#
4+
# This program is free software: you can redistribute it and/or modify
5+
# it under the terms of the GNU Affero General Public License as
6+
# published by the Free Software Foundation, either version 3 of the
7+
# License, or (at your option) any later version.
8+
#
9+
# This program is distributed in the hope that it will be useful,
10+
# but WITHOUT ANY WARRANTY; without even the implied warranty of
11+
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
12+
# GNU Affero General Public License for more details.
13+
#
14+
# You should have received a copy of the GNU Affero General Public License
15+
# along with this program. If not, see <https://www.gnu.org/licenses/>.
16+
17+
from .uri_resolver import resolve_relative_uris
Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
# RSS to Telegram Bot
2+
# Copyright (C) 2025 Rongrong <[email protected]>
3+
#
4+
# This program is free software: you can redistribute it and/or modify
5+
# it under the terms of the GNU Affero General Public License as
6+
# published by the Free Software Foundation, either version 3 of the
7+
# License, or (at your option) any later version.
8+
#
9+
# This program is distributed in the hope that it will be useful,
10+
# but WITHOUT ANY WARRANTY; without even the implied warranty of
11+
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
12+
# GNU Affero General Public License for more details.
13+
#
14+
# You should have received a copy of the GNU Affero General Public License
15+
# along with this program. If not, see <https://www.gnu.org/licenses/>.
16+
17+
# Collected from:
18+
# - https://github.com/kurtmckee/feedparser/blob/6cdc20849a66c29e2d08b0334fceb22f210bdb26/feedparser/urls.py#L39-L47
19+
# - https://docs.python.org/3.12/library/urllib.parse.html
20+
# - https://docs.python.org/3.12/library/urllib.parse.html
21+
ACCEPTABLE_URI_SCHEMES: set[str] = {
22+
'acap',
23+
'aim',
24+
'callto',
25+
'cvs',
26+
'facetime',
27+
'feed',
28+
'file',
29+
'ftp',
30+
'git',
31+
'gopher',
32+
'gtalk',
33+
'h323',
34+
'hdl',
35+
'http',
36+
'https',
37+
'icap',
38+
'imap',
39+
'irc',
40+
'irc6',
41+
'ircs',
42+
'itms',
43+
'magnet',
44+
'mailto',
45+
'mms',
46+
'msnim',
47+
'mtqp',
48+
'news',
49+
'nntp',
50+
'prospero',
51+
'rsync',
52+
'rtsp',
53+
'rtspsrtspu',
54+
'sftp',
55+
'shttp',
56+
'sip',
57+
'sips',
58+
'skype',
59+
'smb',
60+
'snews',
61+
'ssh',
62+
'svn',
63+
'svn+ssh',
64+
'telnet',
65+
'wais',
66+
'ws',
67+
'wss',
68+
'ymsg',
69+
}
70+
71+
# Collected from:
72+
# - https://github.com/kurtmckee/feedparser/blob/6cdc20849a66c29e2d08b0334fceb22f210bdb26/feedparser/urls.py#L107-L137
73+
TAG_ATTR_MAP: dict[str, set[str]] = {
74+
'a': {'href'},
75+
'applet': {'codebase'},
76+
'area': {'href'},
77+
'audio': {'src'},
78+
'blockquote': {'cite'},
79+
'body': {'background'},
80+
'del': {'cite'},
81+
'form': {'action'},
82+
'frame': {'longdesc', 'src'},
83+
'head': {'profile'},
84+
'iframe': {'longdesc', 'src'},
85+
'img': {'longdesc', 'src', 'usemap'},
86+
'input': {'src', 'usemap'},
87+
'ins': {'cite'},
88+
'link': {'href'},
89+
'object': {'classid', 'codebase', 'data', 'usemap'},
90+
'q': {'cite'},
91+
'script': {'src'},
92+
'source': {'src'},
93+
'video': {'poster', 'src'},
94+
}
95+
96+
TAG_ATTR_MAP_RSSTT: dict[str, set[str]] = {
97+
'a': {'href'},
98+
'audio': {'src'},
99+
'iframe': {'src'},
100+
'img': {'src'},
101+
'q': {'cite'},
102+
'source': {'src'},
103+
'video': {'poster', 'src'},
104+
}
Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
# RSS to Telegram Bot
2+
# Copyright (C) 2025 Rongrong <[email protected]>
3+
#
4+
# This program is free software: you can redistribute it and/or modify
5+
# it under the terms of the GNU Affero General Public License as
6+
# published by the Free Software Foundation, either version 3 of the
7+
# License, or (at your option) any later version.
8+
#
9+
# This program is distributed in the hope that it will be useful,
10+
# but WITHOUT ANY WARRANTY; without even the implied warranty of
11+
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
12+
# GNU Affero General Public License for more details.
13+
#
14+
# You should have received a copy of the GNU Affero General Public License
15+
# along with this program. If not, see <https://www.gnu.org/licenses/>.
16+
17+
from typing import Optional, AbstractSet, Callable, Any
18+
19+
import lxml.html
20+
import re
21+
from yarl import URL
22+
23+
from .presets import ACCEPTABLE_URI_SCHEMES, TAG_ATTR_MAP_RSSTT
24+
25+
26+
def _always_true():
27+
return True
28+
29+
30+
class UriResolver:
31+
def __init__(
32+
self,
33+
allowed_schemes: Optional[AbstractSet[str]] = ...,
34+
tag_attr_map: Optional[dict[str, AbstractSet[str]]] = ...,
35+
):
36+
self._allowed_schemes: AbstractSet[str] = (
37+
ACCEPTABLE_URI_SCHEMES
38+
if allowed_schemes is ...
39+
else allowed_schemes or set()
40+
)
41+
self._tag_attr_map: dict[str, AbstractSet[str]] = (
42+
TAG_ATTR_MAP_RSSTT
43+
if tag_attr_map is ...
44+
else tag_attr_map or {}
45+
)
46+
self._scheme_matcher: Callable[[str], Optional[Any]] = (
47+
re.compile(
48+
f'^({"|".join(self._allowed_schemes)}):',
49+
re.IGNORECASE,
50+
).match
51+
if self._allowed_schemes
52+
else _always_true
53+
)
54+
self._xpath: str = '|'.join((
55+
'//{tag_name}[{attrs}]'.format(
56+
tag_name=tag_name,
57+
attrs=' or '.join((
58+
f'@{attr_name}'
59+
for attr_name in attr_names
60+
))
61+
)
62+
for tag_name, attr_names in self._tag_attr_map.items()
63+
))
64+
65+
def resolve(self, html: str, base: str, type_: str) -> str:
66+
if not base:
67+
return html
68+
69+
if '<' not in html:
70+
# Not an HTML.
71+
return html
72+
73+
xpath = self._xpath
74+
if not xpath:
75+
# Nothing to resolve.
76+
return html
77+
78+
scheme_matcher = self._scheme_matcher
79+
if not scheme_matcher(base):
80+
# The base is relative or without an allowed scheme.
81+
return html
82+
83+
tag_attr_map = self._tag_attr_map
84+
85+
base_url = URL(base)
86+
87+
html_tree = lxml.html.fragment_fromstring(html, create_parent='URI_RESOLVER')
88+
89+
allowed_schemes = self._allowed_schemes
90+
element: lxml.html.HtmlElement
91+
for element in html_tree.xpath(xpath):
92+
for attr_name in tag_attr_map[element.tag]:
93+
relative = element.attrib.get(attr_name)
94+
if relative is None:
95+
continue
96+
97+
relative = relative.strip()
98+
99+
if not relative:
100+
element.attrib[attr_name] = base
101+
continue
102+
103+
if scheme_matcher(relative):
104+
# Absolute URL with an allowed scheme, happy path.
105+
continue
106+
107+
relative_url = URL(relative)
108+
if relative_url.absolute:
109+
# Absolute URL without an allowed scheme, erase it.
110+
element.attrib[attr_name] = ''
111+
continue
112+
113+
absolute_url = base_url.join(relative_url)
114+
element.attrib[attr_name] = (
115+
absolute_url.human_repr()
116+
if absolute_url.scheme in allowed_schemes
117+
else ''
118+
)
119+
120+
return lxml.html.tostring(
121+
html_tree,
122+
encoding='unicode',
123+
method='xml' if type_ == 'application/xhtml+xml' else 'html',
124+
).partition('<URI_RESOLVER>')[2].rpartition('</URI_RESOLVER>')[0]
125+
126+
127+
uri_resolver = UriResolver()
128+
129+
130+
def resolve_relative_uris(html_source, base_uri, encoding, type_):
131+
return uri_resolver.resolve(html_source, base_uri, type_)

0 commit comments

Comments
 (0)