refactor: otaproxy: implement resource limit for requests handling and cache r/w, refactor cache_streaming with r/w thread pools #575

Bodong-Yang · 2025-06-22T23:04:06Z

Introduction

This PR refactors the caching read/write implementation(cache_streaming.py) with considering IO operations offloaded and limiting resource and memory use. Also, limitation of concurrent requests handling is applied to the request handling entry.

Major changes:

server_app: implement limiting the concurrent requests.
cache_streaming: introduce read/write workers thread pool for cache r/w operations, refine cache writing, cache streaming and cache reading.

Other changes:

db & lru_cache_helper: update according to cache_streaming implementation change.
cache_control_header: refine implementation.

Ticket

https://tier4.atlassian.net/browse/RT4-18003

for more information, see https://pre-commit.ci

…eption to indicate the client retries again

for more information, see https://pre-commit.ci

Bodong-Yang · 2025-06-23T00:39:35Z

otaclient v3.7.1, single client:

● otaproxy.service - OTA Client
     Loaded: loaded (/etc/systemd/system/otaproxy.service; disabled; vendor preset: enabled)
     Active: active (running) since Wed 2025-06-18 12:11:21 JST; 1min 28s ago
   Main PID: 14816 (python3)
      Tasks: 9 (limit: 9457)
     Memory: 1.3G
        CPU: 48.543s
     CGroup: /system.slice/otaproxy.service
             └─14816 /opt/ota/client/venv/bin/python3 -m otaclient.ota_proxy --host 0.0.0.0 --port 8888 --enable-cache --enable-https

root@autoware:/opt/ota/client# cat /sys/fs/cgroup/system.slice/otaproxy.service/memory.stat 
anon 39096320
file 1678356480
...

With PR, 5 clients:

● otaproxy.service - OTA proxy
     Loaded: loaded (/etc/systemd/system/otaproxy.service; static)
     Active: active (running) since Mon 2025-06-23 09:32:42 JST; 5min ago
   Main PID: 10431 (python3)
      Tasks: 25 (limit: 2288)
     Memory: 252.9M
        CPU: 4min 56.051s
     CGroup: /system.slice/otaproxy.service
             └─10431 /opt/ota/client2/venv/bin/python3 -m ota_proxy --host 0.0.0.0 --port 8888 --enable-cache --enable-https

root@autoware:/home/autoware# cat /sys/fs/cgroup/system.slice/otaproxy.service/memory.stat 
anon 77803520 # 77MB
file 57675776 # 58MB
...

The memory directly used by otaproxy is very stable during multi client downloading test, and the reclaiming of file page cache can also be observed.

Bodong-Yang · 2025-06-22T23:07:42Z

src/ota_metadata/legacy2/metadata.py

@@ -472,7 +470,7 @@ def resources_count(self) -> int:
        )

        try:
-            _query = _orm.orm_execute(_sql_stmt)
+            _query = _orm.orm_execute(_sql_stmt, row_factory=sqlite3.Row)


By default, ORM will apply a custom row_factory to try to convert the raw result into cache db entry, but here we do SELECT count(*) here, the result is not an entry of db.
Although the custom row_factory will detect whether the raw result if actually an entry or not, if not, return the result as it, but it is better to use sqlite3.Row as row_factory if we know we are not selecting db entry in the first place.

Bodong-Yang · 2025-06-22T23:08:24Z

src/ota_proxy/__init__.py

+    anyio.run(
+        _server.serve,
+        backend="asyncio",
+        backend_options={"loop_factory": uvloop.new_event_loop},


The anyio recommended way to setup uvloop.

Bodong-Yang · 2025-06-22T23:08:47Z

src/ota_proxy/__main__.py

+    # suppress logging from third-party deps
+    logging.basicConfig(level=logging.CRITICAL)
+    logger.setLevel(logging.INFO)


For standalone starting otaproxy, configure logging. Note that here we filter out third-party deps' logging, and set the ota_proxy logger level to INFO.

Bodong-Yang · 2025-06-22T23:11:26Z

src/ota_proxy/cache_control_header.py

Cleanup and refactor of cache header parsing/exporting. test_cache_control_headers.py ensures that the new implementation still has the same behavior as previous.

Bodong-Yang · 2025-06-22T23:17:20Z

src/ota_proxy/cache_streaming.py

The major changes introduced by this PR.

A new cache handling model is implemented:

Cache write worker threads pool and cache read worker threads are introduced, all IO operations are dispatched to the thread pool.

For each worker threads pool, a pending task limitation is applied, exceeded incoming cache requests will be dropped to avoid unbounded memory usage.

The following behaviors are still kept in the new implementation:

For the cache teeing from remote resource downloading, we will still first ensure the data streaming back to the client, even the cache writing failed.

Cache writing and cache db entries commit is separated from data streaming back to the client, i.e., it will be handled in worker thread and will not interfere the client request handling.

Bodong-Yang · 2025-06-23T00:17:16Z

src/ota_proxy/lru_cache_helper.py

+        _configured_con_factory = partial(_con_factory, db_f, thread_wait_timeout)
        self._async_db = AsyncCacheMetaORM(
            table_name=table_name,
-            con_factory=_con_factory,
-            number_of_cons=thread_nums,
+            con_factory=_configured_con_factory,
+            number_of_cons=read_db_thread_nums,
+            row_factory="table_spec_no_validation",
+        )
+        self._db = CacheMetaORMPool(
+            table_name=table_name,
+            con_factory=_configured_con_factory,
+            number_of_cons=write_db_thread_nums,
            row_factory="table_spec_no_validation",
        )


Here use two ORM, one for cache db entry committing(sync_db), one for looking up cache entry(async_db).

sync ORM works with thread pool, while async ORM works in the main event loop.

Bodong-Yang · 2025-06-23T00:18:33Z

src/ota_proxy/server_app.py

-    @asynccontextmanager
-    async def _error_handling_for_cache_retrieving(self, url: str, send):
-        _is_succeeded = asyncio.Event()
-        _common_err_msg = f"request for {url=} failed"
+    async def _error_handling_for_cache_retrieving(
+        self, exc: Exception, url: str, send
+    ) -> None:


Don't implement exception handler as context manager anymore, but just implement it as a normal function.

Bodong-Yang · 2025-06-23T00:19:57Z

src/ota_proxy/server_app.py

+        max_concurrent_requests: int = cfg.MAX_CONCURRENT_REQUESTS,
+    ):
        self._lock = asyncio.Lock()
        self._closed = True
        self._ota_cache = ota_cache

+        self._se = asyncio.Semaphore(max_concurrent_requests)


For server APP, also introduce semaphore for limiting concurrent requests.

Bodong-Yang · 2025-06-23T00:23:56Z

src/ota_proxy/server_app.py

+        if self._se.locked():
+            burst_suppressed_logger.warning(
+                f"exceed max pending requests: {self.max_concurrent_requests}, respond with 429"
+            )
+            await self._respond_with_error(HTTPStatus.TOO_MANY_REQUESTS, "", send)
+            return


For exceeded incoming requests, return 429.

Bodong-Yang · 2025-06-23T00:24:45Z

src/ota_proxy/server_app.py

+    def __init__(
+        self,
+        ota_cache: OTACache,
+        *,
+        max_concurrent_requests: int = cfg.MAX_CONCURRENT_REQUESTS,
+    ):
        self._lock = asyncio.Lock()
        self._closed = True
        self._ota_cache = ota_cache

+        self.max_concurrent_requests = max_concurrent_requests
+        self._se = asyncio.Semaphore(max_concurrent_requests)


On server_app side, also introduce semaphore to restrict the number of ongoing handled requests.

Bodong-Yang · 2025-06-25T06:06:59Z

src/ota_proxy/config.py

    """The cache blob storage is located at <cache_mnt_point>/data."""

+    # ------ task management ------ #
+    MAX_CONCURRENT_REQUESTS = 1024


For exceeding requests, HTTP error 429 will be returned.

Bodong-Yang · 2025-06-25T06:09:40Z

src/ota_proxy/server_app.py

 # helper methods


-def parse_raw_headers(raw_headers: List[Tuple[bytes, bytes]]) -> Dict[str, str]:
+def parse_raw_headers(raw_headers: list[tuple[bytes, bytes]]) -> CIMultiDict[str]:


Use case-insensitive dict from the very beginning of parsing headers to prevent accidentally dropping any headers.

Bodong-Yang · 2025-06-25T06:11:22Z

src/ota_proxy/server_app.py

+            if isinstance(exc, (ReaderPoolBusy, CacheProviderNotReady)):
+                _err_msg = f"{_common_err_msg} due to otaproxy is busy: {exc!r}"
+                burst_suppressed_logger.error(_err_msg)
+                await self._respond_with_error(
+                    HTTPStatus.SERVICE_UNAVAILABLE, "otaproxy internal busy", send
+                )


Return 504 when otacache internal r/w thread pools are busy.

Bodong-Yang · 2025-06-25T06:12:13Z

src/ota_proxy/utils.py

-async def read_file(fpath: PathLike) -> AsyncIterator[bytes]:
+async def read_file(
+    fpath: PathLike, chunk_size: int = cfg.LOCAL_READ_SIZE
+) -> AsyncGenerator[bytes]:
    """Open and read a file asynchronously."""
    async with await open_file(fpath, "rb") as f:
-        while data := await f.read(cfg.CHUNK_SIZE):
+        fd = f.wrapped.fileno()
+        os.posix_fadvise(fd, 0, 0, os.POSIX_FADV_SEQUENTIAL)
+        while data := await f.read(chunk_size):
            yield data
+        os.posix_fadvise(fd, 0, 0, os.POSIX_FADV_DONTNEED)
+
+
+def read_file_once(fpath: StrOrPath | anyio.Path) -> bytes:
+    with open(fpath, "rb") as f:
+        fd = f.fileno()
+        os.posix_fadvise(fd, 0, 0, os.POSIX_FADV_SEQUENTIAL)
+        data = f.read()
+        os.posix_fadvise(fd, 0, 0, os.POSIX_FADV_DONTNEED)
+    return data


Use os.posix_fadvice to prevent kernel holds the open/written files page caches after files are closed.

sonarqubecloud · 2025-06-25T06:37:45Z

Quality Gate passed

Issues
4 New issues
0 Accepted issues

Measures
0 Security Hotspots
40.6% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

pick changes from dev branch

ed921b1

Bodong-Yang added refactor Rewrite/remove related code instead of patching them need_backport: v3.9 labels Jun 22, 2025

pre-commit-ci bot and others added 7 commits June 22, 2025 23:04

[pre-commit.ci] auto fixes from pre-commit.com hooks

cbc01bb

for more information, see https://pre-commit.ci

cache_streaming: reader: when reader pool is busy, directly raise exc…

3bef7f4

…eption to indicate the client retries again

ota_cache: integrate new changes

98589b1

server_app: handle internal busy exceptions

98a4e73

[pre-commit.ci] auto fixes from pre-commit.com hooks

d7fbd8c

for more information, see https://pre-commit.ci

minor update

8ece84f

server_app: fix se usage

2575ee4

Merge branch 'main' into refactor/otaproxy_main

405296f

Bodong-Yang commented Jun 23, 2025

View reviewed changes

Bodong-Yang and others added 2 commits June 23, 2025 18:08

Merge branch 'main' into refactor/otaproxy_main

98c939d

db: still use the old check_db function

03807ba

Bodong-Yang commented Jun 25, 2025

View reviewed changes

Bodong-Yang changed the title ~~refactor: otaproxy: implement resource limit for requests handling, introduce read/write thread pools for IO operations~~ refactor: otaproxy: implement resource limit for requests handling and cache r/w, refactor cache_streaming with r/w thread pools Jun 25, 2025

Merge branch 'main' into refactor/otaproxy_main

a662b1f

Bodong-Yang marked this pull request as ready for review June 25, 2025 06:24

Bodong-Yang requested a review from a team as a code owner June 25, 2025 06:24

refactor: otaproxy: implement resource limit for requests handling and cache r/w, refactor cache_streaming with r/w thread pools #575

Are you sure you want to change the base?

refactor: otaproxy: implement resource limit for requests handling and cache r/w, refactor cache_streaming with r/w thread pools #575

Uh oh!

Conversation

Bodong-Yang commented Jun 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Introduction

Ticket

Uh oh!

Bodong-Yang commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Jun 25, 2025

Quality Gate passed

Uh oh!

Uh oh!

Bodong-Yang commented Jun 22, 2025 •

edited

Loading

Bodong-Yang commented Jun 23, 2025 •

edited

Loading