Synapse can fail to move large media uploads from temporary storage to disk in time before the client disconnects #13009
Description
This issue is a distilled version of #12937.
When a file is uploaded to Synapse's media store, Twisted buffers it into a temporary file and gives Synapse an IO
handle to it once it's been fully received.
The issue is that, with sufficiently large files and slow storage, it's possible that it will take more than 30 seconds to copy the bytes out of that temporary file into the correct location in the media store.
Clients such as Element Web will time out 30 seconds after having sent the last byte. (It also wouldn't seem unusual for a reverse proxy to do something similar.)
When this happens, the disconnection causes Twisted to close the temporary file and so Synapse's copy is interrupted.
The error produced in the logs looks as follows:
synapse.http.site - 362 - INFO - POST-8959 - Connection from client lost before response was sent
synapse.http.server - 183 - ERROR - POST-8959 - Failed handle request via 'UploadResource': <XForwardedForRequest at 0x7f5bada94c40 method='POST' uri='/_matrix/media/r0/upload?filename=2.0gb' clientproto='HTTP/1.1' site='8008'>
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/synapse/http/server.py", line 366, in _async_render_wrapper
callback_return = await self._async_render(request)
File "/usr/local/lib/python3.9/site-packages/synapse/http/server.py", line 396, in _async_render
callback_return = await raw_callback_return
File "/usr/local/lib/python3.9/site-packages/synapse/rest/media/v1/upload_resource.py", line 96, in _async_render_POST
content_uri = await self.media_repo.create_content(
File "/usr/local/lib/python3.9/site-packages/synapse/rest/media/v1/media_repository.py", line 178, in create_content
fname = await self.media_storage.store_file(content, file_info)
File "/usr/local/lib/python3.9/site-packages/synapse/rest/media/v1/media_storage.py", line 94, in store_file
await self.write_to_file(source, f)
File "/usr/local/lib/python3.9/site-packages/synapse/rest/media/v1/media_storage.py", line 101, in write_to_file
await defer_to_thread(self.reactor, _write_file_synchronously, source, output)
File "/usr/local/lib/python3.9/site-packages/twisted/python/threadpool.py", line 244, in inContext
result = inContext.theWork() # type: ignore[attr-defined]
File "/usr/local/lib/python3.9/site-packages/twisted/python/threadpool.py", line 260, in <lambda>
inContext.theWork = lambda: context.call( # type: ignore[attr-defined]
File "/usr/local/lib/python3.9/site-packages/twisted/python/context.py", line 117, in callWithContext
return self.currentContext().callWithContext(ctx, func, *args, **kw)
File "/usr/local/lib/python3.9/site-packages/twisted/python/context.py", line 82, in callWithContext
return func(*args, **kw)
File "/usr/local/lib/python3.9/site-packages/synapse/logging/context.py", line 970, in g
return f(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/synapse/rest/media/v1/media_storage.py", line 312, in _write_file_synchronously
shutil.copyfileobj(source, dest)
File "/usr/local/lib/python3.9/shutil.py", line 205, in copyfileobj
buf = fsrc_read(length)
ValueError: I/O operation on closed file
synapse.http.server - 795 - WARNING - POST-8959 - Not sending response to request <XForwardedForRequest at 0x7f5bada94c40 method='POST' uri='/_matrix/media/r0/upload?filename=2.0gb' clientproto='HTTP/1.1' site='8008'>, already disconnected.
We would benefit overall by writing the bytes straight into a file on the correct filesystem as they're received from the client (either to the correct location, or to a temporary location with a rename operation upon completion). This would also mean that the client doesn't have to 'wait around' for the file copy to happen, since it's been happening during the upload (and in most cases, disk I/O is probably faster than the client's upload speed anyway!). In short: it will reduce upload latency a bit for large files.
(An uglier workaround would be to prevent the file copy from blocking the response, returning the MXC URL before it's even ready and finishing the copy in the background. I don't think it's great to give an MXC URL that's not ready — it could lead to races.)
Either solution probably involves some amount of spelunking into how Twisted manages these temporary files.