-
Notifications
You must be signed in to change notification settings - Fork 642
feat: avoid double fetches on ibis.duckdb.connect().read_csv("https://slow_url").cache()
#10845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
ibis.duckdb.connect().read_csv("https://....").cache()
ibis.duckdb.connect().read_csv("https://slow_url").cache()
Can you break down the timing of:
and
? It's not clear that there's a "double fetch" (as in they two fetches are equivalently slow) so much as a small fetch followed by a large fetch. |
Yeah good idea, I'll do that and report back. |
By running this several times: import time
import duckdb
url_388mb = "https://github.com/NickCrews/apoc-data/releases/download/20250514-111142/income.csv"
url_4mb = "https://github.com/NickCrews/apoc-data/releases/download/20250514-111142/campaign_form.csv"
def timeit(sql):
con = duckdb.connect()
con.execute("SET enable_progress_bar = false;")
start = time.time()
con.sql(sql)
end = time.time()
print(f"Execution time: {end - start:.4f} seconds")
def view_then_table(url):
return (
f"CREATE VIEW v AS FROM read_csv('{url}'); CREATE TABLE t AS SELECT * FROM v;"
)
def table_immediate(url):
return f"CREATE TABLE t AS FROM read_csv('{url}');"
timeit(view_then_table(url_388mb))
timeit(table_immediate(url_388mb))
timeit(view_then_table(url_4mb))
timeit(table_immediate(url_4mb)) Here are a few results, on a 460Mbps download connection, per https://speed.measurementlab.net/#/
It appears that duckdb takes about 1-2 seconds to do the initial fetch to get enough rows to infer the schema of the table. For "large" (~400Mb) files, this isn't that big of a deal. But for "small" (4Mb) files, it can double the time taken. |
I'm sympathetic to problem here, but can we perhaps try to get this performance difference fixed upstream in DuckDB so that we don't have to alter the Ibis API? |
Is your feature request related to a problem?
When you call
.read_csv()
on the duckdb backend, this makes duckdb actually go fetch [some] of the data in order to sniff the schema. Then, when you call.cache()
on the created view, it actually goes and fetches the full data.This is related to #9931.
What is the motivation behind your request?
I am working on relatively large tables on a slow internet connection. Each fetch takes about 30 seconds. I would like to avoid this double fetch.
Describe the solution you'd like
Since the result of
.read_csv()
needs to be a Table with a known schema, it is going to be required to fetch some data during that function call. So, I think we need to add an optional argument to the function, or create entirely new function. I would vote for adding params if we can come up with something sane. Maybecache: bool
?What version of ibis are you running?
main
What backend(s) are you using, if any?
duckdb
Code of Conduct
The text was updated successfully, but these errors were encountered: