feat: avoid double fetches on `ibis.duckdb.connect().read_csv("https://slow_url").cache()` #10845

NickCrews · 2025-02-14T20:49:35Z

Is your feature request related to a problem?

When you call .read_csv() on the duckdb backend, this makes duckdb actually go fetch [some] of the data in order to sniff the schema. Then, when you call .cache() on the created view, it actually goes and fetches the full data.

This is related to #9931.

What is the motivation behind your request?

I am working on relatively large tables on a slow internet connection. Each fetch takes about 30 seconds. I would like to avoid this double fetch.

Describe the solution you'd like

Since the result of .read_csv() needs to be a Table with a known schema, it is going to be required to fetch some data during that function call. So, I think we need to add an optional argument to the function, or create entirely new function. I would vote for adding params if we can come up with something sane. Maybe cache: bool?

What version of ibis are you running?

main

What backend(s) are you using, if any?

duckdb

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

cpcloud · 2025-02-14T20:53:13Z

Can you break down the timing of:

CREATE VIEW v AS FROM read_csv('slow_url')

and

CREATE TABLE t AS FROM read_csv('slow_url')

?

It's not clear that there's a "double fetch" (as in they two fetches are equivalently slow) so much as a small fetch followed by a large fetch.

NickCrews · 2025-02-16T06:55:20Z

Yeah good idea, I'll do that and report back.

NickCrews · 2025-05-14T22:35:08Z

By running this several times:

import time

import duckdb

url_388mb = "https://github.com/NickCrews/apoc-data/releases/download/20250514-111142/income.csv"
url_4mb = "https://github.com/NickCrews/apoc-data/releases/download/20250514-111142/campaign_form.csv"


def timeit(sql):
    con = duckdb.connect()
    con.execute("SET enable_progress_bar = false;")
    start = time.time()
    con.sql(sql)
    end = time.time()
    print(f"Execution time: {end - start:.4f} seconds")


def view_then_table(url):
    return (
        f"CREATE VIEW v AS FROM read_csv('{url}'); CREATE TABLE t AS SELECT * FROM v;"
    )


def table_immediate(url):
    return f"CREATE TABLE t AS FROM read_csv('{url}');"


timeit(view_then_table(url_388mb))
timeit(table_immediate(url_388mb))
timeit(view_then_table(url_4mb))
timeit(table_immediate(url_4mb))

Here are a few results, on a 460Mbps download connection, per https://speed.measurementlab.net/#/

Execution time: 14.4625 seconds
Execution time: 13.3351 seconds
Execution time: 2.2171 seconds
Execution time: 1.0658 seconds

Execution time: 14.3710 seconds
Execution time: 12.8988 seconds
Execution time: 2.9992 seconds
Execution time: 1.0783 seconds

Execution time: 14.4168 seconds
Execution time: 12.9135 seconds
Execution time: 2.2149 seconds
Execution time: 1.0874 seconds

It appears that duckdb takes about 1-2 seconds to do the initial fetch to get enough rows to infer the schema of the table. For "large" (~400Mb) files, this isn't that big of a deal. But for "small" (4Mb) files, it can double the time taken.

cpcloud · 2025-05-25T11:21:12Z

I'm sympathetic to problem here, but can we perhaps try to get this performance difference fixed upstream in DuckDB so that we don't have to alter the Ibis API?

NickCrews added the feature Features or general enhancements label Feb 14, 2025

github-project-automation bot added this to Ibis planning and roadmap Feb 14, 2025

github-project-automation bot moved this to backlog in Ibis planning and roadmap Feb 14, 2025

NickCrews changed the title ~~feat: avoid double fetches on ibis.duckdb.connect().read_csv("https://....").cache()~~ feat: avoid double fetches on ibis.duckdb.connect().read_csv("https://slow_url").cache() Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: avoid double fetches on `ibis.duckdb.connect().read_csv("https://slow_url").cache()` #10845

feat: avoid double fetches on `ibis.duckdb.connect().read_csv("https://slow_url").cache()` #10845

NickCrews commented Feb 14, 2025

cpcloud commented Feb 14, 2025 •

edited

Loading

Uh oh!

NickCrews commented Feb 16, 2025

Uh oh!

NickCrews commented May 14, 2025

Uh oh!

cpcloud commented May 25, 2025

Uh oh!

feat: avoid double fetches on ibis.duckdb.connect().read_csv("https://slow_url").cache() #10845

feat: avoid double fetches on ibis.duckdb.connect().read_csv("https://slow_url").cache() #10845

Comments

NickCrews commented Feb 14, 2025

Is your feature request related to a problem?

What is the motivation behind your request?

Describe the solution you'd like

What version of ibis are you running?

What backend(s) are you using, if any?

Code of Conduct

cpcloud commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NickCrews commented Feb 16, 2025

Uh oh!

NickCrews commented May 14, 2025

Uh oh!

cpcloud commented May 25, 2025

Uh oh!

feat: avoid double fetches on `ibis.duckdb.connect().read_csv("https://slow_url").cache()` #10845

feat: avoid double fetches on `ibis.duckdb.connect().read_csv("https://slow_url").cache()` #10845

cpcloud commented Feb 14, 2025 •

edited

Loading