Support lazyFrame in polars integration? #8017

cboettig · 2023-06-20T19:40:40Z

cboettig
Jun 20, 2023

duckdb's integration with polars is really nice.

I'd really like to be able to use polars pythonic notation with duckdb's superior performance with large, out-of-RAM data (including remote data, like partitioned parquet files on S3 buckets). Based on the current documentation though, it looks like the .pl() method creates an in-memory dataframe, not a lazy data frame. Is it possible to generate a lazy data frame instead?

Alex-Monahan · 2023-06-20T20:59:40Z

Alex-Monahan
Jun 20, 2023

Hmm, interesting! What are the use cases you want to enable with that? At some point, DuckDB will need to execute and materialize the results as Arrow format, since we don't use Arrow internally. Or would an Arrow recordset/scanner work? We can read those, but I don't think we write them...

0 replies

cboettig · 2023-06-20T21:27:35Z

cboettig
Jun 20, 2023
Author

Thanks for the reply, I should have included more context, and maybe my question doesn't make sense in the context of polars. My basic use case is for students to write more pythonic 'piped' syntax we see in polars, ibis, prql, and on the R side in dplyr, rather than write raw SQL strings in python code. This works really well in ibis and dplyr (via dbplyr) examples, where the software handles the meta-programming to translate into SQL. Maybe this doesn't make sense for polars, which I guess is doing its own implementations of these operations? The internals of polars are opaque to me...

0 replies

dridk · 2024-05-31T07:46:50Z

dridk
May 31, 2024

Hi,

It would be useful to use LazyDataFrame without loading all data into memory.

For instance :

duckdb.sql("SELECT * FROM large_data").lazy().filter(pl.col("A") == 4).select(pl.col("A") + 10).limit(10).collect()

0 replies

Ben-Epstein · 2024-11-08T11:06:16Z

Ben-Epstein
Nov 8, 2024

Looking for the same thing! This would open the door to a really strong duckdb+delta integration, because polars has a write_delta but duckdb doesn't, so we could stream through duckdb -> polars -> delta

0 replies

Ben-Epstein · 2025-06-04T15:41:06Z

Ben-Epstein
Jun 4, 2025

Following up here, with the announcement of Ducklake, there's an even stronger case for this integration. Needing Polars to add a read_ducklake before adopting would be a bit unfortunate, but if duckdb supported a .lazy I imagine many people (certainly myself included) would be much quicker to adopt ducklake as their lakehouse architecture.

1 reply

pdet Jun 4, 2025
Maintainer

Is it possible to create a lazyframe from a record batch reader?

Tishj · 2025-06-05T12:34:38Z

Tishj
Jun 5, 2025
Collaborator

The attention here is misplaced, this is not a DuckDB limitation, we can create a record batch reader that lazily scans a duckdb relation just fine, but Polars does not expose any way to lazily construct a LazyFrame, all of the ingestion methods eagerly materialize the input

Please raise awareness on the Polars side if you want this feature, we'd be happy to implement this when they support it.

4 replies

J-Meyers Jun 5, 2025

I think the IO plugins within polars enable what is asked for - explicitly supporting various pushdowns and enabling incremental materialization, https://docs.pola.rs/user-guide/plugins/io_plugins/ here is a quick implementation someone did: duckdb/ducklake#121 (comment) although the specific example someone created for ducklake uses the .pl() still, going via record batches and more incrementally returning is also possible as shown in the polars documentation for io plugins specifically

Tishj Jun 5, 2025
Collaborator

I'd like to see an example of "going via record batches and more incrementally returning", I couldn't find it in the link you shared

J-Meyers Jun 5, 2025

In the example CSV reader in the polars documentation they are returned incrementally in batches, here is a slight modification to the duckdb reader which uses record batches:

def duckdb_source(connection: duckdb.DuckDBPyConnection, table: str, extra: str = "") -> pl.LazyFrame:
    """
    A polars IO plugin for DuckDB.
    """
    query = f"select * from {table} {extra} limit 0"
    schema = connection.sql(
        query
    ).pl().schema
    def source_generator(
        with_columns: list[str] | None,
        predicate: pl.Expr | None,
        n_rows: int | None,
        batch_size: int | None,
    ) -> Iterator[pl.DataFrame]:
        if batch_size is None:
          batch_size = 512

        if with_columns is None:
            cols = "*"
        else :
            cols = ",".join(with_columns)

        query = f"select {cols} from {table} {extra}"
        if predicate is not None:
            query += f" where {_predicate_to_sql(predicate)}"
        if n_rows is not None:
            query += f" limit {n_rows}"

        logger.debug(query)

        results = connection.sql(query).fetch_arrow_reader(batch_size)

        while True:
          try:        
            yield pl.from_arrow(results.read_next_batch())
          except StopIteration:
            break
    return register_io_source(source_generator, schema=schema)

Tishj Jun 5, 2025
Collaborator

I did some more digging, should have responded, it makes sense to me 👍
Thanks for the detailed reply!, that's kind of what I figured out as well, we would need to yield the batches in a loop

Mytherin · 2025-08-29T07:37:55Z

Mytherin
Aug 29, 2025
Maintainer

This has been implemented in #17947 that will land in v1.4.0

1 reply

mat-ej Sep 6, 2025

we are so back ...

Support lazyFrame in polars integration? #8017

Uh oh!

Replies: 7 comments · 6 replies

Uh oh!

Uh oh!

cboettig Jun 20, 2023 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pdet Jun 4, 2025 Maintainer

Uh oh!

Tishj Jun 5, 2025 Collaborator

Uh oh!

Uh oh!

Uh oh!

Tishj Jun 5, 2025 Collaborator

Uh oh!

Uh oh!

Uh oh!

Tishj Jun 5, 2025 Collaborator

Uh oh!

Uh oh!

Mytherin Aug 29, 2025 Maintainer

Uh oh!

Replies: 7 comments 6 replies

cboettig
Jun 20, 2023
Author

pdet Jun 4, 2025
Maintainer

Tishj
Jun 5, 2025
Collaborator

Tishj Jun 5, 2025
Collaborator

Tishj Jun 5, 2025
Collaborator

Mytherin
Aug 29, 2025
Maintainer