Skip to content

Add native Polars DataFrame, Series, and LazyFrame support for all meta-learners#901

Open
aman-coder03 wants to merge 4 commits into
uber:masterfrom
aman-coder03:feature/polars-support
Open

Add native Polars DataFrame, Series, and LazyFrame support for all meta-learners#901
aman-coder03 wants to merge 4 commits into
uber:masterfrom
aman-coder03:feature/polars-support

Conversation

@aman-coder03

Copy link
Copy Markdown
Contributor

Proposed changes

Closes #855
Currently, CausalML only accepts pandas DataFrames and NumPy arrays as inputs. This PR extends support to pl.DataFrame, pl.Series, and pl.LazyFrame across all meta-learners (T, S, X, R, DR) without requiring any changes to the individual learner files

What changed

  • causalml/inference/meta/utils.py extended convert_pd_to_np() to detect and convert Polars objects to NumPy at the entry boundary. Added _polars_to_numpy() helper that handles LazyFrames (via implicit .collect()) and squeezes single column DataFrames to 1-D to match pandas Series behaviour. Updated check_p_conditions() to also accept pl.Series
  • causalml/propensity.py added convert_pd_to_np() calls inside PropensityModel.fit(), PropensityModel.predict(), and compute_propensity_score() to handle Polars inputs before they reach sklearn

Design decisions

  • Polars is an optional dependency if not installed, everything behaves exactly as before
  • LazyFrame support implicitly collected at the boundary via .collect(). Keeps scope simple while still supporting lazy pipelines
  • Return types unchanged all methods still return NumPy arrays, preserving the existing API contract

Testing

Added tests/test_polars_support.py with coverage across all five learner types (T, S, X, R, DR), verifying that Polars inputs produce identical results to NumPy/pandas inputs, and covering edge cases like mixed inputs and fit-on-numpy/predict-on-polars

Types of changes

What types of changes does your code introduce to CausalML?
Put an x in the boxes that apply

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation Update (if none of the other choices apply)

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

  • I have read the CONTRIBUTING doc
  • I have signed the CLA
  • Lint and unit tests pass locally with my changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)
  • Any dependent changes have been merged and published in downstream modules

Further comments

If this is a relatively large or complex change, kick off the discussion by explaining why you chose the solution you did and what alternatives you considered, etc. This PR template is adopted from appium.

@jeongyoonlee

Copy link
Copy Markdown
Collaborator

Thanks for looking into this, @aman-coder03.

As I commented on #855, converting the dataframe into numpy is not ideal for Polars, as it doesn't benefit from its performant features. Instead, adding support for native Pandas and Polars DataFrames is recommended. It will require careful inspection of the current indexing to make it comparable to the indexing on DataFrames.

@aman-coder03

Copy link
Copy Markdown
Contributor Author

thanks for the feedback @jeongyoonlee

i understand the concern....converting to NumPy at the boundary means Polars users lose all the performance benefits(lazy evaluation, zero-copy operations, columnar efficiency) that they came for in the first place

for the native approach, my plan would be...

  • add thin abstraction layer in utils.py with helpers like filter_by_mask(), concat_cols(), and get_values() that dispatch to the correct pandas/polars/numpy operation based on the input type
  • audit every indexing pattern across all five learner files(tlearner.py, slearner.py, xlearner.py, rlearner.py, drlearner.py) then replace them with these helpers
  • key patterns to handle are: boolean mask filtering (X[mask]), column concatenation (np.hstack), and in-place mutation(Polars is immutable)
  • keep convert_pd_to_np() only at the final NumPy-only boundaries(sklearn's cross_val_predict in the R-learner), since sklearn doesn't accept Polars natively

Before I go ahead and rewrite, a couple of questions to make sure i am heading in the right direction...

  1. should the output of predict() remain a NumPy array, or would you like an option to return a Polars DataFrame when the input was Polars?
  2. for sklearn calls like cross_val_predict (R-learner) and propensity model fitting, converting to NumPy at that specific boundary seems unavoidable....does that approach work for you, or do you have a preferred alternative?

Happy to update the PR once we are aligned

@jeongyoonlee

Copy link
Copy Markdown
Collaborator

Thanks for the detailed plan, @aman-coder03. One important correction before you start: converting to NumPy at the sklearn boundary isn't necessary.

  • scikit-learn ≥ 1.4 accepts Polars (and pandas) DataFrames natively via the DataFrame Interchange Protocol (release notes); causalml pins scikit-learn>=1.6.0.
  • XGBoost ≥ 3.1 accepts pl.DataFrame and pl.LazyFrame directly (docs).
  • LightGBM nominally accepts pl.DataFrame but has a known sklearn-API bug (issue #6849) — document as a caveat.

So the rewrite scope collapses to causalml's own indexing/concat/mutation — not the model calls.

Answers to your questions:

  1. predict() output type. Keep returning NumPy. The entire underlying stack does the same: scikit-learn's .predict() returns NumPy regardless of input (only transformers honor set_output("polars"), not predictors), XGBoost's .predict() returns NumPy (or cupy.ndarray on GPU) even when fed a pl.DataFrame via its zero-copy Arrow path, and LightGBM's .predict() returns NumPy as documented. Matching that convention keeps causalml's API consistent with sklearn/xgboost/lightgbm and preserves backward compatibility. A return_type="polars" opt-in can be added later if users ask.

  2. NumPy conversion at sklearn boundaries. Skip it — pass Polars/pandas straight through. Zero materialization overhead, and dtype/feature-name plumbing is handled by the ML library.

Suggested phasing:

  • Phase 1: Remove convert_pd_to_np() in favor of native pandas DataFrame support across all meta-learners. No Polars yet — this isolates the indexing-polymorphism work and gives a clean baseline.
  • Phase 2: Add Polars support to one estimator (e.g. BaseTLearner) as a reference implementation with equivalence tests (NumPy / pandas / Polars produce identical te).
  • Phase 3+: Extend Polars to the remaining learners (S/X/R/DR), one PR each.

A few practical notes for the audit:

  • Enumerate the actual patterns first. Grep all five learners for indexing/concat/mutation (X[mask], np.hstack, .iloc, in-place assignment) and design the minimal helper set from real usage, not guesses.
  • Polars is immutable — anywhere current code does X[mask] = value or accumulates by index assignment, you'll need an explicit rewrite (with_columns, when/then/otherwise), not just a dispatch helper.
  • Add polars to test deps in Phase 2. The current PR's tests don't run in CI because Polars isn't installed there — pytest.importorskip("polars") skips the whole file silently.

Happy to review Phase 1 once it's ready.

@aman-coder03

Copy link
Copy Markdown
Contributor Author

thanks for the detailed feedback @jeongyoonlee
I understand the phasing suggestion, but I've intentionally tried to complete the full Polars support in this single PR....covering all five learners (T/S/X/R/DR) together. My reasoning was that the indexing patterns are largely the same across learners, so the helper set (filter_mask, filter_index, prepend_column, etc.) naturally fell out of auditing all of them at once rather than discovering gaps PR by PR.

i'm happy to reorganize if you'd prefer the phased approach...I can split this into separate PRs for Phase 1 (pandas cleanup) and Phase 2/3 (Polars per learner). But if the current implementation looks sound to you, I'd love to land it as is to avoid rebasing overhead

@jeongyoonlee jeongyoonlee left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pushing this through to a full implementation, @aman-coder03 — the helper-based
dispatch (filter_mask / filter_index / prepend_column / concat_treatment_col) is the
right abstraction, and the T-learner is a good reference (docstrings kept + updated, explicit
LazyFrame.collect() guard). The direction is solid.

One process note before the specifics: this landed all of Phase 1–3 plus a large cosmetic
refactor in a single ~1k-line PR, which makes it hard to review and bisect. I'm not going to
ask you to re-split it now, but the four items below are blocking regardless of how it's
packaged.

Blocking

1. Circular import — import causalml.propensity crashes on a cold import

propensity.py adds a top-level from causalml.inference.meta.utils import convert_pd_to_np.
Importing that submodule runs meta/__init__.pyslearnerbase.py (from causalml.propensity import compute_propensity_score) back into the half-initialized
propensity module. Reproduced by applying just that one line:

File ".../causalml/propensity.py", line 11, in <module>
    from causalml.inference.meta.utils import convert_pd_to_np
File ".../causalml/inference/meta/__init__.py", line 1, in <module>
    from .slearner import LRSRegressor, BaseSLearner, BaseSRegressor, BaseSClassifier
File ".../causalml/inference/meta/slearner.py", line 9, in <module>
    from causalml.inference.meta.base import BaseLearner
File ".../causalml/inference/meta/base.py", line 8, in <module>
    from causalml.propensity import compute_propensity_score
ImportError: cannot import name 'compute_propensity_score' from partially initialized
module 'causalml.propensity' (most likely due to a circular import)

Both import causalml.propensity and from causalml.propensity import ElasticNetPropensityModel fail cold. It passes locally only because the test process imports
causalml.inference.meta first, which caches it before propensity runs its new import — so
the cycle never re-triggers under pytest, but it breaks normal user entry points.

Fix: make the import function-local in the three methods that use it, or drop it — sklearn
≥1.6 accepts pandas/Polars natively, so PropensityModel.fit/predict may not need the
conversion at all.

2. polars is not declared as a dependency → the test file is skipped in CI

pyproject.toml is untouched, and tests/test_polars_support.py starts with
pytest.importorskip("polars"), so the entire feature is silently skipped in CI (same point
from the 2026-05-28 round). The code is already optional-ready (try/except in utils.py,
guarded import polars in tlearner.py), so keep polars optional — don't add it to core
[project.dependencies]. Add an optional extra and wire it into test so CI (which installs
-e ".[test]") actually runs the suite:

[project.optional-dependencies]
polars = ["polars>=1.0.0"]      # users: pip install causalml[polars]
test = [
    "pytest>=4.6",
    "pytest-cov>=4.0",
    "causalml[polars]",          # self-referencing extra so .[test] pulls polars in CI
]

A dev without polars still gets a clean importorskip skip; no --runpolars flag needed.

3. Docstrings deleted across S/X/R/DR

Every class / __init__ / fit / predict / fit_predict / estimate_ate docstring in
slearner.py, xlearner.py, rlearner.py, drlearner.py was removed, including the paper
references (Kennedy 2020, Nie & Wager 2019, Künzel et al. 2018). These render on readthedocs.
The T-learner correctly kept and updated its docstrings to mention pl.DataFrame — please
follow that pattern for the other four rather than deleting.

4. LazyFrame handling is inconsistent — and it points at the core design rule

Only the T-learner guards predict with if isinstance(X, pl.LazyFrame): X = X.collect().
In S/X/R/DR, predict passes X to prepend_column / model.predict directly — and
prepend_column/concat_treatment_col have no LazyFrame branch and call len(X), which
raises on a pl.LazyFrame. It's untested because the only test_lazyframe_input lives in
TestTLearnerPolars.

We want to keep native end-to-end (that's the whole point — down-converting X to numpy
would throw away the Polars benefit, and matters in particular for the xgboost zero-copy path
and column-name-aware Pipeline/ColumnTransformer learners). The fix is to apply the
contract consistently:

X stays native end-to-end; treatment / y / p / sample_weight normalize to numpy at
entry.
Those are 1-D vectors that masking/np.unique/.astype need, and they're unrelated
to the wide-frame promise.

Concretely:

  • Collect LazyFrame once at the top of each public method into a pl.DataFrame (not
    numpy). You have to collect to row-mask anyway, so this is the natural single point. After
    that, the helpers only need to handle pl.DataFrame/pl.Series, and S/X/R/DR get LazyFrame
    support for free.
  • Never to_numpy(X) just to read a row countdrlearner.bootstrap does
    to_numpy(X).shape[0], and the te = np.zeros((X.shape[0], ...)) allocations should read
    the count natively (X.shape[0] / len(X) work for numpy/pandas/polars).
  • Let bootstrap resample on the native X via filter_index (callers currently pre-convert
    to X_np before calling it).
  • Tests to lock the contract in: numpy == pandas == polars equivalence for every learner
    (regressor and classifier); a fake learner asserting isinstance(X, pl.DataFrame) inside
    fit/predict so a reintroduced to_numpy(X) fails loudly; a no-feature-name-warning
    assertion on the fit-DataFrame/predict-DataFrame path; and a by-name Pipeline learner.

Note: lightgbm still has the sklearn-API Polars bug (lightgbm-org/LightGBM#6849), so native will
break it — document as a caveat or convert only at the lightgbm boundary.


Non-blocking polish (smaller correctness + test-coverage items) is in a follow-up note so it
doesn't clutter the merge gate. Happy to re-review quickly once the four above are addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Polars support on CausalML

2 participants