Add native Polars DataFrame, Series, and LazyFrame support for all meta-learners#901
Add native Polars DataFrame, Series, and LazyFrame support for all meta-learners#901aman-coder03 wants to merge 4 commits into
Conversation
|
Thanks for looking into this, @aman-coder03. As I commented on #855, converting the dataframe into numpy is not ideal for Polars, as it doesn't benefit from its performant features. Instead, adding support for native Pandas and Polars DataFrames is recommended. It will require careful inspection of the current indexing to make it comparable to the indexing on DataFrames. |
|
thanks for the feedback @jeongyoonlee i understand the concern....converting to NumPy at the boundary means Polars users lose all the performance benefits(lazy evaluation, zero-copy operations, columnar efficiency) that they came for in the first place for the native approach, my plan would be...
Before I go ahead and rewrite, a couple of questions to make sure i am heading in the right direction...
Happy to update the PR once we are aligned |
|
Thanks for the detailed plan, @aman-coder03. One important correction before you start: converting to NumPy at the sklearn boundary isn't necessary.
So the rewrite scope collapses to causalml's own indexing/concat/mutation — not the model calls. Answers to your questions:
Suggested phasing:
A few practical notes for the audit:
Happy to review Phase 1 once it's ready. |
|
thanks for the detailed feedback @jeongyoonlee i'm happy to reorganize if you'd prefer the phased approach...I can split this into separate PRs for Phase 1 (pandas cleanup) and Phase 2/3 (Polars per learner). But if the current implementation looks sound to you, I'd love to land it as is to avoid rebasing overhead |
jeongyoonlee
left a comment
There was a problem hiding this comment.
Thanks for pushing this through to a full implementation, @aman-coder03 — the helper-based
dispatch (filter_mask / filter_index / prepend_column / concat_treatment_col) is the
right abstraction, and the T-learner is a good reference (docstrings kept + updated, explicit
LazyFrame.collect() guard). The direction is solid.
One process note before the specifics: this landed all of Phase 1–3 plus a large cosmetic
refactor in a single ~1k-line PR, which makes it hard to review and bisect. I'm not going to
ask you to re-split it now, but the four items below are blocking regardless of how it's
packaged.
Blocking
1. Circular import — import causalml.propensity crashes on a cold import
propensity.py adds a top-level from causalml.inference.meta.utils import convert_pd_to_np.
Importing that submodule runs meta/__init__.py → slearner → base.py (from causalml.propensity import compute_propensity_score) back into the half-initialized
propensity module. Reproduced by applying just that one line:
File ".../causalml/propensity.py", line 11, in <module>
from causalml.inference.meta.utils import convert_pd_to_np
File ".../causalml/inference/meta/__init__.py", line 1, in <module>
from .slearner import LRSRegressor, BaseSLearner, BaseSRegressor, BaseSClassifier
File ".../causalml/inference/meta/slearner.py", line 9, in <module>
from causalml.inference.meta.base import BaseLearner
File ".../causalml/inference/meta/base.py", line 8, in <module>
from causalml.propensity import compute_propensity_score
ImportError: cannot import name 'compute_propensity_score' from partially initialized
module 'causalml.propensity' (most likely due to a circular import)
Both import causalml.propensity and from causalml.propensity import ElasticNetPropensityModel fail cold. It passes locally only because the test process imports
causalml.inference.meta first, which caches it before propensity runs its new import — so
the cycle never re-triggers under pytest, but it breaks normal user entry points.
Fix: make the import function-local in the three methods that use it, or drop it — sklearn
≥1.6 accepts pandas/Polars natively, so PropensityModel.fit/predict may not need the
conversion at all.
2. polars is not declared as a dependency → the test file is skipped in CI
pyproject.toml is untouched, and tests/test_polars_support.py starts with
pytest.importorskip("polars"), so the entire feature is silently skipped in CI (same point
from the 2026-05-28 round). The code is already optional-ready (try/except in utils.py,
guarded import polars in tlearner.py), so keep polars optional — don't add it to core
[project.dependencies]. Add an optional extra and wire it into test so CI (which installs
-e ".[test]") actually runs the suite:
[project.optional-dependencies]
polars = ["polars>=1.0.0"] # users: pip install causalml[polars]
test = [
"pytest>=4.6",
"pytest-cov>=4.0",
"causalml[polars]", # self-referencing extra so .[test] pulls polars in CI
]A dev without polars still gets a clean importorskip skip; no --runpolars flag needed.
3. Docstrings deleted across S/X/R/DR
Every class / __init__ / fit / predict / fit_predict / estimate_ate docstring in
slearner.py, xlearner.py, rlearner.py, drlearner.py was removed, including the paper
references (Kennedy 2020, Nie & Wager 2019, Künzel et al. 2018). These render on readthedocs.
The T-learner correctly kept and updated its docstrings to mention pl.DataFrame — please
follow that pattern for the other four rather than deleting.
4. LazyFrame handling is inconsistent — and it points at the core design rule
Only the T-learner guards predict with if isinstance(X, pl.LazyFrame): X = X.collect().
In S/X/R/DR, predict passes X to prepend_column / model.predict directly — and
prepend_column/concat_treatment_col have no LazyFrame branch and call len(X), which
raises on a pl.LazyFrame. It's untested because the only test_lazyframe_input lives in
TestTLearnerPolars.
We want to keep native end-to-end (that's the whole point — down-converting X to numpy
would throw away the Polars benefit, and matters in particular for the xgboost zero-copy path
and column-name-aware Pipeline/ColumnTransformer learners). The fix is to apply the
contract consistently:
X stays native end-to-end;
treatment/y/p/sample_weightnormalize to numpy at
entry. Those are 1-D vectors that masking/np.unique/.astypeneed, and they're unrelated
to the wide-frame promise.
Concretely:
- Collect
LazyFrameonce at the top of each public method into apl.DataFrame(not
numpy). You have to collect to row-mask anyway, so this is the natural single point. After
that, the helpers only need to handlepl.DataFrame/pl.Series, and S/X/R/DR get LazyFrame
support for free. - Never
to_numpy(X)just to read a row count —drlearner.bootstrapdoes
to_numpy(X).shape[0], and thete = np.zeros((X.shape[0], ...))allocations should read
the count natively (X.shape[0]/len(X)work for numpy/pandas/polars). - Let
bootstrapresample on the native X viafilter_index(callers currently pre-convert
toX_npbefore calling it). - Tests to lock the contract in: numpy == pandas == polars equivalence for every learner
(regressor and classifier); a fake learner assertingisinstance(X, pl.DataFrame)inside
fit/predictso a reintroducedto_numpy(X)fails loudly; a no-feature-name-warning
assertion on the fit-DataFrame/predict-DataFrame path; and a by-namePipelinelearner.
Note: lightgbm still has the sklearn-API Polars bug (lightgbm-org/LightGBM#6849), so native will
break it — document as a caveat or convert only at the lightgbm boundary.
Non-blocking polish (smaller correctness + test-coverage items) is in a follow-up note so it
doesn't clutter the merge gate. Happy to re-review quickly once the four above are addressed.
…tion across all meta-learners
Proposed changes
Closes #855
Currently, CausalML only accepts pandas DataFrames and NumPy arrays as inputs. This PR extends support to
pl.DataFrame,pl.Series, andpl.LazyFrameacross all meta-learners (T, S, X, R, DR) without requiring any changes to the individual learner filesWhat changed
causalml/inference/meta/utils.pyextendedconvert_pd_to_np()to detect and convert Polars objects to NumPy at the entry boundary. Added_polars_to_numpy()helper that handles LazyFrames (via implicit.collect()) and squeezes single column DataFrames to 1-D to match pandas Series behaviour. Updatedcheck_p_conditions()to also acceptpl.Seriescausalml/propensity.pyaddedconvert_pd_to_np()calls insidePropensityModel.fit(),PropensityModel.predict(), andcompute_propensity_score()to handle Polars inputs before they reach sklearnDesign decisions
.collect(). Keeps scope simple while still supporting lazy pipelinesTesting
Added
tests/test_polars_support.pywith coverage across all five learner types (T, S, X, R, DR), verifying that Polars inputs produce identical results to NumPy/pandas inputs, and covering edge cases like mixed inputs and fit-on-numpy/predict-on-polarsTypes of changes
What types of changes does your code introduce to CausalML?
Put an
xin the boxes that applyChecklist
Put an
xin the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.Further comments
If this is a relatively large or complex change, kick off the discussion by explaining why you chose the solution you did and what alternatives you considered, etc. This PR template is adopted from appium.