THIS PROJECT IS IN ALPHA
I'm just playing with some ideas. Maybe it'll become something?
Surfer slang:
extremely good, extremely great
A fresh approach to tabular feature engineering for machine learning pipelines, based on polars expressions.
Construct your, potentially stateful, polars expressions using bodacious functions before using them in plain polars contexts. You could even export them as plain json for use in the future before loading them back in and applying them independently of bodacious (they are just plain old polars expressions).
What if I told you about a library that:
- was expressive and powerful in its API
- parallised by design with all the power of rust to prevent making copies unnecessarily
- could be run on GPU if you needed it
- could produce a feature computation DAG for lineage, with query opitimsation by default
- could serialise to json and read back in losslessly
- has a stable definition of missing values and types
- was extremely memory efficient when using lazy frames
- implemented the cross-language arrow spec, so it isn't just charging off to create a new standard
You'd say "what is this mana from heaven!?". Polars is the future of tabular data analysis in Python. However, it can become awkward when one wants to change a table
based on state derived from another table, which is common in machine learning pipelines. Data leakage is easy to do. bodacious is here to help, providing helper functions
that can be used on your training data, and then applied seperately to new predictions without reference to anything other than a pure polars object - the expression.
import bodacious.imputers as bd
from palmerpenguins import load_penguins
import polars as pl
penguins_pl = pl.DataFrame(load_penguins())
mean_imputation_exprs = bd.mean_impute_nulls(train_df = penguins_pl, columns=None)
print("State from 'train_df' is captured as part of the polars expression:")
print(mean_imputation_exprs)
print("Unmodified frame:")
print(penguins_pl)
print("Null values imputed for only numeric columns:")
print(penguins_pl.with_columns(mean_imputation_exprs))State from 'train_df' is captured as part of the polars expression:
[<Expr ['col("bill_length_mm").fill_nul…'] at 0x7766D40C20D0>, <Expr ['col("bill_depth_mm").fill_null…'] at 0x7766D40C2110>, <Expr ['col("flipper_length_mm").fill_…'] at 0x7766D40C2150>, <Expr ['col("body_mass_g").fill_null([…'] at 0x7766D40C2190>, <Expr ['col("year").fill_null([Series[…'] at 0x7766D40C21D0>]
Unmodified frame:
shape: (344, 8)
┌───────────┬───────────┬──────────────┬──────────────┬──────────────┬─────────────┬────────┬──────┐
│ species ┆ island ┆ bill_length_ ┆ bill_depth_m ┆ flipper_leng ┆ body_mass_g ┆ sex ┆ year │
│ --- ┆ --- ┆ mm ┆ m ┆ th_mm ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ --- ┆ --- ┆ --- ┆ f64 ┆ str ┆ i64 │
│ ┆ ┆ f64 ┆ f64 ┆ f64 ┆ ┆ ┆ │
╞═══════════╪═══════════╪══════════════╪══════════════╪══════════════╪═════════════╪════════╪══════╡
│ Adelie ┆ Torgersen ┆ 39.1 ┆ 18.7 ┆ 181.0 ┆ 3750.0 ┆ male ┆ 2007 │
│ Adelie ┆ Torgersen ┆ 39.5 ┆ 17.4 ┆ 186.0 ┆ 3800.0 ┆ female ┆ 2007 │
│ Adelie ┆ Torgersen ┆ 40.3 ┆ 18.0 ┆ 195.0 ┆ 3250.0 ┆ female ┆ 2007 │
│ Adelie ┆ Torgersen ┆ null ┆ null ┆ null ┆ null ┆ null ┆ 2007 │
│ Adelie ┆ Torgersen ┆ 36.7 ┆ 19.3 ┆ 193.0 ┆ 3450.0 ┆ female ┆ 2007 │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ Chinstrap ┆ Dream ┆ 55.8 ┆ 19.8 ┆ 207.0 ┆ 4000.0 ┆ male ┆ 2009 │
│ Chinstrap ┆ Dream ┆ 43.5 ┆ 18.1 ┆ 202.0 ┆ 3400.0 ┆ female ┆ 2009 │
│ Chinstrap ┆ Dream ┆ 49.6 ┆ 18.2 ┆ 193.0 ┆ 3775.0 ┆ male ┆ 2009 │
│ Chinstrap ┆ Dream ┆ 50.8 ┆ 19.0 ┆ 210.0 ┆ 4100.0 ┆ male ┆ 2009 │
│ Chinstrap ┆ Dream ┆ 50.2 ┆ 18.7 ┆ 198.0 ┆ 3775.0 ┆ female ┆ 2009 │
└───────────┴───────────┴──────────────┴──────────────┴──────────────┴─────────────┴────────┴──────┘
Null values imputed for only numeric columns:
shape: (344, 8)
┌───────────┬───────────┬──────────────┬─────────────┬─────────────┬─────────────┬────────┬────────┐
│ species ┆ island ┆ bill_length_ ┆ bill_depth_ ┆ flipper_len ┆ body_mass_g ┆ sex ┆ year │
│ --- ┆ --- ┆ mm ┆ mm ┆ gth_mm ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ --- ┆ --- ┆ --- ┆ f64 ┆ str ┆ f64 │
│ ┆ ┆ f64 ┆ f64 ┆ f64 ┆ ┆ ┆ │
╞═══════════╪═══════════╪══════════════╪═════════════╪═════════════╪═════════════╪════════╪════════╡
│ Adelie ┆ Torgersen ┆ 39.1 ┆ 18.7 ┆ 181.0 ┆ 3750.0 ┆ male ┆ 2007.0 │
│ Adelie ┆ Torgersen ┆ 39.5 ┆ 17.4 ┆ 186.0 ┆ 3800.0 ┆ female ┆ 2007.0 │
│ Adelie ┆ Torgersen ┆ 40.3 ┆ 18.0 ┆ 195.0 ┆ 3250.0 ┆ female ┆ 2007.0 │
│ Adelie ┆ Torgersen ┆ 43.92193 ┆ 17.15117 ┆ 200.915205 ┆ 4201.754386 ┆ null ┆ 2007.0 │
│ Adelie ┆ Torgersen ┆ 36.7 ┆ 19.3 ┆ 193.0 ┆ 3450.0 ┆ female ┆ 2007.0 │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ Chinstrap ┆ Dream ┆ 55.8 ┆ 19.8 ┆ 207.0 ┆ 4000.0 ┆ male ┆ 2009.0 │
│ Chinstrap ┆ Dream ┆ 43.5 ┆ 18.1 ┆ 202.0 ┆ 3400.0 ┆ female ┆ 2009.0 │
│ Chinstrap ┆ Dream ┆ 49.6 ┆ 18.2 ┆ 193.0 ┆ 3775.0 ┆ male ┆ 2009.0 │
│ Chinstrap ┆ Dream ┆ 50.8 ┆ 19.0 ┆ 210.0 ┆ 4100.0 ┆ male ┆ 2009.0 │
│ Chinstrap ┆ Dream ┆ 50.2 ┆ 18.7 ┆ 198.0 ┆ 3775.0 ┆ female ┆ 2009.0 │
└───────────┴───────────┴──────────────┴─────────────┴─────────────┴─────────────┴────────┴────────┘
- We Have Expressions at Home - hammers home that expressions define 'how' to perform an operation, separately to the frame in question.