Arnio is a compiled C++ data preparation engine for messy CSV and pandas workflows.
It parses, infers types, strips whitespace, deduplicates, validates, and profiles data β
then hands clean results back to the tools you already use.
Use Arnio before and alongside pandas, NumPy, scikit-learn, DuckDB, and Arrow.
pip install arnioQuickstartβΒ·βIntegrationsβΒ·βWhy ArnioβΒ·βArchitectureβΒ·βBenchmarksβΒ·βCommunityβΒ·βContribute
Three lines. That's the entire workflow.
import arnio as ar
# Load CSV directly through C++ β no Python parsing overhead
frame = ar.read_csv("messy_sales_data.csv")
# Declare what clean data looks like β arnio handles the rest
clean = ar.pipeline(frame, [
("strip_whitespace",),
("normalize_case", {"case_type": "lower"}),
("fill_nulls", {"value": 0.0, "subset": ["revenue"]}),
("drop_nulls",),
("drop_duplicates",),
])
# Out comes a standard pandas DataFrame β use it like you always have
df = ar.to_pandas(clean)Already have a pandas DataFrame? Use Arnio in-place in your existing pandas
workflow:
import pandas as pd
import arnio as ar
df = pd.read_csv("messy_sales_data.csv")
clean_df = df.arnio.clean([
("strip_whitespace",),
("normalize_case", {"case_type": "lower"}),
("drop_duplicates",),
])
report = clean_df.arnio.profile()Use select_columns() to create a new ArFrame with only the required columns before converting to pandas.
selected = frame.select_columns(["name", "revenue"])
print(selected.columns)
# ['name', 'revenue']Every step above executes in C++. Your Python code is a configuration β not the execution engine.
πΈ Peek at a 100 GB file without loading it
scan_csv reads only the header + a sample to infer the schema. Zero data loaded.
schema = ar.scan_csv("100GB_file.csv")
# {'id': 'int64', 'name': 'string', 'is_active': 'bool', 'revenue': 'float64'}Useful for exploring datasets before committing memory.
π Preview rows without pandas conversion or full-column Python list materialization
preview() reads only the first n rows directly from the C++ frame β no pandas conversion triggered.
frame = ar.read_csv("huge_file.csv")
print(frame.preview()) # first 5 rows (default)
print(frame.preview(n=10)) # first 10 rowsRaises ValueError for invalid n (zero, negative, or non-integer).
π§© Add custom steps without touching C++
Register any Python function as a pipeline step. It receives a DataFrame, returns a DataFrame.
def remove_outliers(df, column="revenue", threshold=100_000):
return df[df[column] <= threshold]
ar.register_step("remove_outliers", remove_outliers)
# Now use it in any pipeline alongside native C++ steps
clean = ar.pipeline(frame, [
("strip_whitespace",),
("remove_outliers", {"column": "revenue", "threshold": 50000}),
("drop_duplicates",),
])Custom steps run through a pandasβArFrame conversion bridge. Prototype in Python, then optionally migrate hot paths to C++ for full speed.
Arnio is designed to make the rest of the Python data stack more productive, not to replace it.
| Workflow | How Arnio helps |
|---|---|
| pandas | Clean, validate, and profile messy DataFrames through df.arnio. |
| NumPy | Prepare typed numeric data before array/modeling workflows. |
| scikit-learn | Use Arnio cleaning as a preprocessing layer before model training. |
| DuckDB / Arrow | Validate and prepare data before analytics and columnar exchange. |
| notebooks | Inspect quality issues and cleaning suggestions before analysis. |
df = pd.read_csv("raw_customers.csv")
clean_df = df.arnio.clean(drop_duplicates=True)
quality = clean_df.arnio.profile()
validation = clean_df.arnio.validate({
"email": ar.Email(nullable=False),
"age": ar.Int64(nullable=True, min=0),
})This keeps pandas as the analysis tool while Arnio handles the preparation, quality, and validation layer.
Product direction: PROJECT_DIRECTION.md
Every data project starts the same way:
df = pd.read_csv("data.csv") # π₯ RAM spike β entire file as raw strings
df.columns = df.columns.str.strip() # Why is this not automatic?
df["name"] = df["name"].str.strip() # Python loop over every cell
df["name"] = df["name"].str.lower() # Another Python loop
df = df.dropna() # Another pass
df = df.drop_duplicates() # Another passSix lines. Four full-data passes. All in interpreted Python. This is fine for a Jupyter demo β but it doesn't scale, it doesn't compose, and it definitely doesn't belong in production.
Arnio intercepts this entire pattern. It moves the preparation layer into a predictable pipeline, accelerates supported operations in C++, and gives you clean data for pandas, NumPy, scikit-learn, DuckDB, or notebooks.
df = pd.read_csv(path)
df.columns = df.columns.str.strip()
for col in str_cols:
df[col] = df[col].str.strip()
df[col] = df[col].str.lower()
df = df.dropna(subset=["revenue"])
df = df.drop_duplicates()
# 6+ lines, multiple passes, pure Python |
frame = ar.read_csv(path)
df = ar.to_pandas(ar.pipeline(frame, [
("strip_whitespace",),
("normalize_case", {"case_type": "lower"}),
("drop_nulls", {"subset": ["revenue"]}),
("drop_duplicates",),
]))
# Declarative. Single pipeline. C++ execution. |
Arnio is not a pandas wrapper. It's a separate runtime with its own data model.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Your Python Code β
β frame = ar.read_csv("data.csv") β
β clean = ar.pipeline(frame, [...]) β
β df = ar.to_pandas(clean) β
ββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β pybind11 boundary
ββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββ
β C++ Runtime (_arnio_cpp) β
β β
β βββββββββββββββ βββββββββββββββββββ ββββββββββββββββββββ β
β β CsvReader β β Frame/Column β β Cleaning Engine β β
β β β’ RFC 4180 β β β’ Columnar β β β’ drop_nulls β β
β β β’ BOM strip β β β’ std::variant β β β’ fill_nulls β β
β β β’ Type β β β’ Bool null β β β’ drop_dupes β β
β β inference β β masks β β β’ strip_ws β β
β β β’ Quoted β β β’ O(1) column β β β’ normalize β β
β β fields β β lookup β β β’ rename/cast β β
β βββββββββββββββ βββββββββββββββββββ ββββββββββββββββββββ β
β β
β to_pandas() βββ zero-copy NumPy buffer (numerics/bools) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Decision | What it means |
|---|---|
| Columnar storage | Data lives in typed std::vectors β vector<int64_t>, vector<double>, vector<string> β not rows of variants. Cache-friendly and SIMD-ready. |
| Boolean null masks | Nulls are tracked in a separate vector<bool>, keeping data vectors dense. No sentinel values, no NaN tricks. |
| Two-pass CSV read | Pass 1 infers types across all rows. Pass 2 parses values directly into the correct typed column. No stringβobjectβcast overhead. |
| Zero-copy bridge | to_pandas() exposes C++ memory directly via NumPy's buffer protocol. Numeric and boolean columns cross the boundary without copying. |
| Step registry | Pipeline steps map to C++ function pointers. Adding a new cleaning primitive is a single function + one registry entry. |
Full architecture documentation: ARCHITECTURE.md
Reference environment: Ubuntu, Python 3.12, synthetic messy CSV inputs.
Reproduce:make benchmarkβ generates deterministic tall and wide datasets and runs both engines.
To reproduce the published numbers from a fresh checkout:
python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
python -m pip install -e .
python benchmarks/generate_data.py
python benchmarks/benchmark_vs_pandas.pybenchmarks/generate_data.py uses deterministic NumPy seeds, so every run creates the same benchmarks/benchmark_1m.csv tall input and benchmarks/benchmark_wide.csv wide input. The benchmark then executes three pandas runs and three arnio runs for each case, printing average wall-clock time from time.perf_counter() and peak Python allocation from tracemalloc. For cleaner comparisons, close other memory-heavy processes and run the script from the repository root after installing the same Python, pandas, NumPy, compiler, and arnio commit you want to compare.
Expected output format:
Tall CSV (1,000,000 rows x 12 columns)
Metric pandas arnio
ββββββββββββββββββββββββββββββββββββββββββββ
Exec Time (avg) 4.73s 5.75s
Peak RAM 211MB 212MB
Speed: 0.8x | RAM: -1% reduction
Wide CSV (5,000 rows x 256 columns)
Metric pandas arnio
ββββββββββββββββββββββββββββββββββββββββββββ
Exec Time (avg) ...s ...s
Peak RAM ...MB ...MB
Speed: ...x | RAM: ...% reduction
Small differences are expected across CPUs, operating systems, compilers, Python builds, and pandas/NumPy versions. If you share benchmark results in an issue or PR, include your OS, Python version, CPU model, pandas/NumPy versions, arnio commit, and the full command output so maintainers can compare like for like.
Arnio is near memory parity in the reference benchmark while replacing ad-hoc Python string loops with a compiled, declarative pipeline. Validate memory and speed on your own workload. The execution time gap is a known, active optimization target β the current drop_duplicates and strip_whitespace implementations use unoptimized row-key serialization.
| β What's already won | π― What's being optimized |
|
|
Most operations below run natively in C++. The current filter_rows step uses the Python pipeline backend and may be optimized in C++ later.
| Primitive | What it does | Example |
|---|---|---|
drop_nulls |
Remove rows with null/empty values | ar.drop_nulls(frame, subset=["age"]) |
validate_columns_exist |
Fail early when required columns are missing | ar.validate_columns_exist(frame, ["age"]) |
filter_rows |
Filter rows using comparison operators | ar.filter_rows(frame, column="age", op=">", value=18) |
fill_nulls |
Replace nulls with a scalar | ar.fill_nulls(frame, 0, subset=["revenue"]) |
drop_duplicates |
Deduplicate rows (first/last/none) | ar.drop_duplicates(frame, keep="first") |
drop_constant_columns |
Remove columns with only one unique value | ar.drop_constant_columns(frame) |
clip_numeric |
Clip numeric values to lower and/or upper bounds | ar.clip_numeric(frame, lower=0, upper=100) |
strip_whitespace |
Trim leading/trailing spaces from strings | ar.strip_whitespace(frame) |
normalize_case |
Force lower/upper/title case | ar.normalize_case(frame, case_type="title") |
rename_columns |
Rename columns via mapping | ar.rename_columns(frame, {"old": "new"}) |
cast_types |
Cast column types | ar.cast_types(frame, {"age": "int64"}) |
round_numeric_columns |
Round numeric columns (non-numeric columns in subset ignored safely) | ar.round_numeric_columns(frame, decimals=2) |
clean |
Convenience shorthand | ar.clean(frame, drop_nulls=True) |
safe_divide_columns |
Divide one column by another, handling zero/null denominators | ar.safe_divide_columns(frame, numerator="revenue", denominator="cost", output_column="ratio") |
Or compose them all into a pipeline:
clean = ar.pipeline(frame, [
("validate_columns_exist", {"columns": ["name", "city", "revenue"]}),
("strip_whitespace",),
("normalize_case", {"case_type": "lower"}),
("fill_nulls", {"value": "unknown", "subset": ["city"]}),
("drop_duplicates", {"keep": "first"}),
])Use filter_rows to keep only rows matching a condition.
clean = ar.pipeline(frame, [
("filter_rows", {
"column": "revenue",
"op": ">=",
"value": 1000
}),
])Supported operators:
><>=<===!=
Works with:
- integers
- floats
- strings
- booleans
### π’ Safe column division
Divide one column by another while handling division by zero and null denominators explicitly:
result = ar.safe_divide_columns(
frame,
numerator="revenue",
denominator="cost",
output_column="ratio",
fill_value=0.0, # used when denominator is zero or null
)When the denominator is zero or null, the result is replaced with
fill_value(default0.0) instead of raising an error or producingNaN/Inf.
This table helps users understand which pandas dtypes and workflows are fully supported, partially supported, unsupported, or planned.
If a dtype is partially supported, users may need conversion before processing. Unsupported dtypes should raise clear errors where applicable.
| Pandas Dtype | Support Status | Notes |
|---|---|---|
int64 |
β Supported | Fully supported with native C++ columnar storage |
float64 |
β Supported | Fully supported with zero-copy conversion where possible |
bool |
β Supported | Native supported boolean type |
string |
β Supported | Recommended over object dtype for text workflows |
datetime64[ns] |
β Unsupported | No native datetime parsing or conversion support yet |
category |
Converted to string/object during processing | |
object (mixed columns) |
Mixed object columns may coerce to string and reduce type inference reliability | |
nullable pandas dtypes (Int64, boolean) |
Supported through pandas extension dtypes with null-mask handling | |
timedelta64[ns] |
β Unsupported | Not currently supported |
- Numeric and boolean columns are optimized for zero-copy conversion between C++ and pandas.
- String columns require Python string object creation during
to_pandas()conversion. - Mixed
objectcolumns may reduce type inference accuracy and may require preprocessing. - Unsupported dtypes should raise clear user-facing errors instead of silent failures.
Arnio now includes built-in dataset understanding before you analyze in pandas.
report = ar.profile(frame)
print(report.summary())
suggestions = ar.suggest_cleaning(frame)
clean = ar.pipeline(frame, suggestions)For production data contracts:
schema = ar.Schema({
"id": ar.Int64(nullable=False, unique=True),
"email": ar.Email(nullable=False),
"revenue": ar.Float64(nullable=True, min=0),
})
result = ar.validate(frame, schema)
if not result.passed:
print(result.to_pandas())
print(result.to_markdown(max_issues=10))ValidationResult.to_markdown() is useful in CI logs, GitHub comments, or data quality reports because it renders a compact validation summary plus a GitHub-friendly issue table.
For low-risk automatic cleanup:
clean, report = ar.auto_clean(frame, mode="strict", return_report=True)This is the layer pandas does not try to own: profiling, data contracts, row-level validation issues, and safe cleaning suggestions for messy incoming datasets.
Use this workflow when you receive a small messy dataset and want to inspect what Arnio will change before applying it.
import arnio as ar
import pandas as pd
raw = pd.DataFrame(
{
"order_id": [1001, 1002, 1002, 1003, 1004],
"customer": [" Ishan ", " Prasoon ", " Prasoon ", " Pranay ", " Dhruv "],
"city": [" Paris ", "London", "London", " New York ", " Tokyo "],
}
)
frame = ar.from_pandas(raw)
report = ar.profile(frame)
summary = report.summary()
print(summary)
suggestions = ar.suggest_cleaning(frame)
print(suggestions)
# [('strip_whitespace', {'subset': ['customer', 'city']}), ('drop_duplicates', {'keep': 'first'})]
safe = ar.auto_clean(frame)
strict = ar.auto_clean(frame, mode="strict")Messy input:
| order_id | customer | city |
|---|---|---|
| 1001 | Ishan |
Paris |
| 1002 | Prasoon |
London |
| 1002 | Prasoon |
London |
| 1003 | Pranay |
New York |
| 1004 | Dhruv |
Tokyo |
Expected cleaned output with mode="strict":
| order_id | customer | city |
|---|---|---|
| 1001 | Ishan | Paris |
| 1002 | Prasoon | London |
| 1003 | Pranay | New York |
| 1004 | Dhruv | Tokyo |
mode="safe" only trims whitespace. Use mode="strict" when you also want deterministic built-in cleanup such as exact duplicate removal.
See examples/auto_clean_tutorial.py for a runnable version of this walkthrough.
Arnio provides detailed profiling for datasets via ar.profile(). To generate the report shown in these examples, the following code was used:
import arnio as ar
import pandas as pd
# Sample dataset used for these examples
data = {
"user_id": [101, 102, 103, 104],
"email": ["test@arnio.ai", "invalid-email", None, "test@arnio.ai"],
"score": [85.5, 90.0, None, 88.2]
}
df = ar.from_pandas(pd.DataFrame(data))
report = ar.profile(df)A simplified view of the standard string representation of the report object:
DataQualityReport(
row_count=4,
column_count=3,
memory_usage=733,
duplicate_rows=0,
columns={
'user_id': ColumnProfile(dtype='int64', semantic_type='identifier', unique_count=4),
'email': ColumnProfile(dtype='string', semantic_type='categorical', null_count=1, unique_ratio=0.666667),
'score': ColumnProfile(dtype='float64', semantic_type='numeric', mean=87.9, min=85.5, max=90.0)
}
)
Key fields from the structured JSON export for integration with APIs or dashboards:
{
"row_count": 4,
"column_count": 3,
"memory_usage": 733,
"duplicate_rows": 0,
"duplicate_ratio": 0.0,
"columns": {
"user_id": {
"dtype": "int64",
"semantic_type": "identifier",
"null_count": 0,
"unique_ratio": 1.0
},
"email": {
"dtype": "string",
"semantic_type": "categorical",
"null_count": 1,
"unique_ratio": 0.666667,
"warnings": ["contains_nulls"]
},
"score": {
"dtype": "float64",
"semantic_type": "numeric",
"null_count": 1,
"mean": 87.9,
"min": 85.5,
"max": 90.0,
"warnings": ["contains_nulls"]
}
}
}A manually formatted Markdown table representing the core metrics:
| Metric | Value |
|---|---|
| Row Count | 4 |
| Column Count | 3 |
| Memory Usage | 733 bytes |
| Duplicates | 0 (0.0%) |
| Version | Focus | Status |
|---|---|---|
| v1.0 | Stable release Β· cross-platform wheels Β· CI/CD Β· PyPI publishing Β· Google Colab support | β Shipped |
| v1.1 | Production readiness Β· release hardening Β· docs/tooling | β Shipped |
| v1.2 | C++ pipeline optimization Β· speed parity with pandas Β· hash-based deduplication | π¨ Active |
| v1.3 | Chunked / streaming processing Β· Parquet & JSON readers | π Planned |
| v1.4 | Parallel column processing Β· SIMD string operations | π Exploring |
Join the Arnio Discord Community for quick setup help, contributor onboarding, GSSoC 2026 coordination, feature discussion, and community updates.
Discord is for fast conversation and support. GitHub remains the source of truth for issue assignment, PR reviews, bugs, roadmap decisions, and releases.
Arnio is a GSSoC 2026 project with a structured contributor backlog across beginner, intermediate, and advanced tracks.
Most new features are pure Python pipeline steps:
# 1. Write a function that takes a DataFrame and returns a DataFrame
def remove_special_chars(df, columns=None):
cols = columns or df.select_dtypes("object").columns
for col in cols:
df[col] = df[col].str.replace(r"[^a-zA-Z0-9\s]", "", regex=True)
return df
# 2. Register it
ar.register_step("remove_special_chars", remove_special_chars)
# 3. Write tests, open a PR. That's it.The biggest performance wins are in:
drop_duplicatesβ replacingstd::ostringstreamrow serialization with proper hash-based comparisonsstrip_whitespaceβ converting from copy-on-write to in-place mutation- Parallel column processing β
std::threadacross independent columns
# macOS / Linux
git clone https://github.com/im-anishraj/arnio.git && cd arnio
make install # pip install -e ".[dev]" + pre-commit
make test # pytest with coverage
make lint # ruff + black
# Windows
pip install -e ".[dev]"
pre-commit install
pytest tests/ -vPR titles must follow Conventional Commits β
feat:,fix:,docs:,chore:. Our release pipeline auto-generates changelogs from these.
For GSSoC contributors, please read GSSOC_GUIDE.md before asking to be assigned. It explains issue claiming, contribution levels, review expectations, and what maintainers look for in a strong PR. If you want a quick onboarding refresher, see the GSSoC FAQ. If you are new to Arnio terms, see the contributor glossary.
π Full Contributing GuideβΒ·β GSSoC GuideβΒ·β π Open IssuesβΒ·β π¬ DiscussionsβΒ·β Discord
Arnio releases are automated through Release Please and GitHub Actions.
- Merge user-facing changes with Conventional Commit PR titles (
feat:,fix:,docs:, orchore:) so Release Please can choose the version bump and changelog entries. - Review and merge the Release Please PR on
main; this updates release metadata and creates the GitHub release and tag. - Confirm the
Build & Publish Wheelsworkflow succeeds for the release tag. It builds the sdist and wheels, then publishes to PyPI through Trusted Publishing. - Smoke test the published package in a clean environment:
python -m venv /tmp/arnio-smoke
source /tmp/arnio-smoke/bin/activate
python -m pip install -U pip
python -m pip install arnio
printf 'name,revenue\n Ada,10\n' > /tmp/arnio-smoke.csv
python - <<'PY'
import arnio as ar
print(ar.__version__)
print(ar.scan_csv("/tmp/arnio-smoke.csv"))
PY- Verify the GitHub release, PyPI project page, and install command all show the expected version before announcing the release.
If any publish or smoke-test step fails, leave the failed tag and GitHub release in place until maintainers agree on the recovery plan.
arnio/
βββ cpp/
β βββ include/arnio/ # C++ headers β types, column, frame, csv_reader, cleaning
β βββ src/ # C++ implementations (~30 KB of compiled logic)
βββ bindings/
β βββ bind_arnio.cpp # pybind11 module β the PythonβC++ bridge
βββ arnio/
β βββ __init__.py # Public API surface
β βββ io.py # read_csv, scan_csv
β βββ cleaning.py # Python wrappers for C++ cleaning functions
β βββ pipeline.py # Step registry + pipeline executor
β βββ convert.py # to_pandas (zero-copy), from_pandas
β βββ frame.py # ArFrame β lightweight C++ Frame wrapper
β βββ exceptions.py # ArnioError, UnknownStepError, CsvReadError, TypeCastError
βββ tests/ # pytest suite β CSV, cleaning, pipeline, conversions
βββ benchmarks/ # Reproducible arnio vs pandas benchmark
βββ examples/ # basic_usage.py, auto_clean_tutorial.py, custom_step.py
βββ website/ # Project website β arnio.vercel.app
Stop writing cleaning scripts. Declare clean data.
Built with C++ and pybind11 Β· Licensed under MIT Β· Maintained by @im-anishraj