A high-performance DataFrame library for Rust, providing pandas-like API with advanced features including SIMD optimization, parallel processing, and distributed computing capabilities.
Version 0.4.0 - June 2026: ML/stats correctness release — real PCA (Jacobi eigendecomposition), DBSCAN, AgglomerativeClustering, LogisticRegression (IRLS), IsolationForest, LOF, OneClassSVM, RocAuc, chi²/MI scores, RobustScaler, QuantileTransformer, PowerTransformer, real GridSearch/RandomizedSearch CV, learning/validation curves, real χ²-distributed p-values for Ljung-Box/Friedman/KW/Box-Pierce. 1893 tests passing across feature sets.
Comprehensive Testing: 1893 tests passing (nextest) + 118 doc tests with extensive coverage Active Development: Ongoing improvements to error handling and code quality (629 Rust files across src/, tests/, examples/, and benches/, 248,775 lines of code) Production-Ready Error Handling: Established error handling patterns with descriptive messages
PandRS is a comprehensive data manipulation library that brings the power and familiarity of pandas to the Rust ecosystem. Built with performance, safety, and ease of use in mind, it provides:
- Type-safe operations leveraging Rust's ownership system
- High-performance computing through SIMD vectorization and parallel processing
- Memory-efficient design with columnar storage and string pooling
- Comprehensive functionality matching pandas' core features
- Seamless interoperability with Python, Arrow, and various data formats
use pandrs::{DataFrame, Series};
use pandrs::dataframe::{AggFunc, GroupByExt, NamedAgg};
// Create a DataFrame.
let mut df = DataFrame::new();
df.add_column(
"name".to_string(),
Series::new(
vec!["Alice".to_string(), "Bob".to_string(), "Carol".to_string()],
Some("name".to_string()),
)?,
)?;
df.add_column(
"age".to_string(),
Series::new(vec![30i64, 25, 35], Some("age".to_string()))?,
)?;
df.add_column(
"department".to_string(),
Series::new(
vec![
"Engineering".to_string(),
"Engineering".to_string(),
"Sales".to_string(),
],
Some("department".to_string()),
)?,
)?;
df.add_column(
"salary".to_string(),
Series::new(vec![75_000i64, 65_000, 85_000], Some("salary".to_string()))?,
)?;
// Column-level numeric summary.
let mean_salary = df.mean("salary")?;
// GroupBy + named aggregations. Use the explicit trait path because
// `DataFrame` also exposes an inherent `groupby(&str)` from the pivot
// module which shadows the extension method.
let grouped = GroupByExt::groupby(&df, &["department"])?.agg(vec![
NamedAgg::new("salary".to_string(), AggFunc::Mean, "salary_mean".to_string()),
NamedAgg::new("salary".to_string(), AggFunc::Sum, "salary_sum".to_string()),
NamedAgg::new("age".to_string(), AggFunc::Max, "age_max".to_string()),
])?;- Series: One-dimensional labeled array capable of holding any data type
- DataFrame: Two-dimensional, size-mutable, heterogeneous tabular data structure
- MultiIndex: Hierarchical indexing for advanced data organization
- Categorical: Memory-efficient representation for string data with limited cardinality
- Numeric:
i32,i64,f32,f64,u32,u64 - String: UTF-8 encoded with automatic string pooling
- Boolean: Native boolean support
- DateTime: Timezone-aware datetime with nanosecond precision
- Categorical: Efficient storage for repeated string values
- Missing Values: First-class
NAsupport across all types
- Column addition, removal, and renaming
- Row and column selection with boolean indexing
- Sorting by single or multiple columns
- Duplicate detection and removal
- Data type conversion and casting
- GroupBy operations with multiple aggregation functions
- Window functions (rolling, expanding, exponentially weighted)
- Pivot tables and cross-tabulation
- Custom aggregation functions
- Inner, left, right, and outer joins
- Merge on single or multiple keys
- Concat operations with axis control
- Append with automatic index alignment
- DateTime indexing and slicing
- Resampling and frequency conversion
- Time zone handling and conversion
- Date range generation
- Business day calculations
- Automatic SIMD optimization for numerical operations
- Hand-tuned implementations for common operations
- Support for AVX2 and AVX-512 instruction sets
- Multi-threaded execution for large datasets
- Configurable thread pool sizing
- Parallel aggregations and transformations
- Load-balanced work distribution
- Columnar storage format
- String interning with global string pool
- Copy-on-write semantics
- Memory-mapped file support
- Lazy evaluation for chain operations
- CSV: Fast parallel CSV reader/writer
- Parquet: Apache Parquet with compression support
- JSON: Both records and columnar JSON formats
- Excel: XLSX/XLS read/write with multi-sheet support
- Arrow: Zero-copy Arrow integration
- AWS S3
- Google Cloud Storage
- Azure Blob Storage
- HTTP/HTTPS endpoints
Enterprise-grade security features for data protection and access control:
- JWT (JSON Web Tokens): Stateless authentication with token validation
- OAuth 2.0: Industry-standard authorization framework
- API Key Management: Secure API key generation and validation
- Session Management: User session tracking and lifecycle management
- Role-Based Access Control (RBAC): Fine-grained permission management
- Multi-tenancy Support: Isolated data access per tenant
- Resource-level Permissions: Control access to specific datasets and operations
- Audit Logging: Comprehensive tracking of data access and modifications
- Security Events: Real-time monitoring of authentication and authorization events
- Compliance Support: Features designed to meet security compliance requirements
See examples/security_jwt_oauth_example.rs and examples/security_rbac_example.rs for implementation details.
Built-in analytics engine for monitoring and performance tracking:
- Counters: Track cumulative values and event counts
- Gauges: Monitor current values and resource levels
- Histograms: Measure distribution of values over time
- Timers: Track operation durations and performance
- DataFrame Operations: Monitor query execution and data transformations
- Resource Monitoring: Track memory usage, CPU utilization, and I/O operations
- Performance Profiling: Identify bottlenecks and optimization opportunities
- Threshold-based Alerts: Trigger notifications when metrics exceed limits
- Custom Alert Rules: Define complex alerting conditions
- Alert History: Track and analyze past alerts
- Real-time Dashboards: Monitor system health and performance metrics
- Metric Aggregation: Combine and analyze metrics across dimensions
- Export Capabilities: Export metrics to external monitoring systems
See examples/analytics_dashboard_example.rs for comprehensive usage examples.
Advanced machine learning capabilities integrated with DataFrame operations:
- Decision Trees: Classification and regression with interpretable models
- Random Forests: Ensemble methods for improved accuracy
- Gradient Boosting: High-performance boosting algorithms
- Neural Networks: Deep learning with configurable architectures
- ARIMA Models: AutoRegressive Integrated Moving Average
- Exponential Smoothing: Trend and seasonality modeling
- Prophet Integration: Facebook's forecasting library support
- Feature Engineering: Automatic lag features and date components
- Feature Preprocessing: Scaling, normalization, and encoding
- Model Training: Unified API for training various algorithms
- Cross-validation: K-fold and time series cross-validation
- Hyperparameter Tuning: Grid search and random search optimization
See examples/ml_neural_network_example.rs, examples/ml_decision_tree_example.rs,
examples/ml_random_forest_example.rs, examples/ml_gradient_boosting_example.rs,
and examples/time_series_forecasting_example.rs for detailed examples.
Add to your Cargo.toml:
[dependencies]
pandrs = "0.4.0"Enable additional functionality with feature flags:
[dependencies]
pandrs = { version = "0.4.0", features = ["optimized"] }Available features:
- Core features:
optimized: Performance optimizations and SIMDbackward_compat: Backward compatibility support
- Data formats:
parquet: Parquet file supportexcel: Excel file support
- Advanced features:
distributed: Distributed computing with DataFusionvisualization: Plotting capabilitiesstreaming: Real-time data processingserving: Model serving and deploymentscirs2: SciRS2 scientific computing integration
- Experimental:
cuda: GPU acceleration (requires CUDA toolkit)wasm: WebAssembly compilation supportjit: Just-in-time compilation
Performance comparison with pandas (Python) and Polars (Rust):
| Operation | PandRS | Pandas | Polars | Speedup vs Pandas |
|---|---|---|---|---|
| CSV Read (1M rows) | 0.18s | 0.92s | 0.15s | 5.1x |
| GroupBy Sum | 0.09s | 0.31s | 0.08s | 3.4x |
| Join Operations | 0.21s | 0.87s | 0.19s | 4.1x |
| String Operations | 0.14s | 1.23s | 0.16s | 8.8x |
| Rolling Window | 0.11s | 0.43s | 0.12s | 3.9x |
Benchmarks performed on AMD Ryzen 9 5950X, 64GB RAM, NVMe SSD
The examples/ directory contains comprehensive examples demonstrating all major features:
- Basic Operations:
groupby_example.rs,transform_example.rs,pivot_example.rs - Time Series:
time_series_example.rs,time_series_forecasting_example.rs,datetime_accessor_example.rs - Window Operations:
window_operations_example.rs,comprehensive_window_example.rs,dataframe_window_example.rs - Multi-Index:
multi_index_example.rs,hierarchical_groupby_example.rs,nested_group_operations_example.rs - Categorical Data:
categorical_example.rs,categorical_na_example.rs
- Neural Networks:
ml_neural_network_example.rs - Decision Trees:
ml_decision_tree_example.rs - Random Forests:
ml_random_forest_example.rs - Gradient Boosting:
ml_gradient_boosting_example.rs - ML Pipelines:
optimized_ml_pipeline_example.rs,optimized_ml_feature_engineering_example.rs - Specialized ML:
optimized_ml_clustering_example.rs,optimized_ml_anomaly_detection_example.rs,optimized_ml_dimension_reduction_example.rs
- JWT & OAuth 2.0:
security_jwt_oauth_example.rs - Role-Based Access Control:
security_rbac_example.rs
- Analytics Dashboard:
analytics_dashboard_example.rs
- CSV: Examples integrated into basic operations
- Parquet:
parquet_example.rs,parquet_advanced_example.rs,parquet_advanced_features_example.rs - Excel:
excel_multisheet_example.rs,excel_advanced_features_example.rs
- SIMD & Parallel:
parallel_example.rs,optimized_dataframe_example.rs,optimized_large_dataset_example.rs - GPU Acceleration:
gpu_dataframe_example.rs,gpu_ml_example.rs,gpu_benchmark_example.rs - Distributed Computing:
distributed_example.rs,distributed_window_example.rs,distributed_fault_tolerance_example.rs - JIT Compilation:
jit_parallel_example.rs,jit_window_operations_example.rs - Streaming:
streaming_example.rs
- Plotters Integration:
visualization_plotters_example.rs,plotters_visualization_example.rs,enhanced_visualization_example.rs
use pandrs::{DataFrame, Series};
use pandrs::dataframe::{AggFunc, GroupByExt, NamedAgg};
// Build a DataFrame inline. (Use `pandrs::io::read_csv(path, has_header)?`
// for CSV ingestion; `DataFrame::read_csv` is reserved for future API work.)
let mut df = DataFrame::new();
df.add_column(
"city".to_string(),
Series::new(
vec!["Tallinn".to_string(), "Tallinn".to_string(), "Tartu".to_string()],
Some("city".to_string()),
)?,
)?;
df.add_column(
"occupation".to_string(),
Series::new(
vec!["Engineer".to_string(), "Engineer".to_string(), "Analyst".to_string()],
Some("occupation".to_string()),
)?,
)?;
df.add_column(
"age".to_string(),
Series::new(vec![21i64, 34, 40], Some("age".to_string()))?,
)?;
df.add_column(
"income".to_string(),
Series::new(vec![55_000i64, 72_000, 81_000], Some("income".to_string()))?,
)?;
// Grouped aggregation with explicit named aggregations.
let result = GroupByExt::groupby(&df, &["city", "occupation"])?.agg(vec![
NamedAgg::new("income".to_string(), AggFunc::Mean, "income_mean".to_string()),
NamedAgg::new("income".to_string(), AggFunc::Median, "income_median".to_string()),
NamedAgg::new("income".to_string(), AggFunc::Std, "income_std".to_string()),
NamedAgg::new("age".to_string(), AggFunc::Mean, "age_mean".to_string()),
])?;Note: the
Time Series AnalysisandMachine Learning Pipelinesnippets below are illustrative of the target pandas-like API and still reference helpers (fillna,resample,ewm,get_dummies,apply_columns,DataFrame::read_parquet) that are not yet wired up on the stableDataFrame. They are being aligned with the real surface in a follow-up; seeexamples/for snippets that build and run today.
// NOTE: This snippet shows the target API. Some helpers (resample, ewm,
// DataFrame::read_csv on the base DataFrame) are not yet wired up on the
// stable DataFrame. See examples/time_series_example.rs for runnable code.
use pandrs::prelude::*;
use chrono::{Duration, Utc};
let mut df = DataFrame::read_csv("timeseries.csv", CsvReadOptions::default())?;
df.set_index("timestamp")?;
// Resample to daily frequency
let daily = df.resample("D")?.mean()?;
// Calculate rolling statistics
let rolling_stats = daily
.rolling(RollingOptions {
window: 7,
min_periods: Some(1),
center: false,
})?
.agg(HashMap::from([
("value".to_string(), vec!["mean", "std"]),
]))?;
// Exponentially weighted moving average
let ewm = daily.ewm(EwmOptions {
span: Some(10.0),
..Default::default()
})?;// NOTE: This snippet shows the target API. Some helpers (read_parquet on the
// base DataFrame, fillna, get_dummies, apply_columns) are not yet wired up
// on the stable DataFrame. See examples/optimized_ml_pipeline_example.rs for
// runnable code.
use pandrs::prelude::*;
// Load and preprocess data
let df = DataFrame::read_parquet("features.parquet")?;
// Handle missing values
let df_filled = df.fillna(FillNaOptions::Forward)?;
// Encode categorical variables
let df_encoded = df_filled.get_dummies(vec!["category1", "category2"], None)?;
// Normalize numerical features
let features = vec!["feature1", "feature2", "feature3"];
let df_normalized = df_encoded.apply_columns(&features, |series| {
let mean = series.mean()?;
let std = series.std(1)?;
series.sub_scalar(mean)?.div_scalar(std)
})?;
// Split features and target
let X = df_normalized.drop(vec!["target"])?;
let y = df_normalized.column("target")?;We welcome contributions! Please see our Contributing Guide for details.
# Clone the repository
git clone https://github.com/cool-japan/pandrs
cd pandrs
# Install development dependencies
cargo install cargo-nextest cargo-criterion
# Run tests
cargo nextest run
# Run benchmarks
cargo criterion
# Check code quality
cargo clippy -- -D warnings
cargo fmt -- --checkPandRS is developed and maintained by COOLJAPAN OU (Team Kitasan).
If you find PandRS useful, please consider sponsoring the project to support continued development of the Pure Rust ecosystem.
https://github.com/sponsors/cool-japan
Your sponsorship helps us:
- Maintain and improve the COOLJAPAN ecosystem
- Keep the entire ecosystem (OxiBLAS, OxiFFT, SciRS2, etc.) 100% Pure Rust
- Provide long-term support and security updates
Licensed under the Apache License, Version 2.0 (LICENSE or http://www.apache.org/licenses/LICENSE-2.0).
PandRS is inspired by the excellent pandas library and incorporates ideas from:
- Pandas - API design and functionality
- Polars - Performance optimizations
- Apache Arrow - Columnar format
- DataFusion - Query engine
PandRS is a COOLJAPAN project, bringing high-performance data analysis to the Rust ecosystem.