A Python library for big data processing with a unified, intuitive API across different data processing frameworks.
- Unified API: Work seamlessly with Pandas, Polars, Dask, or PySpark with a consistent API
- Smart Execution: Automatic switching between eager and lazy evaluation based on data size
- Intuitive Interface: Designed to align with natural thought patterns about data operations
- Optimized Performance: Built-in query optimization and execution planning
- Comprehensive I/O: Rich support for reading and writing data in CSV, Parquet, JSON, and SQL formats
- Powerful Visualization: Create static and interactive visualizations with automatic type detection
# Basic installation with pandas support
pip install fmus-big
# With Polars support
pip install fmus-big[polars]
# With Dask support
pip install fmus-big[dask]
# With PySpark support
pip install fmus-big[spark]
# Full installation with all features
pip install fmus-big[all]import fmus_big as fb
# Read data from various sources
df = fb.read("data.csv") # Auto-detects file format
df = fb.read_csv("large_data.csv", execution="lazy") # Explicit lazy execution
df = fb.read_parquet("s3://bucket/data.parquet") # Cloud storage
# Perform operations with a consistent API
result = df.select("name", "age", "city") \
.filter("age > 25") \
.group_by("city") \
.aggregate({"age": ["mean", "max", "count"]}) \
.order_by("age_mean", ascending=False)
# Write results to various formats
result.to_csv("results.csv")
result.to_parquet("results.parquet", partition_cols=["city"])
result.to_json("results.json", orient="records")
# Generic write function
fb.write(result, "results.parquet", compression="snappy")
# Compute and view results
print(result.head())fmus-big provides extensive I/O capabilities:
- Format Support: Read and write CSV, Parquet, JSON, and SQL databases
- Auto-detection: Automatic format detection based on file extension or path
- Partitioned Data: Support for reading and writing partitioned Parquet datasets
- SQL Integration: Query and write to SQL databases with a simple API
- Backend Optimizations: Format-specific optimizations for each backend
# Read examples
df = fb.read("data.csv") # CSV
df = fb.read("data.parquet") # Parquet
df = fb.read("data/") # Partitioned directory
df = fb.read_sql_query("SELECT * FROM table", conn) # SQL
# Write examples
df.to_csv("output.csv", delimiter="|")
df.to_parquet("output.parquet", compression="snappy")
df.to_sql("table_name", conn, if_exists="replace")fmus-big provides rich visualization capabilities:
- Automatic Plot Type: Suggests the best visualization based on your data
- Multiple Backends: Supports Matplotlib, Plotly, and Bokeh
- Interactive Plots: Create interactive visualizations with tooltips, zooming, and panning
- Dashboards: Build interactive dashboards with multiple plots and controls
# Basic visualizations
df.viz.line(x='date', y='value')
df.viz.scatter(x='x', y='y', color='category')
df.viz.bar(x='category', y='value')
df.viz.correlation_matrix()
# Interactive visualizations
df.interactive_viz().scatter_3d(x='x', y='y', z='z', color='category')
df.interactive_viz().bubble(x='x', y='y', size='size', color='category')
df.interactive_viz().geo_map(lat='latitude', lon='longitude', color='value')
# Build a dashboard
dashboard = df.dashboard(title='Sales Dashboard')
dashboard.add_plot('line', x='date', y='sales')
dashboard.add_plot('bar', x='region', y='revenue')
dashboard.add_control('dropdown', label='Region', options=['North', 'South', 'East', 'West'])
dashboard.create().show()- One API to Learn: Master a single API that works across different backends
- Scale Seamlessly: Start with small data on your laptop, scale to clusters without changing code
- Optimized Performance: Automatic query optimization for each backend
- Developer Experience: Clear error messages, sensible defaults, and comprehensive documentation
Contributions are welcome! Check out the contributing guidelines to get started.
This project is licensed under the MIT License.