Skip to content

mexyusef/fmus-big

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

fmus-big

A Python library for big data processing with a unified, intuitive API across different data processing frameworks.

Features

  • Unified API: Work seamlessly with Pandas, Polars, Dask, or PySpark with a consistent API
  • Smart Execution: Automatic switching between eager and lazy evaluation based on data size
  • Intuitive Interface: Designed to align with natural thought patterns about data operations
  • Optimized Performance: Built-in query optimization and execution planning
  • Comprehensive I/O: Rich support for reading and writing data in CSV, Parquet, JSON, and SQL formats
  • Powerful Visualization: Create static and interactive visualizations with automatic type detection

Installation

# Basic installation with pandas support
pip install fmus-big

# With Polars support
pip install fmus-big[polars]

# With Dask support
pip install fmus-big[dask]

# With PySpark support
pip install fmus-big[spark]

# Full installation with all features
pip install fmus-big[all]

Quick Start

import fmus_big as fb

# Read data from various sources
df = fb.read("data.csv")  # Auto-detects file format
df = fb.read_csv("large_data.csv", execution="lazy")  # Explicit lazy execution
df = fb.read_parquet("s3://bucket/data.parquet")  # Cloud storage

# Perform operations with a consistent API
result = df.select("name", "age", "city") \
           .filter("age > 25") \
           .group_by("city") \
           .aggregate({"age": ["mean", "max", "count"]}) \
           .order_by("age_mean", ascending=False)

# Write results to various formats
result.to_csv("results.csv")
result.to_parquet("results.parquet", partition_cols=["city"])
result.to_json("results.json", orient="records")

# Generic write function
fb.write(result, "results.parquet", compression="snappy")

# Compute and view results
print(result.head())

Data I/O Capabilities

fmus-big provides extensive I/O capabilities:

  • Format Support: Read and write CSV, Parquet, JSON, and SQL databases
  • Auto-detection: Automatic format detection based on file extension or path
  • Partitioned Data: Support for reading and writing partitioned Parquet datasets
  • SQL Integration: Query and write to SQL databases with a simple API
  • Backend Optimizations: Format-specific optimizations for each backend
# Read examples
df = fb.read("data.csv")                           # CSV
df = fb.read("data.parquet")                       # Parquet
df = fb.read("data/")                              # Partitioned directory
df = fb.read_sql_query("SELECT * FROM table", conn) # SQL

# Write examples
df.to_csv("output.csv", delimiter="|")
df.to_parquet("output.parquet", compression="snappy")
df.to_sql("table_name", conn, if_exists="replace")

Visualization Capabilities

fmus-big provides rich visualization capabilities:

  • Automatic Plot Type: Suggests the best visualization based on your data
  • Multiple Backends: Supports Matplotlib, Plotly, and Bokeh
  • Interactive Plots: Create interactive visualizations with tooltips, zooming, and panning
  • Dashboards: Build interactive dashboards with multiple plots and controls
# Basic visualizations
df.viz.line(x='date', y='value')
df.viz.scatter(x='x', y='y', color='category')
df.viz.bar(x='category', y='value')
df.viz.correlation_matrix()

# Interactive visualizations
df.interactive_viz().scatter_3d(x='x', y='y', z='z', color='category')
df.interactive_viz().bubble(x='x', y='y', size='size', color='category')
df.interactive_viz().geo_map(lat='latitude', lon='longitude', color='value')

# Build a dashboard
dashboard = df.dashboard(title='Sales Dashboard')
dashboard.add_plot('line', x='date', y='sales')
dashboard.add_plot('bar', x='region', y='revenue')
dashboard.add_control('dropdown', label='Region', options=['North', 'South', 'East', 'West'])
dashboard.create().show()

Why fmus-big?

  • One API to Learn: Master a single API that works across different backends
  • Scale Seamlessly: Start with small data on your laptop, scale to clusters without changing code
  • Optimized Performance: Automatic query optimization for each backend
  • Developer Experience: Clear error messages, sensible defaults, and comprehensive documentation

Contributing

Contributions are welcome! Check out the contributing guidelines to get started.

License

This project is licensed under the MIT License.

About

Python library for big data processing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages