8 releases (4 breaking)

0.6.2	Jun 24, 2025
0.6.1	Jun 24, 2025
0.5.0	Jun 24, 2025
0.4.0	Jun 24, 2025
0.2.1	Jun 24, 2025

#2419 in Database interfaces

177 downloads per month

Apache-2.0

1MB
19K SLoC

Arrow-Graph

A high-performance, Arrow-native graph analytics engine with SQL interface for modern data processing workflows.

Overview

Arrow-Graph brings graph analytics to the Apache Arrow ecosystem, providing:

Arrow-native storage - Zero-copy graph operations using columnar data
SQL-first interface - Query graphs using familiar SQL syntax
Modern algorithms - SIMD-optimized implementations of graph algorithms
DataFusion integration - Seamless integration with existing analytical workflows
Streaming support - Process large graphs with bounded memory
ML-ready - Built for graph neural networks and feature extraction

Graph Model

Arrow-Graph represents graphs as Arrow RecordBatches, enabling zero-copy operations:

Edges Table (RecordBatch):
┌───────┬────────┬────────┐
│ src   │ dst    │ weight │
├───────┼────────┼────────┤
│ "A"   │ "B"    │ 1.0    │
│ "B"   │ "C"    │ 2.0    │
│ "C"   │ "D"    │ 1.5    │
│ "A"   │ "D"    │ 3.0    │
└───────┴────────┴────────┘

Nodes: Automatically indexed from unique src/dst values
Internal ID mapping: "A"→0, "B"→1, "C"→2, "D"→3

Quick Start

Installation

Add to your Cargo.toml:

[dependencies]
arrow-graph = "0.1"

Basic Usage

use arrow_graph::prelude::*;
use arrow::array::{StringArray, Float64Array};
use arrow::record_batch::RecordBatch;

// Create a graph from Arrow data
let edges = RecordBatch::try_new(
    schema,
    vec![
        Arc::new(StringArray::from(vec!["A", "B", "C"])),  // source
        Arc::new(StringArray::from(vec!["B", "C", "D"])),  // target  
        Arc::new(Float64Array::from(vec![1.0, 2.0, 1.5])), // weight
    ],
)?;

let graph = ArrowGraph::from_edges(edges)?;

// Basic graph operations
println!("Nodes: {}", graph.node_count());
println!("Edges: {}", graph.edge_count()); 
println!("Density: {:.3}", graph.density());

// Navigate the graph
let neighbors = graph.neighbors("A").unwrap();
println!("A connects to: {:?}", neighbors);

SQL Interface (Coming in v0.3.0)

Built as DataFusion User-Defined Functions (UDFs) for seamless integration:

-- Find shortest paths
SELECT shortest_path('A', 'D', 'edges_table') as path;

-- Calculate PageRank
SELECT node_id, pagerank() OVER (PARTITION BY graph_id) as rank
FROM nodes_table;

-- Detect communities  
SELECT node_id, community_detection('leiden', 'edges_table') as cluster
FROM nodes_table;

Current prototype (basic graph metrics available):

// Register UDF with DataFusion
ctx.register_udf(create_udf(
    "graph_density",
    vec![DataType::Utf8], // table name
    Arc::new(DataType::Float64),
    Volatility::Stable,
    Arc::new(|args| {
        let edges = get_edges_table(args[0].as_ref())?;
        Ok(ColumnarValue::Scalar(ScalarValue::Float64(
            Some(calculate_density(&edges)?)
        )))
    }),
));

Features

Current (v0.1.0)

✅ Arrow RecordBatch graph construction
✅ Efficient adjacency list indexing
✅ Basic graph metrics (density, node/edge counts)
✅ Graph navigation (neighbors, predecessors)
✅ CLI tool foundation

Coming Soon (v0.2.0)

🚧 Vectorized shortest path algorithms
🚧 PageRank with early termination
🚧 Connected components analysis
🚧 Community detection (Leiden algorithm)
🚧 Triangle counting and clustering

Roadmap

v0.3.0: SQL integration with DataFusion
v0.4.0: Streaming graph processing
v0.5.0: ML/AI integration (GNN support)
v0.6.0: Cloud-native distributed processing

Performance

Built for modern data scales:

Target: 10-100x faster than NetworkX (based on initial benchmarks with 1M edge datasets)
Scale: Handle 100M+ edges on single machine
Memory: Efficient columnar storage with zero-copy operations
SIMD: Vectorized algorithms using Arrow compute kernels

Comprehensive benchmarks coming in v0.2.0 release

Use Cases

Arrow-Graph is designed for modern graph analytics workflows:

Friend recommendations: Find mutual connections and suggest new relationships
Influence measurement: Calculate centrality metrics for key opinion leaders
Community detection: Identify clusters and groups within social networks

Fraud Detection

Transaction networks: Analyze payment flows to detect suspicious patterns
Account linking: Find connected accounts through shared attributes
Risk scoring: Calculate graph-based features for ML fraud models

Knowledge Graphs & Recommendation Systems

Product recommendations: Graph-based collaborative filtering
Content discovery: Find related articles, papers, or media through citation/reference networks
Semantic search: Navigate knowledge graphs for enhanced search results

ML/AI Feature Engineering

GNN preprocessing: Prepare graph data for PyTorch Geometric or DGL
Graph embeddings: Calculate node2vec, GraphSAGE features at scale
Pipeline integration: Seamless integration with existing ML workflows via Arrow

Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   SQL Layer     │    │   Algorithms     │    │   Storage       │
│                 │    │                  │    │                 │
│ DataFusion UDFs │───▶│ SIMD Optimized   │───▶│ Arrow Columnar  │
│ Graph Functions │    │ Vectorized Ops   │    │ Zero-Copy       │
│ Pattern Matching│    │ Streaming Algos  │    │ Memory Mapped   │
└─────────────────┘    └──────────────────┘    └─────────────────┘

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

License

Licensed under the Apache License, Version 2.0. See LICENSE for details.

Acknowledgments

Inspired by Apache Flink Gelly, built for the modern Arrow ecosystem.

Dependencies

~104MB
~2M SLoC