SDG-LOOM

Overview

SDG-LOOM (Scalable Data Generator LOOM) is a framework designed to efficiently generate synthetic datasets for LLMs (Large Language Models) and perform large-scale data analysis using AI agents. It is specifically designed for use cases that require parallel operation of numerous AI agents and high-speed batch processing, achieving significant improvements in processing capacity and flexibility compared to traditional methods.

By adopting the latest MABEL (Model And Blocks Expansion Language) v2.0, it enables highly descriptive and flexible structured agent programs. Additionally, it allows simultaneous operation of different LLM models, making load balancing and performance optimization easy. This makes it highly effective for tasks such as large-scale data analysis using LLMs, data augmentation, real-time inference, and synthetic data generation.

Furthermore, by incorporating adaptive batch processing and error handling mechanisms internally, stable operation is possible even in situations where request volumes fluctuate. It is particularly optimized for workloads involving high-frequency and large-scale inference, such as Natural Language Processing (NLP), generative AI applications, and AI agent-based automation systems.

This framework is designed with a focus on large-scale, high-speed, and stable utilization of AI agents, making it an ideal tool for users who need to efficiently scale up advanced tasks using LLMs.

Features

MABEL v2.0 Support
- Turing-complete expression language (MEX)
- Advanced control structures (while, recurse, reduce, call, let)
- Inline Python functions
- Global variable support
MABEL v1.x Backward Compatibility
- Automatic version detection
Advanced Concurrent Processing
- Adaptive concurrency control inspired by TCP congestion control (Vegas/Reno/BBR)
  - Two-phase control: Slow Start (exponential increase) and Congestion Avoidance (linear increase)
  - Noise reduction and trend detection using EMA (Exponential Moving Average)
  - Vegas-style proactive congestion detection
  - Graduated decrease logic (ignores mild congestion, responds immediately to severe congestion)
- Real-time metrics collection from vLLM/SGLang backends
- Dynamic request batching for optimal throughput
- Automatic latency-based optimization
Multi-Model Support
- Define and operate multiple LLM models simultaneously
Flexible I/O Support
- JSONL and CSV format support in streaming and batch modes
- Direct loading of Hugging Face Datasets
- Key mapping feature for improved dataset compatibility
Robust Error Handling
- Flexible error handling with retry mechanisms
Performance Optimization
- Shared HTTP transport for connection pooling
- HTTP/2 support for improved throughput
- Asynchronous buffered I/O for efficient file operations
- Phase 2: Hierarchical task scheduling and memory optimization (see Phase 2 Optimization Guide)
Post-Generation Profiling
- Language distribution analysis
- Output length distribution statistics
- Duplicate detection and deduplication rate
- Parse/validation failure rate tracking
- LLM token usage statistics per model

Requirements

Python >= 3.10
PyYAML >= 6.0.1
openai >= 1.40.0
tqdm >= 4.66.0

Installation

Examples of installation using multiple environment management methods are provided.

Standard pip Installation

pip install -e .

Installation with pyenv

# Python version management
pyenv install 3.12.0
pyenv local 3.12.0

# Set up venv
python -m venv venv
source venv/bin/activate

# Install dependencies
pip install -e .

Installation with conda

# Create and activate environment
conda create -n sdg python=3.12
conda activate sdg

# Install
pip install -e .

Fast Installation with uv (Recommended)

uv is a fast Python package manager.

# Install uv (if not already installed)
pip install uv

# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate

uv pip install -e .

Quick Start

Minimal configuration example:

mabel:
  version: "2.0"

models:
  - name: gpt4
    api_model: gpt-4o-mini
    api_key: ${ENV.OPENAI_API_KEY}

blocks:
  - type: ai
    exec: 1
    model: gpt4
    prompts:
      - "Summarize: {UserInput}"
    outputs:
      - name: Summary
        select: full
  
  - type: end
    exec: 2
    final:
      - name: answer
        value: "{Summary}"

For detailed specifications, please refer to:

MABEL v2 Specification - Detailed feature descriptions, samples, and specifications

Usage

Command Line (CLI) Execution

Basic JSONL processing:

sdg run \
  --yaml examples/sdg_demo_v2.yaml \
  --input examples/data/input.jsonl \
  --output output/result.jsonl

Quick test with a single data item:

sdg test-run \
  --yaml examples/sdg_demo_v2.yaml \
  --input examples/data/input.jsonl

With verbose logging (detailed debug output):

sdg run \
  --yaml examples/sdg_demo_v2.yaml \
  --input data.jsonl \
  --output result.jsonl \
  --verbose

With Japanese UI (default is English):

sdg run \
  --yaml examples/sdg_demo_v2.yaml \
  --input data.jsonl \
  --output result.jsonl \
  --ui-locale ja

Execution with adaptive concurrency and custom batch settings:

sdg run \
  --yaml examples/sdg_demo_v2.yaml \
  --input data.jsonl \
  --output result.jsonl \
  --adaptive \
  --max-batch 16 \
  --min-batch 2 \
  --target-latency-ms 2000

Using Python API

Simple streaming execution (recommended):

from sdg.runner import run_streaming

run_streaming(
    yaml_path="pipeline.yaml",
    input_path="data/input.jsonl",
    output_path="output/result.jsonl",
    max_concurrent=8,
)

Full control with PipelineEngine:

from sdg.config import load_config
from sdg.runner import PipelineEngine, RunConfig, ConcurrencyConfig

cfg = load_config("pipeline.yaml")
run_config = RunConfig(
    concurrency=ConcurrencyConfig(max_concurrent=8),
)
engine = PipelineEngine(cfg, run_config)
engine.run("output/result.jsonl")

Detailed Documentation 📖

Usage Guide - Detailed usage of CLI and Python API
MABEL v2 Complete Specification - MABEL grammar and feature details

MABEL Editor 🎨

For visual editing of MABEL files, we provide a dedicated GUI tool:

SDG UI - A graphical user interface for creating and editing MABEL configuration files

This tool provides an intuitive way to design and manage MABEL pipelines without manually editing YAML files.

Examples

Sample code and data are provided in the following directory.

examples/
- sdg_demo.yaml : Basic usage example
- sdg_demo_v2.yaml : Advanced MABEL v2 sample
- sdg_comprehensive_v2.yaml : Comprehensive v2 feature sample
- helpers.py : External Python function usage example
- data/ : Sample input/output datasets

License 📝

This project is provided under the MIT License. See the LICENSE file for details.

Contributing 🤝

Contributions to SDG-LOOM are welcome! When submitting pull requests, please ensure:

MABEL v1 compatibility is maintained
MABEL v2 features comply with the latest specifications
All existing samples pass tests
Appropriate documentation is provided

Support 🛠️

For bug reports and feature requests, please use GitHub Issues.

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
assets		assets
docs		docs
examples		examples
sdg		sdg
.gitignore		.gitignore
LICENSE		LICENSE
README.JA.md		README.JA.md
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SDG-LOOM

Overview

Features

Requirements

Installation

Standard pip Installation

Installation with pyenv

Installation with conda

Fast Installation with uv (Recommended)

Quick Start

Usage

Command Line (CLI) Execution

Using Python API

Detailed Documentation 📖

MABEL Editor 🎨

Examples

License 📝

Contributing 🤝

Support 🛠️

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SDG-LOOM

Overview

Features

Requirements

Installation

Standard pip Installation

Installation with pyenv

Installation with conda

Fast Installation with uv (Recommended)

Quick Start

Usage

Command Line (CLI) Execution

Using Python API

Detailed Documentation 📖

MABEL Editor 🎨

Examples

License 📝

Contributing 🤝

Support 🛠️

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages