SDG-LOOM (Scalable Data Generator LOOM) is a framework designed to efficiently generate synthetic datasets for LLMs (Large Language Models) and perform large-scale data analysis using AI agents. It is specifically designed for use cases that require parallel operation of numerous AI agents and high-speed batch processing, achieving significant improvements in processing capacity and flexibility compared to traditional methods.
By adopting the latest MABEL (Model And Blocks Expansion Language) v2.0, it enables highly descriptive and flexible structured agent programs. Additionally, it allows simultaneous operation of different LLM models, making load balancing and performance optimization easy. This makes it highly effective for tasks such as large-scale data analysis using LLMs, data augmentation, real-time inference, and synthetic data generation.
Furthermore, by incorporating adaptive batch processing and error handling mechanisms internally, stable operation is possible even in situations where request volumes fluctuate. It is particularly optimized for workloads involving high-frequency and large-scale inference, such as Natural Language Processing (NLP), generative AI applications, and AI agent-based automation systems.
This framework is designed with a focus on large-scale, high-speed, and stable utilization of AI agents, making it an ideal tool for users who need to efficiently scale up advanced tasks using LLMs.
- MABEL v2.0 Support
- Turing-complete expression language (MEX)
- Advanced control structures (
while,recurse,reduce,call,let) - Inline Python functions
- Global variable support
- MABEL v1.x Backward Compatibility
- Automatic version detection
- Advanced Concurrent Processing
- Adaptive concurrency control inspired by TCP congestion control (Vegas/Reno/BBR)
- Two-phase control: Slow Start (exponential increase) and Congestion Avoidance (linear increase)
- Noise reduction and trend detection using EMA (Exponential Moving Average)
- Vegas-style proactive congestion detection
- Graduated decrease logic (ignores mild congestion, responds immediately to severe congestion)
- Real-time metrics collection from vLLM/SGLang backends
- Dynamic request batching for optimal throughput
- Automatic latency-based optimization
- Adaptive concurrency control inspired by TCP congestion control (Vegas/Reno/BBR)
- Multi-Model Support
- Define and operate multiple LLM models simultaneously
- Flexible I/O Support
- JSONL and CSV format support in streaming and batch modes
- Direct loading of Hugging Face Datasets
- Key mapping feature for improved dataset compatibility
- Robust Error Handling
- Flexible error handling with retry mechanisms
- Performance Optimization
- Shared HTTP transport for connection pooling
- HTTP/2 support for improved throughput
- Asynchronous buffered I/O for efficient file operations
- Phase 2: Hierarchical task scheduling and memory optimization (see Phase 2 Optimization Guide)
- Post-Generation Profiling
- Language distribution analysis
- Output length distribution statistics
- Duplicate detection and deduplication rate
- Parse/validation failure rate tracking
- LLM token usage statistics per model
- Python
>= 3.10 - PyYAML
>= 6.0.1 - openai
>= 1.40.0 - tqdm
>= 4.66.0
Examples of installation using multiple environment management methods are provided.
pip install -e .# Python version management
pyenv install 3.12.0
pyenv local 3.12.0
# Set up venv
python -m venv venv
source venv/bin/activate
# Install dependencies
pip install -e .# Create and activate environment
conda create -n sdg python=3.12
conda activate sdg
# Install
pip install -e .uv is a fast Python package manager.
# Install uv (if not already installed)
pip install uv
# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate
uv pip install -e .Minimal configuration example:
mabel:
version: "2.0"
models:
- name: gpt4
api_model: gpt-4o-mini
api_key: ${ENV.OPENAI_API_KEY}
blocks:
- type: ai
exec: 1
model: gpt4
prompts:
- "Summarize: {UserInput}"
outputs:
- name: Summary
select: full
- type: end
exec: 2
final:
- name: answer
value: "{Summary}"For detailed specifications, please refer to:
- MABEL v2 Specification - Detailed feature descriptions, samples, and specifications
Basic JSONL processing:
sdg run \
--yaml examples/sdg_demo_v2.yaml \
--input examples/data/input.jsonl \
--output output/result.jsonlQuick test with a single data item:
sdg test-run \
--yaml examples/sdg_demo_v2.yaml \
--input examples/data/input.jsonlWith verbose logging (detailed debug output):
sdg run \
--yaml examples/sdg_demo_v2.yaml \
--input data.jsonl \
--output result.jsonl \
--verboseWith Japanese UI (default is English):
sdg run \
--yaml examples/sdg_demo_v2.yaml \
--input data.jsonl \
--output result.jsonl \
--ui-locale jaExecution with adaptive concurrency and custom batch settings:
sdg run \
--yaml examples/sdg_demo_v2.yaml \
--input data.jsonl \
--output result.jsonl \
--adaptive \
--max-batch 16 \
--min-batch 2 \
--target-latency-ms 2000Simple streaming execution (recommended):
from sdg.runner import run_streaming
run_streaming(
yaml_path="pipeline.yaml",
input_path="data/input.jsonl",
output_path="output/result.jsonl",
max_concurrent=8,
)Full control with PipelineEngine:
from sdg.config import load_config
from sdg.runner import PipelineEngine, RunConfig, ConcurrencyConfig
cfg = load_config("pipeline.yaml")
run_config = RunConfig(
concurrency=ConcurrencyConfig(max_concurrent=8),
)
engine = PipelineEngine(cfg, run_config)
engine.run("output/result.jsonl")- Usage Guide - Detailed usage of CLI and Python API
- MABEL v2 Complete Specification - MABEL grammar and feature details
For visual editing of MABEL files, we provide a dedicated GUI tool:
- SDG UI - A graphical user interface for creating and editing MABEL configuration files
This tool provides an intuitive way to design and manage MABEL pipelines without manually editing YAML files.
Sample code and data are provided in the following directory.
examples/sdg_demo.yaml: Basic usage examplesdg_demo_v2.yaml: Advanced MABEL v2 samplesdg_comprehensive_v2.yaml: Comprehensive v2 feature samplehelpers.py: External Python function usage exampledata/: Sample input/output datasets
This project is provided under the MIT License. See the LICENSE file for details.
Contributions to SDG-LOOM are welcome! When submitting pull requests, please ensure:
- MABEL v1 compatibility is maintained
- MABEL v2 features comply with the latest specifications
- All existing samples pass tests
- Appropriate documentation is provided
For bug reports and feature requests, please use GitHub Issues.