SynthGenAI is a package for generating Synthetic Datasets. The idea is to have a tool which is simple to use and can generate datasets on different topics by utilizing LLMs from different API providers. The package is designed to be modular and can be easily extended to include some different API providers for LLMs and new features.
Important
The package is still in the early stages of development and some features may not be fully implemented or tested. If you find any issues or have any suggestions, feel free to open an issue or create a pull request.
Interest in synthetic data generation has surged recently, driven by the growing recognition of data as a critical asset in AI development. As Ilya Sutskever, one of the most important figures in AI, says: 'Data is the fossil fuel of AI.' The more quality data we have, the better our models can perform. However, access to data is often restricted due to privacy concerns, or it may be prohibitively expensive to collect. Additionally, the vast amount of high-quality data on the internet has already been extensively mined. Synthetic data generation addresses these challenges by allowing us to create diverse and useful datasets using current pre-trained Large Language Models (LLMs). Beyond LLMs, synthetic data also holds immense potential for pre-training and post-training of Small Language Models (SLMs), which are gaining popularity due to their efficiency and suitability for specific, resource-constrained applications. By leveraging synthetic data for both LLMs and SLMs, we can enhance performance across a wide range of use cases while balancing resource efficiency and model effectiveness. This approach enables us to harness the strengths of both synthetic and authentic datasets to achieve optimal outcomes.
The package is built using Python and the following libraries:
- uv, An extremely fast Python package and project manager, written in Rust.
- LiteLLM, A Python SDK for accessing LLMs from different API providers with standardized OpenAI Format.
- Langfuse, LLMOps platform for observability, tracebility and monitoring of LLMs.
- Pydantic, Data validation and settings management using Python type annotations.
- Huggingface Hub & Datasets, A Python library for saving generated datasets on Hugging Face Hub.
To install the package, you can use the following command:
pip install synthgenaior if you want to use uv package manager, you can use the following command:
uv add synthgenaior you can install the package directly from the source code using the following commands:
git clone https://github.com/Shekswess/synthgenai.git
uv build
pip install ./dist/synthgenai-{version}-py3-none-any.whlTo use the package, you need to have the following requirements installed:
- Python 3.10+
- uv for building the package directly from the source code
- Ollama running on your local machine if you want to use Ollama as an API provider (optional)
- Langfuse running on your local machine or in the cloud if you want to use Langfuse for tracebility (optional)
- Hugging Face Hub account if you want to save the generated datasets on Hugging Face Hub with generated token (optional)
- Gradio for using the SynthGenAI UI (optional)
After installation, get started quickly by using the CLI:
# 1. See what environment variables you need
synthgenai env-setup
# 2. Set up your API keys (example for OpenAI)
export OPENAI_API_KEY="your-api-key-here"
# 3. # List available dataset types
synthgenai list-types
# 4. Generate your first dataset
synthgenai generate instruction \
--model "openai/gpt-5" \
--topic "Python Programming" \
--domain "Software Development" \
--entries 100
# 5. See more examples
synthgenai examplessynthgenai generate- Generate synthetic datasetssynthgenai list-types- Show all available dataset typessynthgenai examples- Display example commandssynthgenai providers- List supported LLM providerssynthgenai env-setup- Show environment setup guidesynthgenai --help- Show help information
- Groq - more info about Groq models that can be used, can be found here
- Mistral AI - more info about Mistral AI models that can be used, can be found here
- Gemini - more info about Gemini models that can be used, can be found here
- Bedrock - more info about Bedrock models that can be used, can be found here
- Anthropic - more info about Anthropic models that can be used, can be found here
- OpenAI - more info about OpenAI models that can be used, can be found here
- Hugging Face - more info about Hugging Face models that can be used, can be found here
- Ollama - more info about Ollama models that can be used, can be found here
- vLLM - more info about vLLM models that can be used, can be found here
- SageMaker - more info about SageMaker models that can be used, can be found here
- Azure - more info about Azure and Azure AI models that can be used, can be found here & here
- Vertex AI - more info about Vertex AI models that can be used, can be found here
- DeepSeek - more info about DeepSeek models that can be used, can be found here
- xAI - more info about xAI models that can be used, can be found here
- OpenRouter - more info about OpenRouter models that can be used, can be found here
For detailed information about setting up environment variables for different API providers, observability tools, and dataset management, please refer to the Installation Guide.
You can control the logging verbosity using the SYNTHGENAI_DETAILED_MODE environment variable:
# For detailed logging (shows all debug information)
export SYNTHGENAI_DETAILED_MODE="false"
# For NO logging (default)
export SYNTHGENAI_DETAILED_MODE="true"Note
By default, SYNTHGENAI_DETAILED_MODE is set to "true", which provides NO logging output. Set it to "false" to enable detailed debugging information during dataset generation.
For observing the generated datasets, you can use Langfuse for tracebility and monitoring of the LLMs.
For handling the datasets and saving them on Hugging Face Hub, you can use the Hugging Face Datasets library.
Currently there are six types of datasets that can be generated using SynthGenAI:
- Raw Datasets
- Instruction Datasets
- Preference Datasets
- Sentiment Analysis Datasets
- Summarization Datasets
- Text Classification Datasets
The datasets can be generated:
- Synchronously - each dataset entry is generated one by one
- Asynchronously - batch of dataset entries is generated at once
Note
Asynchronous generation is faster than synchronous generation, but some of LLM providers can have limitations on the number of tokens that can be generated at once.
More examples with different combinations of LLM API providers and dataset configurations can be found in the examples directory.
Important
Sometimes the generation of the keywords for the dataset and the dataset entries can fail due to the limitation of the LLM to generate JSON Object as output (this is handled by the package). That's why it is recommended to use models that are capable of generating JSON Objects (structured output). List of models that can generate JSON Objects can be found here.
Examples of generated synthetic datasets can be found on the SynthGenAI Datasets Collection on Hugging Face Hub.
If you want to contribute to this project and make it better, your help is very welcome. Create a pull request with your changes and I will review it. If you have any questions, open an issue.
This project is licensed under the MIT License - see the LICENSE.md file for details.
.
βββ .github/ # GitHub configuration files and workflows
β βββ workflows/ # GitHub Actions workflows
β β βββ build_n_publish.yaml # Build and publish workflow
β β βββ docs.yaml # Documentation deployment workflow
β β βββ uv-ci.yaml # UV package manager CI workflow
β βββ depandabot.yml # Dependabot configuration for automatic dependency updates
βββ docs # MkDocs documentation source files
β βββ assets # Static assets for documentation
β β βββ favicon.png # Website favicon
β β βββ logo_header.png # Header logo image
β β βββ logo.svg # SVG logo for the project
β βββ configurations # Configuration documentation
β β βββ dataset_configuration.md # Dataset configuration guide
β β βββ dataset_generator_configuration.md # Dataset generator configuration guide
β β βββ index.md # Configuration section index
β β βββ llm_configuration.md # LLM configuration guide
β βββ contributing # Contribution guidelines
β β βββ index.md # How to contribute to the project
β βββ datasets # Dataset type documentation
β β βββ index.md # Dataset types overview
β β βββ instruction_datasets.md # Instruction datasets documentation
β β βββ preference_datasets.md # Preference datasets documentation
β β βββ raw_datasets.md # Raw datasets documentation
β β βββ sentiment_analysis_datasets.md # Sentiment analysis datasets documentation
β β βββ summarization_datasets.md # Summarization datasets documentation
β β βββ text_classification_datasets.md # Text classification datasets documentation
β βββ examples # Examples documentation
β β βββ index.md # Code examples and usage patterns
β βββ index.md # Main documentation homepage
β βββ installation # Installation documentation
β β βββ index.md # Installation guide and requirements
β βββ llm_providers # LLM provider documentation
β β βββ index.md # Supported LLM providers guide
β βββ quick_start # Quick start guide
β β βββ index.md # Getting started tutorial
β βββ stylesheets # Custom CSS styles for documentation
βββ examples # Python example scripts demonstrating usage
β βββ anthropic_instruction_dataset_example.py # Anthropic API instruction dataset example
β βββ azure_ai_preference_dataset_example.py # Azure AI preference dataset example
β βββ azure_summarization_dataset_example.py # Azure summarization dataset example
β βββ bedrock_raw_dataset_example.py # AWS Bedrock raw dataset example
β βββ deepseek_instruction_dataset_example.py # DeepSeek instruction dataset example
β βββ gemini_langfuse_raw_dataset_example.py # Gemini with Langfuse raw dataset example
β βββ groq_preference_dataset_example.py # Groq preference dataset example
β βββ huggingface_instruction_dataset_example.py # Hugging Face instruction dataset example
β βββ mistral_preference_dataset_example.py # Mistral AI preference dataset example
β βββ ollama_preference_dataset_example.py # Ollama preference dataset example
β βββ openai_raw_dataset_example.py # OpenAI raw dataset example
β βββ openrouter_raw_dataset_example.py # OpenRouter raw dataset example
β βββ sagemaker_summarization_dataset_example.py # AWS SageMaker summarization dataset example
β βββ vertex_ai_text_classification_dataset_example.py # Google Vertex AI text classification example
β βββ vllm_sentiment_analysis_dataset_example.py # vLLM sentiment analysis dataset example
β βββ xai_raw_dataset_example.py # xAI raw dataset example
βββ synthgenai # Main package source code
β βββ dataset # Dataset handling modules
β β βββ __init__.py # Dataset package initializer
β β βββ base_dataset.py # Base dataset class and common functionality
β β βββ dataset.py # Main dataset implementation
β βββ dataset_genetors # Dataset generation modules
β β βββ __init__.py # Dataset generators package initializer
β β βββ classification_dataset_generator.py # Text classification dataset generator
β β βββ dataset_generator.py # Base dataset generator class
β β βββ instruction_dataset_generator.py # Instruction-following dataset generator
β β βββ preference_dataset_generator.py # Preference dataset generator (RLHF)
β β βββ raw_dataset_generator.py # Raw text dataset generator
β β βββ sentiment_dataset_generator.py # Sentiment analysis dataset generator
β β βββ summarization_dataset_generator.py # Text summarization dataset generator
β βββ llm # LLM interaction modules
β β βββ __init__.py # LLM package initializer
β β βββ base_llm.py # Base LLM class and common functionality
β β βββ llm.py # Main LLM implementation with LiteLLM integration
β βββ prompts # Prompt templates for different dataset types
β β βββ description_system_prompt # System prompt for generating descriptions
β β βββ description_user_prompt # User prompt template for descriptions
β β βββ entry_classification_system_prompt # System prompt for classification entries
β β βββ entry_instruction_system_prompt # System prompt for instruction entries
β β βββ entry_preference_system_prompt # System prompt for preference entries
β β βββ entry_raw_system_prompt # System prompt for raw text entries
β β βββ entry_sentiment_system_prompt # System prompt for sentiment entries
β β βββ entry_summarization_system_prompt # System prompt for summarization entries
β β βββ entry_user_prompt # User prompt template for dataset entries
β β βββ keyword_system_prompt # System prompt for keyword generation
β β βββ keyword_user_prompt # User prompt template for keywords
β β βββ labels_system_prompt # System prompt for label generation
β β βββ labels_user_prompt # User prompt template for labels
β βββ schemas # Pydantic data models and validation schemas
β β βββ __init__.py # Schemas package initializer
β β βββ config.py # Configuration data models
β β βββ datasets.py # Dataset-related data models
β β βββ enums.py # Enumeration definitions
β β βββ messages.py # Message and response data models
β βββ utils # Utility functions and helpers
β | βββ file_utils.py # File I/O operations and utilities
β | βββ __init__.py # Utils package initializer
β | βββ json_utils.py # JSON processing utilities
β | βββ progress_utils.py # Progress tracking and display utilities
β | βββ prompt_utils.py # Prompt processing and formatting utilities
β | βββ text_utils.py # Text manipulation and processing utilities
β | βββ yaml_utils.py # YAML processing utilities
β βββ __init__.py # Main package initializer and version info
β βββ cli.py # Command-line interface implementation
βββ tests # Test suite for the package
β βββ __init__.py # Tests package initializer
β βββ conftest.py # pytest configuration and fixtures
β βββ test_dataset_generator.py # Tests for dataset generators
β βββ test_dataset.py # Tests for dataset functionality
β βββ test_llm.py # Tests for LLM integration
βββ .gitignore # Git ignore rules for excluded files
βββ .pre-commit-config.yaml # Pre-commit hooks configuration
βββ .python-version # Python version specification for pyenv
βββ LICENCE.txt # MIT License file
βββ mkdocs.yml # MkDocs documentation configuration
βββ pyproject.toml # Python project metadata and dependencies (PEP 518)
βββ README.md # Main project documentation and overview
βββ uv.lock # UV lockfile for reproducible dependency resolution