Dia is a 1.6B parameter text-to-speech model developed by Nari Labs, designed to generate highly realistic dialogues directly from transcripts. It supports emotional conditioning and non-verbal cues such as laughter, coughing, and more.
🧪 Pretrained model checkpoints and inference code are publicly available to accelerate research and experimentation.
💬 English only (for now).
🔬 Demo available on Hugging Face Spaces.
This is a community-enhanced fork of the original nari-labs/dia repository. It focuses on performance optimizations and accessibility for personal hardware.
-
🧠 Quantization (4-bit & INT8)
Reduces VRAM usage by nearly 50%. The model now runs on GPUs with just 8GB VRAM, consuming less than 6GB in practice. -
⚡ Flash Attention Integration
Experimental support for Flash Attention to speed up inference and reduce memory footprint. -
🔄 Continued Development
Ongoing efforts to improve inference speed, memory efficiency, and model accessibility.
- 🔈 Generate expressive dialogue using
[S1],[S2]tags. - 🤖 Realistic non-verbal sounds:
(laughs),(sighs),(coughs), etc. - 🧬 Optional voice cloning via reference audio.
- 🪄 Supports speaker conditioning and output diversity.
- 💡 Simple Python API for generation and saving audio.
pip install git+https://github.com/rzafiamy/dia-q4-bit.gitgit clone https://github.com/rzafiamy/dia-q4-bit.git
cd dia
uv run app.pyOr using a Python virtual environment:
git clone https://github.com/rzafiamy/dia-q4-bit.git
cd dia
python -m venv .venv
source .venv/bin/activate
pip install -e .
python app.pyℹ️ Voices vary between runs unless you condition output with an audio prompt or set a fixed seed.
from dia.model import Dia
def get_model(quantize=False, quantize_4bit=False):
print("[INFO] Loading Dia model...")
return Dia.from_pretrained(
"nari-labs/Dia-1.6B",
compute_dtype="float16",
quantize=quantize,
quantize_4bit=quantize_4bit
)
model = get_model(quantize_4bit=True) # or quantize=True for int8
text = "[S1] Dia is fast and memory-efficient! [S2] And it sounds great. (laughs)"
output = model.generate(text, use_torch_compile=True, verbose=True)
model.save_audio("simple.mp3", output)quantize=True: Use INT8 quantizationquantize_4bit=True: Use 4-bit quantization- Both reduce VRAM usage significantly (under 6 GB), perfect for 8 GB consumer GPUs.
- Only one of these options should be enabled at a time.
📢 CLI and PyPI package coming soon.
Tested on PyTorch 2.0+ and CUDA 12.6. CPU support coming soon.
| Precision | Real-time (w/ compile) | Real-time (w/o compile) | VRAM Usage |
|---|---|---|---|
float16 |
x2.2 | x1.3 | ~10 GB |
bfloat16 |
x2.1 | x1.5 | ~10 GB |
float32 |
x1.0 | x0.9 | ~13 GB |
int8/4bit |
✅ Efficient (forked) | ✅ Efficient (forked) | < 6 GB |
Apache License 2.0 – see the LICENSE file for full details.
This model is for research and educational use only. By using it, you agree not to:
- Mimic real identities without consent.
- Generate misleading or harmful content.
- Use it for illegal or malicious purposes.
- 4-bit / INT8 quantization support.
- Flash Attention integration.
- CPU inference support.
- Docker support (incl. ARM/MacOS).
- Public PyPI release & CLI.
- Larger model versions.
This fork is maintained by the open-source community.
PRs are welcome! Join us on Discord to collaborate or share ideas.
- Nari Labs for the original Dia model.
- Google TPU Research Cloud for computing resources.
- Research inspiration from: SoundStorm, Parakeet, and Descript Audio Codec.
- Hugging Face for hosting weights and demo space.