LTXV Model

LTXV: The first generation of the LTX video model

LTXV introduced core creative control and conditioning capabilities. LTX-2 is the current model, built for higher fidelity and production workflows.

Introducing LTXV

LTXV is the foundational model in the LTX video generation family, combining fast inference, strong temporal consistency, and high visual fidelity for text-to-video and image-to-video workflows.

Real-time performance with cinematic quality

Efficient spatiotemporal modeling enables fast generation without sacrificing motion coherence or detail.

Creative control & customization

Precise control over motion and depth, with support for LoRA and IC-LoRA to apply custom styles, brand identity, and motion behavior.

Open-source and scalable by design

A highly optimized, open architecture built to run efficiently on both high-end and consumer GPUs.

Model Evolution

The LTX video models build on each other, offering increasingly advanced capabilities for creative exploration and production-ready workflows.

LTXV

  • First-generation LTX video model built for speed, realism, and creative control
  • Creative exploration and flexible workflows
  • Video-only generation (no integrated audio)
  • Creative control through LoRA and editing tools
  • Legacy open-source model, available in LTX Studio

LTX-2

  • Advanced LTX model delivering synchronized audio–video generation and high-fidelity output
  • Production-ready video generation with extended capabilities
  • Native, synchronized audio and video generation
  • LoRA support with enhanced style features and camera control
  • Available model with open source, API access, Playground, and in LTX Studio

Model Evolution

The LTX video models build on each other, offering increasingly advanced capabilities for creative exploration and production-ready workflows.

01

LTXV

  • First-generation LTX video model built for speed, realism, and creative control
  • Creative exploration and flexible workflows
  • Video-only generation (no integrated audio)
  • Creative control through LoRA and editing tools
  • Legacy open-source model, available in LTX Studio

02

LTX-2

  • Advanced LTX model delivering synchronized audio–video generation and high-fidelity output
  • Production-ready video generation with extended capabilities
  • Native, synchronized audio and video generation
  • LoRA support with enhanced style features and camera control
  • Available model with open source, API access, Playground, and in LTX Studio

Model Evolution

The LTX video models build on each other, offering increasingly advanced capabilities for creative exploration and production-ready workflows.

LTXV

  • First-generation LTX video model built for speed, realism, and creative control
  • Creative exploration and flexible workflows
  • Video-only generation (no integrated audio)
  • Creative control through LoRA and editing tools
  • Legacy open-source model, available in LTX Studio

LTX-2

  • Advanced LTX model delivering synchronized audio–video generation and high-fidelity output
  • Production-ready video generation with extended capabilities
  • Native, synchronized audio and video generation
  • LoRA support with enhanced style features and camera control
  • Available model with open source, API access, Playground, and in LTX Studio

LTXV vs LTX-2: Model Comparison

Compare the foundational creative capabilities of LTXV with the advanced, production-ready features of LTX-2, the model available and supported today.

LTXV
LTX-2
Modalities
Video only
Video + Audio (jointly generated)
Primary Task
Text-to-Video, Image-to-Video
Text-to-Audio+Video (T2AV)
Architecture
Single-stream diffusion transformer integrated with Video-VAE
Asymmetric dual-stream diffusion transformer (video stream + audio stream) with separate VAEs for Audio and Video
Transformer Streams
One unified stream
One unified stream
Dev Model
8-step Distilled model
8 bit quantization
FP8
FP8
NVFP4
NVFP4
Latent Space Design
Deeply compressed spatiotemporal latent space
Decoupled latent spaces (separate VAEs for video and audio), Deeply compressed video latent
Video VAE
Custom Video-VAE with integrated patchifying; decoder performs both latent-to-pixel conversion and final denoising in pixel space
Spatiotemporal causal Video-VAE
Audio VAE
Causal Audio-VAE operating on mel spectrograms
Compression / Tokenization
1:192 compression, 32×32×8 pixels per token
Audio tokens ≈ 1/25s per token, 128-dim latent vectors (video compression not restated)
Denoising Strategy
Transformer denoises in latent space; VAE decoder performs final denoising in pixel space
Dual-stream DiT jointly denoises audio and video latents
Cross-Modal Interaction
Not applicable
Bidirectional audio-video cross-attention throughout the model
Positional Encoding
3D RoPE for video
3D RoPE for video, 1D temporal RoPE for audio; temporal RoPE used for cross-modal attention
Text Conditioning
T5XXL standard textual encoding
Multilingual text encoder (Gemma 3-12B) with multi-layer feature extraction and thinking tokens
Classifier-Free Guidance (CFG)
Standard
Modality-aware CFG with separate text and cross-modal guidance scales
Supported Outputs
Silent video
Video with synchronized speech, ambient audio, and foley
Inference Capabilities
Faster-than-real-time video generation; supports text-to-video and image-to-video (trained simultaneously)
Multi-scale, multi-tile inference up to 4k audiovisual output
Maximum Duration
5 seconds base generation (up to 60 seconds with Temporal Expansion)
Up to 20 seconds of synchronized audiovisual content
Open Source

Generation Flows in LTX-2

Two flows, optimized for different production needs

Fast

Built for speed and tight feedback loops. Choose Fast Flow when rapid iteration matters more than maximum visual detail.

Technical characteristics:

  • Resolutions: 1080p, 1440p, 4K
  • Duration: up to 20 seconds
  • Lower compute load and faster render times

Pro

High-fidelity generation for stable, detailed results. Choose Pro Flow when visual quality and consistency are more important than render speed.

Technical characteristics:

  • Resolutions: 1080p, 1440p, 4K
  • FPS: 25 / 50
  • Duration: up to 20 seconds
  • Enhanced detail and stability across extended sequences