Lang2Motion

Lang2Motion is a framework for language-guided point trajectory generation by aligning motion manifolds with joint embedding spaces. Unlike prior work focusing on human motion or video synthesis, we generate explicit trajectories for arbitrary objects using motion extracted from real-world videos via point tracking.

Overview

Lang2Motion learns trajectory representations through dual supervision: textual motion descriptions and rendered trajectory visualizations, both mapped through CLIP's frozen encoders. Our transformer-based auto-encoder supports multiple decoder architectures including autoregressive and MLP variants.

Key Results

Text-to-Trajectory Retrieval: 34.2% Recall@1, outperforming video-based methods by 12.5 points
Motion Accuracy: 33-52% improvement (12.4 ADE vs 18.3-25.3) compared to video generation baselines
Action Recognition: 88.3% Top-1 accuracy on human actions despite training on diverse object motions
Applications: Style transfer, semantic interpolation, and latent-space editing through CLIP-aligned representations

Architecture

Encoder: Transformer-based motion encoder with point trajectory inputs
Decoder Options:
- Transformer Autoregressive: Sequential generation with causal attention
- MLP: Direct mapping from latent to trajectories
CLIP Integration: Dual supervision through text and trajectory visualizations
Loss Functions: Reconstruction, velocity consistency, and cosine similarity alignment

🎨 Visualizations

Framework Overview

Method Architecture

Teaser Results

Motion Interpolation and Trajectory Generation

Trajectory Generation from Initial Grid and Latent Space Interpolation

Given identical text descriptions, Lang2Motion generates different motion interpretations based on initial grid placement and demonstrates semantic interpolation in CLIP's joint embedding space.

Left: From a dancing pose grid, the model emphasizes panda appearance: dancing panda

Right: From a panda's grid, the model emphasizes dancing motion: dancing panda

Key Insights:

Given identical text "dancing panda", Lang2Motion generates different motion interpretations based on initial grid placement
Initial grids use automatically retrieved masks; initial video frames shown for visualization only
Demonstrates semantic interpolation in CLIP's joint embedding space
Smooth transition between different motion styles while maintaining semantic coherence

🚀 Quick Start

Installation

# Clone the repository
git clone git@bitbucket.org:aclabneu/lang2motion.git
cd lang2motion

# Install dependencies
conda env create -f environment.yml
conda activate lang2motion

Training

# Train on MeViS dataset
python train_pointclip.py --dataset MeViS --batch_size 32 --epochs 200

Generation

# Generate motion from text
python generate.py --text "a person walking forward" --output output.npy

Dataset

Lang2Motion uses point trajectories extracted from real-world videos:

Source: Diverse video datasets with object and human motion
Tracking: Point trajectories extracted via CoTracker3
Supervision: Text descriptions and rendered trajectory visualizations
Scope: Arbitrary objects, not limited to human motion

Results

Text-to-Trajectory Retrieval: 34.2% Recall@1
Motion Accuracy: 12.4 ADE (vs 18.3-25.3 for video baselines)
Action Recognition: 88.3% Top-1 accuracy (cross-domain transfer)
Applications: Style transfer, semantic interpolation, latent-space editing

Citation

@article{lang2motion2025,
  title={Lang2Motion: Language-Guided Point Trajectory Generation},
  author={Bishoy Galoaa, Xiangyu Bai, Sarah Ostadabbas},
  journal={arXiv preprint},
  year={2025}
}

License

MIT License - see LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
src		src
visualizations		visualizations
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
environment.yml		environment.yml
generate.py		generate.py
train_pointclip.py		train_pointclip.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Lang2Motion

Overview

Key Results

Architecture

🎨 Visualizations

Framework Overview

Method Architecture

Teaser Results

Motion Interpolation and Trajectory Generation

🚀 Quick Start

Installation

Training

Generation

Dataset

Results

Citation

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

ostadabbas/Lang2Motion

Folders and files

Latest commit

History

Repository files navigation

Lang2Motion

Overview

Key Results

Architecture

🎨 Visualizations

Framework Overview

Method Architecture

Teaser Results

Motion Interpolation and Trajectory Generation

🚀 Quick Start

Installation

Training

Generation

Dataset

Results

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages