Skip to content

ostadabbas/Lang2Motion

Repository files navigation

Lang2Motion

Lang2Motion is a framework for language-guided point trajectory generation by aligning motion manifolds with joint embedding spaces. Unlike prior work focusing on human motion or video synthesis, we generate explicit trajectories for arbitrary objects using motion extracted from real-world videos via point tracking.

Overview

Lang2Motion learns trajectory representations through dual supervision: textual motion descriptions and rendered trajectory visualizations, both mapped through CLIP's frozen encoders. Our transformer-based auto-encoder supports multiple decoder architectures including autoregressive and MLP variants.

Key Results

  • Text-to-Trajectory Retrieval: 34.2% Recall@1, outperforming video-based methods by 12.5 points
  • Motion Accuracy: 33-52% improvement (12.4 ADE vs 18.3-25.3) compared to video generation baselines
  • Action Recognition: 88.3% Top-1 accuracy on human actions despite training on diverse object motions
  • Applications: Style transfer, semantic interpolation, and latent-space editing through CLIP-aligned representations

Architecture

  • Encoder: Transformer-based motion encoder with point trajectory inputs
  • Decoder Options:
    • Transformer Autoregressive: Sequential generation with causal attention
    • MLP: Direct mapping from latent to trajectories
  • CLIP Integration: Dual supervision through text and trajectory visualizations
  • Loss Functions: Reconstruction, velocity consistency, and cosine similarity alignment

🎨 Visualizations

Framework Overview

Lang2Motion Concept

Method Architecture

Lang2Motion Method

Teaser Results

Lang2Motion Teaser

Motion Interpolation and Trajectory Generation

Trajectory Generation from Initial Grid and Latent Space Interpolation

Given identical text descriptions, Lang2Motion generates different motion interpretations based on initial grid placement and demonstrates semantic interpolation in CLIP's joint embedding space.


Left: From a dancing pose grid, the model emphasizes panda appearance: dancing panda

Right: From a panda's grid, the model emphasizes dancing motion: dancing panda

Key Insights:

  • Given identical text "dancing panda", Lang2Motion generates different motion interpretations based on initial grid placement
  • Initial grids use automatically retrieved masks; initial video frames shown for visualization only
  • Demonstrates semantic interpolation in CLIP's joint embedding space
  • Smooth transition between different motion styles while maintaining semantic coherence

🚀 Quick Start

Installation

# Clone the repository
git clone git@bitbucket.org:aclabneu/lang2motion.git
cd lang2motion

# Install dependencies
conda env create -f environment.yml
conda activate lang2motion

Training

# Train on MeViS dataset
python train_pointclip.py --dataset MeViS --batch_size 32 --epochs 200

Generation

# Generate motion from text
python generate.py --text "a person walking forward" --output output.npy

Dataset

Lang2Motion uses point trajectories extracted from real-world videos:

  • Source: Diverse video datasets with object and human motion
  • Tracking: Point trajectories extracted via CoTracker3
  • Supervision: Text descriptions and rendered trajectory visualizations
  • Scope: Arbitrary objects, not limited to human motion

Results

  • Text-to-Trajectory Retrieval: 34.2% Recall@1
  • Motion Accuracy: 12.4 ADE (vs 18.3-25.3 for video baselines)
  • Action Recognition: 88.3% Top-1 accuracy (cross-domain transfer)
  • Applications: Style transfer, semantic interpolation, latent-space editing

Citation

@article{lang2motion2025,
  title={Lang2Motion: Language-Guided Point Trajectory Generation},
  author={Bishoy Galoaa, Xiangyu Bai, Sarah Ostadabbas},
  journal={arXiv preprint},
  year={2025}
}

License

MIT License - see LICENSE file for details.

About

Lang2Motion: Bridging Language and Motion through Joint Embedding Spaces

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages