-
Stanford
- Stanford, CA
-
04:44
(UTC -07:00) - sayands.github.io
- @debsarkar_sayan
- @sayandsarkar.bsky.social
Stars
Some useful tools for my own research (e.g. visualization of slam trajectory, codes to generate sliding bar video)
[CVPR 2026] WildPose: A Unified Framework for Robust Pose Estimation in the Wild
[Preprint] Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale
[CVPR 2025 Best Paper Award] VGGT: Visual Geometry Grounded Transformer
Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models
Native and Compact Structured Latents for 3D Generation
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
[ECCV 2024] Improving 2D Feature Representations by 3D-Aware Fine-Tuning
Volume Transformer: Revisiting Vanilla Transformers for 3D Scene Understanding
Inference, evaluation and analysis code for STEVO-Bench
State-of-the-art Image & Video CLIP, Multimodal Large Language Models, and More!
Official Implementation of Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction
Code repository for "DUNE: Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers"
Strategic research thinking agents for Claude Code — idea evaluation, project triage, and structured brainstorming. Helps you decide which papers to write, not just how to write them.
Official implementation of "Repurposing Geometric Foundation Models for Multi-view Diffusion"
[ICLR '26 Oral] Official repository of the paper "AnyUp: Universal Feature Upsampling".
[CVPR 2024] Probing the 3D Awareness of Visual Foundation Models
[ICLR 2026] Official implementation of "Splat and Distill: Augmenting Teachers with Feed-Forward 3D Reconstruction For 3D-Aware Distillation"
TIPSv2 (CVPR'26) and TIPS (ICLR'25)
🪨 why use many token when few token do trick — Claude Code skill that cuts 65% of tokens by talking like caveman
[ICLR'26] This repository is the implementation of "3D Aware Region Prompted Vision Language Model"
[CVPR 2025] The code for paper ''Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding''.
[ECCV 2024 - Oral] ACE0 is a learning-based structure-from-motion approach that estimates camera parameters of sets of images by learning a multi-view consistent, implicit scene representation.
Code and data for UniEgoMotion (ICCV 2025)
A simple, unified multimodal models training engine. Lean, flexible, and built for hacking at scale.