-
TU-Darmstadt
- Darmstadt
-
09:13
(UTC -12:00) - http://akshitac8.github.io/
- @akshitac8
Highlights
Stars
[CVPR'25] Official implementation for paper - Contextual AD Narration with Interleaved Multimodal Sequence
[ECCV 2024] Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs
The official code of "Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning"
PyTorch implementation of Audio Flamingo: Series of Advanced Audio Understanding Language Models
[ICCV25 Oral] Token Activation Map to Visually Explain Multimodal LLMs
A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, llama-3.2-vision, qwen-vl, qwen2-vl, phi3-v etc.
🔥🔥First-ever hour scale video understanding models
This repository contains the official implementation of "FastVLM: Efficient Vision Encoding for Vision Language Models" - CVPR 2025
Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
[ICML 2025] Official PyTorch implementation of LongVU
Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis
Story-Based Retrieval with Contextual Embeddings. Largest freely available movie video dataset. [ACCV'20]
Code for "AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling"
SpeechGPT Series: Speech Large Language Models
LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.
Official PyTorch implementation of the paper "Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs"
[ACCV 2024] Official Implementation of "AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description". Junyu Xie, Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman
🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
[ICCV 2025] Official Implementation of "Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation". Junyu Xie, Tengda Han, Max Bain, Arsha Nagrani, Eshika Khandelwal, Gül Varol, W…
【ICLR 2024, Spotlight】Sentence-level Prompts Benefit Composed Image Retrieval
OpenTAD is an open-source temporal action detection (TAD) toolbox based on PyTorch.
The suite of modeling video with Mamba
Official Implementation of LADS (Latent Augmentation using Domain descriptionS)
This repo contains the projects: 'Virtual Normal', 'DiverseDepth', and '3D Scene Shape'. They aim to solve the monocular depth estimation, 3D scene reconstruction from single image problems.
Official PyTorch implementation of StyleGAN3
Official Code for DragGAN (SIGGRAPH 2023)
Code for the paper, Temporal Action Localization with Enhanced Instant Discriminability