Skip to content
View akshitac8's full-sized avatar
😑
Busy
😑
Busy

Block or report akshitac8

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

[CVPR'25] Official implementation for paper - Contextual AD Narration with Interleaved Multimodal Sequence

Python 8 Updated May 1, 2025

[ECCV 2024] Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs

Python 163 10 Updated Nov 6, 2024

The official code of "Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning"

Python 81 1 Updated Oct 15, 2025
Python 805 48 Updated Jul 8, 2024

PyTorch implementation of Audio Flamingo: Series of Advanced Audio Understanding Language Models

997 84 Updated Dec 15, 2025

[ICCV25 Oral] Token Activation Map to Visually Explain Multimodal LLMs

Python 176 5 Updated Dec 14, 2025

A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, llama-3.2-vision, qwen-vl, qwen2-vl, phi3-v etc.

Python 365 43 Updated Dec 11, 2025

🔥🔥First-ever hour scale video understanding models

Python 611 41 Updated Jul 14, 2025

This repository contains the official implementation of "FastVLM: Efficient Vision Encoding for Vision Language Models" - CVPR 2025

Python 7,203 540 Updated May 5, 2025

Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.

Jupyter Notebook 18,302 1,596 Updated Jan 30, 2026

[ICML 2025] Official PyTorch implementation of LongVU

Python 422 35 Updated May 8, 2025

Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis

HTML 15 3 Updated May 28, 2025

Story-Based Retrieval with Contextual Embeddings. Largest freely available movie video dataset. [ACCV'20]

Python 194 29 Updated Sep 21, 2022

Code for "AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling"

Python 870 73 Updated Aug 27, 2024

SpeechGPT Series: Speech Large Language Models

Python 1,403 95 Updated Jul 22, 2024

LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.

Python 3,123 222 Updated May 19, 2025

Official PyTorch implementation of the paper "Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs"

Python 89 15 Updated Jun 6, 2025

[ACCV 2024] Official Implementation of "AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description". Junyu Xie, Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman

Python 28 1 Updated Jan 28, 2025

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production

Python 44,543 5,962 Updated Aug 16, 2024

[ICCV 2025] Official Implementation of "Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation". Junyu Xie, Tengda Han, Max Bain, Arsha Nagrani, Eshika Khandelwal, Gül Varol, W…

Python 20 3 Updated Jul 26, 2025

【ICLR 2024, Spotlight】Sentence-level Prompts Benefit Composed Image Retrieval

Python 92 8 Updated Apr 16, 2024

OpenTAD is an open-source temporal action detection (TAD) toolbox based on PyTorch.

Python 317 25 Updated Apr 29, 2025

The suite of modeling video with Mamba

Python 289 30 Updated May 14, 2024

Official Implementation of LADS (Latent Augmentation using Domain descriptionS)

Python 51 9 Updated Apr 18, 2023

This repo contains the projects: 'Virtual Normal', 'DiverseDepth', and '3D Scene Shape'. They aim to solve the monocular depth estimation, 3D scene reconstruction from single image problems.

Python 1,110 150 Updated Nov 10, 2023

Official PyTorch implementation of StyleGAN3

Python 6,892 1,235 Updated Sep 12, 2023

3D generation on ImageNet [ICLR 2023]

Python 214 8 Updated May 23, 2023

Official Code for DragGAN (SIGGRAPH 2023)

Python 35,970 3,441 Updated May 18, 2024
Python 95 7 Updated Sep 23, 2023

Code for the paper, Temporal Action Localization with Enhanced Instant Discriminability

Python 28 1 Updated Mar 25, 2024
Next