Stars
【EMNLP 2024🔥】Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Code for the AVLnet (Interspeech 2021) and Cascaded Multilingual (Interspeech 2021) papers.
Comparatively fine-tuning pretrained BERT models on downstream, text classification tasks with different architectural configurations in PyTorch.
PySlowFast: video understanding codebase from FAIR for reproducing state-of-the-art video models.
Code for our CVPR 2018 paper "Learning Latent Super-Events to Detect Multiple Activities in Videos"
We rank the 1st in DSTC8 Audio-Visual Scene-Aware Dialog competition. This is the source code for our IEEE/ACM TASLP (AAAI2020-DSTC8-AVSD) paper "Bridging Text and Video: A Universal Multimodal Tra…
A repository of common methods, datasets, and tasks for video research
Official Pytorch implementation of "OmniNet: A unified architecture for multi-modal multi-task learning" | Authors: Subhojeet Pramanik, Priyanka Agrawal, Aman Hussain
Code for the paper "VisualBERT: A Simple and Performant Baseline for Vision and Language"
Implementation for "Large-scale Pretraining for Visual Dialog" https://arxiv.org/abs/1912.02379
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
Starter code in PyTorch for the Visual Dialog challenge
Activity Recognition Algorithms for the Charades Dataset