GitHub - BASHLab/RAVEN

RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language

Project Page: https://bashlab.github.io/raven_project/

🚀 Main Results

Comparison of RAVEN and prior MLLMs on exocentric open-ended video QA (MSVD-QA, MSRVTT-QA, ActivityNet-QA) and audio-visual QA (AVSD, MUSIC-QA) benchmarks. Best and second-best scores are in $\textbf{Bold}$ and $\underline{\text{underline}}$. $^*$ indicates scores reproduced by us.

Comparison of RAVEN with MLLMs on the EgoThink (Reasoning) and AVS-QA benchmarks. RAVEN outperforms across metrics and excels in reasoning. $\textbf{Bold}$ and $\underline{\text{underline}}$ indicate the best and second-best scores.

🛠️ Requirements and Installation

Basic Dependencies:

Python >= 3.8
Pytorch >= 2.2.0
CUDA Version >= 11.8
transformers == 4.40.0 (for reproducing paper results)
tokenizers == 0.19.1

cd RAVEN
pip install -r requirements.txt
pip install flash-attn==2.5.8 --no-build-isolation
pip install opencv-python==4.5.5.64
apt-get update && apt-get install ffmpeg libsm6 libxext6  -y

📁 AVS-QA Dataset

Train and test split of AVS-QA is provided here.
More details here.

🗝️ Training & Evaluation

Coming Soon!

🍀 Model Zoo

Model Name	Modal Type
RAVEN-7B-AV	AV
RAVEN-7B-AVS	AVS

🤖 Inference

STEP 1: Download $\texttt{siglip-so400m-patch14-384}$ from here google/siglip-so400m-patch14-384
STEP 2: Download RAVEN checkpoint

CUDA_VISIBLE_DEVICES=0 python inference.py --model-path=<MODEL PATH> --modal-type=<MODAL TYPE>

👍 Acknowledgement

The codebase of RAVEN is adapted from VideoLLaMA2. We are also grateful for their contribution.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
assets		assets
avs-qa-dataset		avs-qa-dataset
raven		raven
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
inference.py		inference.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language

Project Page: https://bashlab.github.io/raven_project/

🚀 Main Results

Comparison of RAVEN and prior MLLMs on exocentric open-ended video QA (MSVD-QA, MSRVTT-QA, ActivityNet-QA) and audio-visual QA (AVSD, MUSIC-QA) benchmarks. Best and second-best scores are in $\textbf{Bold}$ and $\underline{\text{underline}}$. $^*$ indicates scores reproduced by us.

Comparison of RAVEN with MLLMs on the EgoThink (Reasoning) and AVS-QA benchmarks. RAVEN outperforms across metrics and excels in reasoning. $\textbf{Bold}$ and $\underline{\text{underline}}$ indicate the best and second-best scores.

🛠️ Requirements and Installation

📁 AVS-QA Dataset

🗝️ Training & Evaluation

🍀 Model Zoo

🤖 Inference

👍 Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language

Project Page: https://bashlab.github.io/raven_project/

🚀 Main Results

Comparison of RAVEN and prior MLLMs on exocentric open-ended video QA (MSVD-QA, MSRVTT-QA, ActivityNet-QA) and audio-visual QA (AVSD, MUSIC-QA) benchmarks. Best and second-best scores are in $\textbf{Bold}$ and $\underline{\text{underline}}$. $^*$ indicates scores reproduced by us.

Comparison of RAVEN with MLLMs on the EgoThink (Reasoning) and AVS-QA benchmarks. RAVEN outperforms across metrics and excels in reasoning. $\textbf{Bold}$ and $\underline{\text{underline}}$ indicate the best and second-best scores.

🛠️ Requirements and Installation

📁 AVS-QA Dataset

🗝️ Training & Evaluation

🍀 Model Zoo

🤖 Inference

👍 Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages