Multi-speaker diarization from video using SyncNet’s cross-modal embedding space to match multiple face tracks to corresponding audio tracks.
-
Updated
Aug 14, 2025 - Python
Multi-speaker diarization from video using SyncNet’s cross-modal embedding space to match multiple face tracks to corresponding audio tracks.
Multi-Modal Representational Learning for Social Media Popularity Prediction
This repository implements temporal reasoning capabilities for vision-language models in simulated embodied environments, addressing the critical limitation of frame-by-frame processing in current multimodal AI systems.
Deeplearning utils for multimodal research
Code and Models for Binding Text, Images, Graphs, and Audio for Music Representation Learning
Learning a common representation space from speech and text for cross-modal retrieval given textual queries and speech files.
A unified multimodal generative AI system designed to learn and adapt across multiple modalities (text, audio, vision, robotics) with minimal data and long-term autonomy through reinforcement learning.
Repository for context based emotion recognition
[IKT 2024] A Multi-Task Framework Using Mamba for Identity, Age, and Gender Classification from Hand Images
A deep learning system for real-time emotion recognition from both text and images using transformers.
A PyTorch implementation of multimodal VRNN and VAE.
Project to transform a natural language description into an image using Generative Adversarial Networks.
🚀 SynthAVSR is a research framework for training and evaluating audiovisual speech recognition (AVSR) models using synthetic data — with a focus on low-resource languages like Spanish and Catalan.
Semi-Supervised Learning (SSL)
Using a 3D Nearby Self-Attention Transformer to leverage the spatiotemporal nature of video for representation learning.
Accepted at The Web Conference 2024.
Made some improvements in research of CVPR for detection and suggestion model which is built using a multimodal which works with knowledge base of text and also classifies the images for detecting diseases by adding automated weight updated while also maintaining residues from up to 4 layers prior for maintaining information.
This code is part of the paper: "A Deep Dive Into Neural Synchrony Evaluation for Audio-visual Translation" published at ACM ICMI 2022.
Implementation for 3D bounding box prediction of objects using images, point clouds, and segmentation masks.
Add a description, image, and links to the multimodal-deep-learning topic page so that developers can more easily learn about it.
To associate your repository with the multimodal-deep-learning topic, visit your repo's landing page and select "manage topics."