multimodal-deep-learning

Here are 297 public repositories matching this topic...

tudorhirtopanu / av-matchmaker

Multi-speaker diarization from video using SyncNet’s cross-modal embedding space to match multiple face tracks to corresponding audio tracks.

audio-visual-speech-recognition multimodal-deep-learning

Updated Aug 14, 2025
Python

DistilledCode / mmrl

Star

Multi-Modal Representational Learning for Social Media Popularity Prediction

neural-network embeddings data-pipeline multimodal-deep-learning praw-reddit airflow-dags chromadb multimodal-large-language-models

Updated Jun 30, 2024
Python

Rukh-sana / embodied-temporal-reasoning

Star

This repository implements temporal reasoning capabilities for vision-language models in simulated embodied environments, addressing the critical limitation of frame-by-frame processing in current multimodal AI systems.

computer-vision robotics artificial-intelligence multimodal multimodal-deep-learning embodied-artificial-intelligence temporal-reasoning embodied-agent embodied-ai habitat-sim llava

Updated Sep 16, 2025
Python

mobled37 / utils

Star

Deeplearning utils for multimodal research

finetuning multimodal-deep-learning

Updated Jul 28, 2023
Python

a-tabaza / binding_music

Star

Code and Models for Binding Text, Images, Graphs, and Audio for Music Representation Learning

music-information-retrieval multimodal-deep-learning joint-embedding

Updated Jun 24, 2024
Python

marcomoldovan / cross-modal-speech-segment-retrieval

Star

Learning a common representation space from speech and text for cross-modal retrieval given textual queries and speech files.

natural-language-processing deep-learning speech transformer speech-recognition spoken-language-understanding multimodal-deep-learning self-supervised-learning

Updated Apr 27, 2023
Python

elijahnzeli1 / SalesAI

Star

A unified multimodal generative AI system designed to learn and adapt across multiple modalities (text, audio, vision, robotics) with minimal data and long-term autonomy through reinforcement learning.

machine-learning ai ml agi unified generative multimodality multimodal multimodal-deep-learning multimodal-ai unified-multimodal-models ml-generative

Updated Jul 30, 2025
Python

usc-sail / mica-context-emotion-recognition

Star

Repository for context based emotion recognition

computer-vision emotion-recognition multimodal-deep-learning multimodal-fusion context-understanding

Updated Sep 25, 2023
Python

Avir-AI / handimage_mamba

Star

[IKT 2024] A Multi-Task Framework Using Mamba for Identity, Age, and Gender Classification from Hand Images

authentication mamba gender-classification age-classification hand multimodal-deep-learning vision-mamba vmamba mamba-classification

Updated Mar 2, 2025
Python

discover-Austin / multimodal-emotion-recognition

Star

A deep learning system for real-time emotion recognition from both text and images using transformers.

deep-learning multimodal-deep-learning text-emotion-detection image-emotion-classification

Updated Feb 22, 2025
Python

sathya-ml / multimodal-vrnn-vae

Star

A PyTorch implementation of multimodal VRNN and VAE.

generative-model vae variational-autoencoder multimodal-learning multimodal multimodal-deep-learning vae-implementation vae-pytorch conditional-generation vrnn

Updated Sep 6, 2024
Python

vijayvee / text-to-image-synthesis

Star

Project to transform a natural language description into an image using Generative Adversarial Networks.

generative-adversarial-networks text-to-image multimodal-deep-learning

Updated Dec 9, 2017
Python

Pol-Buitrago / SynthAVSR

Star

🚀 SynthAVSR is a research framework for training and evaluating audiovisual speech recognition (AVSR) models using synthetic data — with a focus on low-resource languages like Spanish and Catalan.

synth spanish multimodality catalan avsr asr synthetic-data multimodal multimodal-deep-learning vsr

Updated Jul 30, 2025
Python

endiqq / Multi-Feature-Semi-Supervised-Learning-for-COVID-19-CXR-Images

Star

Semi-Supervised Learning (SSL)

image-processing semi-supervised-learning image-classification multimodal-deep-learning

Updated Apr 14, 2021
Python

marcomoldovan / 3d-attention-video-understanding

Star

Using a 3D Nearby Self-Attention Transformer to leverage the spatiotemporal nature of video for representation learning.

deep-learning transformer attention representation-learning multimodal-deep-learning self-supervised-learning multimodal-alignment

Updated Jun 8, 2023
Python

Danesed / Ducho

Star

Accepted at The Web Conference 2024.

deep-learning artificial-intelligence feature-extraction recommender-system multimodal multimodal-deep-learning

Updated Feb 6, 2024
Python

roshaanahmed209 / Computer_vision_CVPR-research_project

Star

Made some improvements in research of CVPR for detection and suggestion model which is built using a multimodal which works with knowledge base of text and also classifies the images for detecting diseases by adding automated weight updated while also maintaining residues from up to 4 layers prior for maintaining information.

machine-learning deep-learning artificial-intelligence classification computervision multimodal multimodal-deep-learning