Analyzing Hateful Memes/ (Resources:- Hateful Memes Challenge)
-
Updated
Feb 18, 2024 - Jupyter Notebook
Analyzing Hateful Memes/ (Resources:- Hateful Memes Challenge)
Multi-speaker diarization from video using SyncNet’s cross-modal embedding space to match multiple face tracks to corresponding audio tracks.
Deeplearning utils for multimodal research
Clasificación de imágenes y asignación de textos mediante redes neuronales convolucionales y transformers multimodales
Code and Models for Binding Text, Images, Graphs, and Audio for Music Representation Learning
Example of a multimodal (end-to-end) deep learning model with transformers architecture
Fine-tuning BLIP for pathological visual question answering.
Build a reliable and interpretable model that classifies extreme weather from images – enhancing early detection, situational awareness, and decision-making.
A Survey on Patent Analysis: From NLP to Multimodal AI (ACL 2025)
This repository implements temporal reasoning capabilities for vision-language models in simulated embodied environments, addressing the critical limitation of frame-by-frame processing in current multimodal AI systems.
A repository of Video Language papers, code and datasets.
My implementation of research papers
Learning a common representation space from speech and text for cross-modal retrieval given textual queries and speech files.
Facial landmark detection and emotion classification using PyTorch, MediaPipe, and ResNet18 with attention mechanisms on the FER2013
A unified multimodal generative AI system designed to learn and adapt across multiple modalities (text, audio, vision, robotics) with minimal data and long-term autonomy through reinforcement learning.
Multimodal deep learning model for fake news classification.
The purpose of this project is to build an NLP model to make reading medical abtracts easier.
Build AI Superpowers with One SDK: Agents, Models, Context, Orchestration. Anywhere. Framework for Python & JavaScript/TypeScript
This repository is cloned from https://github.com/HLR/LatentAlignmentProcedural. This is a potential baseline explored for the textual_cloze task on the RecipeQA Dataset - https://hucvl.github.io/recipeqa/
MCD-UNet: A Multi-modal Conditional Diffusion UNet for 3D Medical Image Segmentation
Add a description, image, and links to the multimodal-deep-learning topic page so that developers can more easily learn about it.
To associate your repository with the multimodal-deep-learning topic, visit your repo's landing page and select "manage topics."