Analyzing Hateful Memes/ (Resources:- Hateful Memes Challenge)
-
Updated
Feb 18, 2024 - Jupyter Notebook
Analyzing Hateful Memes/ (Resources:- Hateful Memes Challenge)
Deeplearning utils for multimodal research
Clasificación de imágenes y asignación de textos mediante redes neuronales convolucionales y transformers multimodales
Code and Models for Binding Text, Images, Graphs, and Audio for Music Representation Learning
Example of a multimodal (end-to-end) deep learning model with transformers architecture
A repository of Video Language papers, code and datasets.
My implementation of research papers
Learning a common representation space from speech and text for cross-modal retrieval given textual queries and speech files.
A unified multimodal generative AI system designed to learn and adapt across multiple modalities (text, audio, vision, robotics) with minimal data and long-term autonomy through reinforcement learning.
Facial landmark detection and emotion classification using PyTorch, MediaPipe, and ResNet18 with attention mechanisms on the FER2013
Build AI Superpowers with One SDK: Agents, Models, Context, Orchestration. Anywhere. Framework for Python & JavaScript/TypeScript
The purpose of this project is to build an NLP model to make reading medical abtracts easier.
A generalized research workbench for fusing visual (CNN) and textual (RNN/LSTM) features. It provides a clean, modular Keras framework, optimized for efficient training on Google Colab by leveraging local I/O and persistent asset saving to Google Drive.
This repository implements temporal reasoning capabilities for vision-language models in simulated embodied environments, addressing the critical limitation of frame-by-frame processing in current multimodal AI systems.
Fine-tuning BLIP for pathological visual question answering.
Build a reliable and interpretable model that classifies extreme weather from images – enhancing early detection, situational awareness, and decision-making.
A novel multimodal approach for emotion recognition deploying early fusion based on graph-captured embeddings
This repository accompanies the Thesis "Automated Segmentation and Analysis of High-Speed Video Phase-Detection Data for Boiling Heat Transfer Characterization Using U-Net Convolutional Neural Networks and Uncertainty Quantification" published by MIT Libraries.
This repository focuses on the cutting-edge features of Llama 3.2, including multimodal capabilities, advanced tokenization, and tool calling for building next-gen AI applications. It highlights Llama's enhanced image reasoning, multilingual support, and the Llama Stack API for seamless customization and orchestration.
Add a description, image, and links to the multimodal-deep-learning topic page so that developers can more easily learn about it.
To associate your repository with the multimodal-deep-learning topic, visit your repo's landing page and select "manage topics."