vision-language

Here are 236 public repositories matching this topic...

navaneet625 / RealTimeVQACaptioning

A real-time image captioning and visual question answering (VQA) system. This project uses computer vision and NLP to generate descriptive captions for images and answer user questions about them.

Updated Nov 26, 2025
Python

MahshadSa / radiology-clip-mini

Star

Mini CLIP-style CXR↔report retrieval with prompt ablation, token-length check, and interpretability

medical-imaging radiology vision-language contrastive-learning foundation-models iu-xray

Updated Dec 2, 2025
Python

yaekobB / multimodal-image-captioning

Star

Fine-tuned BLIP model on Flickr8k for multimodal image captioning (vision + language).

nlp computer-vision deep-learning transformers pytorch image-captioning lora fine-tuning huggingface vision-language flickr8k multimodal-ai blip-model

Updated Sep 1, 2025
Jupyter Notebook

Soheil-jafari / XCLIP_Baseline

Star

Resource-aware X-CLIP baseline on Cholec80: FP16 + grad-accum training and evaluation for surgical video –text localization

pytorch clip xclip multimodel vision-language medical-ai contrastive-learning moment-retrieval video-text temporal-localization

Updated Feb 1, 2026
Python

janhadl / tuw-master-thesis

Star

This repository hosts the code for Jan Hadl's Master Thesis at TU Wien: GS-VQA, a zero-shot VQA pipeline that uses VLMs for visual perception and ASP for symbolic reasoning.

machine-learning zero-shot visual-question-answering vision-language

Updated Feb 23, 2024
Jupyter Notebook

maliha-usui / cross-lingual-clip-memes

Star

Cross-lingual evaluation of CLIP on Japanese vs English memes — revealing 7.4% performance gap and sarcasm detection failure

python nlp japanese memes english clip sarcasm cross-lingual explainable-ai multimodal vision-language

Updated Feb 14, 2026
Jupyter Notebook

n-dryer / wearable-assistant-context-bench

Star

A benchmark for measuring whether multimodal assistants update to current context instead of staying anchored to prior context. 50 scenarios, three channel design (audio, camera, ground truth), cross family LLM as judge by default.

benchmark machine-learning evaluation-framework multimodal context-tracking vision-language ai-assistant human-ai-interaction llm-evaluation wearable-ai reference-resolution product-driven

Updated Apr 28, 2026
Python

truefrontier-ai / Monolith

Star

Mamba for Vision, Perception and Action

model mamba vision-language llm llm-inference

Updated Dec 21, 2023

phiyodr / plxmert

Star

PyTorch code for Finding in NAACL 2022 paper "Probing the Role of Positional Information in Vision-Language Models".

naacl transformers vision-and-language pre-training vision-language lxmert naacl2022 unibwm

Updated Jul 20, 2022
Python

mwasifanwar / VisionBrain

Star

Real-time AI that sees, understands, and talks about what it sees - like a visual brain.

python opencv ai realtime transformers python3 innovation artificial-intelligence machinelearning deeplearning computervision huggingface vision-language ai-assistant vision-language-model multimodal-ai

Updated Oct 31, 2025
Python

EmmanuelleB985 / mmeval-vrag

Star

Evaluation Framework for Multimodal RAG Systems

evaluation multimodal rag vision-language

Updated Apr 1, 2026
Python

connectpool / vision-token-bridge

Star

Tiny PyTorch adapter bridging visual features into language-token embeddings.

adapter pytorch multimodal vision-language mllm

Updated Mar 6, 2026
Python

sandeepvijayarao09 / vision-language-grounding

Star

CLIP zero-shot image-text grounding, cross-modal attention visualization, and Stable Diffusion prompt-conditioned generation. VLM/VLA research.

python deep-learning clip vlm multimodal vision-language stable-diffusion

Updated Apr 19, 2026
Python

turbody / mm-citation-rag-lite

Star

Lightweight multimodal RAG toolkit with citation-focused context building

retrieval citation multimodal rag vision-language llm

Updated Mar 5, 2026
Python

SharvenRane / clip-finetuning

Star

Fine-tuning CLIP and SigLIP on domain-specific datasets with zero-shot evaluation

computer-vision pytorch clip multimodal vision-language

Updated Mar 5, 2026
Python

pozapas / traffic-safety-vlm

Star

Research-grade app for multimodal traffic safety analysis from live streams and recorded video.

traffic-analysis safety vlm vision-language

Updated Apr 20, 2026
Python

DavidK0 / SUTS-for-VLMs

Star

This repository contains a spatial understanding test suite for vision-language models

test-suite university-of-washington vision-language

Updated Sep 5, 2022
Python

MahshadSa / ct-phrase-retrieval00

Star

Kaggle-runnable phrase-to-CT slice retrieval (OpenCLIP+FAISS). Portfolio-scale, teachable demo for foundation models in oncology

pytorch kaggle medical-imaging clip radiology oncology vision-language foundation-models

Updated Oct 20, 2025
Python

tardigrade1001 / ComfyUI-Unified-Caption

Star

ComfyUI node for single-image captioning using frontier multimodal models via OpenRouter and Replicate, with cost estimation and fallback support.

image-captioning multimodal vision-language comfyui replicate-api comfyui-custom-node openrouter-api

Updated Apr 22, 2026
Python

VivaanGupta17 / radreport-vl

Star

Vision-Language Model for Automated Radiology Report Generation — ViT encoder + GPT decoder with cross-attention, SCST reward training, and hallucination detection

pytorch nlg radiology chest-xray multimodal mimic-cxr vision-language report-generation medical-ai cross-attention

Updated Apr 8, 2026
Python

Improve this page

Add a description, image, and links to the vision-language topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the vision-language topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vision-language

Here are 236 public repositories matching this topic...

navaneet625 / RealTimeVQACaptioning

MahshadSa / radiology-clip-mini

yaekobB / multimodal-image-captioning

Soheil-jafari / XCLIP_Baseline

janhadl / tuw-master-thesis

maliha-usui / cross-lingual-clip-memes

n-dryer / wearable-assistant-context-bench

truefrontier-ai / Monolith

phiyodr / plxmert

mwasifanwar / VisionBrain

EmmanuelleB985 / mmeval-vrag

connectpool / vision-token-bridge

sandeepvijayarao09 / vision-language-grounding

turbody / mm-citation-rag-lite

SharvenRane / clip-finetuning

pozapas / traffic-safety-vlm

DavidK0 / SUTS-for-VLMs

MahshadSa / ct-phrase-retrieval00

tardigrade1001 / ComfyUI-Unified-Caption

VivaanGupta17 / radreport-vl

Improve this page

Add this topic to your repo