Mini CLIP-style CXR↔report retrieval with prompt ablation, token-length check, and interpretability
-
Updated
Dec 2, 2025 - Python
Mini CLIP-style CXR↔report retrieval with prompt ablation, token-length check, and interpretability
Fine-tuned BLIP model on Flickr8k for multimodal image captioning (vision + language).
Multi-modal remote sensing image restoration and fusion foundation model with language prompting.
A real-time image captioning and visual question answering (VQA) system. This project uses computer vision and NLP to generate descriptive captions for images and answer user questions about them.
Vietnamese image captioning pipeline: BLIP + CLIP + NLLB. Gradio demo with BLEU/METEOR evaluation.
Resource-aware X-CLIP baseline on Cholec80: FP16 + grad-accum training and evaluation for surgical video –text localization
This repository hosts the code for Jan Hadl's Master Thesis at TU Wien: GS-VQA, a zero-shot VQA pipeline that uses VLMs for visual perception and ASP for symbolic reasoning.
[TIP 2022] Official code of paper “Video Question Answering with Prior Knowledge and Object-sensitive Learning”
Cross-lingual evaluation of CLIP on Japanese vs English memes — revealing 7.4% performance gap and sarcasm detection failure
PyTorch code for Finding in NAACL 2022 paper "Probing the Role of Positional Information in Vision-Language Models".
Real-time AI that sees, understands, and talks about what it sees - like a visual brain.
[ICLR 2026] - Spectral Concept Selection and Cross-modal Representation Learning for Generalized Category Discovery
Mamba for Vision, Perception and Action
[ICLR 2025] Data-Augmented Phrase-Level Alignment for Mitigating Object Hallucination
[ECCV2024] Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models
Official code of the paper ORacle: Large Vision-Language Models for Knowledge-Guided Holistic OR Domain Modeling accepted at MICCAI 2024.
Benchmark for evaluating MLLMs as judges of vision-task outputs across intrinsic and tool-mediated settings
MB-ORES: A Multi-Branch Object Reasoner for Visual Grounding in Remote Sensing
Streamlit App Combining Vision, Language, and Audio AI Models
TrackGPT: Track What You Need in Videos via Text Prompts
Add a description, image, and links to the vision-language topic page so that developers can more easily learn about it.
To associate your repository with the vision-language topic, visit your repo's landing page and select "manage topics."