You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A real-time image captioning and visual question answering (VQA) system. This project uses computer vision and NLP to generate descriptive captions for images and answer user questions about them.
This repository hosts the code for Jan Hadl's Master Thesis at TU Wien: GS-VQA, a zero-shot VQA pipeline that uses VLMs for visual perception and ASP for symbolic reasoning.
A benchmark for measuring whether multimodal assistants update to current context instead of staying anchored to prior context. 50 scenarios, three channel design (audio, camera, ground truth), cross family LLM as judge by default.
Vision-Language Model for Automated Radiology Report Generation — ViT encoder + GPT decoder with cross-attention, SCST reward training, and hallucination detection