🚀 Train any LLM with BLIPren, a flexible architecture that adapts to your needs, streamlining language-image pre-training effortlessly.
-
Updated
Feb 5, 2026 - Jupyter Notebook
🚀 Train any LLM with BLIPren, a flexible architecture that adapts to your needs, streamlining language-image pre-training effortlessly.
🌟 Build a PyTorch implementation of Google's PaliGemma model for advanced vision-language tasks, including object detection and segmentation.
Fine-tuned SigLIP for skin lesion classification
FPGA-based acceleration of TinyLLaVA-Phi-2-SigLIP-3.1B inference on AMD Alveo U280 using Vitis HLS.
Code for Post-hoc Probabilistic Vision-Language Models
An AI powered Video Serach Engine with google's SigLIP and FAISS. It allows to search objects or key moments in videos just using natural language.
BLIP-2 implementation for training vision-language models. Q-Former + frozen encoders + any LLM. Colab-ready notebooks with MoE variant.
Generalized Referring Expression Segmentation on Aerial Photos with Aerial-D, a 37,288-image dataset with 1.52M referring expressions covering instances, groups, and semantic regions across 21 categories.
Novel multimodal architecture for detecting such misinformation by explicitly modeling the consistency between visual content, textual claims, and external factual knowledge.
[ICLR 2025] - Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion
Contrastive Olfaction-Language-Image Pre-training Model. The first-ever series of embeddings models for olfaction-vision-language applications in robotics and embodied AI - an extension of CLIP with olfaction.
This project is my PyTorch reproduction of PaliGemma, a compact 3B vision–language model that integrates SigLIP vision features with a Gemma decoder. I implemented the full multimodal pipeline from vision encoding to autoregressive text generation to study modern VLM architectures from a research perspective.
Visual Embedding Reduction and Space Exploration — Clustering-guided Insights for Training Data Enhancement in V-rDu
PyTorch implementation of Google's PaliGemma vision-language model with VQ-VAE decoder for processing referring expression segmentation outputs. Supports detection, segmentation, VQA, and captioning.
[CVPR'25-Demo] Official repository of "TryOffDiff: Virtual-Try-Off via High-Fidelity Garment Reconstruction using Diffusion Models".
Research on Outfit Recommendation Model Based on CNN-Transformer Cross-Modal Fusion
Your all-local photo organizer and photo search tool
Identify mountain peaks in your photos using AI—zero-shot retrieval, landmark re-ranking, and geospatial priors.
Add a description, image, and links to the siglip topic page so that developers can more easily learn about it.
To associate your repository with the siglip topic, visit your repo's landing page and select "manage topics."