Skip to content

RATHOD-SHUBHAM/CLIP-Classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CLIP

  • What is CLIP?

    • Contrastive Language-Image Pre-training (CLIP forshort) is a state-of-the-art model introduced by OpenAl in February.
    • CLIP is a neural network trained on about 400 million (text and image) pairs.
    • Training uses a contrastive learning approach that aims to unify text and images, allowing tasks like image classification to be done with text-image similarity.
  • CLIP Architecture:

    • Two encoders are jointly trained to predict the correct pairings of abatch of training (image, text) examples.

      • The text encoder's backbone is a transformer model, and the base size uses 63 millions- parameters,12 layers, and a 512-wide modelcontaining 8 attention heads.
      • The image encoder, on the other hand, uses both a Vision Transformer (ViT) and a ResNet50 as its backbone, responsible for generating the feature representation of the image.
  • Run Code:

Screenshot

Image Search

SentenceTransformers

SentenceTransformers provides models that allow to embed images and text into the same vector space.
This allows to find similar images as well as to implement image search.

clip-ViT-B-32

This is the Image & Text model CLIP, which maps text and images to a shared vector space

Usage

  1. Git clone Repository.
  2. cd ImageSearch.
  3. pip install requirements.txt

Docker Image


sc

About

CLIP is a multi-modal, zero-shot open-source paradigm. Without optimizing for a specific purpose, given a picture and text descriptions, the model can predict the best suitable text description for that image.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages