Implement Jina CLIP v2 and NewBie dual CLIP #11415
Merged
+306
−4
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I've implemented Jina CLIP v2, which is used by the NewBie image model. Before this PR, ComfyUI already supports NewBie's DiT (see #11172 ) and gemma-3-4b-it text encoder (in Lumina2). After this PR, we can run NewBie with full functionality in native ComfyUI.
The implementation of Jina CLIP v2 is put in a single py file, in a way similar to the existing BERT and Llama/Gemma. The weights and the tokenizer are also packaged in a single safetensors file. I've tested that it produces the same
clip_text_pooledas the official Jina CLIP v2 (within some floating point error).Here is an image generated using this PR, with a simple workflow in it:
NewBie-Image-Exp0.1.safetensorsis downloaded from https://huggingface.co/NewBie-AI/NewBie-image-Exp0.1/blob/main/transformer/diffusion_pytorch_model.safetensorsgemma_3_4b_it_bf16.safetensorsis downloaded from https://huggingface.co/woctordho/comfyui-gemma-3-4b-it/blob/main/gemma_3_4b_it_bf16.safetensorsjina_clip_v2_bf16.safetensorsis downloaded from https://huggingface.co/woctordho/comfyui-jina-clip-v2/blob/main/jina_clip_v2_bf16.safetensorsIf I understand correctly, both Gemma and Jina need the system prompt written in the
CLIPTextEncodenode:otherwise the generated image will be garbage.
After this PR is merged, I can make another PR to support checkpoint loader (all-in-one model).