Skip to main content

Cohere adds vision to its RAG search capabilities

A person's hands type on a laptop displaying a futuristic search interfaces with red and blue holographic elements.
A person's hands type on a laptop displaying a futuristic search interfaces with red and blue holographic elements.
Image Credit: VentureBeat made with OpenAI DALL-E 3 via ChatGPT

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


Cohere has added multimodal embeddings to its search model, allowing users to deploy images to RAG-style enterprise search. 

Embed 3, which emerged last year, uses embedding models that transform data into numerical representations. Embeddings have become crucial in retrieval augmented generation (RAG) because enterprises can make embeddings of their documents that the model can then compare to get the information requested by the prompt. 

The new multimodal version can generate embeddings in both images and texts. Cohere claims Embed 3 is “now the most generally capable multimodal embedding model on the market.” Aidan Gomez, Cohere co-founder and CEO, posted a graph on X showing performance improvements in image search with Embed 3. 

“This advancement enables enterprises to unlock real value from their vast amount of data stored in images,” Cohere said in a blog post. “Businesses can now build systems that accurately and quickly search important multimodal assets such as complex reports, product catalogs and design files to boost workforce productivity.”

Cohere said a more multimodal focus expands the volume of data enterprises can access through an RAG search. Many organizations often limit RAG searches to structured and unstructured text despite having multiple file formats in their data libraries. Customers can now bring in more charts, graphs, product images, and design templates. 

Performance improvements

Cohere said encoders in Embed 3 “share a unified latent space,” allowing users to include both images and text in a database. Some methods of image embedding often require maintaining a separate database for images and text. The company said this method leads to better-mixed modality searches. 

According to the company, “Other models tend to cluster text and image data into separate areas, which leads to weak search results that are biased toward text-only data. Embed 3, on the other hand, prioritizes the meaning behind the data without biasing towards a specific modality.”

Embed 3 is available in more than 100 languages. 

Cohere said multimodal Embed 3 is now available on its platform and Amazon SageMaker. 

Playing catch up

Many consumers are fast becoming familiar with multimodal search, thanks to the introduction of image-based search in platforms like Google and chat interfaces like ChatGPT. As individual users get used to looking for information from pictures, it makes sense that they would want to get the same experience in their working life. 

Enterprises have begun seeing this benefit, too, as other companies that offer embedding models provide some multimodal options. Some model developers, like Google and OpenAI, offer some type of multimodal embedding. Other open-source models can also facilitate embeddings for images and other modalities. The fight is now on the multimodal embeddings model that can perform at the speed, accuracy and security enterprises demand. 

Cohere, which was founded by some of the researchers responsible for the Transformer model (Gomez is one of the writers of the famous “Attention is all you need” paper), has struggled to be top of mind for many in the enterprise space. It updated its APIs in September to allow customers to switch from competitor models to Cohere models easily. At the time, Cohere had said the move was to align itself with industry standards where customers often toggle between models.