Optimizing GGUFs for Decoder-Only Embedding Models

tech

December 3, 2025

Optimizing GGUFs for Decoder-Only Embedding Models

Two weeks ago, we released GGUF formats of jina-embeddings-v4 - a universal embedding model for multimodal multilingual retrieval - with various quantized versions. Our motivation was simple: as a 3.75B parameter model, the vanilla transformer version of jina-embeddings-v4 doesn't scale well on our GCP G2 (L4 GPU) API instances, so we wanted to speed up inference using these smaller, faster GGUF versions. During our experiments, we discovered some interesting findings while converting and running GGUF embedding models. Since most of the llama.cpp community focuses on LLMs, we thought it'd be valuable to share this from an embedding provider's perspective.

Optimizing GGUFs for Decoder-Only Embedding Models

TL;DR

GGUF formats of jina-embeddings-v4 were released with various quantized versions.
The goal was to speed up inference for the 3.75B parameter model on GCP G2 (L4 GPU) API instances.
Interesting findings were discovered during the conversion and running of GGUF embedding models.
The information is shared from an embedding provider's perspective, contrasting with the usual LLM focus of the llama.cpp community.

Continue reading
the original article