Google Research shows TurboQuant can shrink AI memory use 6x

In short: Google Research described TurboQuant, a method that makes large language models use much less memory while keeping the same quality.

What happened

Google Research shared details of TurboQuant, a compression method for large language models, which are the systems behind many chatbots. It focuses on the model’s “KV cache,” which is like a notepad the model uses to remember what it has already read so it can answer based on a long conversation.

TurboQuant squeezes those stored values down to 3 bits each, instead of the common 32 bits. Google says this cuts KV cache memory use by at least 6x. In tests on Nvidia H100 chips, it also sped up a major step called “attention” (how the model decides what to focus on) by up to 8x.

The method has two parts. First, PolarQuant rearranges the numbers in a way that makes them easier to compress with less error (like turning a messy pile into neat stacks before packing). Then Quantized Johnson-Lindenstrauss, or QJL, adds a tiny 1-bit correction step to reduce leftover errors while keeping the “inner product” comparisons the model needs (a simple score for how similar two sets of numbers are).

Google reports “zero accuracy loss” compared to an uncompressed 32-bit baseline, and says it does not require fine-tuning, which is extra training to recover quality. Benchmarks on models including Gemma, Mistral, and Llama-3.1-8B-Instruct show results matching or slightly exceeding the baseline across long-context tests such as LongBench and Needle-In-A-Haystack.

Why it matters

If these results hold up broadly, more AI services could run longer chats and handle longer documents without needing as much expensive memory, which can lower operating costs and reduce hardware pressure.

Source: Arstechnica

In short: Google Research described TurboQuant, a method that makes large language models use much less memory while keeping the same quality.

What happened

Why it matters

Source: Arstechnica