TurboQuant compresses a key part of large language models so they use far less memory and run faster, while matching the original output quality.
In short: Google Research described TurboQuant, a method that makes large language models use much less memory while keeping the same quality.
Google Research shared details of TurboQuant, a compression method for large language models, which are the systems behind many chatbots. It focuses on the model’s “KV cache,” which is like a notepad the model uses to remember what it has already read so it can answer based on a long conversation.
TurboQuant squeezes those stored values down to 3 bits each, instead of the common 32 bits. Google says this cuts KV cache memory use by at least 6x. In tests on Nvidia H100 chips, it also sped up a major step called “attention” (how the model decides what to focus on) by up to 8x.
The method has two parts. First, PolarQuant rearranges the numbers in a way that makes them easier to compress with less error (like turning a messy pile into neat stacks before packing). Then Quantized Johnson-Lindenstrauss, or QJL, adds a tiny 1-bit correction step to reduce leftover errors while keeping the “inner product” comparisons the model needs (a simple score for how similar two sets of numbers are).
Google reports “zero accuracy loss” compared to an uncompressed 32-bit baseline, and says it does not require fine-tuning, which is extra training to recover quality. Benchmarks on models including Gemma, Mistral, and Llama-3.1-8B-Instruct show results matching or slightly exceeding the baseline across long-context tests such as LongBench and Needle-In-A-Haystack.
If these results hold up broadly, more AI services could run longer chats and handle longer documents without needing as much expensive memory, which can lower operating costs and reduce hardware pressure.
Source: Arstechnica
64
Productivity & Workflow52
Software Development52
AI Infrastructure & MLOps39
Automation & Workflow46
Data & Analytics31
Voice & Speech34
Marketing & Growth36
Writing & Content Creation36
Customer Support26
Photography & Imaging34
Sales & Outreach22
Design & Creative22
Operations & Admin19
Research & Analysis22