349
Audio & Video Production338
Software Development246
Automation & Workflow217
Writing & Content Creation201
Marketing & Growth188
Design & Creative168
AI Infrastructure & MLOps168
Photography & Imaging152
Voice & Speech133
Data & Analytics133
Education & Learning128
Customer Support123
Sales & Outreach122
Research & Analysis95
Google released Multi-Token Prediction drafters for Gemma 4, which can make the open AI models run up to 3x faster on some devices without lowering output quality.
In short: Google released experimental add-ons for its Gemma 4 open AI models that can make them generate text up to three times faster on some hardware.
Google’s Gemma 4 models are designed to run locally, meaning on your own computer or phone instead of on a company’s servers. This can help with privacy because your prompts and files do not have to leave your device.
This week, Google released what it calls Multi-Token Prediction, or MTP, “drafters” for Gemma 4. These are smaller helper models that try to guess several next words at once. In AI systems like this, text is normally written one small piece at a time, called a token (think of a token as a short chunk of text, like part of a word or a whole word).
MTP uses a method called speculative decoding. A simple way to picture it is a fast assistant writing a rough draft, while the main model acts like an editor. The main model quickly checks the draft, accepts the parts it agrees with, and then continues. Google says this can reduce waiting time because the device spends less time doing slow, repetitive steps.
Google reports speedups of about 2.8x and 3.1x on Pixel phones for smaller Gemma models, and about 2.5x on Apple’s M4 chip for a larger Gemma model. Google also says there is “zero quality degradation” because the main Gemma model still verifies the draft tokens.
For regular people, faster on-device AI can mean more responsive apps and, on phones, potentially better battery life. “Up to 3x faster” is a best case, though, and results depend on your device. This also does not make the AI more accurate. It mainly makes it quicker at producing the same kind of answers.
Source: Arstechnica