Gemma 4 QAT models enhance device efficiency through model compression

Advertisement

The Gemma 4 series has been upgraded with Quantization-Aware Training (QAT) to significantly lower memory usage and boost performance for devices like mobiles and laptops.

Since Gemma 4’s launch two months ago, there has been ongoing development to enhance its capability. Initially, Multi-Token Prediction (MTP) was added to speed up inference, and more recently, a 12B model was released to fill the gap between the E4B and 26B MOE models.

Today marks the release of new checkpoints that utilize QAT to make Gemma 4 more efficient, facilitating the use of models on common edge devices and consumer graphics processing units (GPUs).

Advertisement

QAT allows for the simulation of quantization during the training process, minimizing quality degradation when compressing the model. This latest release includes QAT checkpoints for the well-known Q4_0 quantization format, as well as a new quantization format designed specifically for mobile applications. This mobile format has reduced the memory usage of the Gemma 4 E2B model to just 1GB, maintaining the expected capabilities and quality of the Gemma 4.

Impact of Quantization

Quantization is vital for running models on consumer hardware as it decreases memory demands while increasing decode speed. However, standard Post-Training Quantization (PTQ) can often result in lowered performance. QAT differentiates itself by incorporating the quantization process during training, which leads to higher quality results than typical PTQ methods.

We applied the QAT approach to the Q4_0 format to enhance model performance. For edge models like E2B and E4B, we developed a unique quantization strategy tailored for mobile devices.

Memory Optimization

  • Standard compression formats can be challenging for mobile processors.
  • A specialized mobile-quantization format ensures smooth performance on mobile hardware.
  • Deploying only necessary modalities, such as text-only models, can further reduce memory usage below 1GB.

To integrate these models into existing workflows effortlessly, we have collaborated with popular development tools to support the Gemma 4 QAT checkpoints from today onward.

We are eager to see the innovations you will create with Gemma 4 now available for local use!