Why AI Energy Efficiency Efforts Must Focus on Inference

  • Energy, not cost, should be the central parameter for AI efficiency
  • Green AI focuses on reducing the environmental impact of AI systems over the lifecycle

As artificial intelligence becomes increasingly embedded in business and society, the conversation around sustainability and efficiency is evolving. Vincent Caldeira, CTO, APAC of Red Hat and an influential voice in the Green Software Foundation, emphasises that the true sustainability challenge for AI lies not in model training, but in inference—the phase where models deliver predictions and value at scale. “The biggest problem is on the inference side, because inference consumes all the time. It scales with utilisation. The more people use it, the more the ratio is going to be worse,” he said on the sidelines of the Red Hat Summit in Boston.

Energy Optimisation: The Key Metric for AI

Caldeira is unequivocal: energy, not cost, should be the central parameter for AI efficiency. While traditional IT operations have focused on optimizing for cost, this approach is no longer sufficient in the era of large-scale AI deployment.

“Energy has become in my view the biggest parameter of actual efficiency management… energy is the only proxy of efficiency as a whole at the system level. There is no other way to measure it.”

He argues that optimizing for cost alone can lead to sub-optimal decisions that may appear efficient financially but are wasteful in terms of resource and energy use. Instead, organisations should measure and optimise for energy consumption across the entire AI lifecycle, from infrastructure choices to operational deployment.

The Challenges of Optimising AI Inference

Optimising inference is a complex, multi-layered problem. Unlike training, which is a one-off event, inference is continuous and highly variable, depending on user demand, workload location, and infrastructure specifics. Caldeira outlines several key challenges:

  • Workload Placement and Infrastructure Diversity: Deciding where to run inference workloads—on-premises, in the cloud, or at the edge—requires careful consideration of data gravity, energy availability, and hardware capabilities.
  • Resource Right-Sizing: Matching the right GPU or accelerator to the workload is non-trivial, especially as hardware and model requirements evolve.
  • Operational Complexity: Efficient inference demands real-time decisions about routing, caching, and memory allocation, all of which impact energy use.
  • Lack of Standardised Metrics: Different cloud providers report efficiency metrics in inconsistent ways, making it hard for enterprises to compare and optimise across environments.

“Inference is a huge, really difficult optimisation problem. You need to decide based on your infrastructure what’s the best place to run the workload… you have a need to do right sizing… to optimise the allocation, then you’ve got to do routing. And this routing ideally should be cognisant of the use case.”

Caldeira also points out the environmental impact of hardware over-provisioning: “You have a hugely efficient GPU cloud… and at the same time, the GPU utilisation of the data center is 30%. That’s like a huge problem… you produce three times too many GPUs because you are bad at sharing them.”

Red Hat’s Approach to Inference Optimisation

To address these challenges, Red Hat employs a comprehensive, multi-layered approach to AI inference optimisation, combining advanced software, hardware-aware strategies, and system-level innovations:

  • vLLM-Powered Continuous Batching: Red Hat’s AI Inference Server leverages the open-source vLLM engine for dynamic batching, paged attention, and multi-GPU scaling—delivering up to four times faster token generation and maximizing GPU utilisation.
  • Model Compression via Neural Magic: Techniques like SparseGPT quantisation and pruning, inherited from Neural Magic, reduce model size by up to 70% with minimal accuracy loss, and enable efficient 8-bit inference on standard CPUs and GPUs.
  • Pre-Optimised Model Repository: Red Hat maintains a curated repository of compressed, validated model variants and deployment blueprints, ensuring consistent, high-performance inference across diverse environments.
  • System-Level Innovations: Unified APIs, carbon-aware scheduling, and hybrid cloud routing help direct workloads to the most energy-efficient locations and time windows, further reducing the overall energy footprint.

Collectively, these techniques enable enterprises to reduce inference energy consumption by 50–75% while maintaining or even improving performance.

Toward Solutions: Full-Stack and Community-Driven Optimisation

The way forward, Caldeira suggests, is holistic and collaborative. No single technology or optimisation layer suffices; instead, the industry must integrate solutions across data, model, and system levels, and standardise metrics for energy efficiency. “The most value is not one technology… you have to take those technologies and make them work together… to really do full stack optimization.”

He highlights the importance of open standards and community-driven initiatives, such as the Cloud Native Computing Foundation’s work on energy observability (e.g., Kepler) and the Green Software Foundation’s efforts to develop a “software carbon intensity” rating for AI systems.

“We have actually an extremely energy-focused approach… we really look at how we optimize the system for energy, not for cost. Cost is just a byproduct.”

Image from Green Software Foundation

Conclusion

As AI adoption accelerates, the sustainability imperative is clear: inference must be the primary focus for efficiency efforts, with energy optimization as the guiding metric. The challenges are significant, spanning technical, operational, and organizational domains.

Author

You may also like