Overview

An “inference service” is a service that exposes a single model for inference. This service will be hosted on a cluster of servers or cloud instances, potentially with reserved capacity and the ability to auto-scale with volume.

Model:

Cluster data:

  • Cluster details
  • Cluster location(s) - if hosted in a cloud, which region(s)
  • Cluster throughput (per server/instance if autoscaling)

Activity data

The activity of the inference service should be provided on an hourly basis to map to grid intensity from renewables:

  • Date and time, aggregated by hour
  • Cluster size (as multiple of base cluster)
  • Average CPU and GPU utilization
  • Use case
  • For key parameters (see below by use case)
    • Number of requests
    • Request latency (or other proxy for computation)

Image parameters

  • Input and output size
  • Output quality
  • LoRA models used

LoRA model training can be modeled using the same methodology as full models.

Text generation / chat parameters

  • Input tokens
  • Output tokens

Calculating emissions and water usage

For a given hour:

usage emissions = (cluster size) x E(gpu%, cpu%) x (datacenter PUE) x (average grid intensity)

embodied emissions = (cluster size) x EmbEm(1)

inference emissions = (usage emissions) + (embodied emissions)

water usage = (cluster size) x E(gpu%, cpu%) x (cluster WUE)

embodied water usage = (cluster size) x EmbH2O(1)

inference water = (water usage) + (embodied water usage)

The emissions per request includes the amortized training for the base model and any add-ons (like LoRA) and fine-tuning emissions:

Em(inference) = sum(amortized training emissions) + (amortized fine-tuning emissions) +
                        (inference emissions)

H2O(inference) = sum(amortized training water) + (amortized fine-tuning water) +
                        (inference water)

Various research findings

To model inference requires understanding the emissions per inference for a model on a particular host given certain parameters. There are many factors that determine how a model is executed including batching strategies, paged attention, and optimizations like quantization. The intention of this methodology is to provide a framework that both calculates and predicts the emissions cost of inference with the flexibility to include future optimizations as they appear.

Inference speed may be limited by the framework or by compute. Maximum performance will be achieved when models are not limited by framework overhead. In an optimal scenario, inference can use the full capacity and power of a GPU, as described by Towards Pareto Optimal Throughput in Small Language Model Serving.

Per Wilkins, Keshav, Mortier (2024) Offline Energy-Optimal LLM Serving, The number of input tokens and number of output tokens both individually have a substantial impact on energy consumption and runtime, with output tokens having a larger effect size as indicated by the higher 𝐹 statistic. Also, the interaction term shows that the input and output tokens depend on each other while impacting en- ergy consumption and runtime.

We therefore propose a model to describe the total energy consumption for a model 𝐾 as a function of input and output tokens, 𝜏𝑖𝑛 and 𝜏𝑜𝑢𝑡 , respectively: 𝑒𝐾 (𝜏𝑖𝑛, 𝜏𝑜𝑢𝑡 ) = 𝛼𝐾,0𝜏𝑖𝑛 + 𝛼𝐾,1𝜏𝑜𝑢𝑡 + 𝛼𝐾,2𝜏𝑖𝑛𝜏𝑜𝑢𝑡

This model has high explainability for the effect of input and output tokens on energy and runtime for inference of these different LLMs.