Overview

Inference is the process of an AI model responding to a request.

A common example is an AI chatbot. The request and the response consist of text, which is encoded in “tokens” (words or sub-words). Other common AI use cases include image generation, image modification, video generation, etc.

Typically a model is hosted on a cluster of servers or cloud instances, with the goal of maximizing throughput (tokens or responses per second) for a given latency (time to respond) constraint.

A consumer of AI inference can either host a model locally, host a model on a cloud service (e.g. AWS, Together.ai) or use a managed service provider endpoint (e.g. OpenAI, Google Vertex AI).

Scope3’s methodology for inference takes in a model, input and output parameters, and information about your inference service. It then estimates the energy required and environmental impact for that inference.

Inference duration prediction

The key element of energy usage and environmental impact is the duration of the inference request - how long the computation takes. Many factors influence the duration. In particular, the model used (including the model’s precision, in bytes), the size of the input and output, and the hardware configuration have a large influence.

There are also many additional factors that determine how a model is executed including batching strategies, paged attention, and optimizations like quantization. The intention of this methodology is to provide a framework that both calculates and predicts the emissions cost of inference with the flexibility to include future optimizations as they appear.

Text-to-Text inference duration prediction

Baseten summarizes LLM inference duration. (For a deeper dive, see kipply’s post on inference arithmetic). LLM inference duration can be broken down into two parts:

  1. Input token (“prefill”) duration - the time required to process all of the input tokens
  2. Output token (“decode”) duration - the time required to generate all of the output tokens

Prefill duration is compute bound whereas decode duration is memory bound:

tin=2PmninAf(b)t_{in} = \frac{2 * P_m * n_{in}}{A_{f}(b)} tout=bPmnoutAbwt_{out} = \frac{b * P_m * n_{out}}{A_{bw}}

where:

  • PmP_m is the number of parameters in the model
  • ninn_{in} and noutn_{out} are the number of input and output tokens
  • bb is the precision of the model parameters, in bytes (e.g. 2, for FP16)
  • Af(b)A_{f}(b) is the computational capacity of the model, in FLOPs, for the given precision
  • AbwA_{bw} is the memory bandwidth of the GPU

The total duration is the sum:

ttotal=tin+toutt_{total} = t_{in} + t_{out}

These expressions are theoretical. Therefore we fit a linear regression to benchmarked inference speed data to predict real-world inference duration:

ttotal^=βintin+βouttout\hat{t_{total}} = \beta_{in} * t_{in} + \beta_{out} * t_{out}

where:

  • βin\beta_{in} and βout\beta_{out} are the learned coefficients of the linear regression

Energy calculation

Given the predicted inference duration, we can calculate the energy use per inference:

E=ttotal^[Pidle+(NgpuPgpu)+(NcpuPcpu)]E = \hat{t_{total}} * \left[ P_{idle} + \left(N_{gpu} * P_{gpu}\right) + \left(N_{cpu} * P_{cpu}\right) \right]

where:

  • PidleP_{idle}, PgpuP_{gpu}, and PcpuP_{cpu} are the power consumption of the idle server, GPU and CPU respectively
  • NgpuN_{gpu} and NcpuN_{cpu} are the number of GPUs and CPUs on the server

Cluster energy use

The cluster is defined by:

  • Cluster details
  • Cluster location(s) - if hosted in a cloud, which region(s)

Calculating emissions and water usage for an inference request

Calculate embodied emissions and water usage using predicted inference duration (see Server cluster for more information about embodied emissions):

embodied_emissions_per_second = EmbEm(1) / 3600
embodied_emissions_per_inference = predicted_inference_duration x embodied_emissions_per_second

embodied_h2o_per_second = EmbH2O(1) / 3600
embodied_h2o_per_inference = predicted_inference_duration x embodied_h2o_per_second

Calculate the training and fine-tuning emissions using:

training_emissions_per_inference = training_emissions_per_inference + fine_tuning_emissions_per_inference
training_h20_per_inference = training_h20_per_inference + fine_tuning_h20_per_inference

Sum all emissions and water use to produce total impact for an inference request:

inference_emissions = usage_energy_per_inference x average_grid_intensity +
                    embodied_emissions_per_inference +
                    training_emissions_per_inference

inference_h20_impact = usage_energy_per_inference x cluster_WUE +
                    embodied_h2o_per_inference +
                    training_h2o_per_inference

Closed Model Parameter Estimation

Many providers do not disclose number of model parameters. For LLMs with unknown parameter counts, we use a linear regression model to produce a best estimate. It is well-documented that the cost a provider charges for inference is a good predictor of model size. For each popular provider, we fit a linear regression model to predict the number of parameters based on the cost, on a log-log scale:

ln(Pmp)^=β0p+β1pln(cmp)\hat{ln(P_m^p)} = \beta_{0}^p + \beta_{1}^p * ln(c_m^p)

where:

  • PmpP_m^p is the number of parameters in model mm from provider pp
  • cmpc_m^p is the cost of inference for model mm from provider pp
  • β0p\beta_{0}^p and β1p\beta_{1}^p are the learned coefficients of the linear regression

Once we have a prediction for each model-provider pair, we average the predictions using the (inverse, squared) standard error of the prediction as the weight, to get a final prediction per model:

Pm^=pwmpPmp^pwmp\hat{P_m} = \frac{\sum_p{w_m^p * \hat{P_m^p}}}{\sum_p{w_m^p}} wmp=1SEmp2w_m^p = \frac{1}{{SE_m^p}^2}

Conservative Estimates

Instead of using the point estimate prediction, we use the 95% confidence interval to produce a conservative estimate of the number of model parameters. The method uses the weighted standard error of the prediction, which uses the standard errors from each individual prediction, as above.

Various Additional Research Findings

Inference speed may be limited by the framework or by compute. Maximum performance will be achieved when models are not limited by framework overhead. In an optimal scenario, inference can use the full capacity and power of a GPU, as described by Towards Pareto Optimal Throughput in Small Language Model Serving.