Methodology for calculating the impact of running AI inference
Inference is the process of an AI model responding to a request.
A common example is an AI chatbot. The request and the response consist of text, which is encoded in “tokens” (words or sub-words). Other common AI use cases include image generation, image modification, video generation, etc.
Typically a model is hosted on a cluster of servers or cloud instances, with the goal of maximizing throughput (tokens or responses per second) for a given latency (time to respond) constraint.
A consumer of AI inference can either host a model locally, host a model on a cloud service (e.g. AWS, Together.ai) or use a managed service provider endpoint (e.g. OpenAI, Google Vertex AI).
Scope3’s methodology for inference takes in a model, input and output parameters, and information about your inference service. It then estimates the energy required and environmental impact for that inference.
The key element of energy usage and environmental impact is the duration of the inference request - how long the computation takes. Many factors influence the duration. In particular, the model used (including the model’s precision, in bytes), the size of the input and output, and the hardware configuration have a large influence.
There are also many additional factors that determine how a model is executed including batching strategies, paged attention, and optimizations like quantization. The intention of this methodology is to provide a framework that both calculates and predicts the emissions cost of inference with the flexibility to include future optimizations as they appear.
Baseten summarizes LLM inference duration. (For a deeper dive, see kipply’s post on inference arithmetic). LLM inference duration can be broken down into two parts:
Prefill duration is compute bound whereas decode duration is memory bound:
where:
The total duration is the sum:
These expressions are theoretical. Therefore we fit a linear regression to benchmarked inference speed data to predict real-world inference duration:
where:
Given the predicted inference duration, we can calculate the energy use per inference:
where:
The cluster is defined by:
Calculate embodied emissions and water usage using predicted inference duration (see Server cluster for more information about embodied emissions):
Calculate the training and fine-tuning emissions using:
Sum all emissions and water use to produce total impact for an inference request:
Many providers do not disclose number of model parameters. For LLMs with unknown parameter counts, we use a linear regression model to produce a best estimate. It is well-documented that the cost a provider charges for inference is a good predictor of model size. For each popular provider, we fit a linear regression model to predict the number of parameters based on the cost, on a log-log scale:
where:
Once we have a prediction for each model-provider pair, we average the predictions using the (inverse, squared) standard error of the prediction as the weight, to get a final prediction per model:
Instead of using the point estimate prediction, we use the 95% confidence interval to produce a conservative estimate of the number of model parameters. The method uses the weighted standard error of the prediction, which uses the standard errors from each individual prediction, as above.
Inference speed may be limited by the framework or by compute. Maximum performance will be achieved when models are not limited by framework overhead. In an optimal scenario, inference can use the full capacity and power of a GPU, as described by Towards Pareto Optimal Throughput in Small Language Model Serving.